Search  for anything...

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

  • Based on 3,398 reviews
Condition: New
Checking for product changes
$23.37 Why this price?
Save $36.62 was $59.99

Buy Now, Pay Later


As low as $5 / mo
  • – 4-month term
  • – No impact on credit
  • – Instant approval decision
  • – Secure and straightforward checkout

Ready to go? Add this product to your cart and select a plan during checkout. Payment plans are offered through our trusted finance partners Klarna, PayTomorrow, Affirm, Apple Pay, and PayPal. No-credit-needed leasing options through Acima may also be available at checkout.

Learn more about financing & leasing here.

Selected Option

Free shipping on this product

This item is eligible for return within 30 days of receipt

To qualify for a full refund, items must be returned in their original, unused condition. If an item is returned in a used, damaged, or materially different state, you may be granted a partial refund.

To initiate a return, please visit our Returns Center.

View our full returns policy here.


Availability: Only 4 left in stock, order soon!
Fulfilled by Amazon

Arrives Saturday, May 25
Order within 4 hours and 42 minutes
Available payment plans shown during checkout

Format: Paperback, Illustrated


Description

Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords?In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.Peer under the hood of the systems you already use, and learn how to use and operate them more effectivelyMake informed decisions by identifying the strengths and weaknesses of different toolsNavigate the trade-offs around consistency, scalability, fault tolerance, and complexityUnderstand the distributed systems research upon which modern databases are builtPeek behind the scenes of major online services, and learn from their architectures Read more


Publisher ‏ : ‎ O'Reilly Media; 1st edition (May 2, 2017)


Language ‏ : ‎ English


Paperback ‏ : ‎ 611 pages


ISBN-10 ‏ : ‎ 1449373321


ISBN-13 ‏ : ‎ 20


Item Weight ‏ : ‎ 2.15 pounds


Dimensions ‏ : ‎ 7.01 x 1.24 x 9.17 inches


Best Sellers Rank: #2,364 in Books (See Top 100 in Books) #1 in Data Modeling & Design (Books) #1 in MySQL Guides #1 in Desktop Database Books


#1 in Data Modeling & Design (Books):


#1 in MySQL Guides:


Frequently asked questions

If you place your order now, the estimated arrival date for this product is: Saturday, May 25

Yes, absolutely! You may return this product for a full refund within 30 days of receiving it.

To initiate a return, please visit our Returns Center.

View our full returns policy here.

  • Klarna Financing
  • Affirm Pay in 4
  • Affirm Financing
  • PayTomorrow Financing
  • Apple Pay Later
Leasing options through Acima may also be available during checkout.

Learn more about financing & leasing here.

Top Amazon Reviews


  • Highly recommended
Kleppmann mentioned during his "Turning the Database Inside Out with Apache Samza" talk at Strange Loop 2014 (see my notes) that he was on sabbatical working on this book, and while waiting quite some time for it to be published, I ended up experimenting with his Bottled Water project as well as Apache Kafka (which was only at release 0.8.2.2 at that point in time). Other reviewers are correct that much of the material included in this book is available elsewhere, but this book is packaged well (although still at 550-pages and heavyweight), with most of the key topics associated with data-intensive applications under one roof with good explanations and numerous footnotes which point to resources providing additional detail. Content is broken down into 3 sections and 12 chapters: (1) foundations of data systems, which covers reliable, scalable, and maintainable applications, data models and query languages, storage and retrieval, and encoding and evolution, (2) distributed data, which covers replication, partitioning, transactions, the trouble with distributed systems, and consistency and consensus, and (3) derived data, which covers batch processing, stream processing, and the future of data systems. The latter 6 chapters are weighted more heavily, with chapter 9 on consistency and consensus, and chapter 12 on the future of data systems, the most lengthy with each comprising about 12% of the book. Some potential readers might be disappointed that this book is all theory, but while the author does not provide any code he discusses practical implementation and specific details when applicable for comparisons within a product category. In my opinion, the last chapter is probably the most abstract simply because it explores ideas about how the tools covered in the prior two chapters might be used in the future to build reliable, scalable, and maintainable applications. Similiary, the chapter on the opposite end of this book sets the stage well for any developer of nontrivial applications with its section on thinking about database systems and the concerns around reliability, scalability, and maintainability. About a year ago, I recall an executive colleague responding to me with a quizzical look when I mentioned that tooling for data and application development is converging over time, and just a few months prior I mentioned in a presentation to developers that transactional and analytical capabilities are being provided more and more by single database products, with one executive in the audience shaking his head in disagreement that kappa rather than lambda architectures are the way to go. Kleppman mentions that we typically think of databases, message brokers, caches, etc as residing in very different categories of tooling because each of these has very different access patterns, meaning different performance characteristics and therefore different implementations. So why should all of this tooling not be lumped together under an umbrella term such as 'data systems'? Many products for data storage and processing have emerged in recent years, optimized for a variety of use cases and no longer neatly fitting into traditional categories: the boundaries between categories are simply becoming blurred, and since a single tool can no longer satisfy the data processing and storage needs for many applications, work is broken down into tasks that can be performed efficiently on a single system that is often comprised of different tooling stitched together by application code under the covers. In addition to the author's abundant and effective simple line diagrams that are reminiscent (although more sophisticated) of his earlier diagrams, one aspect that I especially appreciate is the nomenclature comparisons between products when walking through terminology. For example, at the beginning of chapter 6, the author specifically calls out the terminological confusion that exists with respect to partitioning. "What we call a 'partition' here is called a 'shard' in MongoDB, Elasticsearch, and SolrCloud; it's known as a 'region' in HBase, a 'tablet' in Bigtable, a 'vnode' in Cassandra and Riak, and a 'vBucket' in Couchbase. However, partitioning is the most established term, so we'll stick to that." In addition, Kleppmann walks through differences between products when the same terminology is being used, which can also lead to confusion. For example, in chapter 7 the author provides a great 5-page discussion on the meaning of "ACID" (atomicity, consistency, isolation, and durability), which was an effective reminder to me that while this term was coined in 1983 in an effort to establish precise terminology for fault-tolerance mechanisms in databases, in practice one database's implementation of ACID does not equal another's implementation. "Today, when a system claims to be 'ACID compliant', it's unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term." If you've ever found yourself confused about the concept of "consistency", the author offers a sanity check that your confusion is warranted, not only because the term is "terribly overloaded" with at least four different meanings, but because "the letter C doesn't really belong in ACID" since it was "tossed in to make the acronym work" in the original paper, and that "it wasn't considered important at the time." The reality is that "atomicity, isolation, and durability are properties of the database, whereas consistency (in the ACID sense) is a property of the application. The application may rely on the database's atomicity and isolation properties in order to achieve consistency, but it's not up to the database alone." An later in chapter 9 where he discusses consistency and consensus, the author provides a great sidebar on "the unhelpful CAP theorem". As Kleppmann later comments, "the CAP theorem as formally defined is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions, or nodes that are alive but disconnected from each other). It doesn't say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems." The author concludes in a sidebar by commenting that "all in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided." This is because "CAP is sometimes presented as 'Consistency, Availability, Partition tolerance: pick 2 out of 3'. Unfortunately, putting it this way is misleading because network partitions are a kind of fault, so they aren't something about which you have a choice: they will happen whether you like it or not...A better way of phrasing CAP would be 'either Consistent or Available when Partitioned'. A more reliable network needs to make this choice less often, but at some point the choice is inevitable." While the second section of this text on distributed data was most beneficial to me, the third section on derived data was least beneficial, mainly because I'm already familiar with these topics from recent readings and experience, and because I needed to refamiliarize myself with the content discussed in the second section. However, the author presents derived data well, and I certainly do not recommend skipping this section. As Kleppmann comments, the issues around integrating multiple different data systems into one coherent application architecture is often overlooked by vendors who claim that their product can satisfy all of your needs. In reality, integrating disparate systems (which can be grouped into the two broad categories of "systems of record" and "derived data systems") is one of the most important things that needs to be done in a nontrivial application. I highly recommend this text. ... show more
Reviewed in the United States on June 2, 2018 by Erik Gfesser

  • Essential reading for anyone working on distributed systems in any capacity
Designing Data-Intensive Applications really exceeded my expectations. Even if you are experienced in this area this book will re-enforce things you know (or sort of know) and bring to light new ways of thinking about solving distributed systems and data problems. It will give you a solid understanding of how to choose the right tech for different use cases. The book really pulls you in with an intro that is more high level, but mentions problems and solutions that really anyone who has worked on these types of applications have either encountered or heard mention of. The promise it makes is to take these issues such as scalability, maintainability and durability and explain how to decide on the right solutions to these issues for the problems you are solving. It does an amazing job of that throughout the book. This book covers a lot, but at the same time it knows exactly when to go deep on a subject. Right when it seems like it may be going too deep on things like how different types of databases are implemented (SSTables, B-trees, etc.) or on comparing different consensus algorithms, it is quick to point out how and why those things are important to practical real-world problems and how understanding those things is actually vital to the success of a system. Along those same lines it is excellent at circling back to concepts introduced at prior points in the book. For example the book goes into how log based storage is used for some databases as their core way of storing data and for durability in other cases. Later in the book when getting into different message/eventing systems such as Kafka and ActiveMQ things swing back to how these systems utilize log based storage in similar ways. Even if you have prior knowledge or even have worked with these technologies, how and why they work and the pros and cons of each become crystal clear and really solidified. Same can be said of it's great explanations of things like ZooKeeper and why specific solutions like Kafka make use of it. This book is also amazing at shedding light on the fact that so little of what is out there is totally new, it attempts to go back as far as it can at times on where a certain technology's ideas originated (back to the 1800s at some points!). Bringing in this history really gives a lot of context around the original problems that were being solved, which in turn helps understanding pros and cons. One example is the way it goes through the history of batch processing systems and HDFS. The author starts with MapReduce and relating it to tech that was developed decades before. This really clarifies how we got from batch processing systems on proprietary hardware to things like MapReduce on commodity hardware thanks in part to HDFS, eventually to stream based processing. It also does great at explaining the pros and cons of each and when one might choose one technology over the other. That's really the theme of this book, teaching the reader how to compare and contrast different technologies for solving distributed systems and data problems. It teaches you to read between the lines on how certain technologies work so that you can identify the pros and cons early and without needing them to be spelled out by the authors of those technologies. When thinking about databases it teaches you to really consider the durability/scalability model and how things are no where near black and white between "consistent" vs "eventually consistent", these is a ton of nuance there and it goes deep on things like single vs multi leader vs leaderless, linearizability, total order broadcast, and different consensus algorithms. I could go on forever about this book. To name a few other things it touches on to get a good idea of the breadth here: networking (and networking faults), OLAP, OLTP, 2 phase locking, graph databases, 2 phase commit, data encoding, general fault tolerance, compatibility, message passing, everything I mentioned above, and the list goes on and on and on. I recommend anyone who does any kind of work with these systems takes the time to read this book. All 600ish pages are worth reading, and it's presented in an excellent, engaging way with real world practical examples for everything. ... show more
Reviewed in the United States on June 1, 2020 by Joey

  • Designing Surveillance-intensive applications
Great read, only took 2 years to get through it. Had to learn how to meditate and develop self control just to finish this book. 10/10 self help book with the added benefit of teaching me how to scale up my surveillance operations. Jokes aside this really is a foundational book for system design at scale. ... show more
Reviewed in the United States on November 2, 2022 by Harry Simpson

  • Great system design book
This book is a very good system design book
Reviewed in the United States on November 14, 2022 by Zee

  • Absolutely a fantastic book!
You would love not just an amazing, vast technical breadth but also even more amazing succinct, incisive style of writing. There is enough sharp humor at places. Thoroughly informative and pleasure read. You would hardly get such technical books
Reviewed in the United States on October 9, 2022 by Amazon Customer

  • Great book!
Recommend chapters 2,3,5,6
Reviewed in the United States on October 27, 2022 by Jerry Huang

  • Great book, bad print quality
No doubt about the quality content, the pages are too thin that you can see the writing on the other page
Reviewed in the United States on August 30, 2022 by rachid

Can't find a product?

Find it on Amazon first, then paste the link below.