Holden Karau is transgender Canadian, and anactive open source contributor. When not in San Francisco working as asoftware development engineer at IBM's Spark Technology Center, Holdentalks internationally on Spark and holds office hours at coffee shops athome and abroad. She makes frequent contributions to Spark, specializing inPySpark and Machine Learning. Prior to IBM she worked on a variety ofdistributed, search, and classification problems at Alpine, Databricks,Google, Foursquare, and Amazon. She graduated from the University ofWaterloo with a Bachelor of Mathematics in Computer Science. Outside ofsoftware she enjoys playing with fire, welding, scooters, poutine, anddancing. Most recently, Andy Konwinski co-founded Databricks. Before that he was a PhD student and then postdoc in the AMPLab at UC Berkeley, focused on large scale distributed computing and cluster scheduling. He co-created and is a committer on the Apache Mesos project. He also worked with systems engineers and researchers at Google on the design of Omega, their next generation cluster scheduling system. More recently, he developed and led the AMP Camp Big Data Bootcamps and first Spark Summit, and has been contributing to the Spark project. Patrick Wendell is an engineer at Databricks as well as a Spark Committer and PMC member. In the Spark project, Patrick has acted as release manager for several Spark releases, including Spark 1.0. Patrick also maintains several subsystems of Spark's core engine. Before helping start Databricks, Patrick obtained an M.S. in Computer Science at UC Berkeley. His research focused on low latency scheduling for large scale analytics workloads. He holds a B.S.E in Computer Science from Princeton University Matei Zaharia is the creator of Apache Spark and CTO at Databricks. He holds a PhD from UC Berkeley, where he started Spark as a research project. He now serves as its Vice President at Apache. Apart from Spark, he has made research and open source contributions to other projects in the cluster computing area, including Apache Hadoop (where he is a committer) and Apache Mesos (which he also helped start at Berkeley).
Features & Highlights
Data in all domains is getting bigger. How can you work with it efficiently?
Recently updated for Spark 1.3
, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.
Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You’ll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.
Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell
Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell
Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib
Leverage Spark’s powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
Use one programming paradigm instead of mixing and matching tools like Hive, Hadoop, Mahout, and Storm
Learn how to deploy interactive, batch, and streaming applications
Learn how to deploy interactive, batch, and streaming applications
Connect to data sources including HDFS, Hive, JSON, and S3
Connect to data sources including HDFS, Hive, JSON, and S3
Master advanced topics like data partitioning and shared variables
Master advanced topics like data partitioning and shared variables
Customer Reviews
Rating Breakdown
★★★★★
30%
(102)
★★★★
25%
(85)
★★★
15%
(51)
★★
7%
(24)
★
23%
(78)
Most Helpful Reviews
★★★★★
2.0
AEXZZTNHH5NBUVEPO74Y...
✓ Verified Purchase
Your best bet would be to read some slides on Slideshare
Spark is already 2.2, this book is till based on Spark 1.0/1.1. This book only covers the very basics of Spark, none of the advanced Spark concepts are covered. Your best bet would be to read some slides on Slideshare, follow databricks documentation, there are some decent youtube videos aswell, lastly Apache Spark's documentation is not bad at all.
28 people found this helpful
★★★★★
4.0
AGB36PIYRQ42FYPYEN3F...
✓ Verified Purchase
Best of the Books Currently Available
Good text. I've purchased pretty much all of the other options out there-- while this book is still lacks detail (and borrows much of its content from the Databricks examples & Spark documentation available online) it's a worthwhile investment. The detailed treatment of RDDs, SparkSQL and streaming capability alone is enough to justify the cost.
9 people found this helpful
★★★★★
4.0
AHLH2EVWXNMZEEGWL2DF...
✓ Verified Purchase
Great intro, although a bit outdated already!
The only reason for the 4-star rating and not higher is that the book is already a bit outdated (from a Scala perspective). Running newer versions of Spark do not support some of the examples in the book. This does not change or distort the overall big picture of the book, however. Still a very intuitive and straight forward intro to Spark.
8 people found this helpful
★★★★★
1.0
AG76W56KKXHBGWQCNPJG...
✓ Verified Purchase
but its important and this book is enough out of data to make it nearly useless. You will learn the wrong way to do ...
Hopelessly out of date. Yes, its difficult to keep up with early releases of a software which is new and changing massively from version to version, but its important and this book is enough out of data to make it nearly useless. You will learn the wrong way to do things and increase your troubleshooting time.
7 people found this helpful
★★★★★
1.0
AFFTFK2G7476G2N7JHH2...
✓ Verified Purchase
12 fails to suppress the annoying diagnostics. Example 2-11 fails because saveAsTextFile fails with ...
This book is not what I expected and the instructions in it which I tried don't actually work. For instance, the scala shell of the latest version 1.6.1 doesn't load cleanly (it gives a long list of stack back trace exceptions) and the one in the book 1.3.0 is dated by more than a year. But, ok, I went with 1.3.0 and found the instruction on p. 12 fails to suppress the annoying diagnostics. Example 2-11 fails because saveAsTextFile fails with a null pointer exception. I skipped ahead and found snippets in Scala, Java, and Python which the authors show what "should" work in theory except there's no way to really test the code because it is written in isolation from prior code. In other words, the book is not a tutorial. Instead, it emphasizes the speed of spark but I couldn't get it to run reliably, if at all, with simple instructions and toy examples on a single-node environment. What can one expect in more complex, production scenarios? I have to wonder if all the hadoop talk and big data are the latest IT hype. An interested reader may wish to see, Kugler, L.,"When Big Data Blunders," CACM, June 2016
7 people found this helpful
★★★★★
3.0
AER44IKUO7CS7CKS4SZH...
✓ Verified Purchase
Still 1.6
It used to be one of the best book. Very good explanation of core concepts.
Now with Spark 2.0 this book is out dated.
6 people found this helpful
★★★★★
2.0
AFRTL4GOLNWHMHG4WROU...
✓ Verified Purchase
A decent guided tour of Spark and its major components.
Over the last few years Big Data has gathered an incredible amount of momentum. All this fuzz and buzz resulted in top companies, as well as fearless start-ups, to invest hours and cash in data solutions, some of which have emerged, establishing new standards. Having the spotlight on often resulted in these projects turning into open source ones. Among these , Spark, a cluster computing framework, recently adopted by the Apache Foundation. Despite being a hot topic of this 2015, the literature dedicated to the subject is still very limited. Among the few titles available, Learning Spark provides the curious reader with a decent overview of the major features provided by the framework.
Written by a groups of enthusiasts and developers, including the original creator of the framework itself, Matei, Learning Spark targets data scientists and engineers. As expressly written on the back cover, this book is neither a reference nor a cookbook. Its goal is to presents a different, faster alternative to the Hadoop’s Map/Reduce paradigm and to the elephant made in Apache itself.
The reader is given a quick overview of the capabilities of the framework, such as the built-in libraries, Spark SQL and the many different data sources it can interact with. While not all the main features are presented, those that are found within these almost three-hundreds pages come with plenty of well explained examples.
The examples are, on the other hand, one of the many perplexities raised by this text: each is presented in Python, Java and Scala. While it is great to see many different bindings in action, any average skilled Pythonist can easily understand what happens in Java . And vice versa. This is even more true in the case of Scala, another most wanted topic of the recent years, inevitably related to Java and its ecosystem.
Another thumb down for the complete absence of anything related to the Spark’s internal architecture. The car looks nice, but what about the engine? How does it work? Magic? Witchery?
Again, the examples presented are clear and well explained, but there is no real world case shown. Spark is meant to get executed on huge clusters with scary amounts of data. True, this is a quick overview of the product, but “hello world” per se does not make me wanna learn more.
Overall, a good read for that early morning hour of commute. It helps the curious reader to pickup the basics of the framework. On the other hand, nothing of what is presented can’t be found in the web pages of the Apache Software Foundation.
As usual, you can find more reviews on my personal blog: http://books.lostinmalloc.com Feel free to pass by and share your thoughts!
6 people found this helpful
★★★★★
4.0
AG3ILND4GBCCA2KG2DDT...
✓ Verified Purchase
No-nonsense attempt at explaining Spark
I thought this was a pretty good book, but I agree with some reviewers that the way code snippets were presented is problematic. The code examples, especially the later ones, are very hard to recreate, in part due to the fast moving release cycle of Spark, but also, due to the fact that unless you are in a big shop with lots of servers, it's going to be hard to recreate the conditions. Most importantly, however is that the examples are not self-contained and leave the reader having to infer what some of the variables are (say, from previous examples, continued implicitly). Maybe they did this for space considerations as the book is modest in size at 240 pages.
Having said that, there aren't many Spark books out there and it does a good job with the writing in terms of describing the platform and maybe not as good a job with the code examples. For anyone who in the past has been involved in a roll your own distributed computing environment, Spark itself is an incredible welcome addition.
I happened to like the way the Scala vs Python vs Java breakdown is presented, as some things are not available typically in Python, and it's useful to see the variations (or similarities) in how things are done in the respective languages. The Spark API itself for these languages is elegant in its solution. Particularly prominent is the length of Java code compared to Scala. Spark (written in Scala, which in turn is written in Java) can be leveraged in Scala with very few lines of code.
I only played around with the platform in Scala and Python using the spark-shell in a Mac environment and could not make it work within cygwin on Windows (spark-shell seems to be not supported at the time of this writing for Windows/Cygwin). I did not exercise any of the later code examples.
The introductory chapters were very good, while the chapter on Spark Streaming was difficult and hard to follow. The Spark SQL chapter was also good. I found only a couple of typos (not counting any code errors which would be hard to characterize) - so it seems it was edited well. There was not a lot of editorializing or attempts at humor which I appreciated. Apparently the authors were developers of Spark so their perspective has legitimacy.
Overall I thought it was a solid book on an exciting, future oriented computing topic, and the main thing to improve upon would be to make the example code better. The naming conventions used in the code were somewhat cumbersome, but that is a topic in itself and it's always hard to name variables and functions in a way that is readable and yet not too long and confusing.
Note on my reviews: I have thousands of books in my library and carefully select the next books to read in my reading list so as to have a favorable, positive experience. Therefore there is a good chance I'm going to like the book that I read next, and in turn give it a good review - I have no desire to read bad books (if someone paid me, maybe I would do it). Sometimes I am wrong and I end up reading a real clunker and you will see negative reviews from me. More than likely I will not finish the book in which case I won't review it (I only review books which I read all the way through). So yes, there is a bias in my reviews but it is not for the obvious reasons (i.e. that authors are friends of mine, or have sent me a review copy, or that I just give high ratings to everything ...)
5 people found this helpful
★★★★★
3.0
AETFBGQIQ3NCHBWIMAOQ...
✓ Verified Purchase
this book was published in 2015 and is outdated
The core Spark concepts are there but Spark: The Definitive Guide (which I subsequently purchased) would be a better purchase to make than Learning Spark. It's unfortunate there's not an updated edition of Learning Spark because it's a great introduction to Spark IMO despite the dated content in certain areas. I recommend ignoring the installation instructions altogether and just jump forward to the chapter content.
4 people found this helpful
★★★★★
3.0
AEAGUTQBITQDNSWYCTIQ...
✓ Verified Purchase
Anyone using spark on the job regularly will find that learning to load and print a schema is so basic that is it almost useless. No mention of how one would change the ...
Pros: The book is the only book on the market that gives examples in scala, python and java. The authors attempt to cover a wide range of topics. The authors also give insight into using sbt and maven with projects.
Cons: Generally the book is very basic. It does not go into anything beyond basics and in some areas does not explain the basics very well. Also, the book is very repetitive saying the same basic things in multiple places without any depth. For example, in chapter 5 it talks about JSON data then in chapter 9 it talks about JSON data again. In both chapters, it talks about loading JSON data, however in chapter 9 it shows how to print the schema. That is literally all that is mentioned on JSON. Anyone using spark on the job regularly will find that learning to load and print a schema is so basic that is it almost useless. No mention of how one would change the schema or add columns to nested schema.
When I first started learning spark, I struggled to follow this book. Now, that I understand scala and spark with more depth, I now realize that the book is lacking in explanation and in depth and does not help the reader to learn how to actually do anything beyond the basics.
There are not many options on the market so I would recommend getting the book. But you will quickly find it does not have much depth.