Spark is a platform for running computations or tasks on large amounts of data (we are talking petabytes of data) – these tasks can range from map-reduce sorts of tasks to streaming and machine learning applications. The real power of Spark comes from the extensive APIs and supported languages (Python, Java, Scala, and R) that developers can use to manage and create data-based workflows, plus the fact that it supports integrations which pretty much any data store that you’d want to use.
Why is Spark used today? Companies like Databricks has poured tons of money into the technology, keeping the project alive and up to date. It is also extremely fast – at its core, Spark does a great job of distributing computations to multiple nodes, which in turn compute and cache whatever operation is being requested, using a driver to communicate requests and results. It is often said to be faster and easier to use than competitors who use Hadoop for big data management.
However, there are reportedly many issues with Spark as well. For instance, it can take lots of debugging to make sure configurations are right so that memory errors are no encountered, and not all of the languages that are supported are updated in unison. Overall, things seem to work well when going through tutorials and typical use cases, but large scale usage can be finicky.


