Introduction to Apache Spark (Edx course)

Posted on July 13, 2016

Tags: spark, courses

Introduction to Apache Spark

Edx course - link

Apache Spark is now a hot topic in the world of big data and data analysis. It is commonly used for processing a huge amount of data and positioned as a faster solution than hadoop. So I decided to learn some more about this technology and about what makes it so special.

Here are just some numbers that can impress and give you some insight:

Apache Spark is able to sort 100 TB array in 23 minutes !!!
It is the first technology that solved the problem of 1 PT array sorting !!!

During the course, you will be working with DataBricks service that allows you to create and launch spark jobs in a cluster. And also provides really simple and user-friendly interface for managing clusters. In this course, python is used for working with data frames and building queries inside python notebooks (but you can also use Scala and Java for that task in a real world).

This course will give you really basic understanding of how to work with spark data frames, build some aggregation queries and visualize the results. You will work with a real log from NASA website and retrieve some statistical info.

A few words about functional programming

Still, there are a lot of people who are really sceptical about functional programming and it’s value. What I can say is that Apache Spark project is a really bright demo of the use case of functional programming and real proof that it works. Functional programming patterns are good for building reliable distributed systems and multicore processing. Also, it shows the power of data streaming.

Shortly speaking, if you are interested in the big data and wanted to get familiar with Spark this course is a good place to get started. But if you are already familiar with Spark API I think you can go to the nex course in that series on Edx.

Will be glad if you will suggest some more courses and materials about Spark in comments (: