Disclosure: when you buy through links on our site, we may earn an affiliate commission.

Apache Spark Hands on Specialization for Big Data Analytics

In-depth course to master Apache Spark Development using Scala for Big Data (with 30+ real-world & hands-on examples)
(533 reviews)
12,341 students
Created by


CourseMarks Score®







Platform: Udemy
Video: 11h 50m
Language: English
Next start: On Demand

Table of contents


What if you could catapult your career in one of the most lucrative domains i.e. Big Data by learning the state of the art Hadoop technology (Apache Spark) which is considered mandatory in all of the current jobs in this industry?

What if you could develop your skill-set in one of the most hottest Big Data technology i.e. Apache Spark by learning in one of the most comprehensive course  out there (with 10+ hours of content) packed with dozens of hands-on real world examples, use-cases, challenges and best-practices?
What if you could learn from an instructor who is working in the world’s largest consultancy firm, has worked, end-to-end, in Australia’s biggest Big Data projects to date and who has a proven track record on Udemy with highly positive reviews and thousands of students already enrolled in his previous course(s)?

If you have such aspirations and goals, then this course and you is a perfect match made in heaven!
Why Apache Spark?
Apache Spark has revolutionised and disrupted the way big data processing and machine learning were done by virtue of its unprecedented in-memory and optimised computational model. It has been unanimously hailed as the future of Big Data. It’s the tool of choice all around the world which allows data scientists, engineers and developers to acquire and process data for a number of use-cases like scalable machine learning, stream processing and graph analytics to name a few. All of the leading organisations like Amazon, Ebay, Yahoo among many others have embraced this technology to address their Big Data processing requirements. 
Additionally, Gartner has repeatedly highlighted Apache Spark as a leader in Data Science platforms. Certification programs of Hadoop vendors like Cloudera and Hortonworks, which have high esteem in current industry, have oriented their curriculum to focus heavily on Apache Spark. Almost all of the jobs in Big Data and Machine Learning space demand proficiency in Apache Spark. 
This is what John Tripier, Alliances and Ecosystem Lead at Databricks has to say, “The adoption of Apache Spark by businesses large and small is growing at an incredible rate across a wide range of industries, and the demand for developers with certified expertise is quickly following suit”.
All of these facts correlate to the notion that learning this amazing technology will give you a strong competitive edge in your career.
Why this course?
Firstly, this is the most comprehensive and in-depth course ever produced on Apache Spark. I’ve carefully and critically surveyed all of the resources out there and almost all of them fail to cover this technology in the depth that it truly deserves. Some of them lack coverage of Apache Spark’s theoretical concepts like its architecture and how it works in conjunction with Hadoop, some fall short in thoroughly describing how to use Apache Spark APIs optimally for complex big data problems, some ignore the hands-on aspects to demonstrate how to do Apache Spark programming to work on real-world use-cases and almost all of them don’t cover the best practices in industry and the mistakes that many professionals make in field.
This course addresses all of the limitations that’s prevalent in the currently available courses. Apart from that, as I have attended trainings from leading Big Data vendors like Cloudera (for which they charge thousands of dollars), I’ve ensured that the course is aligned with the educational patterns and best practices followed in those training to ensure that you get the best and most effective learning experience. 
Each section of the course covers concepts in extensive detail and from scratch so that you won’t find any challenges in learning even if you are new to this domain. Also, each section will have an accompanying assignment section where we will work together on a number of real-world challenges and use-cases employing real-world data-sets. The data-sets themselves will also belong to different niches ranging from retail, web server logs, telecommunication and some of them will also be from Kaggle (world’s leading Data Science competition platform).
The course leverages Scala instead of Python. Even though wherever possible, reference to Python development is also given but the course is majorly based on Scala. The decision was made based on a number of rational factors. Scala is the de-facto language for development in Apache Spark. Apache Spark itself is developed in Scala and as a result all of the new features are initially made available in Scala and then in other languages like Python. Additionally, there is significant performance difference when it comes to using Apache Spark with Scala compared to Python. Scala itself is one of the most highest paid programming languages and you will be developing strong skill in that language along the way as well.
The course also has a number of quizzes to further test your skills. For further support, you can always ask questions to which you will get prompt response. I will also be sharing best practices and tips on regular basis with my students.

What you are going to learn in this course?
The course consistsof majorly two sections:
•Section – 1:We’ll start off withthe introduction of Apache Spark and will understand its potential and businessuse-cases in the context of overall Hadoop ecosystem. We’ll then focus on howApache Spark actually works and will take a deep dive of the architectural componentsof Spark as its crucial for thorough understanding.
•Section  – 2:After developingunderstanding of Spark architecture, we will move to the next section of thiscourse where we will employ Scala language to use Apache Spark APIs to developdistributed computation programs. Please note that you don’t need to have priorknowledge of Scala for this course as I will start with the very basics ofScala and as a result you will also be developing your skills in this one ofthe highest paying programming languages.
In this section, Wewill comprehensively understand how spark performs distributed computationusing abstractions like RDDs, what are the caveatsin loading data in Apache Spark, what are thedifferent ways to create RDDs and how to leverage parallelism and much more.
Furthermore, astransformations and action constitute the gist of Apache Spark APIs thus itsimperative to have sound understanding of these. Thus, we will thenfocus on a number of Spark transformations and Actions that are heavily beingused in Industry and will go into detail of each. Each API usage will becomplimented with a series of real-world examples and datasets e.g. retail, webserver logs, customer churn and also from kaggle. Each section of the coursewill have a number of assignments where you will be able to practically applythe learned concepts to further consolidate your skills.
A significantsection of the course will also be dedicated to key value RDDs which form thebasis of working optimally on a number of big data problems.
In addition tocovering the crux of Spark APIs, I will also highlight a number of valuablebest practices based on my experience and exposure and will also intuit onmistakes that many people do in field. You will rarely such informationanywhere else.
Each topic will becovered in a lot of detail with strong emphasis on being hands-on thus ensuringthat you learn Apache Spark in the best possible way.
The course isapplicable and valid for all versions of Spark i.e. 1.6 and 2.0.
After completingthis course, you will develop a strong foundation and extended skill-set to useSpark on complex big data processing tasks. Big data is one of the mostlucractive career domains where data engineers claim salaries in high numbers.This course will also substantially help in your job interviews. Also, if youare looking to excel further in your big data career, by passing Hadoopcertifications like of Cloudera and Hortonworks, this course will prove to beextremely helpful in that context as well.
Lastly, once enrolled, you will have life-time access to the lectures and resources. Its a self-paced course and you can watch lecture videos on any device like smartphone or laptop. Also, you are backed by Udemy’s rock-solid 30 days money back guarantee. So if you are serious about learning about learning Apache Spark, enrol in this course now and lets start this amazing journey together!

You will learn

✓ Understand the relationship between Apache Spark and Hadoop Ecosystem
✓ Understand Apache Spark use-cases and advanced characteristics
✓ Understand Apache Spark Architecture and how it works
✓ Understand how Apache Spark on YARN (Hadoop) works in multiple modes
✓ Understand development life-cycle of Apache Spark Applications in Python and Scala
✓ Learn the foundations of Scala programming language
✓ Understand Apache Spark’s primary data abstraction (RDDs)
✓ Understand and use RDDs advanced characteristics (e.g. partitioning)
✓ Learn nuances in loading files in Hadoop Distributed File system in Apache Spark
✓ Learn implications of delimiters in text files and its processing in Spark
✓ Create and use RDDs by parallelizing Scala’s collection objects and implications
✓ Learn the usage of Spark and YARN Web UI to gain in-depth operational insights
✓ Understand Spark’s Direct Acyclic Graph (DAG) based execution model and implications
✓ Learn Transformations and their lazy execution semantics
✓ Learn Map transformation and master its applications in real-world challenges
✓ Learn Filter transformation and master its usage in real-world challenges
✓ Learn Apache Spark’s advanced Transformations and Actions
✓ Learn and use RDDs of different JVM objects including collections and understanding critical nuances
✓ Learn and use Apache Spark for statistical analysis
✓ Learn and master Key Value Pair RDDs and their applications in complex Big Data problems
✓ Learn and master Join Operations on complex Key Value Pair RDDs in Apache Spark
✓ Learn how RDDs caching works and use it for advanced performance optimization
✓ Learn how to use Apache Spark for Data Ranking problems
✓ Learn how to use Apache Spark for handling and processing structured and unstructured data
✓ Learn how to use Apache Spark for advanced Business Analytics
✓ Learn how to use Apache Spark for advanced data integrity and quality checks
✓ Learn how to use Scala’s advanced features like functional programming and pattern matching
✓ Learn how to use Apache Spark for logs processing


• Background knowledge of Big Data would be helpful but not necessary as everything will be taught from scratch
• Past experience of programming language would be helpful but not necessary as everything will be taught from scratch
• A computer system (Laptop/Desktop) with either Windows, Linux or Mac installed for hands-on practice
• All the software and tools used are freely available
• The most important requirement: Thirst and commitment to learn!

This course is for

• Anyone who has the passion to develop expertise in Big Data and specifically Apache Spark
• Software Engineers or Developers
• Data Warehousing or Business Intelligence Professionals
• Data Scientist and Machine Learning Enthusiasts
• Data Engineers and Big Data Architects
Data Scientist in the world’s largest consultancy firm
A full stack scalable analytics specialist, working in the world’s largest consultancy firm in Australia, with a growing portfolio of successful projects delivering substantial impact and value in multiple capacities across telecom, retail, energy and health-care sectors.

• Artificial Intelligence (AI) stream-lead in Deloitte Australia’s Azure Enablement Initiative
• Member of Deloitte Australia’s ClearLight initiative managing AWS and Azure platform for enablement and assets prototyping
• Trainer of Deloitte’s internal Data Science training program
• Author of “Scala Programming for Big Data Analytics” book published by Apress
• Technical reviewer of “Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark” book published by Apress
• Instructor of Apache Spark and R Programming courses on Udemy with thousands of students enrolled from all around the world
• Designated author of the largest Data Science publication (Towards Data Science) on Medium
• Speaker at DataWorks Summit in 2017 in Sydney on in-memory Big Data Technologies
• Speaker in Data Analytics Explained meetup and in multiple universities all around the world
Browse all courses by on Coursemarks.
Platform: Udemy
Video: 11h 50m
Language: English
Next start: On Demand

Students are also interested in