A practical workshop on bringing the modern data platform into your enterprise environment. This workshop has been created based on our experiences with delivering enterprise grade data solutions. During the lifecycle of any Big Data system the data engineering teams will face requests and requirements coming from departments responsible for data quality, monitoring or security. Knowing best practices based on real world scenarios will help attendees to get ready on day one and ease their path in managing these complex environments.
During May. For more info, leave your contact.
The workshop is 40% theory and 60% hands-on, where each participant will create his own cluster setup within a provided cloud environment and through his/her github account will submit completed work. The advantage of this practice is that the workshop tasks can be replicated later on in the same cloud environment.
Your company has decided to adopt big data or machine learning and has asked you to figure out the implementation.
A consulting company has left you with an expensive Hadoop cluster not doing anything or being managed in any way.
You’ve opted for open source and realized it’s not that easy to meet enterprise demands.
You plan to use software in the Cloudera stack for your end solution.
Basic understanding of the Big Data landscape
Advanced English language (the workshop is primarily in English)
What do you need for the workshop
Laptop with working Wi-Fi
GitHub account (private or public)
Big Data and Hadoop
A brief reminder to why these systems exist and what are the benefits of adopting these technologies.
Why use Gauss and Cloudera
Understanding the pros/cons of adopting open source software and how can companies like Gauss Algorithmic and Cloudera help.
Big Data Projects
A discussion on what do you need to get a Big Data project running.
What are the advantages of running Big Data environments on-premise, hybrid or cloud.
How to figure out what is the right size cluster for my project. How do to predict scale.
Security and compliance
A closer look on architectural designs with security in mind and discussing topics like authorization, authentication, transparency and encryption.
Discussing people and skills, data governance, external monitoring or high availability.
LABS: Building a cluster in the cloud
In this lab each participant will create the latest Cloudera CDH cluster using Cloudera Director in a public cloud environment.
Connecting to the existing enterprise ecosystem
What are the best methods of collecting files, exporting data from RDBMS systems or connecting to cloud storages.
Discussing what options are there for realtime streaming and things one should consider during implementation.
- Best practices
Hadoop Distributed File System
A closer look at the distributed file system and when it’s best to use it. We’ll also cover some basic operational tasks and health checks.
New storage technologies (Apache Kudu)
Kudu is a newer storage technology and it’s been developed to tackle use cases especially in the IoT domain. We’ll introduce this technology and compare it with other storage options.
This is the go-to technology for real-time data processing. Though Kafka’s functionality is rather simple it can be a challenge to understand what’s really happening inside. This discussion will be about basic setup, operation and monitoring.
Creating a basic realtime pipeline with Flume, Kafka and Kudu.
Basic operations on HDFS
Checking out most common operations on the hadoop distributed file system. Moving data around. Setting file permissions. Removing data and trash
Kafka Messaging System
A fun and simple way to play around with Kafka is to create a messaging system, where topics are using as a group chat space.
SQL (Impala, Hive)
SQL is still relevant as it offers and well-known and simple way for data analyst to query data. Impala and Hive are among the most popular and commonly used engines in the Big Data world. We’ll take a closer look at these engines and see where they work best.
Apache Spark needs little introduction as it’s one of the world’s most adopted open source projects. It’s a great framework for large scale data processing but it’s wise to understand when and how to use it correctly.
In large enterprises it’s very likely that clusters are multi tenant. They will carry diverse workloads and therefore knowing how to correctly manage resources is vital to keep end users productive.
Playing with HUE (Hadoop User Experience)
HUE is a basic tool within the Cloudera environment that simplifies access to various data on HDFS and other supported storage spaces. This exercise goes through the basic features.
Connecting BI tools
Preparing for data science
Clusters for data science workloads
Data science utilizing techniques like machine learning are opening new innovation paths for enterprises. To get the most out your data researches should be able to access all data with the freedom to work how they want, when they want. We’ll look at latest available options.
LABS: Implementing Notebooks (Jupyter, Cloudera Data Science Workbench)