Introductie
Data Science with Python Masterclass E-Learning Training
Deze reis met meer dan 120 uur online content, zal eerst een basis bieden voor gegevensarchitectuur, statistieken en programmeervaardigheden voor gegevensanalyse met behulp van Python en R, wat de eerste stap zal zijn in het verwerven van de kennis om over te stappen van het gebruik van ongelijksoortige en verouderde gegevensbronnen. Je leert dan om de data te wringen met Python en R en die data te integreren met Spark en Hadoop. Vervolgens leert u hoe u data kunt operationaliseren en schalen, rekening houdend met compliance en governance. Om de reis te voltooien, leert u vervolgens hoe u die gegevens neemt en visualiseert, om slimme zakelijke beslissingen te nemen.
Omschrijving
Dit leertraject, met meer dan 120 uur online content, is onderverdeeld in de volgende vier tracks:
- Data Science Track 1: Data Analyst
- Data Science Track 2: Data Wrangler
- Data Science Track 3: Data Ops
- Data Science Track 4: Data Scientist
Data Science Track 1: Data Analyst
In this track, the focus is the data analyst role with a focus on: Python, R, architecture, statistics, and Spark.
Cursusinhoud
Data Architecture Primer
Course: 1 Hour, 4 Minutes
- Course Overview
- Data Defined
- Data Privacy
- The Data Lifecycle
- SQL vs. NoSQL
- Create an Entity Relationship Diagram
- Implement a SQL Solution
- Implement a NoSQL Solution
- Big Data
- Data Architecture and Governance
- IT Data System Architecture Types
- Data Analytics and Reporting
- Exercise: Implement Data Architecture Best Practices
Data Engineering Fundamentals
Course: 46 Minutes
- Course Overview
- Overview of Distributed Systems
- Batch vs. In-Memory Processing
- NoSQL Stores
- Tools for Data Management
- What is ETL?
- ETL with Talend Open Studio
- Data Modeling
- AI and Machine Learning
- Data Partitioning
- Data Engineering
- Data Reporting
- Exercise: Create a Data Model
Python for Data Science: Introduction to NumPy for Multi-dimentional Data
Course: 1 Hour
- Course Overview
- Introduction to NumPy and the NumPy Ecosystem
- Array Creation - Part 1
- Array Creation - Part 2
- Printing Arrays
- Basic Array Operations
- Universal Functions
- Indexing and Slicing
- Iterating Over Arrays
- Reshaping Arrays
- Exercise: Python NumPy Array Operations
Python for Data Science: Advanced Operations with NumPy Arrays
Course: 1 Hour, 8 Minutes
- Course Overview
- Splitting NumPy Arrays
- Images as Arrays
- Image Manipulation Using NumPy
- Views and NumPy Arrays
- Deep Copies of Arrays
- Introduction to Index Masks
- Applying Index Masks
- Indexing with Boolean Masks
- Structured Arrays
- Understanding Array Broadcasting
- Applying Broadcasting Rules on Array Operations
- Exercise: NumPy Multi-dimensional Array Operations
Python for Data Science: Introduction to Pandas
Course: 1 Hour, 6 Minutes
- Course Overview
- Features of Pandas and the Pandas Ecosystem
- Introduction to Pandas
- Work with Pandas
- Introduction to DataFrames
- Work with DataFrames
- Load Data into a DataFrame
- Add and Delete DataFrame Contents
- Select Parts of a DataFrame
- Access Pandas DataFrames
- Introduction to Multi-Indexing in a Dataframe
- Reshape DataFrames
- Reshape Dataframes Using Stack and Melt Operations
- Exercise: Pandas for Basic Tabular Data Manipulation
Python for Data Science: Manipulating and Analyzing Data in Pandas DataFrames
Course: 45 Minutes
- Course Overview
- Iterating Over the Contents of a DataFrame
- Exporting a DataFrame
- Sorting
- Handling Missing Data
- Grouping with a Multi-Index
- Merging DataFrames
- Applying Join Operations on DataFrames
- Pandas and Relational Databases
- Exercise: Pandas for Advanced Data Manipulation
R for Data Science: Data Structures
Course: 52 Minutes
- Course Overview
- Creating Vectors
- Manipulating Vectors
- Sorting Vectors
- Using Lists
- Creating Matrices
- Matrix Operations
- Creating Factors
- Creating Data Frames
- Data Frame Operations
- Exercise: Creating and Using a Data Frame
R for Data Science: Importing and Exporting Data
Course: 34 Minutes
- Course Overview
- Reading from CSV
- Reading from Excel
- Reading from HTML
- Exporting to CSV
- Exporting to Excel
- Exporting to HTML
- Exercise: Reading and Writing Data
R for Data Science: Data Exploration
Course: 41 Minutes
- Course Overview
- Creating dplyr Tables
- Selecting Subsets
- Filtering Tabular Data
- Piping Data
- Mutating Data
- Summarizing Data
- Combining Datasets
- Grouping Data
- Exercise: Querying Data
R for Data Science: Regression Methods
Course: 37 Minutes
- Course Overview
- Linear Data Preparation
- Creating Linear Models
- Interpreting Model Output
- Using Linear Prediction
- Logistic Data Preparation
- Using glm
- Exercise: Creating a Linear Model
R for Data Science: Classification & Clustering
Course: 39 Minutes
- Course Overview
- Preparing Data for Classification
- Using rpart
- Using ctree
- Preparing Data for Clustering
- Using K-Means Clustering
- Using Hierarchical Clustering
- Exercise: Creating a Decision Tree
Data Science Statistics: Simple Descriptive Statistics
Course: 1 Hour, 11 Minutes
- Course Overview
- Descriptive and Inferential Statistics
- Population vs. Sample
- Probability vs. Non-Probability Sampling
- Mean
- Median
- Mode
- IQR
- Variance
- Exercise: Using Descriptive Statistics
Data Science Statistics: Common Approaches to Sampling Data
Course: 47 Minutes
- Course Overview
- Terms in Sampling
- Sampling Bias
- Simple Random Sampling
- Systematic Random Sampling
- Stratified Sampling
- Non-Probability Sampling
- Exercise: Efficient and Correct Sampling
Data Science Statistics: Inferential Statistics
Course: 1 Hour, 2 Minutes
- Course Overview
- Gaussian Distribution
- Inferential Statistics and Hypothesis Testing
- Simplified Example of Hypothesis Testing
- T-tests9
- Skewness and Kurtosis
- Correlation and Autocorrelation
- Introducing Linear Regression
- Overfitting and Goodness-of-Fit
- Exercise: Basic Inferential Statistics
Accessing Data with Spark: An Introduction to Spark
Course: 1 Hour, 7 Minutes
- Course Overview
- Introduction to Spark and Hadoop
- Resilient Distributed Datasets (RDDs)
- RDD Operations
- Spark DataFrames
- Spark Architecture
- Spark Installation
- Working with RDDs
- Creating DataFrames from RDDs
- Contents of a DataFrame
- The SQLContext
- The map() Function of an RDD
- Accessing the Contents of a DataFrame
- DataFrames in Spark and Pandas
- Exercise: Working with Spark
Getting Started with Hadoop: Fundamentals & MapReduce
Course: 1 Hour, 4 Minutes
- Course Overview
- An Introduction to Big Data
- Building Systems to Scale with Data
- A Quick Overview of Hadoop
- MapReduce Overview
- The Map Phase of a MapReduce
- The Shuffle and Reduce Phases
- Exercise: Fundamentals of Hadoop and MapReduce
Getting Started with Hadoop: Developing a Basic MapReduce Application
Course: 1 Hour, 14 Minutes
- Course Overview
- Provisioning a Hadoop Cluster on the Cloud
- Browsing the Hadoop Web Applications
- Creating a MapReduce project
- Coding the Map Phase
- Coding the Reduce Phase
- Defining the Driver Program
- Building the Application
- Executing the MapReduce Application
- Exercise: Developing a Basic MapReduce Application
Hadoop HDFS: Introduction
Course: 1 Hour, 15 Minutes
- Course Overview
- Scaling Datasets
- Horizontal Scaling for Big Data
- Distributed Clusters and Horizontal Scaling
- Overview of HDFS
- HDFS Architectures
- MapReduce for HDFS
- YARN for HDFS
- The Mechanism of Resource Allocation in Hadoop
- Apache Zookeeper for HDFS
- The Hadoop Ecosystem
- Exercise: An Introduction to HDFS
Hadoop HDFS: Introduction to the Shell
Course: 53 Minutes
- Course Overview
- Creating a Hadoop Cluster on the Google Cloud
- Exploring Hadoop Clusters
- The YARN Cluster Manager UI
- The HDFS NameNode UI
- Browsing the Packaged Hadoop Tools
- Configuring HDFS
- The HDFS Shells
- Exercise: Introduction to the HDFS Shell
Hadoop HDFS: Working with Files
Course: 48 Minutes
- Course Overview
- Basic Directory Commands in HDFS
- Using the copyFromLocal Command in HDFS
- Using the put Command in HDFS
- Using the copyToLocal Command in HDFS
- Retrieving files from HDFS
- Append and Delete Operations in HDFS
- Exercise: Working with Files on HDFS
Hadoop HDFS: File Permissions
Course: 49 Minutes
- Course Overview
- The HDFS count and du Commands
- Viewing and Setting File Permissions in HDFS
- Applying Permissions Recursively in HDFS
- An Introduction to Bash Scripting
- Scripting HDFS Operations
- Exploring the HDFS NameNode UI
- Cleanup Operations in HDFS
- Exercise: File Permissions on HDFS
Data Silos, Lakes, & Streams: Introduction
Course: 1 Hour, 20 Minutes
- Course Overview
- Data Silos
- Data Lakes
- Characteristics of Data Lakes
- Data Lake Architecture, Features, and Challenges
- Data Warehouses
- Data Warehouses vs. Data Lakes
- Data Streams
- Migrating Data to AWS
- Data Lakes on AWS
- Working with Data Lakes on AWS
- Exercise: Data Silos, Lakes, and Streams
Data Silos, Lakes, and Streams: Data Lakes on AWS
Course: 1 Hour, 10 Minutes
- Course Overview
- Create a Role for the AWS Glue Service
- Upload Data to S
- Explore the Glue Web Console
- Manually Create Glue Tables
- Query the Data Lake Using Amazon Athena
- Configure and Run Glue Crawlers
- Access Data in Crawled Tables
- Crawl Multiple CSV Files in the Same Folder Path
- Merge Data in Multiple Files in the Same Folder Path
- Work with Files Having the Exact Same Schema
- Exercise: Data Lakes on AWS with S3 and Glue
Data Silos, Lakes, & Streams: Sources, Visualizations, & ETL Operations
Course: 1 Hour, 29 Minutes
- Course Overview
- Set Up a Redshift Cluster
- Create Tables and Load Data From S
- Establish a JDBC Connection to Redshift
- Crawl Redshift Using a JDBC Connection
- Crawl DynamoDB
- Configure QuickSight to Visualize Data
- Visualize Data in QuickSight
- Configure a Job to Perform Extract, Transform, Load
- Execute an ETL Operation in Glue
- Perform ETL to Back Up Redshift Data in S3 Buckets
- Perform ETL to Back Up DynamoDB Data in S3 Buckets
- Exercise: Multiple Sources, Visualizations, and ETL
Data Analysis Application
Course: 1 Hour, 25 Minutes
- Course Overview
- Install and Configure Anaconda Python
- Install R Using Anaconda
- Use Jupyter Notebook
- Import and Export Data in Python
- Import and Export Data in R
- Deal with Missing Data in R
- Transform Data in R
- Work with Numpy
- Work with Pandas
- Mean, Median, and Mode in R
- Analyze Data with Pandas
- Plot Data in R
- Visualize Data in Python
- Exercise: Perform Data Analysis
Online Mentor• You can reach your Mentor by entering chats or submitting an email.Final Exam assessment• Estimated duration: 65 minutesPractice Labs: Analyzing Data with Python (estimated duration: 8 hours)• Practice performing data analysis tasks using Python by configuring VSCode, loading data from SQLite into Pandas, grouping data and using box plots. Then, test your skills by answering assessment questions after using Python to calculate frequency distribution, measures of center, and coefficient of dispersion. This lab provides access to several tools commonly used in data science, including:o VS Code, Anaconda, Jupyter Notebook + Hub, Pandas, NumPy, SiPy, Seaborn Library, Spyder IDE
Data Science Track 2: Data Wrangler
In this track, the focus will be on the data wrangler role. We will explore areas such as: wrangling with Python, Mongo, and Hadoop.Content:E-learning courses
Data Wrangling with Pandas: Working with Series & DataFrames
Course: 1 Hour, 11 Minutes
- Course Overview
- Installing Pandas
- Pandas Series Objects
- Operations on Series
- Appending and Sorting Series Values
- Pandas DataFrames
- Indexing Operations with DataFrames
- Missing Data
- Column Aggregations
- Statistical Operations
Data Wrangling with Pandas: Visualizations and Time-Series Data
Course: 1 Hour, 29 Minutes
- Course Overview
- Pandas and Matplotlib for Visualizations
- Pie Charts, Box Plots, and Scatter Plots
- Time-Series Data
- Deltas and Percentage Change Calculations
- Time Deltas and Date Ranges
- Mismatched DataFrames and Missing Data
- Working with String Data
- Advanced Operations on Strings
- Applying Functions on Series
- Transforming Data With User-Defined Functions
- Applying Functions on DataFrames
- Exercise: Plot Charts and Transform Column Values
Data Wrangling with Pandas: Advanced Features
Course: 1 Hour, 12 Minutes
- Course Overview
- Grouping and Aggregations
- MultiIndex DataFrames
- Grouping and Aggregations with MultiIndex DataFrames
- General Aggregation Functions
- Filtering
- Masking Column Values
- Working with Duplicates
- Working with Categorical Data
- Filtering, Adding, and Removing Categories
- Reindexing
- Exercise: Filtering, Duplicates and Categorical Data
Data Wrangler 4: Cleaning Data in R
Course: 1 Hour, 3 Minutes
- Course Overview
- Types of Unclean Data
- Data Quality
- Downloading JSON Data
- Excel Sheets
- Reading Dirty CSVs
- Querying Relational Databases
- Joining Tabular Data
- Spreading Data
- Summarizing Data
- Imputing Data
- Extracting Matches
- Exercise: Wrangling Data
Data Tools: Technology Landscape & Tools for Data Management
Course: 27 Minutes
- Course Overview
- Technology Landscape and Tools
- Tool Comparison
- Machine Learning in Data Analytics
- Machine Learning Tools
- Machine Learning Implementation
- Python and R for Data Management
- Cloud and Machine Learning
- Exercise: Implement Machine Learning on Scikit-learn
Data Tools: Machine Learning & Deep Learning in the Cloud
Course: 23 Minutes
- Course Overview
- Microsoft Machine Learning Toolkit
- AWS and Machine Learning
- Spark Machine Learning Capabilities
- Deep Learning Frameworks
- Deep Learning Implementation
- Data Mining and Analytical Tools
- KNIME Capabilities
- Exercise: Implement Deep Learning
Trifacta for Data Wrangling: Wrangling Data
Course: 50 Minutes
- Course Overview
- Standardizing Data
- Formatting Dates
- Filtering Rows
- Replacing Values
- Counting Matches
- Splitting Columns
- Merging Columns
- Extracting Data
- Conditional Aggregation
- Reshaping Data
- Joining Data
- Exercise: Wrangling Data
MongoDB for Data Wrangling: Querying
Course: 1 Hour, 8 Minutes
- Course Overview
- Introduction to PyMongo
- Document Structure
- CRUD Operations
- ObjectID and Timestamp
- Query Operations
- Projection Queries
- Comparison Operators
- Element Query Operators
- The Regex Operator
- Using the Size and All Operators
- Text Search
- Using mongoimport
- Using mongoexport
- Exercise: Performing a Query
MongoDB for Data Wrangling: Aggregation
Course: 51 Minutes
- Course Overview
- Aggregation Framework
- Using Group
- Using Match
- Using Project
- Using Limit and Sort
- Using Unwind
- Using Lookup
- Using Indexes
- Using Geospatial Indexes
- Exercise: Performing an Aggregate Query
Getting Started with Hive: Introduction
Course: 56 Minutes
- Course Overview
- Hive as a Data Warehouse
- Overview of Relational Databases
- OLTP and OLAP
- Hive and the Hadoop Ecosystem
- HiveServer and The Metastore
- Hive on Cloud Computing Platforms
- Data Types in Hive
- Data and Tables in Hive
- Exercise: Introduction to Hive
Getting Started with Hive: Loading and Querying Data
Course: 1 Hour, 20 Minutes
- Course Overview
- Setting up a Hadoop Cluster on the Google Cloud
- Creating a Hive Table
- Running Simple Queries in Hive
- Executing Hive Queries from the Shell
- Joining Tables in Hive
- Exploring the Hive Warehouse
- External Tables in Hive
- Modifying Tables in Hive
- Temporary Tables in Hive
- Loading Data into Tables in Hive
- Populating Multiple Tables in Hive
- Exercise: Loading and Querying Data in Hive
Getting Started with Hive: Viewing and Querying Complex Data
Course: 1 Hour, 14 Minutes
- Course Overview
- The Array Data Type in Hive
- The Map Data Type in Hive
- The Struct Type in Hive
- The explode and posexplode Functions in Hive
- Lateral Views in Hive
- Multiple Lateral Views in Hive
- Set Operations in Hive
- The IN and EXISTS clauses in Hive
- Creating and Populating Tables in Hive
- Views in Hive
- Exercise: Viewing and Querying Complex Data
Getting Started with Hive: Optimizing Query Executions
Course: 43 Minutes
- Course Overview
- Hive Queries as MapReduce Jobs
- Techniques to Improve Query Performance in Hive
- Partitioning Tables in Hive
- Bucketing Tables in Hive
- Structuring Join Queries in Hive
- Exercise: Optimizing Query Execution in Hive
Getting Started with Hive: Optimizing Query Executions with Partitioning
Course: 1 Hour, 1 Minute
- Course Overview
- Setting up a Hadoop Cluster on the Google Cloud
- Creating a Partitioned Table in Hive
- Working with Partitions in Hive
- Populating Partitions in Hive
- Partitioning External Tables in Hive
- Modifying Partitions in Hive
- Dynamic Partitions in Hive
- Using Multiple Columns for Partitioning in Hive
- Exercise: Optimize Executions with Partitioning
Getting Started with Hive: Bucketing & Window Functions
Course: 1 Hour, 4 Minutes
- Course Overview
- Apply Bucketing for a Table in Hive
- Using Bucketing and Partitioning Together in Hive
- Sorting a Bucket's Contents in Hive
- Sampling a Table in Hive
- Joining Multiple Tables in Hive
- Introducing Window Functions in Hive
- Windows Functions with Partitions in Hive
- Exercise: Bucketing and Window Functions in Hive
Getting Started with Hadoop: Filtering Data Using MapReduce
Course: 59 Minutes
- Course Overview
- Counting the Data Points in Each Category
- The Reducer and Driver Programs
- Building and Executing the Application
- A Simple Filter Using MapReduce
- Executing and Examining the Output
- Extracting the Unique Values in a Column
- Viewing the Distinct Values Extracted
- Exercise: Filtering Data Using MapReduce
Getting Started with Hadoop: MapReduce Applications With Combiners
Course: 1 Hour, 24 Minutes
- Course Overview
- Combiners in MapReduce
- Revisiting MapReduce
- Working with Combiners
- Using Combiners for Calculating Averages
- Creating a Project to Calculate Averages
- Coding the Map and Reduce Phases8
- Configure the Application in the Driver
- Executing the Application and Examining the Output
- Adding a Combiner to a MapReduce Application
- Conveying a Pair of Numbers from the Mapper
- Running the Fixed Application
- Exercise: Optimizing MapReduce With Combiners
Getting Started with Hadoop: Advanced Operations Using MapReduce
Course: 49 Minutes
- Course Overview
- Defining a User-Defined Type for a PriorityQueue
- Implementing a PriorityQueue in a Mapper
- Using a PriorityQueue in a Reducer
- Running and Verifying the Results
- Building an Inverted Index - Map Phase
- Building an Inverted Index - Reduce Phase
- Executing the Application and Viewing the Index
- Exercise: Advanced Operations Using MapReduce
Accessing Data with Spark: Data Analysis Using the Spark DataFrame API
Course: 1 Hour, 12 Minutes
- Course Overview
- Performance Improvements in Spark
- Broadcast Variables and Accumulators
- Loading Data into a DataFrame
- Sampling the Contents of a DataFrame
- Grouping and Aggregations
- Visualizing Data in a DataFrame
- Trimming and Cleaning Data
- User-Defined Functions and DataFrames
- Combining Filters, Aggregations, and Sorting
- Using Broadcast Variables
- Using Accumulators
- Exporting DataFrame Contents
- Custom Accumulators
- Join Operations
- Exercise: Data Analysis Using the DataFrame API
Accessing Data with Spark: Data Analysis using Spark SQL
Course: 55 Minutes
- Course Overview
- The Spark Catalyst Optimizer
- Introduction to Spark SQL
- Preparing Data for Analysis
- Running SQL Queries
- Inferred and Explicit Schemas
- Windowing in Spark
- Applying Window Functions
- Exercise: Data Analysis Using Spark SQL
Data Lake: Framework & Design Implementation
Course: 34 Minutes
- Course Overview
- Data Lakes and Data Warehouses
- Data Lake Selection Criteria
- Data Lake and Data Democratization
- Data Lake Design Principles
- AWS Data Lake Architecture
- Implement AWS Data Store
- Data Lake For On-Premise and Multi-Cloud
- Data Processing Frameworks for Data Lake
- Exercise: Implement AWS Data Store
Data Lake: Architectures & Data Management Principles
Course: 35 Minutes
- Course Overview
- Real-Time Big Data Architectures
- Data Lake Reference Architecture
- Data Ingestion and File Formats
- Ingestion Using Sqoop
- Data Processing Strategies
- Deriving Value from Data Lakes
- Data Life Cycle
- S3 and Glacier
- Exercise: Ingest Data and Implement Archival Policy
Data Architecture - Deep Dive: Design & Implementation
Course: 36 Minutes
- Course Overview
- Data Complexity Management Strategies
- Data Modeling Process
- Distributed Data Management
- Partitioning Methods and Criteria
- MongoDB Partitioning
- Hybrid Data Architectures
- Implement Directed Acyclic Graph
- CAP Theorem
- Batch vs. Streaming
- Read and Write Concerns
- Exercise: Implement Serverless Architecture
Data Architecture - Deep Dive: Microservices & Serverless Computing
Course: 26 Minutes
- Course Overview
- Microservices and Data
- Serverless and Lambda Architecture
- Lambda Implementation
- Cluster Benefits
- Data Architecture Types
- Data Discovery Process
- Data Risk Types
- Data POC
- Exercise: Implement Lambda Architecture
Online Mentor• You can reach your Mentor by entering chats or submitting an email.Final Exam assessment• Estimated duration: 90 minutesNova Learning, januar 2021Practice Labs: Data Wrangling with Python (estimated duration: 8 hours)• Perform data wrangling tasks including using a Pandas DataFrame to convert multiple Excel sheets to separate JSON documents, extract a table from an HTML file, use mean substitution and convert dates within a DataFrame. Then, test your skills by answering assessment questions after using a Pandas DataFrame to convert a CSV document to a JSON document, replace missing values with a default value, split a column with a delimiter and combine two columns by concatenating text.
Data Science Track 3: Data Ops
The tracks objective is to help prepare the learner for a Data Ops role with a focus on governance, security, and harnessing volume and velocity.Content:E-learning courses
Deploying Data Tools: Data Science Tools
Course: 48 Minutes
- Course Overview
- Data Science Platform
- Challenges of Deploying Data Science Tools
- Considerations for Data Science Tools
- Data Science Workflow
- Data Science Analytic Tools
- Data Science Visualization Tools
- Data Science Database Tools
- Benefits of Deploying Cloud-Based Tools
- Challenges of Deploying Cloud-Based Tools
- What is DevOps
- DevOps for Data Science
- Exercise: Identifying Uses of Data Science Tools
Delivering Dashboards: Management Patterns
Course: 34 Minutes
- Course Overview
- Analytical Visualization
- Dashboard Types
- Data Management
- Dashboard Components
- Dashboard Best Practices
- Dashboard Using ELK
- Dashboard Using Power BI
- Chart Selection Criteria
- Leaderboards and Scorecards
- Scorecard Types
- Exercise: Create Dashboards with PowerBI and ELK
Delivering Dashboards: Exploration & Analytics
Course: 31 Minutes
- Course Overview
- Data Exploration Using Charts
- Analytical Visualization Tools
- Bar and Line Charts
- Dashboarding with Kibana
- Dashboard Sharing with Kibana
- Dashboarding with Tableau
- Dashboarding with Qlikview
- Data Ingest and Dashboards
- Dashboard Patterns
- Monitoring Dashboards
- Exercise: Create Dashboards Using Kibana and Tableau
Cloud Data Architecture: DevOps & Containerization
Course: 45 Minutes
- Course Overview
- Containerization on the Cloud
- Benefits of Containers
- Serverless Computing
- DevOps in the Cloud
- AWS OpsWorks
- Storage Classification
- Cloud and Machine Learning
- Cloud and BI Analytics
- Exercise: Containerization and Serverless Computing
Compliance Issues and Strategies: Data Compliance
Course: 44 Minutes
- Course Overview
- Data Compliance Issues
- Data Regulations
- The Importance of Global Standards
- Risk and Company Standards
- Myths and Facts of Data Compliance
- Compliance Training for Users
- Compliance Training for Management
- The Benefits of a Data Compliance Program
- Elements of a Good Compliance Strategy
- Building a Compliance Strategy
- Reporting and Response Procedures
- Exercise: Explain the Importance of Data Compliance
Implementing Governance Strategies
Course: 46 Minutes
- Course Overview
- Governance and its Relationship with Big Data
- Why Big Data Requires Governance
- Requirements for Big Data Governance
- Why is Big Data Different?
- Identifying Data
- Identifying Stakeholders
- Cloud Technologies and Data Governance
- Designing a Data Governance Process
- Managing a Data Governance Strategy
- Monitoring a Data Governance Strategy
- Maintaining a Data Governance Strategy
- Exercise: Defining Data Governance Strategies
Data Access & Governance Policies: Data Access Oversight and IAM
Course: 59 Minutes
- Course Overview
- Data Access Governance
- Risk and Data Safety Compliance
- Data Access Patterns
- Data Breach Prevention
- Least Privilege
- Assign and View Effective File System Permissions
- Identity and Access Management
- Create an AWS IAM User and Group
- Assign AWS IAM Group Permissions
- Vulnerability Assessments
- Implement Effective Security Controls
- Exercise: Implement Data Access Governance Solutions
Data Access & Governance Policies: Data Classification, Encryption, and Monitoring
Course: 1 Hour, 19 Minutes
- Course Overview
- Data Classification
- Classify Data Using Microsoft FSRM
- Data Encryption
- Encrypt Data at Rest
- Encrypt Data in Motion
- Implement Security Compliance Checking
- Examine Data Access Trends
- Data Access Monitoring Solutions
- Logging, Auditing, and Data Analytics
- Configure a Custom Filtered Log View
- Enable Windows Data Access Auditing
- Exercise: Implement Data Confidentiality
Streaming Data Architectures: An Introduction to Streaming Data
Course: 51 Minutes
- Course Overview
- Introduction to Streaming data
- The Stream Processing Model
- The Message Transport
- Stream Processing with RDDs
- Structured Streaming for Continuous Applications
- Streaming vs Structured Streaming
- Triggers and Output Modes
- Exercise: Working with Streaming Data
Streaming Data Architectures: Processing Streaming Data
Course: 53 Minutes
- Course Overview
- PySpark Setup
- Setting Up a Socket Stream with Netcat
- The Update Output Mode
- Using a File Input Stream
- The Append Output Mode
- The Complete Output Mode
- Aggregations on Streaming Data
- SQL Operations on Streaming Data
- User-Defined Functions (UDFs)
- Exercise: Processing Streaming Data
Scalable Data Architectures: Introduction
Course: 53 Minutes
- Course Overview
- Scalable Architectures with Distributed Computing
- Introducing Data Warehouses
- Contrasting Warehouses with Relational Databases
- Data Warehouses for Analytical Processing
- Data Warehouse Architectural Components
- Amazon Redshift - A Data Warehouse on the Cloud
- Exercise: Scalable Data Architectures
Scalable Data Architectures: Introduction to Amazon Redshift
Course: 55 Minutes
- Course Overview
- Provisioning a Redshift Cluster Using Quick Launch
- Creating a Redshift Cluster With Additional Detail
- Exploring the Redshift Configs and Metrics
- Attaching an IAM Role to a Redshift Cluster
- Creating an AWS User to Work With Redshift
- Installing and Configuring the AWS CLI
- Running Queries from the Redshift Query Editor
- Exercise: An Introduction to Amazon Redshift
Scalable Data Architectures: Working with Amazon Redshift & QuickSight
Course: 1 Hour, 18 Minutes
- Course Overview
- Loading Data from Amazon S3 to a Redshift Cluster
- Running Queries and Evaluating Their Execution
- Querying a Redshift Cluster Using a SQL client
- Working with Automated Snapshots
- Restoring Tables from a Snapshot
- Horizontal Scaling of a Redshift Cluster
- Vertical and Horizontal Scaling of a Cluster
- Configuring Access from QuickSight to Redshift
- Loading a Dataset to QuickSight
- Creating Visualizations with QuickSight
- Exercise: Working with Redshift and QuickSight
Building Data Pipelines
Course: 1 Hour, 10 Minutes
- Course Overview
- Data Pipelines Overview
- Traditional ETL Pipeline with Batch Processing
- Data Pipeline Tools
- Setup and Install Airflow
- Apache Airflow
- Airflow Workflows
- Airflow Tasks
- Airflow Dependencies
- ETL Pipeline with Airflow
- Automated Pipeline without ETL
- Airflow Command Line Testing
- Exercise: Using Apache Airflow
Data Pipeline: Process Implementation Using Tableau & AWS
Course: 39 Minutes
- Course Overview
- Data Pipeline
- Data Pipeline Processes
- Data Pipeline Stages
- Data Pipeline Technologies
- Data Source Types
- Scheduled Data Pipeline
- Tableau Server and Utilities
- Data Pipeline Using Tableau
- Data Pipeline on AWS
- Exercise: Build Data Pipelines with Tableau
Data Pipeline: Using Frameworks for Advanced Data Management
Course: 33 Minutes
- Course Overview
- Celery and Luigi
- Data Pipeline with Python Luigi
- Working with Dask Library
- Dask Arrays
- Data Exploration and Visualization Frameworks
- Spark and Tableau
- Streaming Data Visualization with Python
- Data Pipeline Open Source Tools
- Exercise: Implement Data Pipelines with Luigi
Data Sources: Integration
Course: 40 Minutes
- Course Overview
- Elements of IoT Solutions
- Service Categories in IoT
- IoT Capabilities and Maturity Model
- IoT Design Principles
- IoT Cloud Architectures
- MQTT and XXMP
- IoT Controllers
- IoT Data Management
- Securing IoT
- Exercise: Generating Data Streams
Data Sources: Implementing Edge on the Cloud
Course: 31 Minutes
- Course Overview
- AWS IoT Greengrass
- GCP IoT Edge
- AWS IoT over WebSockets
- IoT Device Simulator
- Generating Streams of Data Using MQTT
- Exercise: Working with IoT Device Simulators
Securing Big Data Streams
Course: 1 Hour, 3 Minutes
- Course Overview
- Big Data Security Concerns
- Streaming Data Security Concerns
- NoSQL Database Security Concerns
- Distributed Processing Security Risks
- Data Mining and Analytics Privacy Flaws
- End-Point Device Tampering Risks
- Secure Big Data
- Secure Data Streams
- Secure Data In Motion
- End-Point Input Validation and Filtering
- Secure Data at Rest with Symmetric Ciphers
- Exercise: Securing Big Data Streams
Harnessing Data Volume & Velocity: Big Data to Smart Data
Course: 39 Minutes
- Course Overview
- Comparing Big Data and Smart Data
- Smart Data and Edge Technologies
- Big Data to Smart Data Formation
- Smart Data and Smart Processes
- Smart Data Use Cases
- Smart Data Life Cycle
- Big Data to Smart Data Using k-NN
- Smart Data Frameworks
- Smart Data to Business
- Clustering Smart Data
- Smart Data Integration
- Exercise: Transform Big Data to Smart Data
Data Rollbacks: Transaction Rollbacks & Their Impact
Course: 36 Minutes
- Course Overview
- Rollback Process
- State of Transactions
- Transaction Types
- SQL Transaction Management
- Transaction Log Operations
- Deadlock Management
- SQL Server Rollback Mechanism
- SQL Server Rollback Mechanism Implementation
- Exercise: Implement Transactions with SQL Server
Data Rollbacks: Transaction Management & Rollbacks in NoSQL
Course: 29 Minutes
- Course Overview
- NoSQL and SQL Transaction Management
- MongoDB Transactions
- Manage Multi-Document Transactions in MongoDB
- Change Data Capture
- Change Stream in MongoDB
- MongoDB Change Stream Implementation
- Exercise: MongoDB Transactions and Change Streams
Online Mentor• You can reach your Mentor by entering chats or submitting an email.Final Exam assessment• Estimated duration: 90 minutesPractice Labs: Implementing Data Ops with Python (estimated duration: 8 hours)• Perform data ops tasks with Python including working with row subsets, creating new columns with Regex, performing joins and spreading rows. Then, test your skills by answering assessment questions after working with field subsets and computed columns, and performing set operations and binding rows.
Data Science Track 4: Data Scientist
For this track, the focus will be on the Data Scientist role. Here we will explore areas such as: visualization, APIs, and ML and DL algorithms.Content:E-learning courses
Balancing the Four Vs of Data: The Four Vs of Data
Course: 40 Minutes
- Course Overview
- Overview of the Four Vs
- The Importance of Volume
- The Importance of Variety
- The Importance of Velocity
- The Importance of Veracity
- The Relationship Between the Four Vs
- Variety and Data Structure
- Validity and Volatility
- Finding Balance in the Four Vs
- Use Cases
- Extracting Value from the Four Vs
- Exercise: Describe the Four Vs of Big Data
Data Driven Organizations
Course: 1 Hour, 15 Minutes
- Course Overview
- Data Driven Organizations
- Decision Making
- Analytic Maturity
- Analytic Roles
- Data Source Priority
- Facets of Data Quality
- Power BI Data Visualization
- Missing Data
- Duplicate Data
- Truncated Data
- Data Provenance
Raw Data to Insights: Data Ingestion & Statistical Analysis
Course: 54 Minutes
- Course Overview
- Statistical Analysis
- Data Correction
- Outlier Detection
- Data Architecture Pattern
- Data Ingestion Tools
- Kafka and Apache NiFi
- Apache Sqoop Ingest
- Ingest Using WaveFront
Raw Data to Insights: Data Management & Decision Making
Course: 57 Minutes
- Course Overview
- Data-driven Decision Making Framework
- Loading Data into R
- Preparing Data
- Data Correction Approach
- Data Correction Using Simple Transformation
- Data Correction Using Deductive Correction
- Distributed Data Management
- Data Analytics
- Data Analytics Using R
- Predictive Modeling
Tableau Desktop: Real Time Dashboards
Course: 1 Hour, 8 Minutes
- Course Overview
- Introducing Real Time Dashboards
- Creating Real Time Dashboards with Tableau
- Build a Tableau Dashboard
- Real Time Dashboard Updates in Tableau
- Organizing Your Tableau Dashboard
- Formatting Your Tableau Dashboard
- Interactive Tableau Dashboard
- Tableau Dashboard Starters
- Tableau Dashboard Extensions
- Tableau Dashboards and Story Points
- Sharing your Tableau Dashboard
Storytelling with Data: Introduction
Course: 47 Minutes
- Course Overview
- Storytelling Process
- Interpreting Context
- Analysis Types
- Who, What, and How of Storytelling
- Visualization for Storytelling
- Graphical Tools for Data Elaboration
- Storytelling Scenarios
- Storyboarding
Storytelling with Data: Tableau & PowerBI
Course: 57 Minutes
- Course Overview
- Visual Selection
- Slopegraphs
- Bar Charts and Types of Bar Charts
- Clutter and Clutter Elimination
- Gestalt Principle
- Story Design Best Practices
- Tools for Storytelling
- Decluttering
- Crafting Visual Data
- Visual Design Concerns
- Storytelling with Power BI
- Model Visual and Tableau
Python for Data Science: Basic Data Visualization Using Seaborn
Course: 1 Hour, 7 Minutes
- Course Overview
- Introduction to Seaborn
- Install Seaborn
- Simple Univariate Distributions
- Configure Univariate Distribution Plots
- Simple Bivariate Distributions
- Explore Different Types of Bivariate Distributions
- Analyze Multiple Variable Pairs
- Regression Plots
- Themes and Styles in Seaborn
Python for Data Science: Advanced Data Visualization Using Seaborn
Course: 1 Hour, 4 Minutes
- Course Overview
- Searching for Patterns in a Dataset
- Configuring Plot Aesthetics
- Normal Distribution and Outliers
- Distributions Within Categories - Part
- Distributions Within Categories - Part
- Analyzing Categories with Facet Grids - Part
- Analyzing Categories with Facet Grids - Part
- Introducing Color Palettes
- Using Color Palettes
Data Science Statistics: Using Python to Compute & Visualize Statistics
Course: 1 Hour, 16 Minutes
- Course Overview
- An Introduction to Matplotlib
- Analyzing Data Using NumPy and Pandas
- Visualizing Univariate and Bivariate Distributions
- Summary Statistics Using Native Python Functions
- Summary Statistics Using NumPy
- Summary Statistics Using the SciPy Library
- Correlation and Covariance
- Z-score
R for Data Science: Data Visualization
Course: 33 Minutes
- Course Overview
- An Introduction to Matplotlib
- Analyzing Data Using NumPy and Pandas
- Visualizing Univariate and Bivariate Distributions
- Summary Statistics Using Native Python Functions
- Summary Statistics Using NumPy
- Summary Statistics Using the SciPy Library
- Correlation and Covariance
- Z-score
Advanced Visualizations & Dashboards: Visualization Using Python
Course: 38 Minutes
- Course Overview
- Relevance of Data Visualization for Business
- Libraries for Data Visualization in Python
- Python Data Visualization Environment Configuration
- Matplotlib Libraries for Visualization
- Bar Chart Using ggplot
- Bokeh and Pygal
- Select Visualization Libraries
- Interactive Graphs and Image Files
- Plot Graphs
- Multiple Lines in Graphs
Advanced Visualizations & Dashboards: Visualization Using R
Course: 35 Minutes
- Course Overview
- Chart Types
- Stacked Bar Plot
- Animate Plots with Matplotlib
- Plotting in Jupyter Notebook
- Graphics in R
- Heat Map and Scatter Plot in R
- Correlogram and Area Chart in R
- ggplot2 Capabilities
- Customize ggplot2 Graphs
Powering Recommendation Engines: Recommendation Engines
Course: 1 Hour, 5 Minutes
- Course Overview
- Describing Recommendation Engines
- Comparing the Types of Recommendation Engines
- Collecting and Manipulating Data
- Manipulating Data in R
- Describing Similarity and Neighborhoods
- Creating a Recommendation Engine
- Recommending Another Item
- Finding Items to Recommend
- Recommending Items Based on Other Items
- Evaluating a Recommendation System
- Validating a Recommendation System
Data Insights, Anomalies, & Verification: Handling Anomalies
Course: 46 Minutes
- Course Overview
- Data and Anomaly Sources
- Decomposition and Forecasting
- Examine Data Using Randomization Tests
- Anomaly Detection
- Anomaly Detection Techniques
- Anomaly Detection with scikit-learn
- Anomaly Detection Tools
- Anomaly Detection Rules
Data Insights, Anomalies, & Verification: Machine Learning & Visualization Tools
Course: 51 Minutes
- Course Overview
- Machine Learning Anomaly Detection Techniques
- Comparing Anomaly Detection Algorithms
- Anomaly Detection Using R
- Online Anomaly Detection Components
- Online Anomaly Detection Approaches
- Anomaly Detection Use Cases
- Anomaly Detection with Visualization Tools
- Anomaly Detection with Mathematical Approaches
- Cluster-Based Anomaly Detection
Data Science Statistics: Applied Inferential Statistics
Course: 1 Hour, 19 Minutes
- Course Overview
- The One-Sample T-test
- Independent and Paired T-tests
- Testing Hypotheses with T-tests
- Loading and Analyzing a Skewed Dataset
- Measuring Skewness and Kurtosis
- Preparing a Dataset for Regression
- Simple Linear Regression
- Multiple Linear Regression
Data Research Techniques
Course: 33 Minutes
- Course Overview
- Data Research Fundamentals
- Data Research Steps
- Values, Variables, and Observations
- JMP Scale of Measurement
- Non-experimental and Experimental Research
- Descriptive and Inferential Statistical Analysis
- Inferential Tests
- Case Study of Clinical Data Research
- Data Research in Sales Management
Data Research Exploration Techniques
Course: 50 Minutes
- Course Overview
- Fundamentals of Exploratory Data Analysis
- Data Exploration Types
- Working with R
- Data Exploration in R
- Data Exploration Using Plots
- Python Packages for Data Exploration
- Data Exploration Using Python
- Data Research Using Linear Algebra
- Linear Algebra for Data Research
Data Research Statistical Approaches
Course: 43 Minutes
- Course Overview
- Role of Statistics in Data Research
- Discrete vs. Continuous Distribution
- PDF and CDF
- Binomial Distribution
- Interval Estimation
- Point and Interval Estimation
- Data Visualization Techniques
- Data Visualization Using R
- Data Integration Techniques
- Creating Plots
- Missing Values and Outliers
Machine & Deep Learning Algorithms: Introduction
Course: 46 Minutes
- Course Overview
- Machine Learning Algorithms
- How Machine Learning Works
- Introduction to Pandas ML
- Support Vector Machines
- Overfitting
Machine & Deep Learning Algorithms: Regression & Clustering
Course: 49 Minutes
- Course Overview
- The Confusion Matrix
- An Introduction to Regression
- Applications of Regression
- Supervised and Unsupervised Learning
- Clustering
- Principal Component Analysis
Machine & Deep Learning Algorithms: Data Preperation in Pandas ML
Course: 1 Hour, 4 Minutes
- Course Overview
- Data Preparation in scikit-learn
- Training and Evaluating Models in scikit-learn
- Introducing the Pandas ML ModelFrame
- Training and Evaluating Models in Pandas ML
- Preparing Data for Regression
- Evaluating Regression Models
- Preparing Data for Clustering
- The K-Means Clustering Algorithm
Machine & Deep Learning Algorithms: Imbalanced Datasets Using Pandas ML
Course: 1 Hour, 24 Minutes
- Course Overview
- Analyzing an Imbalanced Dataset
- The RandomOverSampler
- The SMOTE Oversampler
- Undersampling Using imbalanced-learn
- Ensemble Classifiers for Imbalanced Data
- Combination Samplers
- Finding Correlations in a Dataset
- Building a Multi-Label Classification Model
- Dimensionality Reduction with PCA
- Imbalanced Learn and PCA
Creating Data APIs Using Node.js
Course: 1 Hour, 31 Minutes
- Course Overview
- API Prerequisites
- Building a RESTful API Using Node.js and Express.js
- RESTful API with OAuth
- HTTP Server with Hapi.js
- API Modules
- Returning Data with JSON
- Nodemon for Development Workflow
- API Requests
- POSTman for API
- Deploying APIs
- Social Media APIs
- Exercise: Building RESTful APIs
Online Mentor• You can reach your Mentor by entering chats or submitting an email.Final Exam assessment• Estimated duration: 90 minutesPractice Labs: Data Visualization with Python (estimated duration: 8 hours)• Perform data visualization tasks with Python such as creating scatter plots, plotting linear regression, using logistic regression and creating decision tree. Then, test your skills by answering assessment questions after creating time-series graphs, resampling observations, creating histograms and using a grid pair.