introducing-data-science-machine-learning-python.pdf
14.6 MB
Introducing Data Science
by DAVY CIELEN
ARNO D. B. MEYSMAN
MOHAMED ALI
by DAVY CIELEN
ARNO D. B. MEYSMAN
MOHAMED ALI
π2
Python Data Science Handbook
Python Data Science Handbook: full text in Jupyter Notebooks. This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.
Creator: Jake Vanderplas
StarsβοΈ: 39k
Fork: 17.1K
Repo: https://github.com/jakevdp/PythonDataScienceHandbook
ββββββββββββββ
Join @github_repositories_bds for more cool data science materials.
*This channel belongs to @bigdataspecialist group
Python Data Science Handbook: full text in Jupyter Notebooks. This repository contains the entire Python Data Science Handbook, in the form of (free!) Jupyter notebooks.
Creator: Jake Vanderplas
StarsβοΈ: 39k
Fork: 17.1K
Repo: https://github.com/jakevdp/PythonDataScienceHandbook
ββββββββββββββ
Join @github_repositories_bds for more cool data science materials.
*This channel belongs to @bigdataspecialist group
GitHub
GitHub - jakevdp/PythonDataScienceHandbook: Python Data Science Handbook: full text in Jupyter Notebooks
Python Data Science Handbook: full text in Jupyter Notebooks - jakevdp/PythonDataScienceHandbook
SQL for Data Analysis: Solving real-world problems with data
A simple & concise mySQL course (applicable to any SQL), perfect for data analysis, data science, business intelligence.
Rating βοΈ: 4.3 out 5
Students π¨βπ : 47,690
Duration β° : 1hr 57min of on-demand video
Created by π¨βπ«: Max SQL
π Course Link
#data_analytics #data #SQL #programming
ββββββββββββββ
πJoin @datascience_bds for moreπ
A simple & concise mySQL course (applicable to any SQL), perfect for data analysis, data science, business intelligence.
Rating βοΈ: 4.3 out 5
Students π¨βπ : 47,690
Duration β° : 1hr 57min of on-demand video
Created by π¨βπ«: Max SQL
π Course Link
#data_analytics #data #SQL #programming
ββββββββββββββ
πJoin @datascience_bds for moreπ
Udemy
Free Data Analysis Tutorial - SQL for Data Analysis: Solving real-world problems with data
A simple & concise mySQL course (applicable to any SQL), perfect for data analysis, data science, business intelligence. - Free Course
π1
Few years ago I was learning about transformers and was writing down some notes for myself. Now I come accross those notes and decided to share some part of it here in case any of you find it useful.
Most famous transformers
1. BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained transformer model developed by Google. It has achieved state-of-the-art results in various NLP tasks, such as question answering, sentiment analysis, and text classification.
2. GPT (Generative Pre-trained Transformer): GPT is a series of transformer-based models developed by OpenAI. GPT-3, the most recent version, is a highly influential model known for its impressive language generation capabilities. It has been used in various creative applications, including text completion, language translation, and dialogue generation.
3. Transformer-XL: Transformer-XL is a transformer-based model developed by researchers at Google. It addresses the limitation of standard transformers by incorporating a recurrence mechanism to capture longer-term dependencies in the input sequence. It has been successful in tasks that require modeling long-range context, such as language modeling.
4. T5 (Text-to-Text Transfer Transformer): T5, developed by Google, is a versatile transformer model capable of performing a wide range of NLP tasks. It follows a "text-to-text" framework, where different tasks are cast as text generation problems. T5 has demonstrated strong performance across various benchmarks and has been widely adopted in the NLP community.
5. RoBERTa (Robustly Optimized BERT Pretraining Approach): RoBERTa is a variant of BERT developed by Facebook AI. It addresses some limitations of the original BERT model by tweaking the training setup and introducing additional data. RoBERTa has achieved improved performance on several NLP tasks, including text classification and named entity recognition.
BERT vs RoBERTa vs DistilBERT vs ALBERT
BERT - created by Google, 2018, question answering, summarization, and sequence classification, has 12 Encoders stacked, baseline to others.
RoBERTa - created by Facebook, 2019. literally same architecture as BERT, but improves on BERT by carefully and intelligently optimizing the training hyperparameters for BERT. It's trained on larger data, bigger vocabulary size and longer sentences. It overperforms BERT.
DistilBERT - created by Hugging Face, October 2019. roughly same general architecture as BERT, but smaller, only 6 Encoders. Distilbert is 40% smaller (40% less parameters) than the original BERT-base model, is 60% faster than it, and retains 95+% of its functionality.
ALBERT (A Light BERT) - published/introduced at around the same time as Distilbert. 18x less parameters than BERT, trained 1.7x faster. It doesn't have tradeoff in performance while DistilBERT has it at small extent. This comes from just the core difference in the way the Distilbert and Albert experiments are structured. Distilbert is trained in such a way to use BERT as the teacher for its training/distillation process. Albert, on the other hand, is trained from scratch like BERT. Better yet, Albert outperforms all previous models including BERT, Roberta, Distilbert, and XLNet.
Note: Training speed is not so important to end-users because all those are pre-trained transformer models. Still, in some cases we will need to fine-tune models using our own datasets, which is where speed is important. Also smaller and faster models like DistilBERT and ALBERT can be advantageous when there is not enough memory or computational power.
Most famous transformers
1. BERT (Bidirectional Encoder Representations from Transformers): BERT is a pre-trained transformer model developed by Google. It has achieved state-of-the-art results in various NLP tasks, such as question answering, sentiment analysis, and text classification.
2. GPT (Generative Pre-trained Transformer): GPT is a series of transformer-based models developed by OpenAI. GPT-3, the most recent version, is a highly influential model known for its impressive language generation capabilities. It has been used in various creative applications, including text completion, language translation, and dialogue generation.
3. Transformer-XL: Transformer-XL is a transformer-based model developed by researchers at Google. It addresses the limitation of standard transformers by incorporating a recurrence mechanism to capture longer-term dependencies in the input sequence. It has been successful in tasks that require modeling long-range context, such as language modeling.
4. T5 (Text-to-Text Transfer Transformer): T5, developed by Google, is a versatile transformer model capable of performing a wide range of NLP tasks. It follows a "text-to-text" framework, where different tasks are cast as text generation problems. T5 has demonstrated strong performance across various benchmarks and has been widely adopted in the NLP community.
5. RoBERTa (Robustly Optimized BERT Pretraining Approach): RoBERTa is a variant of BERT developed by Facebook AI. It addresses some limitations of the original BERT model by tweaking the training setup and introducing additional data. RoBERTa has achieved improved performance on several NLP tasks, including text classification and named entity recognition.
BERT vs RoBERTa vs DistilBERT vs ALBERT
BERT - created by Google, 2018, question answering, summarization, and sequence classification, has 12 Encoders stacked, baseline to others.
RoBERTa - created by Facebook, 2019. literally same architecture as BERT, but improves on BERT by carefully and intelligently optimizing the training hyperparameters for BERT. It's trained on larger data, bigger vocabulary size and longer sentences. It overperforms BERT.
DistilBERT - created by Hugging Face, October 2019. roughly same general architecture as BERT, but smaller, only 6 Encoders. Distilbert is 40% smaller (40% less parameters) than the original BERT-base model, is 60% faster than it, and retains 95+% of its functionality.
ALBERT (A Light BERT) - published/introduced at around the same time as Distilbert. 18x less parameters than BERT, trained 1.7x faster. It doesn't have tradeoff in performance while DistilBERT has it at small extent. This comes from just the core difference in the way the Distilbert and Albert experiments are structured. Distilbert is trained in such a way to use BERT as the teacher for its training/distillation process. Albert, on the other hand, is trained from scratch like BERT. Better yet, Albert outperforms all previous models including BERT, Roberta, Distilbert, and XLNet.
Note: Training speed is not so important to end-users because all those are pre-trained transformer models. Still, in some cases we will need to fine-tune models using our own datasets, which is where speed is important. Also smaller and faster models like DistilBERT and ALBERT can be advantageous when there is not enough memory or computational power.
π13
Introduction to Datascience [R20DS501].pdf
5.3 MB
Introduction to Data Science
[R20DS501]
DIGITAL NOTES
[R20DS501]
DIGITAL NOTES
π3π₯2π1
Beyond Jupyter Notebooks
Build your own Data science platform with Docker & Python
Rating βοΈ: 4.7 out 5
Students π¨βπ : 5,018
Duration β° : 1hr 26min of on-demand video
Created by π¨βπ«: Joshua GΓΆrner
π Course Link
#data_science #Jupyter #python #Docker
ββββββββββββββ
πJoin @datascience_bds for moreπ
Build your own Data science platform with Docker & Python
Rating βοΈ: 4.7 out 5
Students π¨βπ : 5,018
Duration β° : 1hr 26min of on-demand video
Created by π¨βπ«: Joshua GΓΆrner
π Course Link
#data_science #Jupyter #python #Docker
ββββββββββββββ
πJoin @datascience_bds for moreπ
Udemy
Free Docker Tutorial - Beyond Jupyter Notebooks
Build your own Data science platform with Docker & Python - Free Course
π2
Best YouTube Playlists for Data Science
βΆοΈ Python
π Playlist Link
βΆοΈ SQL
π Playlist Link
βΆοΈ Data Analysis
π Playlist Link
βΆοΈ Data Analyst
π Playlist Link
βΆοΈ Linear Algebra
π Playlist Link
βΆοΈ Calculus
π Playlist Link
βΆοΈ Statistics
π Playlist Link
βΆοΈ Machine Learning
π Playlist Link
βΆοΈ Deep Learning
π Playlist Link
βΆοΈ Excel Power Query
π Playlist Link
βΆοΈ Ruby
π Playlist Link
βΆοΈ Microsoft Excel
π Playlist Link
βΆοΈ Python
π Playlist Link
βΆοΈ SQL
π Playlist Link
βΆοΈ Data Analysis
π Playlist Link
βΆοΈ Data Analyst
π Playlist Link
βΆοΈ Linear Algebra
π Playlist Link
βΆοΈ Calculus
π Playlist Link
βΆοΈ Statistics
π Playlist Link
βΆοΈ Machine Learning
π Playlist Link
βΆοΈ Deep Learning
π Playlist Link
βΆοΈ Excel Power Query
π Playlist Link
βΆοΈ Ruby
π Playlist Link
βΆοΈ Microsoft Excel
π Playlist Link
π8β€6π1
The Top 5 Machine Learning Libraries in Python
A Gentle Introduction to the Top Python Libraries used in Applied Machine Learning
Rating βοΈ: 4.4 out 5
Students π¨βπ : 103,885
Duration β° : 1hr 27min of on-demand video
Created by π¨βπ«: Mike West
π Course Link
#Python #Libraries #Machine_Learning #programming
ββββββββββββββ
πJoin @datascience_bds for moreπ
A Gentle Introduction to the Top Python Libraries used in Applied Machine Learning
Rating βοΈ: 4.4 out 5
Students π¨βπ : 103,885
Duration β° : 1hr 27min of on-demand video
Created by π¨βπ«: Mike West
π Course Link
#Python #Libraries #Machine_Learning #programming
ββββββββββββββ
πJoin @datascience_bds for moreπ
Udemy
Free Machine Learning Tutorial - The Top 5 Machine Learning Libraries in Python
A Gentle Introduction to the Top Python Libraries used in Applied Machine Learning - Free Course
π3
Drag, Drop, Analyze: The Rise of No-Code Data Science
No-code or low-code functionalities in data science have gained significant traction in recent years. These solutions are well-proven and matured, and they make data science more accessible to a wider range of people.
No-code or low-code data science solutions can be very rewarding. "The first and most important benefit is that they can lead to better forms of collaboration," Mierswa underscores. "Everyone can understand visual workflows or models if they are explained, however, not everyone is a computer scientist or programmer, and not everyone can understand code." So, in order to collaborate effectively, you need to understand what assets the team is collectively producing. "Data science is, at the end of the day, a team sport. You need people who understand the business problems, whether or not they can code, as coding may not be their daily business."
π Read more
No-code or low-code functionalities in data science have gained significant traction in recent years. These solutions are well-proven and matured, and they make data science more accessible to a wider range of people.
No-code or low-code data science solutions can be very rewarding. "The first and most important benefit is that they can lead to better forms of collaboration," Mierswa underscores. "Everyone can understand visual workflows or models if they are explained, however, not everyone is a computer scientist or programmer, and not everyone can understand code." So, in order to collaborate effectively, you need to understand what assets the team is collectively producing. "Data science is, at the end of the day, a team sport. You need people who understand the business problems, whether or not they can code, as coding may not be their daily business."
π Read more
π3
Data Scientist Roadmap
|
|-- 1. Basic Foundations
| |-- a. Mathematics
| | |-- i. Linear Algebra
| | |-- ii. Calculus
| | |-- iii. Probability
| | `-- iv. Statistics
| |
| |-- b. Programming
| | |-- i. Python
| | | |-- 1. Syntax and Basic Concepts
| | | |-- 2. Data Structures
| | | |-- 3. Control Structures
| | | |-- 4. Functions
| | | `-- 5. Object-Oriented Programming
| | |
| | `-- ii. R (optional, based on preference)
| |
| |-- c. Data Manipulation
| | |-- i. Numpy (Python)
| | |-- ii. Pandas (Python)
| | `-- iii. Dplyr (R)
| |
| `-- d. Data Visualization
| |-- i. Matplotlib (Python)
| |-- ii. Seaborn (Python)
| `-- iii. ggplot2 (R)
|
|-- 2. Data Exploration and Preprocessing
| |-- a. Exploratory Data Analysis (EDA)
| |-- b. Feature Engineering
| |-- c. Data Cleaning
| |-- d. Handling Missing Data
| `-- e. Data Scaling and Normalization
|
|-- 3. Machine Learning
| |-- a. Supervised Learning
| | |-- i. Regression
| | | |-- 1. Linear Regression
| | | `-- 2. Polynomial Regression
| | |
| | `-- ii. Classification
| | |-- 1. Logistic Regression
| | |-- 2. k-Nearest Neighbors
| | |-- 3. Support Vector Machines
| | |-- 4. Decision Trees
| | `-- 5. Random Forest
| |
| |-- b. Unsupervised Learning
| | |-- i. Clustering
| | | |-- 1. K-means
| | | |-- 2. DBSCAN
| | | `-- 3. Hierarchical Clustering
| | |
| | `-- ii. Dimensionality Reduction
| | |-- 1. Principal Component Analysis (PCA)
| | |-- 2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
| | `-- 3. Linear Discriminant Analysis (LDA)
| |
| |-- c. Reinforcement Learning
| |-- d. Model Evaluation and Validation
| | |-- i. Cross-validation
| | |-- ii. Hyperparameter Tuning
| | `-- iii. Model Selection
| |
| `-- e. ML Libraries and Frameworks
| |-- i. Scikit-learn (Python)
| |-- ii. TensorFlow (Python)
| |-- iii. Keras (Python)
| `-- iv. PyTorch (Python)
|
|-- 4. Deep Learning
| |-- a. Neural Networks
| | |-- i. Perceptron
| | `-- ii. Multi-Layer Perceptron
| |
| |-- b. Convolutional Neural Networks (CNNs)
| | |-- i. Image Classification
| | |-- ii. Object Detection
| | `-- iii. Image Segmentation
| |
| |-- c. Recurrent Neural Networks (RNNs)
| | |-- i. Sequence-to-Sequence Models
| | |-- ii. Text Classification
| | `-- iii. Sentiment Analysis
| |
| |-- d. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)
| | |-- i. Time Series Forecasting
| | `-- ii. Language Modeling
| |
| `-- e. Generative Adversarial Networks (GANs)
| |-- i. Image Synthesis
| |-- ii. Style Transfer
| `-- iii. Data Augmentation
|
|-- 5. Big Data Technologies
| |-- a. Hadoop
| | |-- i. HDFS
| | `-- ii. MapReduce
| |
| |-- b. Spark
| | |-- i. RDDs
| | |-- ii. DataFrames
| | `-- iii. MLlib
| |
| `-- c. NoSQL Databases
| |-- i. MongoDB
| |-- ii. Cassandra
| |-- iii. HBase
| `-- iv. Couchbase
|
|-- 6. Data Visualization and Reporting
| |-- a. Dashboarding Tools
| | |-- i. Tableau
| | |-- ii. Power BI
| | |-- iii. Dash (Python)
| | `-- iv. Shiny (R)
| |
| |-- b. Storytelling with Data
| `-- c. Effective Communication
|
|-- 7. Domain Knowledge and Soft Skills
| |-- a. Industry-specific Knowledge
| |-- b. Problem-solving
| |-- c. Communication Skills
| |-- d. Time Management
| `-- e. Teamwork
|
`-- 8. Staying Updated and Continuous Learning
|-- a. Online Courses
|-- b. Books and Research Papers
|-- c. Blogs and Podcasts
|-- d. Conferences and Workshops
`-- e. Networking and Community Engagement
|
|-- 1. Basic Foundations
| |-- a. Mathematics
| | |-- i. Linear Algebra
| | |-- ii. Calculus
| | |-- iii. Probability
| | `-- iv. Statistics
| |
| |-- b. Programming
| | |-- i. Python
| | | |-- 1. Syntax and Basic Concepts
| | | |-- 2. Data Structures
| | | |-- 3. Control Structures
| | | |-- 4. Functions
| | | `-- 5. Object-Oriented Programming
| | |
| | `-- ii. R (optional, based on preference)
| |
| |-- c. Data Manipulation
| | |-- i. Numpy (Python)
| | |-- ii. Pandas (Python)
| | `-- iii. Dplyr (R)
| |
| `-- d. Data Visualization
| |-- i. Matplotlib (Python)
| |-- ii. Seaborn (Python)
| `-- iii. ggplot2 (R)
|
|-- 2. Data Exploration and Preprocessing
| |-- a. Exploratory Data Analysis (EDA)
| |-- b. Feature Engineering
| |-- c. Data Cleaning
| |-- d. Handling Missing Data
| `-- e. Data Scaling and Normalization
|
|-- 3. Machine Learning
| |-- a. Supervised Learning
| | |-- i. Regression
| | | |-- 1. Linear Regression
| | | `-- 2. Polynomial Regression
| | |
| | `-- ii. Classification
| | |-- 1. Logistic Regression
| | |-- 2. k-Nearest Neighbors
| | |-- 3. Support Vector Machines
| | |-- 4. Decision Trees
| | `-- 5. Random Forest
| |
| |-- b. Unsupervised Learning
| | |-- i. Clustering
| | | |-- 1. K-means
| | | |-- 2. DBSCAN
| | | `-- 3. Hierarchical Clustering
| | |
| | `-- ii. Dimensionality Reduction
| | |-- 1. Principal Component Analysis (PCA)
| | |-- 2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
| | `-- 3. Linear Discriminant Analysis (LDA)
| |
| |-- c. Reinforcement Learning
| |-- d. Model Evaluation and Validation
| | |-- i. Cross-validation
| | |-- ii. Hyperparameter Tuning
| | `-- iii. Model Selection
| |
| `-- e. ML Libraries and Frameworks
| |-- i. Scikit-learn (Python)
| |-- ii. TensorFlow (Python)
| |-- iii. Keras (Python)
| `-- iv. PyTorch (Python)
|
|-- 4. Deep Learning
| |-- a. Neural Networks
| | |-- i. Perceptron
| | `-- ii. Multi-Layer Perceptron
| |
| |-- b. Convolutional Neural Networks (CNNs)
| | |-- i. Image Classification
| | |-- ii. Object Detection
| | `-- iii. Image Segmentation
| |
| |-- c. Recurrent Neural Networks (RNNs)
| | |-- i. Sequence-to-Sequence Models
| | |-- ii. Text Classification
| | `-- iii. Sentiment Analysis
| |
| |-- d. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU)
| | |-- i. Time Series Forecasting
| | `-- ii. Language Modeling
| |
| `-- e. Generative Adversarial Networks (GANs)
| |-- i. Image Synthesis
| |-- ii. Style Transfer
| `-- iii. Data Augmentation
|
|-- 5. Big Data Technologies
| |-- a. Hadoop
| | |-- i. HDFS
| | `-- ii. MapReduce
| |
| |-- b. Spark
| | |-- i. RDDs
| | |-- ii. DataFrames
| | `-- iii. MLlib
| |
| `-- c. NoSQL Databases
| |-- i. MongoDB
| |-- ii. Cassandra
| |-- iii. HBase
| `-- iv. Couchbase
|
|-- 6. Data Visualization and Reporting
| |-- a. Dashboarding Tools
| | |-- i. Tableau
| | |-- ii. Power BI
| | |-- iii. Dash (Python)
| | `-- iv. Shiny (R)
| |
| |-- b. Storytelling with Data
| `-- c. Effective Communication
|
|-- 7. Domain Knowledge and Soft Skills
| |-- a. Industry-specific Knowledge
| |-- b. Problem-solving
| |-- c. Communication Skills
| |-- d. Time Management
| `-- e. Teamwork
|
`-- 8. Staying Updated and Continuous Learning
|-- a. Online Courses
|-- b. Books and Research Papers
|-- c. Blogs and Podcasts
|-- d. Conferences and Workshops
`-- e. Networking and Community Engagement
β€17π11
8 Books that Will Teach You the Basics of Data Science
In an era where data is hailed as the new oil, the demand for data scientists continues to soar. Data science, a multidisciplinary field that extracts insights and knowledge from data, has become a cornerstone of many industries. For those aspiring to enter this dynamic field, building a solid foundation is essential. Books are a timeless source of knowledge, and in this article, weβll explore eight must-read books that will teach you the basics of data science, making your journey into this fascinating world more accessible.
1. βPython for Data Analysisβ by Wes McKinney
Wes McKinneyβs book is a fantastic starting point for beginners. It focuses on the practical use of Python, one of the most popular programming languages in data science. Youβll learn how to work with data structures, perform data cleaning, and apply statistical analysis. The book also introduces the powerful Pandas library for data manipulation.
Source-Link: analyticsinsight
In an era where data is hailed as the new oil, the demand for data scientists continues to soar. Data science, a multidisciplinary field that extracts insights and knowledge from data, has become a cornerstone of many industries. For those aspiring to enter this dynamic field, building a solid foundation is essential. Books are a timeless source of knowledge, and in this article, weβll explore eight must-read books that will teach you the basics of data science, making your journey into this fascinating world more accessible.
1. βPython for Data Analysisβ by Wes McKinney
Wes McKinneyβs book is a fantastic starting point for beginners. It focuses on the practical use of Python, one of the most popular programming languages in data science. Youβll learn how to work with data structures, perform data cleaning, and apply statistical analysis. The book also introduces the powerful Pandas library for data manipulation.
Source-Link: analyticsinsight
π1