Machine Learning Notes.pdf
226.8 KB
A Stanford CS' Lecture note diving into supervised/unsupervised algorithms, neural networks, SVMs with math proofs and Python pseudocode.
❤6
📚 Data Science Riddle
Two team members run the same notebook but get different results. What's the culprit?
Two team members run the same notebook but get different results. What's the culprit?
Anonymous Quiz
6%
Loss Curves
12%
Batch shapes
59%
Random seeds
23%
Metric choice
📚 Data Science Riddle
A query runs slowly due to large table scans. What's the most targeted fix?
A query runs slowly due to large table scans. What's the most targeted fix?
Anonymous Quiz
54%
Add indexes
17%
Use aliases
16%
Add DISTINCT
13%
Increase RAM
📚 Data Science Riddle
You want to detect extreme values visually in one plot. Which one is best?
You want to detect extreme values visually in one plot. Which one is best?
Anonymous Quiz
53%
Box plot
30%
Heatmap
9%
Line chart
8%
Area plot
Mining of Massive Datasets (Leskovec, Stanford).pdf
2.9 MB
The Big Data bible from Stanford: MapReduce, Spark, recommendation systems, PageRank, locality-sensitive hashing, Large scale machine learning and mining social networks/streams all explained clearly with real algorithms you can code today. 500 pages of pure gold.
❤3
📚 Data Science Riddle
You want to prevent inconsistent data across environments. What helps most?
You want to prevent inconsistent data across environments. What helps most?
Anonymous Quiz
32%
Checkpoints
20%
Contracts
39%
Indexes
10%
Sharding
🛠️ Running Code in Jupyter Notebooks
Jupyter Notebooks let you write & run code interactively.
Here’s a quick guide to make your workflow smoother:
▶️ Kernel & Code Cells
- Each notebook is tied to a single kernel (e.g. IPython).
- Code cells are where you write and execute code.
⌨️ Useful Shortcuts
- Shift + Enter → run current cell, move to next
- Alt + Enter → run current cell, insert new one below
- Ctrl + Enter → run current cell, stay in place
🔄 Kernel Management
- Interrupt the kernel if code hangs.
- Restart kernel to reset memory & variables.
🖥️ Output Handling
- Results & errors appear directly under the cell.
- Long-running code outputs appear as they’re generated.
- Large outputs can be scrolled or collapsed for clarity.
💡 Pro Tip:
Always “Restart & Run All” before sharing or saving a notebook.
This ensures reproducibility and clean results.
👉 Explore
Jupyter Notebooks let you write & run code interactively.
Here’s a quick guide to make your workflow smoother:
▶️ Kernel & Code Cells
- Each notebook is tied to a single kernel (e.g. IPython).
- Code cells are where you write and execute code.
⌨️ Useful Shortcuts
- Shift + Enter → run current cell, move to next
- Alt + Enter → run current cell, insert new one below
- Ctrl + Enter → run current cell, stay in place
🔄 Kernel Management
- Interrupt the kernel if code hangs.
- Restart kernel to reset memory & variables.
🖥️ Output Handling
- Results & errors appear directly under the cell.
- Long-running code outputs appear as they’re generated.
- Large outputs can be scrolled or collapsed for clarity.
💡 Pro Tip:
Always “Restart & Run All” before sharing or saving a notebook.
This ensures reproducibility and clean results.
👉 Explore
❤2
📚 Data Science Riddle
You need fast reads of small files. What storage options fits best?
You need fast reads of small files. What storage options fits best?
Anonymous Quiz
23%
Distributed FS
9%
Cold storage
20%
Object Storage
49%
Local SSD
❤4
📚 Data Science Riddle
A feature has low importance but domain experts insist it matters. What do you do?
A feature has low importance but domain experts insist it matters. What do you do?
Anonymous Quiz
26%
Encode it differently
22%
Scale it
13%
Drop the feature
40%
Check interaction effects
Advanced Data Science on Spark.pdf
1.8 MB
Covers Spark for ML, graph processing (GraphFrames), and integration with Hadoop from Stanford University.
❤3
📚 Data Science Riddle
Your estimate has high variance. Best fix?
Your estimate has high variance. Best fix?
Anonymous Quiz
57%
Increase sample size
27%
Change confidence level
8%
Reduce bin count
7%
Switch to bootstrap
The Difference Between Model Accuracy and Business Accuracy
A model can be 95% accurate…
yet deliver 0% business value.
Why❔
Because data science metrics ≠ business metrics.
📌 Examples:
- A fraud model catches tiny fraud but misses large ones
- A churn model predicts already obvious churners
- A recommendation model boosts clicks but reduces revenue
Always align ML metrics with business KPIs.
Otherwise, your “great model” is just a great illusion.
A model can be 95% accurate…
yet deliver 0% business value.
Why❔
Because data science metrics ≠ business metrics.
📌 Examples:
- A fraud model catches tiny fraud but misses large ones
- A churn model predicts already obvious churners
- A recommendation model boosts clicks but reduces revenue
Always align ML metrics with business KPIs.
Otherwise, your “great model” is just a great illusion.
❤4
📚 Data Science Riddle
Your model's loss fluctuates but doesn't decrease overall. What's the most likely issue?
Your model's loss fluctuates but doesn't decrease overall. What's the most likely issue?
Anonymous Quiz
25%
Gradient exploding
40%
Weak regularization
22%
Small batch size
13%
Slow optimizer
✅ Complete AI (Artificial Intelligence) Roadmap 🤖🚀
1️⃣ Basics of AI
🔹 What is AI?
🔹 Types: Narrow AI vs General AI
🔹 AI vs ML vs DL
🔹 Real-world applications
2️⃣ Python for AI
🔹 Python syntax & libraries
🔹 NumPy, Pandas for data handling
🔹 Matplotlib, Seaborn for visualization
3️⃣ Math Foundation
🔹 Linear Algebra: Vectors, Matrices
🔹 Probability & Statistics
🔹 Calculus basics
🔹 Optimization techniques
4️⃣ Machine Learning (ML)
🔹 Supervised vs Unsupervised
🔹 Regression, Classification, Clustering
🔹 Scikit-learn for ML
🔹 Model evaluation metrics
5️⃣ Deep Learning (DL)
🔹 Neural Networks basics
🔹 Activation functions, backpropagation
🔹 TensorFlow / PyTorch
🔹 CNNs, RNNs, LSTMs
6️⃣ NLP (Natural Language Processing)
🔹 Text cleaning & tokenization
🔹 Word embeddings (Word2Vec, GloVe)
🔹 Transformers & BERT
🔹 Chatbots & summarization
7️⃣ Computer Vision
🔹 Image processing basics
🔹 OpenCV for CV tasks
🔹 Object detection, image classification
🔹 CNN architectures (ResNet, YOLO)
8️⃣ Model Deployment
🔹 Streamlit / Flask APIs
🔹 Docker for containerization
🔹 Deploy on cloud: Render, Hugging Face, AWS
9️⃣ Tools & Ecosystem
🔹 Git & GitHub
🔹 Jupyter Notebooks
🔹 DVC, MLflow (for tracking models)
🔟 Build AI Projects
🔹 Chatbot, Face recognition
🔹 Spam classifier, Stock prediction
🔹 Language translator, Object detector
1️⃣ Basics of AI
🔹 What is AI?
🔹 Types: Narrow AI vs General AI
🔹 AI vs ML vs DL
🔹 Real-world applications
2️⃣ Python for AI
🔹 Python syntax & libraries
🔹 NumPy, Pandas for data handling
🔹 Matplotlib, Seaborn for visualization
3️⃣ Math Foundation
🔹 Linear Algebra: Vectors, Matrices
🔹 Probability & Statistics
🔹 Calculus basics
🔹 Optimization techniques
4️⃣ Machine Learning (ML)
🔹 Supervised vs Unsupervised
🔹 Regression, Classification, Clustering
🔹 Scikit-learn for ML
🔹 Model evaluation metrics
5️⃣ Deep Learning (DL)
🔹 Neural Networks basics
🔹 Activation functions, backpropagation
🔹 TensorFlow / PyTorch
🔹 CNNs, RNNs, LSTMs
6️⃣ NLP (Natural Language Processing)
🔹 Text cleaning & tokenization
🔹 Word embeddings (Word2Vec, GloVe)
🔹 Transformers & BERT
🔹 Chatbots & summarization
7️⃣ Computer Vision
🔹 Image processing basics
🔹 OpenCV for CV tasks
🔹 Object detection, image classification
🔹 CNN architectures (ResNet, YOLO)
8️⃣ Model Deployment
🔹 Streamlit / Flask APIs
🔹 Docker for containerization
🔹 Deploy on cloud: Render, Hugging Face, AWS
9️⃣ Tools & Ecosystem
🔹 Git & GitHub
🔹 Jupyter Notebooks
🔹 DVC, MLflow (for tracking models)
🔟 Build AI Projects
🔹 Chatbot, Face recognition
🔹 Spam classifier, Stock prediction
🔹 Language translator, Object detector
❤2
📚 Data Science Riddle - CNN Kernels
Which convolution increases channel depth but not spatial size?
Which convolution increases channel depth but not spatial size?
Anonymous Quiz
0%
1x1 convolution
38%
3x3 convolution
54%
Depthwise convolution
8%
Transposed convolution
