📊 Load Machine Learning Data From Scratch in Python
🧠 Introduction
Most tutorials use libraries like pandas
or scikit-learn
to load datasets. But learning how to load data from scratch helps you:
- Understand the internal structure of datasets
- Handle custom data formats
- Improve debugging skills
📁 Table of Contents
- 1. Why Load Data from Scratch?
- 2. Types of Machine Learning Data
- 3. File Formats in ML (CSV, TSV, JSON, etc.)
- 4. Understanding a Dataset Structure
- 5. Reading a File Line by Line in Python
- 6. Splitting Lines into Columns
- 7. Converting Data Types
- 8. Handling Missing Values
- 9. Handling Categorical Data
- 10. Normalizing Data
- 11. Splitting Data into Train/Test Sets
- 12. One-Hot Encoding
- 13. Creating a Loader Function
- 14. Testing with Iris Dataset
- 15. Conclusion and Best Practices
1. Why Load Data from Scratch?
Working without high-level libraries allows you to understand and customize the data preprocessing pipeline.
2. Types of Machine Learning Data
- Structured: CSV, Excel, SQL
- Semi-structured: JSON, XML
- Unstructured: Text, Images, Audio
3. Common File Formats
.csv
: Comma-separated.tsv
: Tab-separated.json
: Hierarchical
4. Dataset Example
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa
5. Reading CSV Manually
with open('iris.csv', 'r') as file: for line in file: print(line)
6. Splitting Lines
with open('iris.csv', 'r') as file: for line in file: values = line.strip().split(',') print(values)
7. Converting to Floats
line = "5.1,3.5,1.4,0.2,setosa" values = line.strip().split(',') features = list(map(float, values[:4])) label = values[4]
8. Missing Value Handling
def is_missing(value): return value == '' or value.lower() == 'nan' cleaned = [float(val) if not is_missing(val) else 0.0 for val in values[:4]]
9. Encoding Labels
label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2} label = label_map[values[4]]
10. Feature Normalization
def normalize_column(col): min_val = min(col) max_val = max(col) return [(x - min_val) / (max_val - min_val) for x in col]
11. Train/Test Split
import random data = list(zip(X, y)) random.shuffle(data) split = int(0.8 * len(data)) train_data = data[:split] test_data = data[split:]
12. One-Hot Encoding
def one_hot(label, num_classes=3): vector = [0] * num_classes vector[label] = 1 return vector
13. Full Loader Function
def load_dataset(filename): data = [] label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2} with open(filename, 'r') as file: next(file) # skip header for line in file: parts = line.strip().split(',') features = list(map(float, parts[:4])) label = label_map[parts[4]] data.append((features, label)) return data
14. Try with a CSV File
dataset = load_dataset('iris.csv') print("Sample:", dataset[:3])
📝 Practice: Write a script that loads, splits, normalizes, and one-hot encodes a dataset from scratch!
15. Conclusion
This tutorial taught you how to:
- Load CSV manually using basic Python
- Handle missing and categorical data
- Normalize and prepare for training
💡 Next Step: Learn to build a neural network from scratch to use this data!
Comments
Post a Comment