📊 Load Machine Learning Data From Scratch in Python
🧠 Introduction
Most tutorials use libraries like pandas or scikit-learn to load datasets. But learning how to load data from scratch helps you:
- Understand the internal structure of datasets
- Handle custom data formats
- Improve debugging skills
📁 Table of Contents
- 1. Why Load Data from Scratch?
- 2. Types of Machine Learning Data
- 3. File Formats in ML (CSV, TSV, JSON, etc.)
- 4. Understanding a Dataset Structure
- 5. Reading a File Line by Line in Python
- 6. Splitting Lines into Columns
- 7. Converting Data Types
- 8. Handling Missing Values
- 9. Handling Categorical Data
- 10. Normalizing Data
- 11. Splitting Data into Train/Test Sets
- 12. One-Hot Encoding
- 13. Creating a Loader Function
- 14. Testing with Iris Dataset
- 15. Conclusion and Best Practices
1. Why Load Data from Scratch?
Working without high-level libraries allows you to understand and customize the data preprocessing pipeline.
2. Types of Machine Learning Data
- Structured: CSV, Excel, SQL
- Semi-structured: JSON, XML
- Unstructured: Text, Images, Audio
3. Common File Formats
.csv: Comma-separated.tsv: Tab-separated.json: Hierarchical
4. Dataset Example
sepal_length,sepal_width,petal_length,petal_width,species 5.1,3.5,1.4,0.2,setosa
5. Reading CSV Manually
with open('iris.csv', 'r') as file:
for line in file:
print(line)
6. Splitting Lines
with open('iris.csv', 'r') as file:
for line in file:
values = line.strip().split(',')
print(values)
7. Converting to Floats
line = "5.1,3.5,1.4,0.2,setosa"
values = line.strip().split(',')
features = list(map(float, values[:4]))
label = values[4]
8. Missing Value Handling
def is_missing(value):
return value == '' or value.lower() == 'nan'
cleaned = [float(val) if not is_missing(val) else 0.0 for val in values[:4]]
9. Encoding Labels
label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
label = label_map[values[4]]
10. Feature Normalization
def normalize_column(col):
min_val = min(col)
max_val = max(col)
return [(x - min_val) / (max_val - min_val) for x in col]
11. Train/Test Split
import random data = list(zip(X, y)) random.shuffle(data) split = int(0.8 * len(data)) train_data = data[:split] test_data = data[split:]
12. One-Hot Encoding
def one_hot(label, num_classes=3):
vector = [0] * num_classes
vector[label] = 1
return vector
13. Full Loader Function
def load_dataset(filename):
data = []
label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}
with open(filename, 'r') as file:
next(file) # skip header
for line in file:
parts = line.strip().split(',')
features = list(map(float, parts[:4]))
label = label_map[parts[4]]
data.append((features, label))
return data
14. Try with a CSV File
dataset = load_dataset('iris.csv')
print("Sample:", dataset[:3])
📝 Practice: Write a script that loads, splits, normalizes, and one-hot encodes a dataset from scratch!
15. Conclusion
This tutorial taught you how to:
- Load CSV manually using basic Python
- Handle missing and categorical data
- Normalize and prepare for training
💡 Next Step: Learn to build a neural network from scratch to use this data!
Comments
Post a Comment