Load Machine Learning Data From Scratch in Python

📊 Load Machine Learning Data From Scratch in Python

🧠 Introduction

Most tutorials use libraries like pandas or scikit-learn to load datasets. But learning how to load data from scratch helps you:

Understand the internal structure of datasets
Handle custom data formats
Improve debugging skills

📁 Table of Contents

1. Why Load Data from Scratch?
2. Types of Machine Learning Data
3. File Formats in ML (CSV, TSV, JSON, etc.)
4. Understanding a Dataset Structure
5. Reading a File Line by Line in Python
6. Splitting Lines into Columns
7. Converting Data Types
8. Handling Missing Values
9. Handling Categorical Data
10. Normalizing Data
11. Splitting Data into Train/Test Sets
12. One-Hot Encoding
13. Creating a Loader Function
14. Testing with Iris Dataset
15. Conclusion and Best Practices

1. Why Load Data from Scratch?

Working without high-level libraries allows you to understand and customize the data preprocessing pipeline.

2. Types of Machine Learning Data

Structured: CSV, Excel, SQL
Semi-structured: JSON, XML
Unstructured: Text, Images, Audio

3. Common File Formats

.csv: Comma-separated
.tsv: Tab-separated
.json: Hierarchical

4. Dataset Example

sepal_length,sepal_width,petal_length,petal_width,species

5.1,3.5,1.4,0.2,setosa

5. Reading CSV Manually


with open('iris.csv', 'r') as file:

    for line in file:

        print(line)

6. Splitting Lines


with open('iris.csv', 'r') as file:

    for line in file:

        values = line.strip().split(',')

        print(values)

7. Converting to Floats


line = "5.1,3.5,1.4,0.2,setosa"

values = line.strip().split(',')

features = list(map(float, values[:4]))

label = values[4]

8. Missing Value Handling


def is_missing(value):

    return value == '' or value.lower() == 'nan'

cleaned = [float(val) if not is_missing(val) else 0.0 for val in values[:4]]

9. Encoding Labels


label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

label = label_map[values[4]]

10. Feature Normalization


def normalize_column(col):

    min_val = min(col)

    max_val = max(col)

    return [(x - min_val) / (max_val - min_val) for x in col]

11. Train/Test Split


import random

data = list(zip(X, y))

random.shuffle(data)

split = int(0.8 * len(data))

train_data = data[:split]

test_data = data[split:]

12. One-Hot Encoding


def one_hot(label, num_classes=3):

    vector = [0] * num_classes

    vector[label] = 1

    return vector

13. Full Loader Function


def load_dataset(filename):

    data = []

    label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

    

    with open(filename, 'r') as file:

        next(file)  # skip header

        for line in file:

            parts = line.strip().split(',')

            features = list(map(float, parts[:4]))

            label = label_map[parts[4]]

            data.append((features, label))

    

    return data

14. Try with a CSV File


dataset = load_dataset('iris.csv')

print("Sample:", dataset[:3])

📝 Practice: Write a script that loads, splits, normalizes, and one-hot encodes a dataset from scratch!

15. Conclusion

This tutorial taught you how to:

Load CSV manually using basic Python
Handle missing and categorical data
Normalize and prepare for training

💡 Next Step: Learn to build a neural network from scratch to use this data!

DarchumsTech

Search This Blog