Load Machine Learning Data From Scratch in Python

Load Machine Learning Data From Scratch in Python

📊 Load Machine Learning Data From Scratch in Python

🧠 Introduction

Most tutorials use libraries like pandas or scikit-learn to load datasets. But learning how to load data from scratch helps you:

  • Understand the internal structure of datasets
  • Handle custom data formats
  • Improve debugging skills

📁 Table of Contents

  • 1. Why Load Data from Scratch?
  • 2. Types of Machine Learning Data
  • 3. File Formats in ML (CSV, TSV, JSON, etc.)
  • 4. Understanding a Dataset Structure
  • 5. Reading a File Line by Line in Python
  • 6. Splitting Lines into Columns
  • 7. Converting Data Types
  • 8. Handling Missing Values
  • 9. Handling Categorical Data
  • 10. Normalizing Data
  • 11. Splitting Data into Train/Test Sets
  • 12. One-Hot Encoding
  • 13. Creating a Loader Function
  • 14. Testing with Iris Dataset
  • 15. Conclusion and Best Practices

1. Why Load Data from Scratch?

Working without high-level libraries allows you to understand and customize the data preprocessing pipeline.

2. Types of Machine Learning Data

  • Structured: CSV, Excel, SQL
  • Semi-structured: JSON, XML
  • Unstructured: Text, Images, Audio

3. Common File Formats

  • .csv: Comma-separated
  • .tsv: Tab-separated
  • .json: Hierarchical

4. Dataset Example

sepal_length,sepal_width,petal_length,petal_width,species

5.1,3.5,1.4,0.2,setosa

5. Reading CSV Manually


with open('iris.csv', 'r') as file:

    for line in file:

        print(line)

  

6. Splitting Lines


with open('iris.csv', 'r') as file:

    for line in file:

        values = line.strip().split(',')

        print(values)

  

7. Converting to Floats


line = "5.1,3.5,1.4,0.2,setosa"

values = line.strip().split(',')

features = list(map(float, values[:4]))

label = values[4]

  

8. Missing Value Handling


def is_missing(value):

    return value == '' or value.lower() == 'nan'

cleaned = [float(val) if not is_missing(val) else 0.0 for val in values[:4]]

  

9. Encoding Labels


label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

label = label_map[values[4]]

  

10. Feature Normalization


def normalize_column(col):

    min_val = min(col)

    max_val = max(col)

    return [(x - min_val) / (max_val - min_val) for x in col]

  

11. Train/Test Split


import random

data = list(zip(X, y))

random.shuffle(data)

split = int(0.8 * len(data))

train_data = data[:split]

test_data = data[split:]

  

12. One-Hot Encoding


def one_hot(label, num_classes=3):

    vector = [0] * num_classes

    vector[label] = 1

    return vector

  

13. Full Loader Function


def load_dataset(filename):

    data = []

    label_map = {'setosa': 0, 'versicolor': 1, 'virginica': 2}

    

    with open(filename, 'r') as file:

        next(file)  # skip header

        for line in file:

            parts = line.strip().split(',')

            features = list(map(float, parts[:4]))

            label = label_map[parts[4]]

            data.append((features, label))

    

    return data

  

14. Try with a CSV File


dataset = load_dataset('iris.csv')

print("Sample:", dataset[:3])

  
📝 Practice: Write a script that loads, splits, normalizes, and one-hot encodes a dataset from scratch!

15. Conclusion

This tutorial taught you how to:

  • Load CSV manually using basic Python
  • Handle missing and categorical data
  • Normalize and prepare for training
💡 Next Step: Learn to build a neural network from scratch to use this data!

Comments