Note

Go to the end to download the full example code.

Final Project

Your Mission

Congratulations for making it this far :)! You’ve learned some powerful mathematical and computational tools. Now it’s time to put them to work on something you care about.

Your final project is straightforward: Pick a real-world question, find data to answer it, and use the techniques you’ve learned to tell a compelling story.

In the video above, we walk through what makes a great final project and show you an example of how to successfully apply what you have learned.

What Makes a Great Project?

The best projects have three ingredients:

A Question You Actually Care About

It doesn’t have to be earth-shattering. It just has to be something that matters to you and your community. The more specific and local, the better. The best questions are the ones you see in your daily life.

Data You Can Get Your Hands On

More on this below - but don’t worry, we’ll show you where to look!

Either use one technique from this course, or explore more advanced methods
Especially if they help answer your question better. The goal of this project is not to restrict you to course tools, but to help you use data science to answer a meaningful real-world question.

Project Ideas to Get You Started

Not sure where to begin? Start with a question that genuinely matters to you and your community. The most powerful machine learning projects are not the ones that sound impressive, they are the ones that solve real problems close to home. For this course, we encourage you to focus on African challenges and local realities. Think about the issues you see every day. Flood prediction in informal settlements. Crop disease detection for smallholder farmers. Electricity outage patterns in your county. Traffic congestion in your city. Air quality near busy roads. Youth unemployment trends in your area.

For example, let’s say you’re from Kenya and you’ve noticed how loud matatu music can be during long commutes. You might ask: Is there a correlation between prolonged exposure to high-decibel matatu music and hearing issues among commuters? That question can turn into a real machine learning project — collecting survey data, measuring exposure time, analysing patterns, and testing whether there is correlation (and carefully exploring whether causation might be possible to infer).

This is exactly the spirit of the Kujenga course: Start with what you see. Ask a meaningful question. Use data to investigate it. Your project does not have to be perfect. It just has to be relevant and thoughtful.

Where to Find Data

The internet is full of free, accessible datasets, but don’t limit yourself to global sources only. Some of the most impactful projects come from local African datasets. You can explore:

Government open data portals (for example, national statistics offices, health ministries, meteorological departments).
African research institutions and NGOs.
County-level public records.
International datasets with African coverage (World Bank, UN data, etc.).
Community-collected surveys and field data.

And remember: you are not restricted to pre-existing datasets. You can collect your own data through surveys, simple measurements, interviews, or partnerships with local organisations. If you ever feel stuck, these repositories below can be your starting point. The goal is to find data that helps you answer a question you genuinely care about.

General Data Repositories

Kaggle Datasets - Thousands of clean, ready-to-use datasets
Google Dataset Search - Like Google but for data
Our World in Data - Beautiful, clean data on global issues
Data.gov - US government open data
UCI Machine Learning Repository - Classic datasets

Cleaning Your Data

Real-world data is messy. Here’s how to clean it:

from markdown import Markdown
import pandas as pd

Step 1: Load and Inspect

First, load your data and take a look at what you’re working with. This helps you understand the structure and identify potential issues.

# Load your data

#data = pd.read_csv('your_data.csv')

# Take a look
#print(data.head())
#print(data.info())
#print(data.describe())

Step 2: Handle Missing Values

Missing data is common. You can either remove rows with missing values or fill them with appropriate values like the mean or median.

# See where data is missing
#print(data.isnull().sum())

# Option 1: Drop rows with missing values
#data_clean = data.dropna()

# Option 2: Fill missing values with mean/median
#data['column_name'].fillna(data['column_name'].mean(), inplace=True)

Step 3: Remove Duplicates

Duplicate rows can skew your analysis. Check for and remove them.

# Check for duplicates
#print(data.duplicated().sum())

# Remove them

#data_clean = data.drop_duplicates()

Step 4: Fix Data Types

Make sure each column has the correct data type. Dates should be datetime objects, numbers should be numeric types, etc.

# Convert to the right type

#data['date_column'] = pd.to_datetime(data['date_column'])
#data['numeric_column'] = pd.to_numeric(data['numeric_column'])

Step 5: Filter Outliers (if needed)

Extreme values can distort your results. Use the interquartile range (IQR) method to identify and remove outliers.

# Remove extreme values

#Q1 = data['column'].quantile(0.25)
#Q3 = data['column'].quantile(0.75)
#IQR = Q3 - Q1

#data_clean = data[(data['column'] >= Q1 - 1.5*IQR) &
#                  (data['column'] <= Q3 + 1.5*IQR)]

Need more help? Check out the Pandas documentation or this data cleaning tutorial.

Project Requirements

Your final project should include:

A Clear Question - What are you trying to find out?
Data Description - Where did you get your data? What does it contain?
Data Cleaning - Show how you cleaned and prepared your data
Analysis - Use at least one technique from the course (regression, t-test, modeling, etc.) # type: ignore
Visualization - Create at least 2 plots that help tell your story
Conclusion - What did you discover? What are the limitations?
Code - Submit your Jupyter notebook or Python script

Submission Instructions

In addition to submitting your Jupyter notebook, you are also required to create a GitHub repository for your final project.

Why? Because when tutors run your notebook, the data should load easily and automatically. The best way to do this is to store your dataset inside your GitHub repository and load it directly from there in your notebook.

What you should submit:

A Jupyter Notebook (.ipynb) - File naming: yourname_yourcountry_final_project.ipynb
A GitHub repository link to your project containing:
- Your notebook
- Your dataset (or a link to the dataset)
- A short README explaining your project
- Any other files you see fit to submit(requirements file, etc)

Submit a Jupyter notebook HERE that includes:

Markdown cells explaining your thinking
Code cells showing your analysis
Visualizations
A final conclusion section

Tips for Success

Start simple - Better to do one thing well than many things poorly

Tell a story - Guide your reader through your thinking

Make it visual - Good plots make your findings memorable

Be honest - If your hypothesis was wrong, that’s okay! Explain what you learned

Ask for help - Stuck? Reach out to your instructor or classmates

Getting Started

Pick a question that excites you

Find a dataset

Download the data and start exploring

Clean it up and apply what you’ve learned

Create visualizations that tell your story

Share your findings

Celebrate your hard work and new skills!

Remember: The best projects are the ones where you learn something new about a topic you care about.

Have Any Questions? Ask your instructor.

Total running time of the script: (0 minutes 1.157 seconds)

Gallery generated by Sphinx-Gallery