Project 001 — Code
Pipeline & key excerpts (Python). Preprocessing → Correlation → Clustering → Supervised Learning
My Role (Team Lead)
- Team leader (coordination, timeline, task allocation)
- Implemented the Correlation Analysis component (one-hot + Pearson + visualizations)
- Wrote most of the report’s main analysis sections
- Prepared and refined the presentation: rehearsal plan, content allocation, editing
Quick Navigation
Pipeline Overview
The project investigates whether household income relates to travel mode choice. The workflow is organised as four scripts, each producing outputs (CSV / figures / metrics) used by later steps.
Inputs
- /course/household.csv
- /course/journey_to_work.csv
- /course/journey_to_education.csv
(Paths above reflect the original team submission environment; you can adjust them to your local setup.)
Core Outputs
- individual_modes.csv (per-person)
- household_modes.csv (per-household)
- figs/ (heatmaps, boxplots, scatter, bar chart)
- results/ (metrics JSON, confusion matrices, CV summary)
1) Preprocessing — dataset_processed.py
This script cleans household income, merges household + travel datasets, and normalises raw travel-mode text into three categories: public, private, active. It then exports two datasets: individual_modes.csv and household_modes.csv.
Income extraction (regex-based)
def extract_lower_inc(s):
s = str(s)
# standard range like "($65,000-$77,999)"
match_range = re.search(r'\((\$?[\d,]+)-(\$?[\d,]+)\)', s)
if match_range:
num = match_range.group(1).replace(",", "").replace("$", "")
return int(num)
# open-ended range like "($190,000 or more)"
match_open = re.search(r'\((\$?[\d,]+)\s*or more\)', s, re.IGNORECASE)
if match_open:
num = match_open.group(1).replace(",", "").replace("$", "")
return int(num)
return None
df_household['hhinc_numeric'] = df_household['hhinc_group'].apply(extract_lower_inc)
df_household['hhinc_group'] = df_household['hhinc_numeric']
Mode grouping (public / private / active)
mode_public = ['bus', 'train', 'tram']
mode_private = ['car', 'vehicle', 'taxi', 'driver', 'passenger']
mode_active = ['walking', 'bicycle']
def categorise_mode(mode):
if pd.isna(mode):
return None
m = str(mode).strip().lower()
if any(p in m for p in mode_public):
return "public"
elif any(p in m for p in mode_private):
return "private"
elif any(p in m for p in mode_active):
return "active"
return None
Household mode aggregation
# apply across rows
df_merged["main_mode_group"] = df_merged.apply(most_common_mode, axis=1)
# export per-individual
output_df = df_merged[["hhid", "hhinc_group", "main_mode_group"]]
output_df.to_csv("individual_modes.csv", index=False)
# export per-household (mode by household)
df_household_modes = (
df_merged.groupby("hhid")["main_mode_group"]
.agg(lambda x: pd.Series.mode(x)[0] if not x.mode().empty else None)
.reset_index()
)
df_final = df_household_modes.merge(
df_merged[["hhid", "hhinc_group"]].drop_duplicates(),
on="hhid", how="left"
)
df_final = df_final[["hhid", "hhinc_group", "main_mode_group"]]
df_final.to_csv("household_modes.csv", index=False)
2) Correlation Analysis (My Part) — correlaiton.py
This part focuses on the relationship between income and travel-mode choice using one-hot encoding + Pearson correlation, plus multiple visualisations (heatmaps, boxplots, scatter). It also produces summary statistics of household counts and shares by mode.
One-hot + Pearson correlation
encoded = pd.get_dummies(df["main_mode_group"], prefix="mode")
corr_df = pd.concat([df["hhinc_group"], encoded], axis=1).corr(method="pearson")
print("pearson correlation of household income and travel modes:")
print(corr_df["hhinc_group"].sort_values(ascending=False))
Heatmap outputs
plot_heatmap_full(
corr_df,
"Correlation Heatmap of Income and Travel Modes (Pearson)",
"correlation_heatmap_full.png"
)
subset_cols = ["hhinc_group"] + [c for c in corr_df.columns if c.startswith("mode_")]
subset = corr_df.loc[["hhinc_group"], subset_cols]
# saved to: figs/correlation_heatmap_income_vs_modes.png
Distribution view (boxplot + annotations)
plt.boxplot(groups, labels=modes, showfliers=True, patch_artist=True)
plt.title("Distribution of Household Income under Different Travel Modes")
# compute Q1 / median / Q3 / IQR, whiskers, outliers then annotate on plot
q1, q2, q3 = np.percentile(s, [25, 50, 75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr
Household counts & shares (summary table + bar chart)
household_counts = (
df.groupby("main_mode_group")["hhid"].nunique()
.rename("households")
.sort_values(ascending=False)
)
summary = pd.DataFrame(household_counts)
summary["share_households"] = (summary["households"] / summary["households"].sum()).round(3)
summary.to_csv("figs/mode_household_counts_v4.csv", encoding="utf-8-sig")
plt.bar(summary.index, summary["households"])
plt.title("Number of Households by Travel Mode")
plt.savefig("figs/households_by_mode_totals_v4.png", dpi=200)
Note: The filename correlaiton.py is kept as-is from the team submission.
3) Clustering — kMeans.py
This script explores whether simple clustering can separate travel-mode patterns from income. It normalises income, maps modes to numeric values, uses the elbow method to inspect K, and visualises the result for k = 3.
Mode mapping + elbow method
def toNum(mode):
if mode == 'private':
return 0
elif mode == 'active':
return 0.5
elif mode == 'public':
return 1
def elbow(data):
distortions = []
for k in range(1, 10):
kmean = KMeans(n_clusters=k)
kmean.fit(data[['hhinc_group', 'main_mode_group']])
distortions.append(kmean.inertia_)
plt.plot(range(1, 10), distortions, 'bx-')
plt.savefig('elbow_method.png')
KMeans (k=3) visualisation
KMEANS = KMeans(n_clusters=3)
KMEANS.fit(df[['hhinc_group', 'main_mode_group']])
plt.scatter(
df['hhinc_group'], df['main_mode_group'],
c=[colormap.get(x) for x in KMEANS.labels_]
)
plt.yticks([0, 0.5, 1], ['private', 'active', 'public'])
plt.title('k = 3')
plt.savefig('kmeans.png')
4) Supervised Learning & Evaluation — main.py
This script treats travel-mode choice as a classification task using income as the feature. It evaluates Logistic Regression (with standard scaling) and a Decision Tree, runs a small cross-validation search, and outputs metrics and confusion matrices.
Tiny cross-validation grid search
def tiny_cv(model, grid, X, y, scoring="f1_macro", cv=5, seed=7):
skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=seed)
best_params, best_score = None, -np.inf
for params in grid:
m = model.set_params(**params)
scores = cross_val_score(m, X, y, cv=skf, scoring=scoring)
mean = float(scores.mean())
if mean > best_score:
best_score, best_params = mean, params
return best_params
Model definitions + evaluation outputs
logreg = Pipeline([
("scaler", StandardScaler()),
("clf", LogisticRegression(multi_class="multinomial", max_iter=500, random_state=args.seed))
])
dtree = DecisionTreeClassifier(random_state=args.seed)
# evaluate and save outputs:
# results/logreg_metrics.json, results/logreg_cm.png
# results/dtree_metrics.json, results/dtree_cm.png
The supervised models are intentionally kept simple to test whether income alone provides a useful predictive signal.
How to Run (Local)
The original team submission used local file paths. Below is a clean, repeatable way to run it on your machine. Adjust filenames/paths if your CSV names differ.
- Place raw datasets in a folder (example: data/).
- Run preprocessing to generate household_modes.csv:
# 1) Preprocess
python dataset_processed.py
# 2) Correlation analysis (generates figs/)
python correlaiton.py
# 3) KMeans clustering (generates elbow_method.png and kmeans.png)
python kMeans.py
# 4) Supervised learning (generates results/)
python main.py
Dependencies: pandas, numpy, matplotlib, scikit-learn (and seaborn is imported in kMeans.py).