← Back to Demo

Project 001 — Code

Pipeline & key excerpts (Python). Preprocessing → Correlation → Clustering → Supervised Learning

My Role (Team Lead)

Quick Navigation

Pipeline Overview

The project investigates whether household income relates to travel mode choice. The workflow is organised as four scripts, each producing outputs (CSV / figures / metrics) used by later steps.

Inputs

  • /course/household.csv
  • /course/journey_to_work.csv
  • /course/journey_to_education.csv

(Paths above reflect the original team submission environment; you can adjust them to your local setup.)

Core Outputs

  • individual_modes.csv (per-person)
  • household_modes.csv (per-household)
  • figs/ (heatmaps, boxplots, scatter, bar chart)
  • results/ (metrics JSON, confusion matrices, CV summary)

1) Preprocessing — dataset_processed.py

This script cleans household income, merges household + travel datasets, and normalises raw travel-mode text into three categories: public, private, active. It then exports two datasets: individual_modes.csv and household_modes.csv.

Income extraction (regex-based)

def extract_lower_inc(s):
    s = str(s)
    # standard range like "($65,000-$77,999)"
    match_range = re.search(r'\((\$?[\d,]+)-(\$?[\d,]+)\)', s)
    if match_range:
        num = match_range.group(1).replace(",", "").replace("$", "")
        return int(num)

    # open-ended range like "($190,000 or more)"
    match_open = re.search(r'\((\$?[\d,]+)\s*or more\)', s, re.IGNORECASE)
    if match_open:
        num = match_open.group(1).replace(",", "").replace("$", "")
        return int(num)
    return None

df_household['hhinc_numeric'] = df_household['hhinc_group'].apply(extract_lower_inc)
df_household['hhinc_group'] = df_household['hhinc_numeric']

Mode grouping (public / private / active)

mode_public  = ['bus', 'train', 'tram']
mode_private = ['car', 'vehicle', 'taxi', 'driver', 'passenger']
mode_active  = ['walking', 'bicycle']

def categorise_mode(mode):
    if pd.isna(mode):
        return None
    m = str(mode).strip().lower()

    if any(p in m for p in mode_public):
        return "public"
    elif any(p in m for p in mode_private):
        return "private"
    elif any(p in m for p in mode_active):
        return "active"
    return None

Household mode aggregation

# apply across rows
df_merged["main_mode_group"] = df_merged.apply(most_common_mode, axis=1)

# export per-individual
output_df = df_merged[["hhid", "hhinc_group", "main_mode_group"]]
output_df.to_csv("individual_modes.csv", index=False)

# export per-household (mode by household)
df_household_modes = (
    df_merged.groupby("hhid")["main_mode_group"]
      .agg(lambda x: pd.Series.mode(x)[0] if not x.mode().empty else None)
      .reset_index()
)
df_final = df_household_modes.merge(
    df_merged[["hhid", "hhinc_group"]].drop_duplicates(),
    on="hhid", how="left"
)
df_final = df_final[["hhid", "hhinc_group", "main_mode_group"]]
df_final.to_csv("household_modes.csv", index=False)

2) Correlation Analysis (My Part) — correlaiton.py

This part focuses on the relationship between income and travel-mode choice using one-hot encoding + Pearson correlation, plus multiple visualisations (heatmaps, boxplots, scatter). It also produces summary statistics of household counts and shares by mode.

One-hot + Pearson correlation

encoded = pd.get_dummies(df["main_mode_group"], prefix="mode")
corr_df = pd.concat([df["hhinc_group"], encoded], axis=1).corr(method="pearson")

print("pearson correlation of household income and travel modes:")
print(corr_df["hhinc_group"].sort_values(ascending=False))

Heatmap outputs

plot_heatmap_full(
    corr_df,
    "Correlation Heatmap of Income and Travel Modes (Pearson)",
    "correlation_heatmap_full.png"
)

subset_cols = ["hhinc_group"] + [c for c in corr_df.columns if c.startswith("mode_")]
subset = corr_df.loc[["hhinc_group"], subset_cols]
# saved to: figs/correlation_heatmap_income_vs_modes.png

Distribution view (boxplot + annotations)

plt.boxplot(groups, labels=modes, showfliers=True, patch_artist=True)
plt.title("Distribution of Household Income under Different Travel Modes")

# compute Q1 / median / Q3 / IQR, whiskers, outliers then annotate on plot
q1, q2, q3 = np.percentile(s, [25, 50, 75])
iqr = q3 - q1
lo, hi = q1 - 1.5*iqr, q3 + 1.5*iqr

Household counts & shares (summary table + bar chart)

household_counts = (
    df.groupby("main_mode_group")["hhid"].nunique()
      .rename("households")
      .sort_values(ascending=False)
)

summary = pd.DataFrame(household_counts)
summary["share_households"] = (summary["households"] / summary["households"].sum()).round(3)
summary.to_csv("figs/mode_household_counts_v4.csv", encoding="utf-8-sig")

plt.bar(summary.index, summary["households"])
plt.title("Number of Households by Travel Mode")
plt.savefig("figs/households_by_mode_totals_v4.png", dpi=200)

Note: The filename correlaiton.py is kept as-is from the team submission.

3) Clustering — kMeans.py

This script explores whether simple clustering can separate travel-mode patterns from income. It normalises income, maps modes to numeric values, uses the elbow method to inspect K, and visualises the result for k = 3.

Mode mapping + elbow method

def toNum(mode):
    if mode == 'private':
        return 0
    elif mode == 'active':
        return 0.5
    elif mode == 'public':
        return 1

def elbow(data):
    distortions = []
    for k in range(1, 10):
        kmean = KMeans(n_clusters=k)
        kmean.fit(data[['hhinc_group', 'main_mode_group']])
        distortions.append(kmean.inertia_)
    plt.plot(range(1, 10), distortions, 'bx-')
    plt.savefig('elbow_method.png')

KMeans (k=3) visualisation

KMEANS = KMeans(n_clusters=3)
KMEANS.fit(df[['hhinc_group', 'main_mode_group']])
plt.scatter(
  df['hhinc_group'], df['main_mode_group'],
  c=[colormap.get(x) for x in KMEANS.labels_]
)
plt.yticks([0, 0.5, 1], ['private', 'active', 'public'])
plt.title('k = 3')
plt.savefig('kmeans.png')

4) Supervised Learning & Evaluation — main.py

This script treats travel-mode choice as a classification task using income as the feature. It evaluates Logistic Regression (with standard scaling) and a Decision Tree, runs a small cross-validation search, and outputs metrics and confusion matrices.

Tiny cross-validation grid search

def tiny_cv(model, grid, X, y, scoring="f1_macro", cv=5, seed=7):
    skf = StratifiedKFold(n_splits=cv, shuffle=True, random_state=seed)
    best_params, best_score = None, -np.inf
    for params in grid:
        m = model.set_params(**params)
        scores = cross_val_score(m, X, y, cv=skf, scoring=scoring)
        mean = float(scores.mean())
        if mean > best_score:
            best_score, best_params = mean, params
    return best_params

Model definitions + evaluation outputs

logreg = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(multi_class="multinomial", max_iter=500, random_state=args.seed))
])
dtree = DecisionTreeClassifier(random_state=args.seed)

# evaluate and save outputs:
# results/logreg_metrics.json, results/logreg_cm.png
# results/dtree_metrics.json,  results/dtree_cm.png

The supervised models are intentionally kept simple to test whether income alone provides a useful predictive signal.

How to Run (Local)

The original team submission used local file paths. Below is a clean, repeatable way to run it on your machine. Adjust filenames/paths if your CSV names differ.

  1. Place raw datasets in a folder (example: data/).
  2. Run preprocessing to generate household_modes.csv:
# 1) Preprocess
python dataset_processed.py

# 2) Correlation analysis (generates figs/)
python correlaiton.py

# 3) KMeans clustering (generates elbow_method.png and kmeans.png)
python kMeans.py

# 4) Supervised learning (generates results/)
python main.py

Dependencies: pandas, numpy, matplotlib, scikit-learn (and seaborn is imported in kMeans.py).

Download and View Full Source (Python)