Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
5b0c4ee
Add Real-Estate-Price-Prediction/requirements.txt
Naveen-Boddepalli May 30, 2026
ebb4acd
Add Real-Estate-Price-Prediction/README.md
Naveen-Boddepalli May 30, 2026
6a08c44
Add Real-Estate-Price-Prediction/Dataset/README.md
Naveen-Boddepalli May 30, 2026
dd9ea8a
Add Real-Estate-Price-Prediction/Model/Real_Estate_Price_Prediction.i…
Naveen-Boddepalli May 30, 2026
0dee70d
Add Real-Estate-Price-Prediction/Model/README.md
Naveen-Boddepalli May 30, 2026
8a5faef
Add Real-Estate-Price-Prediction/Dataset/Bengaluru_House_Data.csv
Naveen-Boddepalli May 30, 2026
28dea50
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.48…
Naveen-Boddepalli May 30, 2026
c522aee
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.48…
Naveen-Boddepalli May 30, 2026
1f16263
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.48…
Naveen-Boddepalli May 30, 2026
fd6ea6f
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.49…
Naveen-Boddepalli May 30, 2026
2ffaccc
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.50…
Naveen-Boddepalli May 30, 2026
d477025
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.53…
Naveen-Boddepalli May 30, 2026
9b0341d
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.53…
Naveen-Boddepalli May 30, 2026
678dcd0
Add Real-Estate-Price-Prediction/Images/Screenshot_2026-05-30_at_5.54…
Naveen-Boddepalli May 30, 2026
1cf4eab
Merge branch 'abhisheks008:main' into main
Naveen-Boddepalli May 31, 2026
8428b2f
Rename folder name
Naveen-Boddepalli May 31, 2026
5a43091
Delete Real Estate Price Prediction directory
Naveen-Boddepalli May 31, 2026
dd2a45b
Add Real Estate Price Prediction/requirements.txt
Naveen-Boddepalli May 31, 2026
5a976e7
Add Real Estate Price Prediction/README.md
Naveen-Boddepalli May 31, 2026
2c75ea8
Add Real Estate Price Prediction/Dataset/README.md
Naveen-Boddepalli May 31, 2026
c3c7e74
Add Real Estate Price Prediction/Dataset/Bengaluru_House_Data.csv
Naveen-Boddepalli May 31, 2026
c01f6d5
Add Real Estate Price Prediction/Model/README.md
Naveen-Boddepalli May 31, 2026
5739fd9
Add Real Estate Price Prediction/Model/Real_Estate_Price_Prediction.i…
Naveen-Boddepalli May 31, 2026
00803f8
Add Real Estate Price Prediction/Images/correlation_heatmap.png
Naveen-Boddepalli May 31, 2026
ed922eb
Add Real Estate Price Prediction/Images/eda_overview.png
Naveen-Boddepalli May 31, 2026
ae0e1c5
Add Real Estate Price Prediction/Images/model_coparision.png
Naveen-Boddepalli May 31, 2026
18b477f
Add Real Estate Price Prediction/Images/top_locations.png
Naveen-Boddepalli May 31, 2026
e641abd
Delete Real-Estate-Price-Prediction directory
Naveen-Boddepalli May 31, 2026
35212ef
Enhance README.md with comprehensive model details
Naveen-Boddepalli May 31, 2026
50512a2
Add notebook HTML
Naveen-Boddepalli May 31, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13,321 changes: 13,321 additions & 0 deletions Real-Estate-Price-Prediction/Dataset/Bengaluru_House_Data.csv

Large diffs are not rendered by default.

32 changes: 32 additions & 0 deletions Real-Estate-Price-Prediction/Dataset/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Dataset: Bengaluru House Price Data

## Source
[Kaggle - Bengaluru House Price Data](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data)

## Description
This dataset contains real estate listing information from Bengaluru, India, with features that influence housing prices.

## Columns
| Column | Description |
|--------|-------------|
| `area_type` | Type of area (Super built-up, Built-up, Plot, Carpet) |
| `availability` | Possession status (Ready to Move / specific date) |
| `location` | Locality in Bengaluru |
| `size` | Number of BHK/Bedrooms (e.g., "2 BHK", "3 Bedroom") |
| `society` | Name of the housing society (if applicable) |
| `total_sqft` | Total square footage of the property |
| `bath` | Number of bathrooms |
| `balcony` | Number of balconies |
| `price` | Price in Lakhs (INR) |

## Preprocessing Notes
- `size` column needs parsing to extract numeric BHK count
- `total_sqft` may contain ranges (e.g., "2000-2500") β€” use midpoint
- Missing values present in `bath`, `balcony`, `society`
- Outliers in `price_per_sqft` removed using std deviation logic per location
- `location` has high cardinality β€” rare locations grouped as "other" before OHE

## Download Instructions
1. Visit the [Kaggle dataset page](https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data)
2. Download `Bengaluru_House_Data.csv`
3. Place it in this `Dataset/` folder
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
229 changes: 229 additions & 0 deletions Real-Estate-Price-Prediction/Model/Real_Estate_Price_Prediction.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,229 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "cell-01",
"metadata": {},
"source": "# Real Estate Price Prediction\n**Dataset:** Bengaluru House Price Data (Kaggle) \n**Author:** Naveen-Boddepalli \n**Event:** GSSoC 2026"
},
{
"cell_type": "markdown",
"id": "cell-02",
"metadata": {},
"source": "## 1. Imports & Setup"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-03",
"metadata": {},
"outputs": [],
"source": "import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# ML\nfrom sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV\nfrom sklearn.linear_model import LinearRegression, Lasso\nfrom sklearn.ensemble import RandomForestRegressor\nfrom sklearn.preprocessing import StandardScaler, OneHotEncoder\nfrom sklearn.compose import ColumnTransformer\nfrom sklearn.pipeline import Pipeline\nfrom sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error\nfrom xgboost import XGBRegressor\nimport joblib\n\n# DL\nimport torch\nimport torch.nn as nn\nfrom torch.utils.data import DataLoader, TensorDataset\n\nprint(\"All imports successful!\")"
},
{
"cell_type": "markdown",
"id": "cell-04",
"metadata": {},
"source": "## 2. Load Dataset"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-05",
"metadata": {},
"outputs": [],
"source": "df = pd.read_csv('../Dataset/Bengaluru_House_Data.csv')\nprint(f\"Shape: {df.shape}\")\ndf.head()"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-06",
"metadata": {},
"outputs": [],
"source": "print(df.info())\nprint(\"\\nMissing values:\")\nprint(df.isnull().sum())"
},
{
"cell_type": "markdown",
"id": "cell-07",
"metadata": {},
"source": "## 3. Data Preprocessing"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-08",
"metadata": {},
"outputs": [],
"source": "# Drop rows where location or size is null (very few)\ndf.dropna(subset=['location', 'size'], inplace=True)\n\n# Fill missing bath with median\ndf['bath'] = df['bath'].fillna(df['bath'].median())\ndf['balcony'] = df['balcony'].fillna(df['balcony'].median())\n\n# Drop society (too many nulls, low signal)\ndf.drop(columns=['society'], inplace=True)\n\nprint(\"After basic cleaning:\", df.shape)"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-09",
"metadata": {},
"outputs": [],
"source": "# Parse 'size' -> BHK count\ndef parse_bhk(size):\n try:\n return int(str(size).split()[0])\n except:\n return np.nan\n\ndf['bhk'] = df['size'].apply(parse_bhk)\ndf.dropna(subset=['bhk'], inplace=True)\ndf['bhk'] = df['bhk'].astype(int)\ndf.drop(columns=['size'], inplace=True)\nprint(df['bhk'].value_counts().head(10))"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-10",
"metadata": {},
"outputs": [],
"source": "# Parse 'total_sqft' (handles ranges like \"2000-2500\")\ndef parse_sqft(sqft):\n try:\n if '-' in str(sqft):\n parts = sqft.split('-')\n return (float(parts[0]) + float(parts[1])) / 2\n return float(sqft)\n except:\n return np.nan\n\ndf['total_sqft'] = df['total_sqft'].apply(parse_sqft)\ndf.dropna(subset=['total_sqft'], inplace=True)\nprint(f\"Shape after sqft parsing: {df.shape}\")"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-11",
"metadata": {},
"outputs": [],
"source": "# Engineer price_per_sqft\ndf['price_per_sqft'] = df['price'] * 1e5 / df['total_sqft']\n\n# Group rare locations\nlocation_counts = df['location'].value_counts()\ndf['location'] = df['location'].apply(\n lambda x: x.strip() if location_counts[x.strip()] >= 10 else 'other'\n)\nprint(f\"Unique locations after grouping: {df['location'].nunique()}\")"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-12",
"metadata": {},
"outputs": [],
"source": "# Remove outliers: price_per_sqft outside mean +/- std per location\ndef remove_pps_outliers(df):\n df_out = pd.DataFrame()\n for loc, subdf in df.groupby('location'):\n m = subdf['price_per_sqft'].mean()\n s = subdf['price_per_sqft'].std()\n reduced = subdf[(subdf['price_per_sqft'] > (m - s)) & (subdf['price_per_sqft'] < (m + s))]\n df_out = pd.concat([df_out, reduced], ignore_index=True)\n return df_out\n\ndf = remove_pps_outliers(df)\n\n# Remove extreme bath counts (bath > bhk+2 is suspicious)\ndf = df[df['bath'] <= df['bhk'] + 2]\n\nprint(f\"Shape after outlier removal: {df.shape}\")"
},
{
"cell_type": "markdown",
"id": "cell-13",
"metadata": {},
"source": "## 4. Exploratory Data Analysis"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-14",
"metadata": {},
"outputs": [],
"source": "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n\n# Price distribution\naxes[0].hist(df['price'], bins=50, color='steelblue', edgecolor='white')\naxes[0].set_title('Price Distribution (Lakhs)', fontsize=13)\naxes[0].set_xlabel('Price'); axes[0].set_ylabel('Count')\n\n# Sqft vs Price scatter\naxes[1].scatter(df['total_sqft'], df['price'], alpha=0.3, color='coral', s=5)\naxes[1].set_title('Total Sqft vs Price', fontsize=13)\naxes[1].set_xlabel('Total Sqft'); axes[1].set_ylabel('Price (Lakhs)')\n\n# BHK distribution\ndf['bhk'].value_counts().sort_index().plot(kind='bar', ax=axes[2], color='mediumseagreen')\naxes[2].set_title('BHK Distribution', fontsize=13)\naxes[2].set_xlabel('BHK'); axes[2].set_ylabel('Count')\n\nplt.tight_layout()\nplt.savefig('../Images/eda_overview.png', dpi=150, bbox_inches='tight')\nplt.show()"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-15",
"metadata": {},
"outputs": [],
"source": "# Correlation heatmap\nplt.figure(figsize=(8, 6))\nnumeric_cols = ['total_sqft', 'bath', 'balcony', 'bhk', 'price_per_sqft', 'price']\ncorr = df[numeric_cols].corr()\nsns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', square=True)\nplt.title('Correlation Heatmap', fontsize=14)\nplt.tight_layout()\nplt.savefig('../Images/correlation_heatmap.png', dpi=150, bbox_inches='tight')\nplt.show()"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-16",
"metadata": {},
"outputs": [],
"source": "# Top 10 locations by median price\ntop10 = df.groupby('location')['price'].median().sort_values(ascending=False).head(10)\nplt.figure(figsize=(12, 5))\ntop10.plot(kind='bar', color='orchid', edgecolor='white')\nplt.title('Top 10 Locations by Median Price (Lakhs)', fontsize=14)\nplt.xticks(rotation=45, ha='right')\nplt.ylabel('Median Price (Lakhs)')\nplt.tight_layout()\nplt.savefig('../Images/top_locations.png', dpi=150, bbox_inches='tight')\nplt.show()"
},
{
"cell_type": "markdown",
"id": "cell-17",
"metadata": {},
"source": "## 5. Feature Engineering & Train/Test Split"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-18",
"metadata": {},
"outputs": [],
"source": "# Drop derived feature before modeling (avoid leakage)\ndf.drop(columns=['price_per_sqft'], inplace=True)\n\nX = df.drop(columns=['price'])\ny = df['price']\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\nprint(f\"Train: {X_train.shape}, Test: {X_test.shape}\")"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-19",
"metadata": {},
"outputs": [],
"source": "# Preprocessing pipeline\ncategorical_features = ['area_type', 'availability', 'location']\nnumerical_features = ['total_sqft', 'bath', 'balcony', 'bhk']\n\npreprocessor = ColumnTransformer(transformers=[\n ('num', StandardScaler(), numerical_features),\n ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_features)\n])"
},
{
"cell_type": "markdown",
"id": "cell-20",
"metadata": {},
"source": "## 6. ML Models"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-21",
"metadata": {},
"outputs": [],
"source": "results = {}\n\ndef evaluate(name, y_true, y_pred):\n r2 = r2_score(y_true, y_pred)\n mae = mean_absolute_error(y_true, y_pred)\n rmse = np.sqrt(mean_squared_error(y_true, y_pred))\n results[name] = {'R2': round(r2, 4), 'MAE': round(mae, 2), 'RMSE': round(rmse, 2)}\n print(f\"{name}: R2={r2:.4f} | MAE={mae:.2f} | RMSE={rmse:.2f}\")\n\n# Linear Regression\nlr_pipe = Pipeline([('pre', preprocessor), ('model', LinearRegression())])\nlr_pipe.fit(X_train, y_train)\nevaluate('Linear Regression', y_test, lr_pipe.predict(X_test))\n\n# Lasso\nlasso_pipe = Pipeline([('pre', preprocessor), ('model', Lasso(alpha=1.0))])\nlasso_pipe.fit(X_train, y_train)\nevaluate('Lasso', y_test, lasso_pipe.predict(X_test))\n\n# Random Forest\nrf_pipe = Pipeline([('pre', preprocessor), ('model', RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1))])\nrf_pipe.fit(X_train, y_train)\nevaluate('Random Forest', y_test, rf_pipe.predict(X_test))\n\n# XGBoost\nxgb_pipe = Pipeline([('pre', preprocessor), ('model', XGBRegressor(n_estimators=300, learning_rate=0.05, max_depth=6, random_state=42, n_jobs=-1))])\nxgb_pipe.fit(X_train, y_train)\nevaluate('XGBoost', y_test, xgb_pipe.predict(X_test))"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-22",
"metadata": {},
"outputs": [],
"source": "# Save best ML model\nbest_ml_model = xgb_pipe # Update if another model wins\njoblib.dump(best_ml_model, 'real_estate_best_model.pkl')\nprint(\"Saved: real_estate_best_model.pkl\")"
},
{
"cell_type": "markdown",
"id": "cell-23",
"metadata": {},
"source": "## 7. Deep Learning Models (PyTorch)"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-24",
"metadata": {},
"outputs": [],
"source": "# Preprocess for DL: use the same preprocessor fitted on ML data\nX_train_proc = preprocessor.fit_transform(X_train)\nX_test_proc = preprocessor.transform(X_test)\n\nX_train_t = torch.FloatTensor(X_train_proc)\ny_train_t = torch.FloatTensor(y_train.values).unsqueeze(1)\nX_test_t = torch.FloatTensor(X_test_proc)\ny_test_t = torch.FloatTensor(y_test.values).unsqueeze(1)\n\ninput_dim = X_train_t.shape[1]\nprint(f\"Input dimension: {input_dim}\")\n\ntrain_ds = TensorDataset(X_train_t, y_train_t)\ntrain_dl = DataLoader(train_ds, batch_size=64, shuffle=True)"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-25",
"metadata": {},
"outputs": [],
"source": "# \u2500\u2500 MLP \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nclass MLP(nn.Module):\n def __init__(self, input_dim):\n super().__init__()\n self.net = nn.Sequential(\n nn.Linear(input_dim, 256), nn.BatchNorm1d(256), nn.ReLU(), nn.Dropout(0.3),\n nn.Linear(256, 128), nn.BatchNorm1d(128), nn.ReLU(), nn.Dropout(0.2),\n nn.Linear(128, 64), nn.ReLU(),\n nn.Linear(64, 1)\n )\n def forward(self, x): return self.net(x)\n\n# \u2500\u2500 Wide & Deep \u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\nclass WideDeep(nn.Module):\n def __init__(self, input_dim):\n super().__init__()\n self.wide = nn.Linear(input_dim, 1)\n self.deep = nn.Sequential(\n nn.Linear(input_dim, 256), nn.ReLU(), nn.Dropout(0.3),\n nn.Linear(256, 128), nn.ReLU(), nn.Dropout(0.2),\n nn.Linear(128, 1)\n )\n def forward(self, x):\n return self.wide(x) + self.deep(x)\n\ndef train_model(model, epochs=50, lr=1e-3):\n opt = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-5)\n scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(opt, patience=5, factor=0.5)\n loss_fn = nn.MSELoss()\n model.train()\n for epoch in range(epochs):\n total = 0\n for xb, yb in train_dl:\n opt.zero_grad()\n loss = loss_fn(model(xb), yb)\n loss.backward()\n opt.step()\n total += loss.item()\n avg = total / len(train_dl)\n scheduler.step(avg)\n if (epoch+1) % 10 == 0:\n print(f\" Epoch {epoch+1}/{epochs} \u2014 loss: {avg:.4f}\")\n return model\n\ndef eval_dl(name, model):\n model.eval()\n with torch.no_grad():\n preds = model(X_test_t).squeeze().numpy()\n evaluate(name, y_test.values, preds)"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-26",
"metadata": {},
"outputs": [],
"source": "print(\"Training MLP...\")\nmlp = MLP(input_dim)\nmlp = train_model(mlp, epochs=50)\neval_dl(\"MLP\", mlp)\n\nprint(\"\\nTraining Wide & Deep...\")\nwd = WideDeep(input_dim)\nwd = train_model(wd, epochs=50)\neval_dl(\"Wide & Deep\", wd)"
},
{
"cell_type": "markdown",
"id": "cell-27",
"metadata": {},
"source": "## 8. Model Comparison"
},
{
"cell_type": "code",
"execution_count": null,
"id": "cell-28",
"metadata": {},
"outputs": [],
"source": "results_df = pd.DataFrame(results).T.sort_values('R2', ascending=False)\nprint(results_df.to_string())\n\nfig, axes = plt.subplots(1, 3, figsize=(16, 5))\nfor ax, metric in zip(axes, ['R2', 'MAE', 'RMSE']):\n colors = ['steelblue' if i > 0 else 'coral' for i in range(len(results_df))]\n results_df[metric].plot(kind='bar', ax=ax, color='steelblue', edgecolor='white')\n ax.set_title(f'{metric} by Model', fontsize=13)\n ax.set_xticklabels(results_df.index, rotation=30, ha='right', fontsize=9)\n ax.set_ylabel(metric)\n\nplt.tight_layout()\nplt.savefig('../Images/model_comparison.png', dpi=150, bbox_inches='tight')\nplt.show()"
},
{
"cell_type": "markdown",
"id": "cell-29",
"metadata": {},
"source": "## 9. Conclusions\n\n> *(Update after running \u2014 best model, key insights, feature importance)*\n\n- **Best model:** TBD\n- **Key features:** location, total_sqft, bhk\n- **Suggestions:** TabNet and XGBoost likely top performers on this tabular dataset"
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
93 changes: 93 additions & 0 deletions Real-Estate-Price-Prediction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Real Estate Price Prediction β€” Model Documentation

## Objective
Predict real estate prices in Bengaluru using ML and Deep Learning models on the Bengaluru House Price dataset.

---

## Approach

### 1. Data Preprocessing
- Handle missing values in `bath`, `balcony`, `society`
- Parse `size`: extract numeric BHK (e.g., "2 BHK" β†’ 2)
- Handle `total_sqft` ranges (e.g., "2000–2500" β†’ 2250)
- Engineer `price_per_sqft` feature
- Remove outliers using mean Β± std deviation per location group
- Group rare locations (< 10 listings) as "other"
- One-Hot Encode `location` and `area_type`

### 2. Exploratory Data Analysis (EDA)
- Price distribution across top locations (bar/violin plots)
- Correlation heatmap of numerical features
- Scatter: `total_sqft` vs `price`
- Box plots for outlier visualization

### 3. Models Implemented

#### Machine Learning Baseline
| Model | Notes |
|-------|-------|
| Linear Regression | Baseline |
| Lasso Regression | L1 regularization; feature selection |
| Random Forest | Ensemble; handles non-linearity |
| XGBoost | Gradient boosting; top ML performer |

Tuning: GridSearchCV with 5-Fold Cross Validation

#### Deep Learning Models
| Model | Notes |
|-------|-------|
| MLP (Feedforward NN) | ReLU + Dropout + BatchNorm layers |
| Wide & Deep Network | Linear memorization + deep generalization (Google) |
| TabNet | Attention-based; interpretable; great for tabular data |
| Embedding-based DNN | Learned entity embeddings for `location` + DNN |

Stack: TensorFlow/Keras or PyTorch

### 4. Evaluation Metrics
| Metric | Description |
|--------|-------------|
| RΒ² Score | Proportion of variance explained |
| MAE | Mean Absolute Error (Lakhs INR) |
| RMSE | Root Mean Squared Error (Lakhs INR) |
| Cross-val Score | 5-fold CV for reliability |

---

## Results

> *(Update after running the notebook)*

| Model | RΒ² | MAE | RMSE |
|-------|----|-----|------|
| Linear Regression | β€” | β€” | β€” |
| Lasso Regression | β€” | β€” | β€” |
| Random Forest | β€” | β€” | β€” |
| XGBoost | β€” | β€” | β€” |
| MLP | β€” | β€” | β€” |
| Wide & Deep | β€” | β€” | β€” |
| TabNet | β€” | β€” | β€” |
| Embedding DNN | β€” | β€” | β€” |

---

## Visualizations

> *(Add saved plots from the Images/ folder)*

- `price_distribution.png` β€” Price spread across locations
- `correlation_heatmap.png` β€” Feature correlation matrix
- `sqft_vs_price.png` β€” Scatter plot
- `model_comparison.png` β€” RΒ² / RMSE comparison bar chart

---

## Conclusions

> *(Fill in after training)*

---

## Saved Artifacts
- `real_estate_best_model.pkl` β€” Best ML model (joblib)
- `real_estate_dl_model.h5` β€” Best DL model (Keras) or `.pt` (PyTorch)
28 changes: 28 additions & 0 deletions Real-Estate-Price-Prediction/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Core
numpy==1.26.4
pandas==2.2.2
scikit-learn==1.4.2

# ML Models
xgboost==2.0.3

# Deep Learning
torch==2.3.0
torchvision==0.18.0
tensorflow==2.16.1
pytorch-tabnet==4.1.0

# Visualization
matplotlib==3.9.0
seaborn==0.13.2
plotly==5.22.0

# Notebook
jupyter==1.0.0
ipykernel==6.29.4

# Model persistence
joblib==1.4.2

# Utilities
tqdm==4.66.4
Loading