Automobile Data Analysis Report¶

Introduction¶

This report provides a structured data analysis of an automobile dataset, focusing on key characteristics, relationships, and statistical insights. The analysis covers data preparation, cleaning, and multiple levels of statistical exploration, culminating in actionable insights for business decision-making

Scenario¶

A dataset containing various attributes of automobiles—including technical specifications, fuel economy, and pricing—has been provided for analysis. The goal is to uncover patterns and correlations that can inform pricing strategies, product positioning, and market segmentation.

Step 1 : ASK¶

Business Task¶

  • To analyze the automobile dataset to identify key factors influencing car prices and to understand relationships among technical and categorical features.

Key Questions¶

  • What are the main drivers of automobile price?

  • How do technical specifications (e.g., horsepower, engine size) relate to price?

  • Are there significant differences in price and features across drive types or body styles?

Step 2: PREPARE & PROCESS¶

Data Selection¶

This analysis uses the Automobile Data Set from Kaggle, contributed by Toramky and made available for public use. Visit the Automobile Data Set on Kaggle.

The dataset contains detailed records of various automobiles, including technical specifications, fuel economy, and pricing information. It consists of a single CSV file with 205 entries and 26 attributes, covering aspects such as:

  • Technical Specifications: Engine size, horsepower, number of cylinders, dimensions (wheel-base, length, width, height), curb weight, and fuel system.

  • Categorical Features: Make, fuel type, aspiration, number of doors, body style, drive wheels, and engine location.

  • Performance & Economy: Compression ratio, city-mpg, highway-mpg, peak rpm.

  • Pricing: Manufacturer’s suggested retail price.

These features provide a comprehensive basis for analyzing factors that influence automobile pricing, technical performance, and market segmentation.

Limitations of the Data¶

  • Missing values in several columns (e.g., normalized-losses, price, horsepower).

  • Some categorical variables require encoding for statistical analysis.

  • The dataset may not represent the full diversity of the automobile market.

Environment¶

This analysis was conducted using Python in Visual Studio Code with Jupyter Notebook. The main libraries used include pandas for data processing, numpy for numerical operations, and matplotlib and seaborn for data visualization.

Import Libraries¶

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

Load the dataset¶

In [47]:
import os
print(os.getcwd())
d:\DATA_ANALYSIS_PROJECTS\GIT_PROJECTS\Automobile_Data_Analysis
In [48]:
df = pd.read_csv("D:\\DATA_ANALYSIS_PROJECTS\\GIT_PROJECTS\\Automobile_Data_Analysis\\Automobile_data.csv")

Preview and Understand Data Structure¶

In [49]:
# Display the first 5 rows
print(df.head())

# Display column names and data types
print(df.info())
   symboling normalized-losses         make fuel-type aspiration num-of-doors  \
0          3                 ?  alfa-romero       gas        std          two   
1          3                 ?  alfa-romero       gas        std          two   
2          1                 ?  alfa-romero       gas        std          two   
3          2               164         audi       gas        std         four   
4          2               164         audi       gas        std         four   

    body-style drive-wheels engine-location  wheel-base  ...  engine-size  \
0  convertible          rwd           front        88.6  ...          130   
1  convertible          rwd           front        88.6  ...          130   
2    hatchback          rwd           front        94.5  ...          152   
3        sedan          fwd           front        99.8  ...          109   
4        sedan          4wd           front        99.4  ...          136   

   fuel-system  bore  stroke compression-ratio horsepower  peak-rpm city-mpg  \
0         mpfi  3.47    2.68               9.0        111      5000       21   
1         mpfi  3.47    2.68               9.0        111      5000       21   
2         mpfi  2.68    3.47               9.0        154      5000       19   
3         mpfi  3.19     3.4              10.0        102      5500       24   
4         mpfi  3.19     3.4               8.0        115      5500       18   

  highway-mpg  price  
0          27  13495  
1          27  16500  
2          26  16500  
3          30  13950  
4          22  17450  

[5 rows x 26 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               205 non-null    object 
 19  stroke             205 non-null    object 
 20  compression-ratio  205 non-null    float64
 21  horsepower         205 non-null    object 
 22  peak-rpm           205 non-null    object 
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              205 non-null    object 
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB
None

Problems Identified in the Data¶

  • Non-numeric placeholders ('?') required replacement.

  • Inconsistent data types (e.g., numeric columns stored as strings).

  • Potential outliers and skewed distributions in price and technical features.

Data Cleaning¶

Data Cleaning Pipeline¶

In [50]:
def clean_automobile_data(df):
    # Replace '?' with NaN
    df = df.replace('?', pd.NA)
    
    # Convert all columns to numeric where possible
    for col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='ignore')
    
    # Select only numeric columns
    numeric_df = df.select_dtypes(include='number')
    
    # Drop rows with missing numeric values
    numeric_df = numeric_df.dropna()
    
    return numeric_df

# Use the function and print the result
numeric_df = clean_automobile_data(df)
print(numeric_df.head())
    symboling  normalized-losses  wheel-base  length  width  height  \
3           2              164.0        99.8   176.6   66.2    54.3   
4           2              164.0        99.4   176.6   66.4    54.3   
6           1              158.0       105.8   192.7   71.4    55.7   
8           1              158.0       105.8   192.7   71.4    55.9   
10          2              192.0       101.2   176.8   64.8    54.3   

    curb-weight  engine-size  bore  stroke  compression-ratio  horsepower  \
3          2337          109  3.19     3.4               10.0       102.0   
4          2824          136  3.19     3.4                8.0       115.0   
6          2844          136  3.19     3.4                8.5       110.0   
8          3086          131  3.13     3.4                8.3       140.0   
10         2395          108  3.50     2.8                8.8       101.0   

    peak-rpm  city-mpg  highway-mpg    price  
3     5500.0        24           30  13950.0  
4     5500.0        18           22  17450.0  
6     5500.0        19           25  17710.0  
8     5500.0        17           20  23875.0  
10    5800.0        23           29  16430.0  
C:\Users\Win10.DESKTOP-S18D2M2\AppData\Local\Temp\ipykernel_16708\265821505.py:7: FutureWarning: errors='ignore' is deprecated and will raise in a future version. Use to_numeric without passing `errors` and catch exceptions explicitly instead
  df[col] = pd.to_numeric(df[col], errors='ignore')

Step 3: ANALYSE & SHARE¶

3.1 Univariate Analysis¶

Descriptive Statistics: Height¶

In [51]:
# Descriptors for Height
height_stats = numeric_df['height'].describe()
print(height_stats)
count    160.000000
mean      53.878750
std        2.276608
min       49.400000
25%       52.000000
50%       54.100000
75%       55.500000
max       59.800000
Name: height, dtype: float64

Distribution, Violin, and Box Plots for Height¶

In [52]:
# Ensure 'height' is numeric and drop NaNs
height_data = pd.to_numeric(df['height'], errors='coerce').dropna()

# Set up the figure with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# --- Distribution Plot ---
sns.histplot(height_data, kde=True, bins=30, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Distribution Plot of Height')
axes[0].set_xlabel('Height')
axes[0].set_ylabel('Frequency')

# --- Violin Plot ---
sns.violinplot(x=height_data, ax=axes[1], color='lightcoral')
axes[1].set_title('Violin Plot of Height')
axes[1].set_xlabel('Height')

# --- Box Plot ---
sns.boxplot(x=height_data, ax=axes[2], color='lightgreen')
axes[2].set_title('Box Plot of Height')
axes[2].set_xlabel('Height')

# Improve layout
plt.tight_layout()
plt.show()
No description has been provided for this image

Vehicle heights are fairly symmetrically distributed around the mean, with few outliers. This confirms that the dataset primarily includes standard-sized vehicles, and there is little representation of unusually tall or low-profile models.

Descriptive Statistics: Horsepower¶

In [53]:
# Descriptors for Height
height_stats = numeric_df['horsepower'].describe()
print(height_stats)
count    160.000000
mean      95.875000
std       30.625708
min       48.000000
25%       69.000000
50%       88.000000
75%      114.000000
max      200.000000
Name: horsepower, dtype: float64

Distribution, Violin, and Box Plots for Horsepower¶

In [54]:
# Ensure 'horsepower' is numeric and drop NaNs
horsepower_data = pd.to_numeric(df['horsepower'], errors='coerce').dropna()

# Set up the figure with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# --- Distribution Plot ---
sns.histplot(horsepower_data, kde=True, bins=30, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Distribution Plot of Horsepower')
axes[0].set_xlabel('Horsepower')
axes[0].set_ylabel('Frequency')

# --- Violin Plot ---
sns.violinplot(x=horsepower_data, ax=axes[1], color='lightcoral')
axes[1].set_title('Violin Plot of Horsepower')
axes[1].set_xlabel('Horsepower')

# --- Box Plot ---
sns.boxplot(x=horsepower_data, ax=axes[2], color='lightgreen')
axes[2].set_title('Box Plot of Horsepower')
axes[2].set_xlabel('Horsepower')

# Improve layout
plt.tight_layout()
plt.show()
No description has been provided for this image

The distribution of horsepower is right-skewed, with most vehicles having moderate horsepower and a few high-performance outliers. This pattern reflects a market largely composed of regular passenger cars, with performance vehicles being less common.

Descriptive Statistics: Price¶

In [55]:
# Descriptors for Height
price_stats = numeric_df['price'].describe()
print(price_stats)
count      160.000000
mean     11427.681250
std       5863.789011
min       5118.000000
25%       7383.500000
50%       9164.000000
75%      14559.250000
max      35056.000000
Name: price, dtype: float64

Distribution, Violin, and Box Plots for Power¶

In [56]:
# Ensure 'price' is numeric and drop NaNs
price_data = pd.to_numeric(df['price'], errors='coerce').dropna()

# Set up the figure with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# --- Distribution Plot ---
sns.histplot(price_data, kde=True, bins=30, ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Distribution Plot of Price')
axes[0].set_xlabel('Price')
axes[0].set_ylabel('Frequency')

# --- Violin Plot ---
sns.violinplot(x=price_data, ax=axes[1], color='lightcoral')
axes[1].set_title('Violin Plot of Price')
axes[1].set_xlabel('Price')

# --- Box Plot ---
sns.boxplot(x=price_data, ax=axes[2], color='lightgreen')
axes[2].set_title('Box Plot of Price')
axes[2].set_xlabel('Price')

# Improve layout
plt.tight_layout()
plt.show()
No description has been provided for this image

The distribution of horsepower is right-skewed, with most vehicles having moderate horsepower and a few high-performance outliers. This pattern reflects a market largely composed of regular passenger cars, with performance vehicles being less common.

3.2 Bivariate Analysis¶

Scatter Plot with Regression Line: Price vs Horsepower¶

In [57]:
# Ensure 'price' and 'horsepower' are numeric and clean the data
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')
scatter_df = df[['horsepower', 'price']].dropna()

# Create the scatter plot with trend line
plt.figure(figsize=(6, 4))
sns.regplot(data=scatter_df, x='horsepower', y='price', scatter_kws={'alpha':0.6}, line_kws={"color":"red"})

# Format the Y-axis with 'K' labels
plt.gca().yaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f'{int(x/1000)}K'))

# Customize the plot
plt.title('Price vs Horsepower with Trend Line')
plt.xlabel('Horsepower')
plt.ylabel('Price')
plt.grid(True)
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

There is a clear positive correlation between horsepower and price, as shown by the upward trend in the scatter plot. Higher horsepower vehicles tend to be more expensive, highlighting the market value placed on performance.

Box Plot: Price by Drive Wheels¶

In [58]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Convert columns to appropriate types
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['drive-wheels'] = df['drive-wheels'].astype('category')

# Filter and drop missing values
box_df = df[['drive-wheels', 'price']].dropna()

# Define order
order = ['fwd', '4wd', 'rwd']

# Format the X-axis with 'K' labels
plt.gca().xaxis.set_major_formatter(mtick.FuncFormatter(lambda x, _: f'{int(x/1000)}K'))

# Create the box plot
sns.boxplot(data=box_df, y='drive-wheels', x='price', order=order)

plt.show()
No description has been provided for this image

Rear-wheel drive (rwd) vehicles generally have higher prices compared to front-wheel (fwd) and four-wheel drive (4wd) vehicles. This suggests that rwd cars in the dataset are more likely to be positioned as luxury or performance models.

3.3 Multivariate Analysis¶

Pair Plot (Scatter Matrix) with Categorical Hue¶

In [59]:
# Convert to numeric where needed
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['engine-size'] = pd.to_numeric(df['engine-size'], errors='coerce')
df['horsepower'] = pd.to_numeric(df['horsepower'], errors='coerce')

# Filter relevant columns and drop missing values
pair_df = df[['price', 'drive-wheels', 'num-of-cylinders', 'engine-size', 'horsepower']].dropna()

# Optional: encode categorical columns as ordered categories
# For num-of-cylinders, map text to numbers (e.g., 'four' → 4)
cylinder_map = {
    'two': 2, 'three': 3, 'four': 4, 'five': 5,
    'six': 6, 'eight': 8, 'twelve': 12
}
pair_df['num-of-cylinders'] = pair_df['num-of-cylinders'].map(cylinder_map)

# Create pair plot
sns.pairplot(pair_df, hue='drive-wheels')
plt.show()
No description has been provided for this image

The pair plot reveals strong positive relationships among price, engine size, and horsepower, especially for rwd vehicles. These patterns suggest that technical specifications are closely linked to vehicle pricing, with higher-end features concentrated in more expensive models.

Correlation Matrix Calculation¶

In [60]:
# Compute the correlation matrix
correlation_matrix = numeric_df.corr()

# Show correlation matrix
print(correlation_matrix)
                   symboling  normalized-losses  wheel-base    length  \
symboling           1.000000           0.518388   -0.520465 -0.336217   
normalized-losses   0.518388           1.000000   -0.064001  0.029114   
wheel-base         -0.520465          -0.064001    1.000000  0.871968   
length             -0.336217           0.029114    0.871968  1.000000   
width              -0.219850           0.104857    0.815935  0.839184   
height             -0.473994          -0.417081    0.558764  0.505156   
curb-weight        -0.252372           0.122860    0.810507  0.870355   
engine-size        -0.110238           0.203841    0.650488  0.726666   
bore               -0.257013          -0.036167    0.580484  0.649059   
stroke             -0.020539           0.065627    0.164012  0.116049   
compression-ratio  -0.139022          -0.129971    0.293968  0.188968   
horsepower         -0.003669           0.290906    0.514507  0.666726   
peak-rpm            0.199798           0.240676   -0.292491 -0.239104   
city-mpg            0.088912          -0.236934   -0.576635 -0.716877   
highway-mpg         0.149309          -0.189691   -0.608270 -0.717831   
price              -0.163329           0.199924    0.734789  0.760323   

                      width    height  curb-weight  engine-size      bore  \
symboling         -0.219850 -0.473994    -0.252372    -0.110238 -0.257013   
normalized-losses  0.104857 -0.417081     0.122860     0.203841 -0.036167   
wheel-base         0.815935  0.558764     0.810507     0.650488  0.580484   
length             0.839184  0.505156     0.870355     0.726666  0.649059   
width              1.000000  0.298403     0.870649     0.780018  0.575048   
height             0.298403  1.000000     0.369363     0.116505  0.261501   
curb-weight        0.870649  0.369363     1.000000     0.888847  0.646640   
engine-size        0.780018  0.116505     0.888847     1.000000  0.597336   
bore               0.575048  0.261501     0.646640     0.597336  1.000000   
stroke             0.192891 -0.095364     0.171691     0.296693 -0.105464   
compression-ratio  0.261530  0.237432     0.226513     0.143568  0.019216   
horsepower         0.678779  0.032264     0.788509     0.809855  0.557107   
peak-rpm          -0.235906 -0.251236    -0.262086    -0.287260 -0.315841   
city-mpg          -0.662123 -0.194559    -0.759538    -0.695890 -0.585618   
highway-mpg       -0.689367 -0.221646    -0.787167    -0.711364 -0.586729   
price              0.843316  0.247500     0.893810     0.841725  0.534891   

                     stroke  compression-ratio  horsepower  peak-rpm  \
symboling         -0.020539          -0.139022   -0.003669  0.199798   
normalized-losses  0.065627          -0.129971    0.290906  0.240676   
wheel-base         0.164012           0.293968    0.514507 -0.292491   
length             0.116049           0.188968    0.666726 -0.239104   
width              0.192891           0.261530    0.678779 -0.235906   
height            -0.095364           0.237432    0.032264 -0.251236   
curb-weight        0.171691           0.226513    0.788509 -0.262086   
engine-size        0.296693           0.143568    0.809855 -0.287260   
bore              -0.105464           0.019216    0.557107 -0.315841   
stroke             1.000000           0.240895    0.149315 -0.008569   
compression-ratio  0.240895           1.000000   -0.162894 -0.418726   
horsepower         0.149315          -0.162894    1.000000  0.074932   
peak-rpm          -0.008569          -0.418726    0.074932  1.000000   
city-mpg          -0.021381           0.279513   -0.837180 -0.054938   
highway-mpg       -0.013974           0.222442   -0.827973 -0.034372   
price              0.158798           0.210948    0.758582 -0.173970   

                   city-mpg  highway-mpg     price  
symboling          0.088912     0.149309 -0.163329  
normalized-losses -0.236934    -0.189691  0.199924  
wheel-base        -0.576635    -0.608270  0.734789  
length            -0.716877    -0.717831  0.760323  
width             -0.662123    -0.689367  0.843316  
height            -0.194559    -0.221646  0.247500  
curb-weight       -0.759538    -0.787167  0.893810  
engine-size       -0.695890    -0.711364  0.841725  
bore              -0.585618    -0.586729  0.534891  
stroke            -0.021381    -0.013974  0.158798  
compression-ratio  0.279513     0.222442  0.210948  
horsepower        -0.837180    -0.827973  0.758582  
peak-rpm          -0.054938    -0.034372 -0.173970  
city-mpg           1.000000     0.971997 -0.690100  
highway-mpg        0.971997     1.000000 -0.718314  
price             -0.690100    -0.718314  1.000000  

Correlation Heatmap¶

In [61]:
# Set the figure size
plt.figure(figsize=(12, 10))

# Plot heatmap
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', linewidths=0.5)

# Show plot
plt.title('Correlation Heatmap of Automobile Dataset')
plt.show()
No description has been provided for this image

The correlation heatmap shows that price is strongly correlated with technical features like curb weight, engine size, and horsepower, while negatively correlated with fuel efficiency metrics. This underscores that more powerful and larger vehicles tend to be costlier and less fuel-efficient.

Step 4: ACT¶

Key Insights¶

  • Price Correlates Strongly with Technical Features: Price shows high positive correlation with engine size, curb weight, horsepower, and width. Larger, more powerful cars tend to be more expensive.

  • Drive Type Influences Price: Rear-wheel drive (rwd) vehicles generally command higher prices compared to front-wheel drive (fwd) and four-wheel drive (4wd) models, as shown in box plots.

  • Distribution Insights: Horsepower and price distributions are right-skewed, indicating a few high-value, high-performance vehicles in the dataset.

  • Multivariate Relationships: Pair plots reveal clusters by drive-wheels and number of cylinders, suggesting market segmentation opportunities.

  • Data Limitations: Missing values and inconsistent data types could introduce bias or reduce the robustness of certain analyses. Further data enrichment and validation are recommended for production-grade modeling.

Recommendations¶

  • Focus on technical upgrades (engine size, horsepower) for premium pricing strategies.

  • Address missing data for more robust analytics.

  • Explore segmentation by drive type and cylinder count for targeted marketing.

  • Consider additional data collection to fill gaps and validate findings.