Fork me on GitHub

Working with the Auto MPG Data Set

Working with the Auto MPG Data Set

In this post we will look into the Auto MPG data set and clean it so that it is ready for further use.

This data set shows the mpg of a group of car models produced in the 1970s and the 1980s along with some characteristic information associated with each model. More information about the data set can be found here.


1. Download and Load Data File

Download the auto-mpg.data data file and save it to the local directory where you would run the code on this page.

In [22]:
import pandas as pd
filename = "auto-mpg.data"
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin', 'name']
df = pd.read_csv(filename, delim_whitespace=True, names=column_names)

print df.shape
df.head(5)
(398, 9)
Out[22]:
mpg cylinders displacement horsepower weight acceleration year origin name
0 18.0 8 307.0 130.0 3504.0 12.0 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693.0 11.5 70 1 buick skylark 320
2 18.0 8 318.0 150.0 3436.0 11.0 70 1 plymouth satellite
3 16.0 8 304.0 150.0 3433.0 12.0 70 1 amc rebel sst
4 17.0 8 302.0 140.0 3449.0 10.5 70 1 ford torino

The data has 9 columns (features) and 397 entries. Of the 9 columns, 8 are numerical. The labeling on the origin column is 1 for domestic, 2 for Europe and 3 for Asia. This data set is typically used to learn and predict the mpg column using the remaining columns.


2. Cleaning the Data

Upon close inspection we find that there 5 missing values in the horsepower column and that they have been entered as "?"s:

In [23]:
horsepower_missing_ind = df[df.horsepower=='?'].index
df.loc[horsepower_missing_ind]
Out[23]:
mpg cylinders displacement horsepower weight acceleration year origin name
32 25.0 4 98.0 ? 2046.0 19.0 71 1 ford pinto
126 21.0 6 200.0 ? 2875.0 17.0 74 1 ford maverick
330 40.9 4 85.0 ? 1835.0 17.3 80 2 renault lecar deluxe
336 23.6 4 140.0 ? 2905.0 14.3 80 1 ford mustang cobra
354 34.5 4 100.0 ? 2320.0 15.8 81 2 renault 18i
374 23.0 4 151.0 ? 3035.0 20.5 82 1 amc concord dl
In [24]:
df.dtypes
Out[24]:
mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight          float64
acceleration    float64
year              int64
origin            int64
name             object
dtype: object

We have another problem. The horsepower column appears as an object. We need to convert it to float or int.

One way to fix the missing value problem is to fill in those locations with the mean value of the horsepower data. To be able to do that we first assign NaN to the missing data locations, change the data type of the horsepower column to float and then assign each missing location the mean value of the column.

In [25]:
df.loc[horsepower_missing_ind, 'horsepower'] = float('nan')
df.horsepower = df.horsepower.apply(pd.to_numeric)
df.loc[horsepower_missing_ind, 'horsepower'] = int( df.horsepower.mean() )

Now, things are in order and the statistics should come out fine. We can use the dataframe.describe() function to get a quick idea of the statistics of the data.

In [26]:
pd.set_option('precision', 2)
# display stats of the features
df.describe()
Out[26]:
mpg cylinders displacement horsepower weight acceleration year origin
count 398.00 398.00 398.00 398.00 398.00 398.00 398.00 398.00
mean 23.51 5.45 193.43 104.46 2970.42 15.57 76.01 1.57
std 7.82 1.70 104.27 38.20 846.84 2.76 3.70 0.80
min 9.00 3.00 68.00 46.00 1613.00 8.00 70.00 1.00
25% 17.50 4.00 104.25 76.00 2223.75 13.83 73.00 1.00
50% 23.00 4.00 148.50 95.00 2803.50 15.50 76.00 1.00
75% 29.00 8.00 262.00 125.00 3608.00 17.17 79.00 2.00
max 46.60 8.00 455.00 230.00 5140.00 24.80 82.00 3.00

3. Saving the Data as a .CSV File

Now that the problems associated with the data are ironed out, we can save this version of the file to separate file. The data will be saved to a .csv due to its versatility.

In [27]:
df.to_csv('auto-mpg.csv', index=False)

When needed, this file can now be loaded using the command df = pd.read_csv('auto-mpg.csv') to variable df.

blogroll

social