Used Car Photo

Predicting used car prices

Zhiyuan Xue, Austin Long, Helen Chen

Introduction

As one of the most common forms of transportation, a car is one of the many important everyday commodities for a household. Buying a brand new car can be decidedly costly; therefore, owning a used car is instead the preferred choice. Customers can be defrauded by car dealers who use that advantage and sell used cars at an unreasonably high price. In order to avoid that scenario and get a fairly priced vehicle at the best quality, we present a model that can predict various prices of used cars by evaluating its features.

Car price prediction is often difficult because it mostly relies on experience. There are multiple different factors, and those factors include (but are not limited to):

  • the year it was made
  • what brand it is manufactured from
  • its mileage (how many miles the car has traveled)
  • its horsepower
  • the type of fuel it uses
  • how many cylinders it has
  • transmission type
  • size
  • paint color

Goal

We aim to find what features of a car are important to buyers and thus affect the price more, and what features are less important and are more likely gimmicks. Based on that, we hope to train a machine learning model to predict the price of a used car given its features.

Our goal is to help consumers who are browsing used cars determine what factors are typically important when buying a used car, as well as a fair price for a specific car based on the features it offers.

The raw data

We found a dataset on Kaggle containing roughly 500,000 used cars that were being sold from Craigslist. To start off, we load this into a dataframe in Pandas.

However, since the data consists of a CSV file totaling nearly 1.5 gigabytes of data, we weren't able to load the CSV directly in one go. To remedy this, we read in the data from our CSV file in chunks of 100,000 rows.

Upon doing an initial inspection of our data, we notice that there is occasional missing data from some of the columns. We decided to do a complete case analysis as it would be hard to predict what the missing data would be since cars vary wildly depending on their brand and model. Thus, we drop any rows with missing data such that we only have information from cars for which we have complete data.

We also decided to use random sampling by randomly choosing 1000 car listings out of the total 500,000. Random sampling greatly helps to reduce bias on the sample data, meaning there is a greater chance of choosing different cars with different features for a representative sample.

In [1]:
import pandas

# empty dataframe to store our sample data
data = pandas.DataFrame()

# csv file is too large so process in chunks
chunksize = 10 ** 5
for chunk in pandas.read_csv("craigslistVehicles.csv", chunksize=chunksize):   
    # complete case analysis
    chunk = chunk.dropna()

    # randomly sample 1000 total data points
    sample_size = int(len(chunk.index) / 60)
    sample = chunk.sample(n=sample_size)
    data = data.append(sample)

data = data.reset_index(drop=True)
data.head()
Out[1]:
url city city_url price year manufacturer make condition cylinders fuel ... transmission VIN drive size type paint_color image_url desc lat long
0 https://limaohio.craigslist.org/ctd/d/upper-sa... lima / findlay https://limaohio.craigslist.org 9900 2011.0 cadillac escalade esv luxury excellent 8 cylinders gas ... automatic 1GYS4HEF0BR120429 4wd full-size SUV white https://images.craigslist.org/00E0E_exOfDapynd... WWW.WMSOHIO.COM - Price Drop!!!\n\nNew these g... 40.832775 -83.250961
1 https://kalamazoo.craigslist.org/ctd/d/hudsonv... kalamazoo, MI https://kalamazoo.craigslist.org 39999 2014.0 ram 2500 like new 6 cylinders diesel ... automatic 3C6UR5GL9EG308721 4wd full-size truck grey https://images.craigslist.org/00f0f_5tpTMClWO4... 2014 RAM 2500 Laramie Limited Crew Cab 4WD - $... 42.868175 -85.862086
2 https://huntsville.craigslist.org/ctd/d/cullma... huntsville / decatur https://huntsville.craigslist.org 39995 2015.0 ford f-250 sd good 8 cylinders diesel ... automatic 1FT7W2BT6FEA67827 4wd full-size truck blue https://images.craigslist.org/00N0N_aBCENws3Dg... 2015 Ford F-250 SD Lariat Crew Cab 4WD - $39,9... 34.164361 -86.837410
3 https://louisville.craigslist.org/cto/d/brande... louisville, KY https://louisville.craigslist.org 2875 2007.0 ford taurus excellent 4 cylinders gas ... automatic 1FAHP34N57W169966 fwd compact sedan grey https://images.craigslist.org/00Q0Q_x8qa4DbdoD... I have a 2007 Ford Focus for sale. This vehicl... 37.965753 -86.134229
4 https://mansfield.craigslist.org/ctd/d/bucyrus... mansfield, OH https://mansfield.craigslist.org 11995 2011.0 chevrolet express g2500 excellent 8 cylinders gas ... automatic 1GCWGFCA9B1168536 rwd full-size van white https://images.craigslist.org/00303_gV2J6T34j1... CHECK OUT THIS VERY NICE 2011 CHEVY EXPRESS G2... 40.812476 -82.949602

5 rows × 22 columns

Tidying the data

After reading in the data, we begin the process of tidying it by removing columns containing information that would likely not be relevant to the car's price. Some examples of this would be the url of the post, the car's VIN, and the description. Although the description does occassionaly provide useful information, it typically repeats what's already in the other columns and is too noisy as it does not have a structured format like the other columns.

For the remaining columns, we rename a few for brevity and better readability.

In [2]:
# dropping information columns
data = data.drop(columns=['url','city_url','title_status','VIN','image_url','desc'])
data = data.reset_index(drop=True)

# renaming columns for readability
data = data.rename(columns={'odometer' : 'mileage','paint_color' : 'color'})
data.head()
Out[2]:
city price year manufacturer make condition cylinders fuel mileage transmission drive size type color lat long
0 lima / findlay 9900 2011.0 cadillac escalade esv luxury excellent 8 cylinders gas 263000.0 automatic 4wd full-size SUV white 40.832775 -83.250961
1 kalamazoo, MI 39999 2014.0 ram 2500 like new 6 cylinders diesel 73800.0 automatic 4wd full-size truck grey 42.868175 -85.862086
2 huntsville / decatur 39995 2015.0 ford f-250 sd good 8 cylinders diesel 135082.0 automatic 4wd full-size truck blue 34.164361 -86.837410
3 louisville, KY 2875 2007.0 ford taurus excellent 4 cylinders gas 215000.0 automatic fwd compact sedan grey 37.965753 -86.134229
4 mansfield, OH 11995 2011.0 chevrolet express g2500 excellent 8 cylinders gas 112511.0 automatic rwd full-size van white 40.812476 -82.949602

Exploratory data analysis

We begin by first doing some data visualization to explore what features are most important in regards to the car's price. This also allows us to see the characteristics of each feature's distribution to determine what type of relationship it may show when compared to the price.

Mileage vs. Price

We first look at the comparison between how far the car has traveled and the price it is sold at to investigate if there is any relationship.

In [4]:
import matplotlib.pyplot as plt
import numpy as np

# create and label scatter plot
fig = plt.figure()
scatter_plot = fig.add_subplot(1, 1, 1)
scatter_plot.set_title("Mileage vs Price")
scatter_plot.set_xlabel("Mileage (miles)")
scatter_plot.set_ylabel("Price ($)")

for index, row in data.iterrows():
    scatter_plot.scatter(row['mileage'], row['price'])
Out[4]:

From the graph, there seems to be a few outliers, as they appear very far from the rest of the dots. We decided to remove them since they are most likely unrepresentative cases where the car's value is highly overestimated or has an atypical number of miles driven.

We look at the comparison of the mileage of the car versus the price it is sold at again without the outliers for a closer look. We predict an inverse relationship where, the higher the price of the car, the less number of miles it was driven.

In [5]:
# delete outliers
outliers = []
for index, row in data.iterrows():
    if row['price'] > 100000 or row['mileage'] > 500000:
        outliers.append(index)

data = data.drop(outliers)
data = data.reset_index(drop=True)

# create and label scatter plot
fig = plt.figure()
scatter_plot2 = fig.add_subplot(1, 1, 1)
scatter_plot2.set_title("Mileage vs Price")
scatter_plot2.set_xlabel("Mileage (miles)")
scatter_plot2.set_ylabel("Price ($)")

for index, row in data.iterrows():
    scatter_plot2.scatter(row['mileage'],row['price'])
Out[5]:

Without the presence of the outliers messing up our scale, we can note that there seems to be a weak negative correlation (somewhat linear relationship) between the mileage and the price.

As a result, we cannot yet confidently conclude that the less number of miles the vehicle had driven, the higher the price it is listed at.

Year vs. Price

To investigate if any of the other features had a stronger effect on the price, we decided to compare the year the car was made with the price it is being sold at.

We predict that the higher the price it is being sold at, the newer the vehicle is.

In [6]:
fig = plt.figure()
scatter_plot = fig.add_subplot(1, 1, 1)
scatter_plot.set_title("Year vs Price")
scatter_plot.set_xlabel("Year")
scatter_plot.set_ylabel("Price ($)")

for index, row in data.iterrows():
    scatter_plot.scatter(row['year'], row['price'])
Out[6]:

Interestingly enough, the scatterplot appears to look somewhat quadratic, with many dots clustered around the later years (2010s) representing a sharp, nonlinear increase in price as compared to cars with earlier model years.

Although, intuitively, it makes sense that most of the newer used cars are being sold at a higher price than the older ones, there could be other external factors that make the scatterplot appear this way.

For instance, the depreciation rate of cars could very well be nonlinear, leading to the price exponentially decaying as the car becomes older. But on the other hand, another factor could be the rise of the Internet, as people selling their used cars from the mid 2000's and beforehand would probably be of an older population that is more likely to sell via traditional car dealerships -- instead of Craigslist -- as opposed to the more Internet-savvy young people of today who own newer cars. As a result of the less supply of very old cars, their price would increase.

Location vs. Price

Next, we compared the location of where the car is being sold with the price it is auctioned at.

To simplify visualization, we decided to display the average price of the cars listed in each area using a map of North America, in order to more easily see the relationship between them.

In [7]:
import json
import folium
from folium import Marker
from folium.plugins import MarkerCluster
from jinja2 import Template

# create a marker that holds its price
class MarkerWithProps(Marker):
    _template = Template(u"""
        {% macro script(this, kwargs) %}
        var {{this.get_name()}} = L.marker(
            [{{this.location[0]}}, {{this.location[1]}}],
            {
                icon: new L.Icon.Default(),
                {%- if this.draggable %}
                draggable: true,
                autoPan: true,
                {%- endif %}
                {%- if this.props %}
                props : {{ this.props }} 
                {%- endif %}
                }
            )
            .addTo({{this._parent.get_name()}});
        {% endmacro %}
        """)
    def __init__(self, location, popup=None, tooltip=None, icon=None,
                 draggable=False, props = None ):
        super(MarkerWithProps, self).__init__(location=location,popup=popup,tooltip=tooltip,icon=icon,draggable=draggable)
        self.props = json.loads(json.dumps(props))    

# function to average the price of all markers within the cluster
icon_create_function = '''
    function(cluster) {
        var markers = cluster.getAllChildMarkers();
        var sum = 0;
        for (var i = 0; i < markers.length; i++) {
            sum += markers[i].options.props.price;
        }
        var avg = Math.round(sum/cluster.getChildCount());

        return L.divIcon({
             html: '<b>' + avg + '</b>',
             className: 'marker-cluster marker-cluster-small',
             iconSize: new L.Point(20, 20)
        });
    }
'''

# create the map
map_osm = folium.Map(location=[39.8283, -98.5795], zoom_start=4)

# create cluster using previously defined function
cluster = MarkerCluster(icon_create_function=icon_create_function)

# create and add a marker for each data point based on its location and price
for index, row in data.iterrows():
    marker = MarkerWithProps(
        location=[row['lat'],row['long']],
        props = { 'price': row['price']}
    )
    marker.add_to(cluster)    

# add cluster to map
cluster.add_to(map_osm)   
map_osm
Out[7]:

Our map displays the average price of each region using marker clusters. Hovering over a specific cluster will highlight the exact region that the average price represents. You can also change the zoom level to get a more detailed view of average prices in a region. Note that each blue marker only represents the location of an individual listing and not its price as more than one listing is required to generate a cluster of the average prices in an area.

According to the overall view of the map, many of the average prices on the Southern edges of the United States are cheaper than the average listed prices in the Northern parts. It seems that the average prices differ by a large amount between the East and West coast, where the Northern part and Southern part of the East and West coast, respectively, have much higher prices as compared to the corresponding Southern and Northern region for each coast.

But when you scroll in to get more detailed average prices from each region, there doesn't seem to be as much of an overarching trend between regions. However, there does seem to be a trend where more densely populated and richer cities typically have a higher average price than a city found in the middle of nowhere.

Machine Learning

Based off the following features we explored, we begin preparing our data for machine learning.

Preparing the data

First, we normalized the year and mileage for our training data. This was done since the ranges for year and mileage are distinct and their distributions are fairly dissimilar from each other, making it hard to draw conclusions by directly comparing both. However, by normalizing each distribution using z-scores, we make each distribution follow a normal curve with similar characteristics such that we are able to make direct comparisons between the year and mileage. To read more about normalizing z-scores, click here.

Next, we do some feature engineering by squaring our data for the year. Since during our EDA (Exploratory Data Analysis) phase, we determined that the distribution for year seemed quadratic, we square the year in order to hopefully make it more like a linear relationship.

In [8]:
# normalize mileage
mil_mean = data['mileage'].mean()
mil_std = data['mileage'].std()
data = data.apply(lambda x: ((x-mil_mean) / mil_std) if x.name == 'mileage' else x)

#normalize year
year_mean = data['year'].mean()
year_std = data['year'].std()
data = data.apply(lambda x: ((x-year_mean) / year_std) if x.name == 'year' else x)

# square year for quadratic relationship
data = data.apply(lambda x: x*abs(x) if x.name == 'year' else x)

Training a linear regression model

We first attempt to train a linear regression model to fit the data, since most of the features we observed seem to exhibit a linear relationship.

We begin by dropping any irrelevant columns that we were unable to find a relationship for during our EDA phase. Although they may play a factor into the price, we were unable to find any clear relationship between them, and regression analysis with subsets of these features didn't show any significant improvements.

We also use one hot encoding for the city variable to convert it into a form that can be used by our ML algorithm. Since the city is a categorical variable, one hot encoding allows us to convert that into a numerical form which can be used more efficiently for ML. For more information on how one hot encoding works, click here.

Finally, we also used cross validation when training our sample in an attempt to avoid overfitting and for evaluation of our model's performance. We use cross validation to essentially split our data into two sets, 80% of our sample goes towards a training set which is used to initially train the model, and the remaining 20% is held back as a test set for validating our model against later. For a more in-depth look at the cross-validation process, see this tutorial.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

# separate the price column
price = pandas.DataFrame(data['price'])

# drop unnecessary columns
data = data.drop(columns=['price','lat','long','manufacturer','make','condition','cylinders','fuel','transmission','drive','size','type','color'])

# one hot encoding for city variable
data_ohe = pandas.get_dummies(data)

# train the linear regression model using 80% of the sample as training data
data_train, data_test, y_train, y_test = train_test_split(data_ohe, price['price'].values.reshape(-1,1), test_size=0.2, random_state=0)
linearRegression = LinearRegression().fit(data_train, y_train)

# use model to predict the test data
y_pred = linearRegression.predict(data_test)

# show coefficients
print('Intercept:', linearRegression.intercept_)
coeff_df = pandas.DataFrame(linearRegression.coef_.reshape(-1,1), data_ohe.columns, columns=['Coefficient'])
coeff_df
Intercept: [-3.43968652e+13]
Out[9]:
Coefficient
year 2.933759e+00
mileage -3.764081e+03
city_SF bay area 3.439687e+13
city_abilene, TX 3.439687e+13
city_akron / canton 3.439687e+13
city_albany, GA 3.439687e+13
city_albany, NY 3.439687e+13
city_albuquerque 3.439687e+13
city_altoona-johnstown 3.439687e+13
city_amarillo, TX 3.439687e+13
city_anchorage / mat-su 3.439687e+13
city_ann arbor, MI 3.439687e+13
city_annapolis, MD 3.439687e+13
city_appleton-oshkosh-FDL 3.439687e+13
city_asheville, NC 3.439687e+13
city_atlanta, GA 3.439687e+13
city_augusta, GA 3.439687e+13
city_austin, TX 3.439687e+13
city_bakersfield, CA 3.439687e+13
city_baltimore, MD 3.439687e+13
city_baton rouge 3.439687e+13
city_battle creek, MI 3.439687e+13
city_bellingham, WA 3.439687e+13
city_bend, OR 3.439687e+13
city_billings, MT 3.439687e+13
city_binghamton, NY 3.439687e+13
city_birmingham, AL 3.439687e+13
city_bismarck, ND 2.325970e+15
city_boise, ID 3.439687e+13
city_boston 3.439687e+13
... ...
city_tulsa, OK 3.439687e+13
city_twin falls, ID 3.439687e+13
city_tyler / east TX 3.439687e+13
city_upper peninsula, MI 3.439687e+13
city_utica-rome-oneida 3.439687e+13
city_valdosta, GA 3.439687e+13
city_ventura county 3.439687e+13
city_vermont 3.439687e+13
city_victoria, TX 3.439687e+13
city_visalia-tulare 3.439687e+13
city_waco, TX 5.456968e-12
city_washington, DC 3.439687e+13
city_waterloo / cedar falls 3.439687e+13
city_watertown, NY 3.439687e+13
city_wausau, WI 3.439687e+13
city_wenatchee, WA 3.439687e+13
city_western massachusetts 3.439687e+13
city_western slope -7.094059e-11
city_wichita, KS 3.439687e+13
city_williamsport, PA 3.439687e+13
city_wilmington, NC 3.439687e+13
city_winchester, VA 3.439687e+13
city_winston-salem, NC 3.439687e+13
city_worcester / central MA 3.439687e+13
city_wyoming 3.439687e+13
city_yakima, WA 3.439687e+13
city_york, PA 3.439687e+13
city_youngstown, OH 3.439687e+13
city_yuba-sutter, CA 3.439687e+13
city_yuma, AZ 3.439687e+13

296 rows × 1 columns

Based off the coefficients from our trained model, it seems that the year the car was made has little effect on the price it is being sold at, since its coefficient value is several orders of magnitude lower than that for the mileage.

This appears to be the same with most cities, since their coefficient value just cancels out the intercept that was found. Thus, our model seems to emphasize mainly mileage as the most important factor in determining price.

Evaluating the linear regression model

After fitting our training data to our model, we do further statistical analysis below to see how good our model comes out to be and evaluate its performance.

In [10]:
# print statisical data
sum = 0
for actu in y_test:
    sum += actu[0]

mean = sum/y_test.size

print('Mean of Test Set', mean) 
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean of Test Set 12253.439814814816
Mean Absolute Error: 290729947100698.9
Mean Squared Error: 4.936160234470563e+30
Root Mean Squared Error: 2221747113078030.5

From the results, the Root Mean Squared Error (RMSE) is very high compared to the mean, signifying there might be a problem with our model.

We go ahead and plot a graph of the residuals to check linearity and see where we might have gone wrong.

In [11]:
fig = plt.figure()
scatter_plot2 = fig.add_subplot(1, 1, 1)
scatter_plot2.set_title("Residual Plot")
scatter_plot2.set_xticks([])
scatter_plot2.set_ylabel("Residual")

i = 0
for pred,actu in zip(y_pred,y_test):
    # using mileage as the x to spread out the residuals
    scatter_plot2.scatter(data_test.iloc[i]['mileage'], pred-actu)
    i += 1
Out[11]:

From the residuals plot, it appears that our model is moderately good at predicting most prices, as can be seen from the vast majority of the residuals being around zero.

However, we can also see that there are a few cases where our model is way off compared to the actual price. This may be due to a failure to remove all outliers, or perhaps there were other outside features that were more important for those cases.

We take a look at the scores below for more information.

In [12]:
print('Score for training data: ', linearRegression.score(data_train, y_train))
print('Score for test data: ', linearRegression.score(data_test, y_test))
Score for training data:  0.44200590253406674
Score for test data:  -4.528440979329621e+22

Based off these scores, we can see that the performance for the training data is okay but not exceptional, suggesting that a linear regression model is alright, but perhaps another model would offer better performance.

In addition, the performance for our test data is absolutely abysmal, perhaps indicating a severe problem with overfitting in our model. This potentially suggests that the features we chose don't offer enough information to accurately predict the price of the car, so it might be worth investigating the other features once again to see if they could improve our model.

Training a random forest model

Because of how unfavorable the results for our linear regression model were, we wanted to try a different model to see if there will be any improvements.

We decided to use Random Forests as our second model in order to remove the assumption of linearity. Decision trees are the basic building blocks of random forests. They are basically asking a sequence of yes or no questions about the data, in order to eventually reach a decision. The type and number of questions are generated randomly by the machine as it tries to guess what features are most important in determining the price. If you want to learn and read more about random forests, click here.

We train our random forest model by first tuning the hyperparameters below using RandomizedSearchCV.

In [13]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(200, 2000, num=11)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 4, 8]
bootstrap = [True, False]

# Create the random grid using the most important hyperparameters
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(rf, random_grid, n_iter=10, cv=10, verbose=1, random_state=42, n_jobs=-1, scoring='r2')

rf_random.fit(data_train, y_train)
rf_random.best_params_
Fitting 10 folds for each of 10 candidates, totalling 100 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
/opt/conda/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 12.8min finished
/opt/conda/lib/python3.7/site-packages/sklearn/model_selection/_search.py:715: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  self.best_estimator_.fit(X, y, **fit_params)
Out[13]:
{'n_estimators': 560,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 90,
 'bootstrap': True}

Before fitting the random forest over our training set, we first need to tune the hyperparameters. Tuning hyperparameters is important because we want to try to optimize them for our specific problem in order to get the best possible prediction.

You can find a list containing an explanation of our chosen random forest hyperparamters in the following hyperparameter tuning guide.

We use RandomizedSearchCV to automatically tune the hyperparameters. It works by running many iterations of fitting our data, randomly picking different values for the hyperparameters each time from our specified lists. After each iteration is complete, it evaluates the model's performance by scoring it based on its $r^2$ value, and then returns the model with the best performance (which can be seen above).

Although we could have used GridSearchCV to search for the best hyperparameters more exhaustively, we found that RandomizedSearchCV did a good enough job of optimizing our hyperparameters while taking much less time to run.

Evaluating random forest model

Now, we evaluate the performance of our random forest model below, using the optimized hyperparamters we found.

In [14]:
y_rf_pred = rf_random.predict(data_test)

print('Mean of Test Set', mean)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_rf_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_rf_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_rf_pred)))

print('\nScore for training data: ', rf_random.score(data_train, y_train))
print('Score for test data: ', rf_random.score(data_test, y_test))
Mean of Test Set 12253.439814814816
Mean Absolute Error: 6485.614752763793
Mean Squared Error: 73521367.46126999
Root Mean Squared Error: 8574.460184832045

Score for training data:  0.8886672126613622
Score for test data:  0.32551384587766274

From the results, we can see that our RMSE is still somewhat large, but it is much better than before when we used linear regression.

In addition, the score for our training data is much better, at nearly a 90% $r^2$ value, indicating that our random forest model was a better fit for our data.

However, although our test data was much improved compared to before, it still stands at less than half the performance of our training dataset. Thus, overfitting still seems to be an issue despite changing models.

Conclusion

Predicting previously owned vehicles is still a very challenging process to go through. There are still so many characteristics to be acknowledged for a very accurate prediction, for example, knowing the reliability of the previous owner(s), or knowing the vehicle history report. In addition, there may be some factors which cannot be predicted, such as outliers where someone prices their car absurdly high or low due to reasons unknown to us. However, we believe that we were able to find a loose model to predict the price of a car, albeit with some caveats and issues.

Despite this, from our exploratory data analysis we thought that the three most important features would likely be the car's mileage, model year, and location sold. All these variables proved to have a roughly linear relationship with the price, so we thought that they held the most promise when training our machine learning algorithms.

We tried two different machine learning data analyses: linear regression and random forests. We started off with linear regression due to the findings from our exploratory data analysis phase. However, when training the model, we found that most of the features we chose didn't factor heavily into the model's prediction of price, except for the mileage. When evaluating the performance of the linear model, it turned out to have severe issues with overfitting, as well as just being a bad fit overall.

Based off this, we decided to try to axe our assumption of linearity and try random forests. Although we were able to find a reasonable model using random forests, it appears that overfitting was still a consistent issue. In order to reduce overfitting, we would likely have to further investigate other different features to find which ones would offer a model that is more representative of our data. We could possibly lengthen our dataset findings to other websites that have more secondhand car listings, like eBay Motors or CarFax.

In [0]: