Who Has the Best Prices for Tech’s Top 100 Products of the Year? A Machine Learning Analysis.

Continuation Analysis

We would like to predict whether each given searched result from iprice will match with one of the top 100 coolest electronic gadgets.

The features we considered to be included are:

dist_jw : Jaro–Winkler distance

price_diff_ratio : ( pricerefer_price) / refer_price

discount : Discount percentage

Jaro-Wrinkler distance

“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.

Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro-Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length”

— Source:

We will be diving into the Math so that it is easier for us to fully understand!

Jaro distance is defined as:

Wow, this seems quite complicated…. I would prefer to sleep… Nah, I promise you will fully understand after these few examples.

But first, we need to understand what do theses terms mean.

dj: Jaro distance
m: Number of matching characters which appear in s1 and in s2.
t is half the number of transpositions (compare the i-th character of s1 and the i-th character of s2 divided by 2)
|s1| is the length of the first string
|s2| is the length of the second string

Lets use an example to explain the math.

How to calculate jaro distance between Facebook and Firebook?

matching characters       : Febook   -> 6 characters -> m = 6
no transposition needed : t=0
length of the 1st string : Facebook -> 8 characters -> |s1| = 8
length of the 2nd string : Firebook -> 8 characters -> |s2| = 8
dj = (1/3)*( (6/8) + (6/8) + ((6-0)/6)) )
dj ~= 0.83
Jaro distance = 83%

After knowing how to calculate Jaro distance, it’s time to understand how to calculate Jaro-Winkler distance!

l: Length of common prefix at the start of the string up to a maximum of 4 characters.

p: Constant scaling factor for how much the score is adjusted upwards for having common prefixes. Normally we use p=0.1 .

Continue to the previous case example of Firebook vs Facebook

dj    : 0.83
prefix: character F -> 1 character -> l=1
p : 0.1
dw = 0.83 + 1 * 0.1 * (1-0.83)
dw = 0.847
Jaro-Winkler distance = 84.7%

Price Difference Ratio

Intuition of creating this feature, is that we believe if the price of the product is higher or lower than the price of the top 100 coolest product(keyword) by a lot, then this product do not match the keyword which we want to find.

For instance, taken from one of our keywords : Apple Ipad Pro

refer_price of Apple Ipad Pro equals to around SGD 1081 (using exchange rate 1 USD = 1.37 SGD). Then, we can conclude base on price difference ratio = (51–1081)/1081 ~= -0.95.

Explain in another word -> 95% price difference between refer_price and the keyword’s price -> high probability that product do not match our keyword in this case -> Apple Ipad Pro.

Calculate Jaro-Winkler distance and Price Difference Ratio

for index,row in data.iterrows():
data.loc[index,'dist_jw'] = L.jaro_winkler(row['name'], row['refer_name'])

data['price_diff_ratio'] = (data['price']-data['refer_price'])/data['refer_price']

Visualization of Jaro-Winkler distance and Price Difference Ratio

With code below, you should be able to reproduce our result.

sns.scatterplot(data=data, x='dist_jw',y='price_diff_ratio', hue='status').set_title("Relationship between jaro-wrinkle distance and price difference ratio")

Wow! It seems there should exist a boundary which is able to separate between searched products which match or do not match the keyword (status = 0 or 1).

Using code below, we are able to find the best horizontal line to separate between status =0 and 1 and visualize it.

count_dict = {}
x = min(data['price_diff_ratio'])
while x<2:
temp_data = data
temp_data['guess'] = [1 if price>=x else 0 for price in data['price_diff_ratio'] ]
correct = len(data[temp_data['status'] == temp_data['guess']])
count_dict[x] = correct
x = x+0.001
boundary_const = [max(count_dict, key=lambda x: count_dict[x])][0]
ax = sns.scatterplot(
data=data[data['price_diff_ratio']<=1],
x='dist_jw',y='price_diff_ratio', hue='status')
plt.axhline(y=boundary_const, color='r', linestyle='-')
plt.show()

From the figure above, we are able to see the best horizontal line separating the status is equal to around -0.55.

The classification rule is:

1. Below -0.55 will be classified as status = 0.
2. Above 0.55 will be classified as status = 1.

Based on the above classification rule, we will be able to get roughly 94% accuracy! We shouldn’t treat this number too seriously as we should actually apply this rule only on training data to avoid data leakage problem. So this number is for us to have a rough idea on the machine learning model prediction later.

This observation gives us intuition on creating a machine learning model to find the best boundary so that we are able to have the predictive power!

Select Feature For Machine Learning Model

There are several ways to select feature to include in our model, for our case, we use p-value to select suitable feature.

What is p-value?

Probability of finding more extreme value given that null hypothesis is true.

If p-value of the variable is smaller than significant value, then the variable is statistically significant and vice-versa. We choose 0.05 as our significance level.

First, we run logit model on dist_jw , price_diff_ratio and discount variables. Do refer to links below for detail explanation on Logit model.

Running code below to get started!

logit_model=sm.Logit(data['status'],data[["dist_jw","price_diff_ratio", "discount"]])
result=logit_model.fit()
print(result.summary2())

We can see that p-value of discount (0.8549) is larger than 0.05 , so discount variable is statistically insignificant. Other variables such as dist_jw and price_diff_ratio p-value are lesser than 0.05, which are statistically significant, thus they are the variables which will be included in our later machine learning modeling.

Train Machine Learning Model

There are three basic machine learning models which we will be using:

Wait…. We still need to separate our data into train data and test data before training our model. For our case, we will split 80% of our data for train data while 20% data for test data.

X_train, X_test, y_train, y_test = train_test_split(data[["dist_jw","price_ratio"]], data['status'], test_size=0.2, random_state=0)

Let us begin to train and predict our first machine learning model.

Logistic regression for classification

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Wow!!! Just using two features and we are able to get 92% accuracy without fine tuning our machine learning model. Good Stuff! 92% accuracy will act as a benchmark for our machine learning model.

To visualize more, we plot boundary calculated by logistic regression!

xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = logreg.predict_proba(grid)[:, 1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("\$P(y = 1)\$")
ax_c.set_ticks([0, .25, .5, .75, 1])
ax.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, s=50,
cmap="RdBu", vmin=-.2, vmax=1.2,
edgecolor="white", linewidth=1)
ax.set(aspect="equal",
xlim=(-5, 5), ylim=(-5, 5),
xlabel="\$X_1\$", ylabel="\$X_2\$")

After visualizing the boundary, we proceed to plot confusion matrix. I refer to this link to plot our confusion matrix.

def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Normalized confusion matrix")
else:
print('Confusion matrix, without normalization')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.tight_layout()
confusion_mat = confusion_matrix(y_test, y_pred)
# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(confusion_mat, classes=['Incorrect', 'Correct'], title='Confusion matrix, without normalization')

To conclude, there are 220 out of 239 test data are predicted correctly. No particular variable have higher wrong prediction rate.

Let us further visualizing ROC Curve.

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points.

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity).

Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993).

— by

https://www.medcalc.org/manual/roc-curves.php

logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

We can observed that Area of ROC curve (0.92) is close to 1 for Logistic Regression, meaning the accuracy for logistic regression model is high!

Support Vector Machine for classification

clf = SVC(kernel='rbf')
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy of SVM classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Support Vector Machine for classification beat our benchmark 92% by 2%!

Random Forest for classification

clf = RandomForestClassifier(random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Random Forest classification beat our benchmark 92% by 4%, is the best machine learning model among the three models we have tested. We can achieve 96% without any fine tuning of machine learning model. Meaning if we are able to create a good feature, we actually do not need to spend a lot of our time to fine tune our model to achieve a desirable accuracy!

Further Improvements

1. Include more variables for example likes, comments and ratings of each of the keywords.
2. Perform more string manipulation on each keyword to obtain more searched results for our analysis and modeling.

Happy Coding and feel free to comment below.

If you want us to fine tune our machine learning model, please let us know by commenting below!