# Who Has the Best Prices for Tech’s Top 100 Products of the Year? A Machine Learning Analysis.

### Continuation Analysis

We would like to predict whether each given searched result from iprice will match with one of the top 100 coolest electronic gadgets.

The features we considered to be included are:

`dist_jw` : Jaro–Winkler distance

`price_diff_ratio` : ( `price``refer_price`) / `refer_price`

`discount` : Discount percentage

### Jaro-Wrinkler distance

“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.

Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro-Winkler distance uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length”

— Source:

We will be diving into the Math so that it is easier for us to fully understand!

Jaro distance is defined as:

Wow, this seems quite complicated…. I would prefer to sleep… Nah, I promise you will fully understand after these few examples.

But first, we need to understand what do theses terms mean.

dj: Jaro distance
m: Number of matching characters which appear in s1 and in s2.
t is half the number of transpositions (compare the i-th character of s1 and the i-th character of s2 divided by 2)
|s1| is the length of the first string
|s2| is the length of the second string

Lets use an example to explain the math.

How to calculate jaro distance between Facebook and Firebook?

``matching characters       : Febook   -> 6 characters -> m = 6no transposition needed   : t=0length of the 1st string  : Facebook -> 8 characters -> |s1| = 8length of the 2nd string  : Firebook -> 8 characters -> |s2| = 8``
`dj = (1/3)*( (6/8) + (6/8) + ((6-0)/6)) )dj ~= 0.83`
`Jaro distance = 83%`

After knowing how to calculate Jaro distance, it’s time to understand how to calculate Jaro-Winkler distance!

l: Length of common prefix at the start of the string up to a maximum of 4 characters.

p: Constant scaling factor for how much the score is adjusted upwards for having common prefixes. Normally we use p=0.1 .

Continue to the previous case example of Firebook vs Facebook

`dj    : 0.83prefix: character F -> 1 character -> l=1p     : 0.1`
`dw = 0.83 + 1 * 0.1 * (1-0.83)dw = 0.847`
`Jaro-Winkler distance = 84.7%`

### Price Difference Ratio

Intuition of creating this feature, is that we believe if the price of the product is higher or lower than the price of the top 100 coolest product(keyword) by a lot, then this product do not match the keyword which we want to find.

For instance, taken from one of our keywords : Apple Ipad Pro

`refer_price` of Apple Ipad Pro equals to around SGD 1081 (using exchange rate 1 USD = 1.37 SGD). Then, we can conclude base on price difference ratio = (51–1081)/1081 ~= -0.95.

Explain in another word -> 95% price difference between `refer_price` and the keyword’s price -> high probability that product do not match our keyword in this case -> Apple Ipad Pro.

### Calculate Jaro-Winkler distance and Price Difference Ratio

`for index,row in data.iterrows():data.loc[index,'dist_jw'] = L.jaro_winkler(row['name'], row['refer_name'])data['price_diff_ratio'] = (data['price']-data['refer_price'])/data['refer_price']`

### Visualization of Jaro-Winkler distance and Price Difference Ratio

With code below, you should be able to reproduce our result.

`sns.scatterplot(data=data, x='dist_jw',y='price_diff_ratio', hue='status').set_title("Relationship between jaro-wrinkle distance and price difference ratio")`

Wow! It seems there should exist a boundary which is able to separate between searched products which match or do not match the keyword (status = 0 or 1).

Using code below, we are able to find the best horizontal line to separate between status =0 and 1 and visualize it.

`count_dict = {}x = min(data['price_diff_ratio'])`
`while x<2:temp_data = datatemp_data['guess'] = [1 if price>=x else 0 for price in       data['price_diff_ratio'] ]correct = len(data[temp_data['status'] == temp_data['guess']])count_dict[x] = correctx = x+0.001`
`boundary_const = [max(count_dict, key=lambda x: count_dict[x])][0]`
`ax = sns.scatterplot(data=data[data['price_diff_ratio']<=1], x='dist_jw',y='price_diff_ratio', hue='status')`
`plt.axhline(y=boundary_const, color='r', linestyle='-')plt.show()`

From the figure above, we are able to see the best horizontal line separating the status is equal to around -0.55.

The classification rule is:

1. Below -0.55 will be classified as status = 0.
2. Above 0.55 will be classified as status = 1.

Based on the above classification rule, we will be able to get roughly 94% accuracy! We shouldn’t treat this number too seriously as we should actually apply this rule only on training data to avoid data leakage problem. So this number is for us to have a rough idea on the machine learning model prediction later.

This observation gives us intuition on creating a machine learning model to find the best boundary so that we are able to have the predictive power!

### Select Feature For Machine Learning Model

There are several ways to select feature to include in our model, for our case, we use p-value to select suitable feature.

What is p-value?

Probability of finding more extreme value given that null hypothesis is true.

If p-value of the variable is smaller than significant value, then the variable is statistically significant and vice-versa. We choose 0.05 as our significance level.

First, we run logit model on `dist_jw` , `price_diff_ratio` and `discount` variables. Do refer to links below for detail explanation on Logit model.

Running code below to get started!

`logit_model=sm.Logit(data['status'],data[["dist_jw","price_diff_ratio", "discount"]])result=logit_model.fit()print(result.summary2())`

We can see that p-value of `discount` (0.8549) is larger than 0.05 , so discount variable is statistically insignificant. Other variables such as `dist_jw` and `price_diff_ratio` p-value are lesser than 0.05, which are statistically significant, thus they are the variables which will be included in our later machine learning modeling.

### Train Machine Learning Model

There are three basic machine learning models which we will be using:

Wait…. We still need to separate our data into train data and test data before training our model. For our case, we will split 80% of our data for train data while 20% data for test data.

`X_train, X_test, y_train, y_test = train_test_split(data[["dist_jw","price_ratio"]], data['status'], test_size=0.2, random_state=0)`

Let us begin to train and predict our first machine learning model.

Logistic regression for classification

`logreg = LogisticRegression()logreg.fit(X_train, y_train)`
`y_pred = logreg.predict(X_test)print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))`

Wow!!! Just using two features and we are able to get 92% accuracy without fine tuning our machine learning model. Good Stuff! 92% accuracy will act as a benchmark for our machine learning model.

To visualize more, we plot boundary calculated by logistic regression!

`xx, yy = np.mgrid[-5:5:.01, -5:5:.01]grid = np.c_[xx.ravel(), yy.ravel()]probs = logreg.predict_proba(grid)[:, 1].reshape(xx.shape)`
`f, ax = plt.subplots(figsize=(8, 6))contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",vmin=0, vmax=1)ax_c = f.colorbar(contour)ax_c.set_label("\$P(y = 1)\$")ax_c.set_ticks([0, .25, .5, .75, 1])`
`ax.scatter(X_train.iloc[:,0], X_train.iloc[:,1], c=y_train, s=50,cmap="RdBu", vmin=-.2, vmax=1.2,edgecolor="white", linewidth=1)`
`ax.set(aspect="equal",xlim=(-5, 5), ylim=(-5, 5),xlabel="\$X_1\$", ylabel="\$X_2\$")`

After visualizing the boundary, we proceed to plot confusion matrix. I refer to this link to plot our confusion matrix.

`def plot_confusion_matrix(cm, classes,normalize=False,title='Confusion matrix',cmap=plt.cm.Blues):"""This function prints and plots the confusion matrix.Normalization can be applied by setting `normalize=True`."""if normalize:cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]print("Normalized confusion matrix")else:print('Confusion matrix, without normalization')`
`    print(cm)`
`    plt.imshow(cm, interpolation='nearest', cmap=cmap)plt.title(title)plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)`
`    fmt = '.2f' if normalize else 'd'thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]),   range(cm.shape[1])):plt.text(j, i, format(cm[i, j], fmt),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")`
`    plt.ylabel('True label')plt.xlabel('Predicted label')plt.tight_layout()`
`confusion_mat = confusion_matrix(y_test, y_pred)# Plot non-normalized confusion matrixplt.figure()plot_confusion_matrix(confusion_mat, classes=['Incorrect', 'Correct'], title='Confusion matrix, without normalization')`

To conclude, there are 220 out of 239 test data are predicted correctly. No particular variable have higher wrong prediction rate.

Let us further visualizing ROC Curve.

In a Receiver Operating Characteristic (ROC) curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity) for different cut-off points.

Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold.

A test with perfect discrimination (no overlap in the two distributions) has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity).

Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test (Zweig & Campbell, 1993).

— by

https://www.medcalc.org/manual/roc-curves.php

`logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])plt.figure()plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)plt.plot([0, 1], [0, 1],'r--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic')plt.legend(loc="lower right")plt.savefig('Log_ROC')plt.show()`

We can observed that Area of ROC curve (0.92) is close to 1 for Logistic Regression, meaning the accuracy for logistic regression model is high!

Support Vector Machine for classification

`clf = SVC(kernel='rbf')clf.fit(X_train, y_train)`
`y_pred = clf.predict(X_test)print('Accuracy of SVM classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))`

Support Vector Machine for classification beat our benchmark 92% by 2%!

Random Forest for classification

`clf = RandomForestClassifier(random_state=0)clf.fit(X_train, y_train)`
`y_pred = clf.predict(X_test)print('Accuracy of Random Forest classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))`

Random Forest classification beat our benchmark 92% by 4%, is the best machine learning model among the three models we have tested. We can achieve 96% without any fine tuning of machine learning model. Meaning if we are able to create a good feature, we actually do not need to spend a lot of our time to fine tune our model to achieve a desirable accuracy!

### Further Improvements

1. Include more variables for example likes, comments and ratings of each of the keywords.
2. Perform more string manipulation on each keyword to obtain more searched results for our analysis and modeling.