11 min read

Assignment A01_Business Analytics

Problem definition

A professor of machine learning is planning to get married to his long-time girlfriend. He has never shopped for diamonds before. In the mall, he was confronted with a dizzying array of diamond characteristics, configurations, and pricing. His quick search revealed that diamonds are primarily characterized by 4C viz. Color, Cut, Carat Weight and Clarity besides Polish, Symmetry, and certification. He scrapped the web to collect information from three different wholesaler websites to build his pricing model to ensure he does not get cheated while purchasing the diamond ring. Build a Linear Regression Model to predict the price of the diamond ring of his interest.

Import data

## Rows: 440
## Columns: 9
## $ Carat         <dbl> 92, 92, 82, 81, 9, 87, 8, 84, 8, 8, 85, 83, 82, 82, 8, 9~
## $ Colour        <chr> "I", "I", "F", "G", "J", "F", "D", "F", "D", "D", "G", "~
## $ Clarity       <chr> "SI2", "SI2", "SI2", "SI1", "VS2", "SI2", "SI2", "SI1", ~
## $ Cut           <chr> "G", "V", "I", "I", "V", "I", "I", "G", "V", "V", "I", "~
## $ Certification <chr> "AGS", "AGS", "GIA", "GIA", "GIA", "AGS", "GIA", "GIA", ~
## $ Polish        <chr> "V", "G", "X", "X", "V", "G", "V", "V", "V", "V", "V", "~
## $ Symmetry      <chr> "V", "G", "X", "V", "V", "V", "V", "V", "V", "X", "V", "~
## $ Price         <dbl> 3000, 3000, 3004, 3004, 3006, 3007, 3008, 3010, 3012, 30~
## $ Wholesaler    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~

We use exploratory graphs in data analysis to understand data properties, find patterns in data, suggest modeling strategies.

Univariate Analysis of Metric Data

Univariate EDA for a quantitative variable is a way to make preliminary assessments about the population distribution of the variable using the data of the observed sample.

Price

The distribution of price shows two different range. First range is between $100-$700 and the second one is between almost $1800-$3300. With this information, we are not able to found that why there are no price data in range $800-$1700. Median of price is $2169 and the mean is $1717. The professor’s diamond ring price is $3100 which almost near the maximum price of the price data set. It means that either the diamond ring is precious enough or the professor will get cheated. We should analyze other specification of diamond to found whether they are at the precious level or not.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     160     520    2169    1717    3012    3145

Carat

Let’s look at the another parameter of diamond ring, Carat.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    3.00    3.00   13.05    9.00   92.00      51

Univariate Analysis of Non-Metric Data

In this section of analysis, the univariate analysis of non-metric variables are investigated

Colour

Acording to the data, Colour includes D, E, F, G, H, I, G, K, and L. D-F: Colorless G-I: Near Colorless J-K: Faint Yellow L-N: Very light yellow O-S: Light Yellow T-Z: Yellow

The mode is for I and the least frequency relates to colour L. The professor wants to buy a diamond ring with colour J , Faint Yellow, which is the second most popular ring among all. However, with this univariate analysis we cannot found the relation between the most frequent colours and the price to help the professor. we need a bivariat analysis to figure out the estimation of the price based on the colour Faint Yellow.

##    Length     Class      Mode 
##       440 character character

Clarity

I1: very few inclusions visible to naked eye I2: few inclusions visible to naked eye SI1: very very few inclusions at 10X SI2: very few inclusions at 10X SI3: several inclusions at 10X VS1: few inclusions at 30X VS: several inclusions at 30X VVS1: very very few inclusions at 30X VVS2: very few inclusions at 30X

Figure shows that the most frequent clarity is SI1 which is very very few inclusions at 10X. The professor wants to buy a ring with SI2 that has very few inclusions at 10X and is the second most frequent type of ring among all. Again, we need a bivariate analysis to see the distribution of the dimaond ring price based on different clarity features.

##    Length     Class      Mode 
##       440 character character

Cut

F: Fair G: Good I: Ideal V: Very Good X: Excellent

Type x of cut is the most frequent among others which represent the excellent cut. The professor is going to buy a very good cut which is the second most frequent cut type.

##    Length     Class      Mode 
##       440 character character

Certification

The mode of certification data is GIA and also the professor wants to buy a diamond ring with GIA certification. Like previous analysis, univariate investigation cannot help us to find the relationship between the price and type of certification.

##    Length     Class      Mode 
##       440 character character

Polish

F: Fair G: Good I: Ideal V: Very Good X: Excellent

Polish classification is similar to the cut code. As can be seen in the bar chart, very good and good are the most nd second most ones. The professor ring has good polish that seems to be a good choice.

##    Length     Class      Mode 
##       440 character character

Symmetry

F: Fair G: Good I: Ideal V: Very Good X: Excellent

Symmetry classification is also similar to the cut code and polish. Very good and good symmetry are popular. The professor choice is very good which the mode of the symmetry variable.

##    Length     Class      Mode 
##       440 character character

Bivariate Analysis

In this part of analysis, I am going to analyze the relationship between price and other metric or non-metric variable usinf regression model and plots. The correlation and covariance would also be analyzed.

Price vs Carat

The correlation between the price and the Carat is shown in the following figure and the linear regression model and coefficents are also calculated. For the professor choice that Carat is 0.9, the price $3100 seeems fair according to these analysis. The p-values are also significant that makes we sure that the linear regression model is reliable.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   1340.      64.1      20.9  1.48e-65
## 2 Carat           19.8      2.40      8.25 2.56e-15

\[\widehat{Price}_{i} = 1339.88 + 19.82 \times Carat_{i}\]

Price vs Colour

The correlation between the price and the Colour is shown in the following figure and the linear regression model and coefficents are also calculated. For the professor choice that Colour is J, we can see a range of price implyng that there is a need to more analysis on other variables using the multiple regression. The p-values for some colours are not significant significant showing that this model is not significant enough in overall.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 9 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   2316.       254.     9.13  2.74e-18
## 2 ColourE       -764.       297.    -2.57  1.04e- 2
## 3 ColourF       -982.       294.    -3.34  9.17e- 4
## 4 ColourG       -148.       307.    -0.481 6.31e- 1
## 5 ColourH       -874.       287.    -3.04  2.50e- 3
## 6 ColourI       -766.       284.    -2.70  7.30e- 3
## 7 ColourJ       -535.       287.    -1.87  6.27e- 2
## 8 ColourK         42.1      326.     0.129 8.97e- 1
## 9 ColourL         52.3      414.     0.126 9.00e- 1

Price vs Clarity

The analysis of price vs clarity also seems like the Colour one results. For instance, for the clarity of S12 which is the professor diamond ring choice, we can see a wide range of actual prices. This means that there are other variables affecting the price.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 9 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   2543.       108.    23.6   1.82e-79
## 2 ClarityI2     -201.       214.    -0.940 3.48e- 1
## 3 ClaritySI1   -1496.       141.   -10.6   1.69e-23
## 4 ClaritySI2    -569.       143.    -3.99  7.79e- 5
## 5 ClaritySI3      76.2      220.     0.347 7.29e- 1
## 6 ClarityVS1   -1405.       209.    -6.74  5.17e-11
## 7 ClarityVS2   -1655.       187.    -8.85  2.27e-17
## 8 ClarityVVS1  -1996.       700.    -2.85  4.54e- 3
## 9 ClarityVVS2  -1979.       450.    -4.39  1.40e- 5

Price vs Cut

The correlation between the price and the Cut is shown in the following figure and the linear regression model and coefficents are also calculated. We can see that coefficents of linear regression are significant although the plot shows the gap between the prices of each cut category. Therefore, we need more analysis.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    2455.      145.     16.9  1.07e-49
## 2 CutG           -409.      215.     -1.90 5.82e- 2
## 3 CutI           -723.      188.     -3.84 1.42e- 4
## 4 CutV          -1277.      184.     -6.94 1.44e-11
## 5 CutX           -797.      171.     -4.65 4.43e- 6

Price vs Certification

The correlation between the price and the Certification is shown in the following figure and the linear regression model and coefficents are also calculated. For the professor choice which is GIA cetification, the coefficient of regression is reliable but the plot shows a wide range of price meaning tat other variables effect should be considered.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 5 x 5
##   term             estimate std.error statistic  p.value
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)         3033.      265.     11.4  1.28e-26
## 2 CertificationDOW   -1002.      957.     -1.05 2.96e- 1
## 3 CertificationEGL    -356.      279.     -1.28 2.02e- 1
## 4 CertificationGIA   -1574.      271.     -5.80 1.30e- 8
## 5 CertificationIGI   -2768.      300.     -9.22 1.31e-18

Price vs Polish

The correlation between the price and the Polish is shown in the following figure and the linear regression model and coefficents are also calculated. Most of the regression coefficients are insignificant. Moreover, for the professor choice, Good Polish, the coefficient is not significant.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 6 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)    2319.      516.     4.49  0.00000907
## 2 PolishG        -404.      524.    -0.771 0.441     
## 3 PolishI         729.      730.     0.998 0.319     
## 4 Polishv         762.     1264.     0.603 0.547     
## 5 PolishV        -715.      523.    -1.37  0.172     
## 6 PolishX        -940.      537.    -1.75  0.0808

Price vs Symmetry

The correlation between the price and the Polish is shown in the following figure and the linear regression model and coefficents are also calculated. Most of the regression coefficients are insignificant but the Symmetry V, which is the professor’s diamond ring specification, we can see a significant coefficient. However, the plot shows a wide range of price meaning that effect of other variables should be considered.

## `geom_smooth()` using formula 'y ~ x'

## # A tibble: 5 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    2432.      250.      9.74 2.00e-20
## 2 SymmetryG      -538.      266.     -2.02 4.36e- 2
## 3 SymmetryI       615.      569.      1.08 2.80e- 1
## 4 SymmetryV      -967.      262.     -3.69 2.53e- 4
## 5 SymmetryX      -673.      297.     -2.27 2.38e- 2

Multiple Regression Model

In the previous section, we found that most of the variables coefficients are not significant. Therefore, we are going to investigate the multiple regression model to estimate the price of diamond ring precisely. The estimated price of the diamond ring is $2959.7 that is a little lower than $3100.

## # A tibble: 35 x 5
##    term        estimate std.error statistic  p.value
##    <chr>          <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)   1902.     663.      2.87   4.39e- 3
##  2 Carat           22.8      1.81   12.6    3.33e-30
##  3 ColourE       -237.     204.     -1.16   2.47e- 1
##  4 ColourF       -281.     203.     -1.38   1.68e- 1
##  5 ColourG         10.9    206.      0.0530 9.58e- 1
##  6 ColourH       -417.     202.     -2.06   3.99e- 2
##  7 ColourI       -381.     203.     -1.88   6.13e- 2
##  8 ColourJ       -295.     209.     -1.41   1.59e- 1
##  9 ColourK        114.     239.      0.477  6.34e- 1
## 10 ColourL       -227.     304.     -0.747  4.55e- 1
## # ... with 25 more rows

\[\widehat{Price}_{i} = 1901.97 + 22.75 \times 90 - 295.49 \times 1 - 412.37 \times 1 + 584.78 \times 1 - 438.32 \times 1 - 428.37 \times 1 = 2959.7\] ## Summary

First, we build a univariate analysis of the metric and non-metric variables independently. We found that there is a need to evaluate the effect of other variables. Then, we analyze the bivariate of price vs all of the variables. Finally, the multiple regression model was biult to found our best estimation of the diamond ring for the professor. Our estimation is $2959.7 that is a little lower than $3100. The estimation shows that the ring worth $140 lower than the suggested price.