The Different MANOVA Analysis (Math Problem Sample)
In this exercise, it is the intention to analyze a chosen data set with the techniques discussed in the course, with as ultimate objective the development of a good classification model. The report should (at least) include:
1. A brief summary of the data (among other things: background information, attributes (which can be served as categorical and which can be potential classifiers)) etc.
2. Check necessary assumptions, e.g. normality, homogeneity of covariance matrices
3. Data transformation if necessary
4. Selection of optimal classification rule
5. An additional classification rule for further comparison
6. Evaluation of the classification rules proposed. You may choose to use APER and/or Eˆ(AER) and/or ROC.
7. Selection of classifier.
8. Conclusion.
Data sets:
• Multiple sclerosis data, AMSA, Johnson and Wichern (see exercise 1.14 and exercise 11.23)
• Crude oil data, AMSA, Johnson and Wichern (see exercise 11.30)
• Data on Brands of Cereal (see exercise 11.34 and Table 11.9)
• Real estate sales data provided in realestate.txt. Further information see screenshot provided in Figure 1.
• Breast tissue data, UCI Machine Learning Repository, http://archive.ics.uci.edu/ml/datasets/Breast+Tissue
Restrict the number of classes to four, as described on the above webpage.
To clarify, the following R packages are allowed to be used:
MASS
nnet
ROCR
st514
readxl
The Different MANOVA Analysis
Brief summary of the data.
We make an assumption that the sample size gives the importance of the treatment means. We classify quality into two categories 1-2 and 3. We classify bedrooms into three different categories, 0-2, 3 and 4, 5-7. We classify the style into either 1 or not 1. We will also transform the response variables. Firstly, we will conduct a principal component analysis and check the important factors, do a MANOVA test then we focus on the variables feet, price and beds.
Principal Component Analysis
We conducted a principal component analysis as shown below. From the analysis, the first two components accounts for 94.02% of the variation, the pictorial scree plot indicates more graphical representation of the components.
MANOVA
Using the manova package the p-values are below alpha level of 5%.
From the different MANOVA analysis (Hotelling-Lawley, Pillai, Roy, Wilks), it is evident that the 12 variables are significantly different from the price. The str function indicates that the categories is more than three, hence the linear discriminant analysis is necessary.
The linear discriminant analysis is shown below.
Price versus Number of Bedrooms
ANOVA ANALYSIS
Running an ANOVA analysis for the that gives a breakdown of the sales price with respect to the number of bedrooms.
The p-value for the ANOVA analysis is less than 0.05, therefore, we fail to reject the null hypothesis that the population means are equal and conclude that there is no statistically significant difference among the sample means.
The correlation coefficient is 0.4133239, this indicates that the correlation between the two variable is positive, but weakly correlated.
The model will be: Price = 82809 + 56200Beds
When the number of beds increase, the prices increases, at 0 beds, the Price is 82809.
The correlation plot is indicated below
For frequencies, there are small number of rooms and the large number of rooms. 0,1 and 3 have been combined into one category, 5, 6 and 7 rooms have bet put into one category.
We will plot a box plot for the price according to the number of bedrooms, taking this into considerations, it will be okay to assume that the variance is homogeneous, and there is a suggestion of variance stabilization.
Data Transformation
We will fit log-log model transformation to illustrate the standard deviation and present as a function if the cell mean. We will make an inference at alpha = 0.05
The logarithmic scale transforms the sales price, we will perform a bootstrap Levene test and plot a QQ plot for the standardized residuals fitted from the ANOVA that express the price expressed on a logarithmic scale. This is based on the comparative boxplots. The ANOVA model will test the diagnostic for the model, one factor ANOVA.
The ANOVA is more accurate on a moderate deviation. Based on the law of normality, when the sample size is large, we can model the ANOVA.
Price versus Finished Square Feet
First, we check on the correlation coefficient between price and finished square feet. The correlation coefficient is 0.08194701, indicating there is a strong positive correlation between price and the finished square feet.
Furthermore, we do a correlation plot, scatter plots are centered this indicates that there is a strong correlation between the variables
Fitting the model
The model parameters have a p value less than alpha = 0.05, this indicates that they are statistically significant.
The one-way ANOVA performed below has a p value less than alpha = 0.05, shows that we don’t have statistical difference of the two group means.
The fitted model will be of the form
Price = -81432.95 + 158.95Beds
When the number of bedrooms are zero, there will be a loss. As the number of bedrooms increase the price will increase.
The Corresponding scatter plot with an abline is
More points are along the red central line, meaning we have a strong relationship between the price and the number of bedrooms.
Data Transformation
Here, we will convert the square feet into square meter that is, we divide by 10. We will convert the cost into $’000’, therefore, new price will be price divide by 1000
From the previous analysis before transformation, it evident that the data looks to be scaled by equal transformation. The model parameters are statistically significant at alpha = 0.05, since the p-values are less than alpha.
The ANOVA for the transformed data will be
The p –values for the ANOVA is less than alpha=0.05, meaning at 95% confidence interval the means of the two samples are not statistically different from each other, therefore, we will fail to reject the null hypothesis that the two data sets have the same sample means.
Conclusion
From the analysis done, the data indicated that there is strong relationship between the independent variables and dependent variable. The transformed data had the same inference as the actual data, this indicates that any scaling would not affect the data conclusion but the same results at the end if the analysis. The non- parametric analysis indicated that indeed the data follow a normal distribution.
Other Topics:
- Proving the Pythagorean TheoremDescription: The Pythagoras theorem is an important mathematical relation that connects the geometry of a triangle. It gives the links between the sides of a triangle; right-angled triangle. The three sides of a right-angle triangle include: the two legs meeting at the right angle, and the longest diagonal (hypotenuse)....1 page/≈275 words| 2 Sources | APA | Mathematics & Economics | Math Problem |
- Calculating How Fast the Travel in Mile Per HourDescription: Vivian and Noelle both leave the park at the same time but in opposite directions. If Noelle travels 5 mph faster than Vivian and after 8 hours, they are 136 miles apart. How fast in mile per hour is each traveling? The information can be represented in the figure as shown below Total distance covered by both...1 page/≈275 words| No Sources | APA | Mathematics & Economics | Math Problem |
- Application of the Laws of Sine and CosineDescription: Solve for α in the oblique triangle ABC; AB = 30; AC = 15 and angle B = 20° 1 Type out the two equations substituting the numbers from the diagram. The sum of the total angles in a triangle = 180 A + B + C = 180 A + 20 + C = 180 A + C = 160 The Sine Rule is given by aSin A=bSin B=cSin C Therefore, given...1 page/≈275 words| No Sources | APA | Mathematics & Economics | Math Problem |