Data analysis & Business intelligence (Math Problem Sample)

Instructions:

Data analysis in R-Statistical program

source..

Content:

STUDENT ID NUMBER:
ASSIGNMENT 2
WINTER TERM, 2015
DATA ANALYTICS & BUSINESS INTELLIGENCE
(8697)
DUE DATE: Friday 24 July by noon
WEIGHTING: 40%
PERMITTED MATERIALS: any materials
INSTRUCTIONS:
1. Complete the assignment individually.
2. Answer all questions.
3. Write your answers to the questions in this document, keeping your solution as a single Word document. You may cut and paste items from software into your document, or use Ctrl-Alt-PrtScn (on Windows) or Command+Shift+4 (on macOS) to create a screen shot for your document.
4. Submit as a single Word document. Late submissions will attract a penalty of 5% per day. Submissions will not be accepted after Friday 31 July by noon.
QUESTION 1(45 marks)
Cluster analysis is a data mining technique used to divide data into meaningful groups.
* Describe (in your own words) what the k means algorithm does.
[4 marks]
K means clustering is the data partitioning technique in which data points are grouped into small clusters. The objective of the k-means clustering process is to minimize the averaged squared Euclidean distance from the centre of the data points assigned in the clustered data. As such, the k-means algorithm can be said to partition the data into pre-determined cluster numbers with the distance metric being a metric function of the similarities or dissimilarities between the data points.
* Describe (in your own words) what hierarchical cluster analysis does.
[4 marks]
Clustering involves a number of algorithms used in grouping similar objects into diverse categories. Once the data objects have been classified into different categories, the distinct data clusters can then be classified into a tree-like structure in which item arrangement depends on the similarity or dissimilarity. This concept is called a hierarchical cluster. The analysis of the clusters based on their similarities or dissimilarities is called hierarchical cluster analysis.
* Imagine you are performing a k means cluster analysis using Rattle. Describe the steps you would go through to determine the optimal number of clusters.
[4 marks]
To determine the optimal number of k-means clusters in Rattle, the following criterion is used. First, select k-centroids, with each row being randomly selected. Secondly, assign all the data points in the clusters to their respectively close centroids. Third, recalculate the length of the centroids by averaging the data points within the clusters each of p variables. Next, assign all the data points to the centroids closest to them and repeat these third and fourth steps until unassigned observations or their respective maximum number of iterations is attained.
* Describe the measures/characteristics you would use to evaluate a k means cluster analysis you have created in Rattle.[4 marks]
The Euclidean distance â€“ This refers to the geometric distance within the multidimensional space. As the commonest type of distance, its computed as; distance (x,y) = {Î£i(xi â€“ yi)2}1/2. This measure evaluates raw data.
Squared Euclidean distance â€“ To progressively place greater weight on objects placed apart, the square of the standard Euclidean distance is evaluated as is computed as; distance (x,y) = Î£i(xi â€“ yi)2
Percentage disagreement â€“ This measure is useful in determining if the dimensions captured in the analysis are categorical in nature or not and is computed as; distance (x,y) = (Number of xi â‰ yi)/i
Chebychev distance â€“ This measure evaluates the appropriateness of cases where you want to define objects as different or as if they are different in one dimension and is computed as; distance (x,y) = Maximum |xi â€“ yi|
Similarity and dissimilarity â€“ Similarity is defined by the number of times or attributes an object is ensnared into common attributes with another. Dissimilarity on the other hand indicates the filtration of object attributes that are not ensnared into another objectâ€™s attributes.
* Explain why the Wardâ€™s method is chosen under the Agglomerate check box when performing a hierarchical cluster analysis in Rattle and describe briefly how it functions.
[4 marks]
In hierarchical clustering, sample plots can be evaluated as individuals with other similar plots being fused into a single cluster. As such, the deployment and use of the wand method allows for pairing/fusion of plots with similar features while those with dissimilar features are also fused. Agglomerative clustering algorithms allow for divergent computation of similarity when there are several plots involved. In essence, the plots consist of several arguments such as: d, the method and member arguments.
Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. One of the most common medical conditions across the globe is the heart disease. Researchers wish to be able to classify patients as either being prone (positive) or not prone (negative) to heart disease. The following dataset scores 286 patients on 10 characteristics. These characteristics have been established as differing between positive and negative heart disease.
The dataset can be found in the file heart.csv on Moodle. The variables in the file are as follows:
Data Description

Variable Name

Values

age

age in years

sex

male; female

chest_pain

typical angina; atypical angina; non-anginal pain; asymptomatic

rest_bps

resting blood pressure in mm Hg (on admission to the hospital)

chol

serum cholesterol in mg/dl

fbs

fasting blood sugar > 120 mg/dl (t = true; f = false)

rest_ecg

resting electrocardiographic results (normal; having ST-T wave abnormality; showing probable or definite left ventricular hypertrophy)

max_hr

maximum heart rate achieved

ex_ang

exercise induced angina (yes; no)

disease

Get the Whole Paper!

Not exactly what you need?

Do you need a custom essay? Order right now:

Order

Data analysis & Business intelligence (Math Problem Sample)

Other Topics: