Essay Available:
You are here: Home → Math Problem → Accounting, Finance, SPSS
Pages:
4 pages/≈1100 words
Sources:
No Sources
Level:
Other
Subject:
Accounting, Finance, SPSS
Type:
Math Problem
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 18.72
Topic:
Data analysis & Business intelligence (Math Problem Sample)
Instructions:
Data analysis in R-Statistical program
source..Content:
STUDENT ID NUMBER:
ASSIGNMENT 2
WINTER TERM, 2015
DATA ANALYTICS & BUSINESS INTELLIGENCE
(8697)
DUE DATE: Friday 24 July by noon
WEIGHTING: 40%
PERMITTED MATERIALS: any materials
INSTRUCTIONS:
1. Complete the assignment individually.
2. Answer all questions.
3. Write your answers to the questions in this document, keeping your solution as a single Word document. You may cut and paste items from software into your document, or use Ctrl-Alt-PrtScn (on Windows) or Command+Shift+4 (on macOS) to create a screen shot for your document.
4. Submit as a single Word document. Late submissions will attract a penalty of 5% per day. Submissions will not be accepted after Friday 31 July by noon.
QUESTION 1(45 marks)
Cluster analysis is a data mining technique used to divide data into meaningful groups.
* Describe (in your own words) what the k means algorithm does.
[4 marks]
K means clustering is the data partitioning technique in which data points are grouped into small clusters. The objective of the k-means clustering process is to minimize the averaged squared Euclidean distance from the centre of the data points assigned in the clustered data. As such, the k-means algorithm can be said to partition the data into pre-determined cluster numbers with the distance metric being a metric function of the similarities or dissimilarities between the data points.
* Describe (in your own words) what hierarchical cluster analysis does.
[4 marks]
Clustering involves a number of algorithms used in grouping similar objects into diverse categories. Once the data objects have been classified into different categories, the distinct data clusters can then be classified into a tree-like structure in which item arrangement depends on the similarity or dissimilarity. This concept is called a hierarchical cluster. The analysis of the clusters based on their similarities or dissimilarities is called hierarchical cluster analysis.
* Imagine you are performing a k means cluster analysis using Rattle. Describe the steps you would go through to determine the optimal number of clusters.
[4 marks]
To determine the optimal number of k-means clusters in Rattle, the following criterion is used. First, select k-centroids, with each row being randomly selected. Secondly, assign all the data points in the clusters to their respectively close centroids. Third, recalculate the length of the centroids by averaging the data points within the clusters each of p variables. Next, assign all the data points to the centroids closest to them and repeat these third and fourth steps until unassigned observations or their respective maximum number of iterations is attained.
* Describe the measures/characteristics you would use to evaluate a k means cluster analysis you have created in Rattle.[4 marks]
The Euclidean distance – This refers to the geometric distance within the multidimensional space. As the commonest type of distance, its computed as; distance (x,y) = {Σi(xi – yi)2}1/2. This measure evaluates raw data.
Squared Euclidean distance – To progressively place greater weight on objects placed apart, the square of the standard Euclidean distance is evaluated as is computed as; distance (x,y) = Σi(xi – yi)2
Percentage disagreement – This measure is useful in determining if the dimensions captured in the analysis are categorical in nature or not and is computed as; distance (x,y) = (Number of xi ≠yi)/i
Chebychev distance – This measure evaluates the appropriateness of cases where you want to define objects as different or as if they are different in one dimension and is computed as; distance (x,y) = Maximum |xi – yi|
Similarity and dissimilarity – Similarity is defined by the number of times or attributes an object is ensnared into common attributes with another. Dissimilarity on the other hand indicates the filtration of object attributes that are not ensnared into another object’s attributes.
* Explain why the Ward’s method is chosen under the Agglomerate check box when performing a hierarchical cluster analysis in Rattle and describe briefly how it functions.
[4 marks]
In hierarchical clustering, sample plots can be evaluated as individuals with other similar plots being fused into a single cluster. As such, the deployment and use of the wand method allows for pairing/fusion of plots with similar features while those with dissimilar features are also fused. Agglomerative clustering algorithms allow for divergent computation of similarity when there are several plots involved. In essence, the plots consist of several arguments such as: d, the method and member arguments.
Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. One of the most common medical conditions across the globe is the heart disease. Researchers wish to be able to classify patients as either being prone (positive) or not prone (negative) to heart disease. The following dataset scores 286 patients on 10 characteristics. These characteristics have been established as differing between positive and negative heart disease.
The dataset can be found in the file heart.csv on Moodle. The variables in the file are as follows:
Data Description
Variable Name
Values
age
age in years
sex
male; female
chest_pain
typical angina; atypical angina; non-anginal pain; asymptomatic
rest_bps
resting blood pressure in mm Hg (on admission to the hospital)
chol
serum cholesterol in mg/dl
fbs
fasting blood sugar > 120 mg/dl (t = true; f = false)
rest_ecg
resting electrocardiographic results (normal; having ST-T wave abnormality; showing probable or definite left ventricular hypertrophy)
max_hr
maximum heart rate achieved
ex_ang
exercise induced angina (yes; no)
disease
ASSIGNMENT 2
WINTER TERM, 2015
DATA ANALYTICS & BUSINESS INTELLIGENCE
(8697)
DUE DATE: Friday 24 July by noon
WEIGHTING: 40%
PERMITTED MATERIALS: any materials
INSTRUCTIONS:
1. Complete the assignment individually.
2. Answer all questions.
3. Write your answers to the questions in this document, keeping your solution as a single Word document. You may cut and paste items from software into your document, or use Ctrl-Alt-PrtScn (on Windows) or Command+Shift+4 (on macOS) to create a screen shot for your document.
4. Submit as a single Word document. Late submissions will attract a penalty of 5% per day. Submissions will not be accepted after Friday 31 July by noon.
QUESTION 1(45 marks)
Cluster analysis is a data mining technique used to divide data into meaningful groups.
* Describe (in your own words) what the k means algorithm does.
[4 marks]
K means clustering is the data partitioning technique in which data points are grouped into small clusters. The objective of the k-means clustering process is to minimize the averaged squared Euclidean distance from the centre of the data points assigned in the clustered data. As such, the k-means algorithm can be said to partition the data into pre-determined cluster numbers with the distance metric being a metric function of the similarities or dissimilarities between the data points.
* Describe (in your own words) what hierarchical cluster analysis does.
[4 marks]
Clustering involves a number of algorithms used in grouping similar objects into diverse categories. Once the data objects have been classified into different categories, the distinct data clusters can then be classified into a tree-like structure in which item arrangement depends on the similarity or dissimilarity. This concept is called a hierarchical cluster. The analysis of the clusters based on their similarities or dissimilarities is called hierarchical cluster analysis.
* Imagine you are performing a k means cluster analysis using Rattle. Describe the steps you would go through to determine the optimal number of clusters.
[4 marks]
To determine the optimal number of k-means clusters in Rattle, the following criterion is used. First, select k-centroids, with each row being randomly selected. Secondly, assign all the data points in the clusters to their respectively close centroids. Third, recalculate the length of the centroids by averaging the data points within the clusters each of p variables. Next, assign all the data points to the centroids closest to them and repeat these third and fourth steps until unassigned observations or their respective maximum number of iterations is attained.
* Describe the measures/characteristics you would use to evaluate a k means cluster analysis you have created in Rattle.[4 marks]
The Euclidean distance – This refers to the geometric distance within the multidimensional space. As the commonest type of distance, its computed as; distance (x,y) = {Σi(xi – yi)2}1/2. This measure evaluates raw data.
Squared Euclidean distance – To progressively place greater weight on objects placed apart, the square of the standard Euclidean distance is evaluated as is computed as; distance (x,y) = Σi(xi – yi)2
Percentage disagreement – This measure is useful in determining if the dimensions captured in the analysis are categorical in nature or not and is computed as; distance (x,y) = (Number of xi ≠yi)/i
Chebychev distance – This measure evaluates the appropriateness of cases where you want to define objects as different or as if they are different in one dimension and is computed as; distance (x,y) = Maximum |xi – yi|
Similarity and dissimilarity – Similarity is defined by the number of times or attributes an object is ensnared into common attributes with another. Dissimilarity on the other hand indicates the filtration of object attributes that are not ensnared into another object’s attributes.
* Explain why the Ward’s method is chosen under the Agglomerate check box when performing a hierarchical cluster analysis in Rattle and describe briefly how it functions.
[4 marks]
In hierarchical clustering, sample plots can be evaluated as individuals with other similar plots being fused into a single cluster. As such, the deployment and use of the wand method allows for pairing/fusion of plots with similar features while those with dissimilar features are also fused. Agglomerative clustering algorithms allow for divergent computation of similarity when there are several plots involved. In essence, the plots consist of several arguments such as: d, the method and member arguments.
Data mining techniques have been widely applied in the medical domain to assist in the diagnosis of various medical conditions. One of the most common medical conditions across the globe is the heart disease. Researchers wish to be able to classify patients as either being prone (positive) or not prone (negative) to heart disease. The following dataset scores 286 patients on 10 characteristics. These characteristics have been established as differing between positive and negative heart disease.
The dataset can be found in the file heart.csv on Moodle. The variables in the file are as follows:
Data Description
Variable Name
Values
age
age in years
sex
male; female
chest_pain
typical angina; atypical angina; non-anginal pain; asymptomatic
rest_bps
resting blood pressure in mm Hg (on admission to the hospital)
chol
serum cholesterol in mg/dl
fbs
fasting blood sugar > 120 mg/dl (t = true; f = false)
rest_ecg
resting electrocardiographic results (normal; having ST-T wave abnormality; showing probable or definite left ventricular hypertrophy)
max_hr
maximum heart rate achieved
ex_ang
exercise induced angina (yes; no)
disease
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:
Other Topics:
- Independence Test in SPSSDescription: Independence Test in SPSS Accounting, Finance, SPSS Math Problem...1 page/≈275 words| No Sources | Other | Accounting, Finance, SPSS | Math Problem |
- Accounting overheads Accounting, Finance, SPSS Math ProblemDescription: The negative contribution margin indicates that the variable costs and variable expenses of the company exceed the sales of the company. The company should reduce its expenses in total. The management should also work to increase its sales to counteract the effect of the negative ...1 page/≈275 words| No Sources | Other | Accounting, Finance, SPSS | Math Problem |
- Finance, Accounting, and Banking Accounting, Finance Math ProblemDescription: We are using 600 it is the cash flow close to $1000 the initial investment under year two. We also have to substrate 600 1100 because the next cash flow has exceeded 1000 the initial investment. From the calculations payback is 2.8 years, this is the time the project will be profitable....1 page/≈275 words| No Sources | Other | Accounting, Finance, SPSS | Math Problem |