CP1407 Assignment 2
- Page 1 -
Note: This is an individual assignment. While it is expected that students will
discuss their ideas with one another, students need to be aware of their
responsibilities in ensuring that they do not deliberately or inadvertently
plagiarise the work of others.
Assignment 2 – Practice on various Machine Learning algorithms
1. [Data Pre-Processing, Clustering] [10 marks]
Why is attribute scaling of data important? The following table contains sample
records having the number of numbers and the total revenue generated by particular
stores of a supermarket. Use the table as an example to discuss the necessity of
normalisation in any proximity measurement for clustering purposes.
Supermarket ID Employee Count Revenue
001 38 $5,500,000
002 29 $5,000,000
003 24 $5,000,000
004 10 $8**,000
005 40 $2,500,000
006 31 $3,200,000
007 14 $678,000
008 35 $5,200,000
009 30 $5,300,000
010 22 $5,500,000
2. [Classification – Decision Tree algorithm] [20 marks]
Use the soybean dataset (diabetes.arff) to perform decision tree induction in Weka
using three different decision tree induction algorithms; J48, REPTree, and
RandomTree. Investigate different options, particularly looking at differences between
pruned trees and unpruned trees. In discussing your results, consider the following
a) What are the effects of pruning on the results for the soybean datasets?
b) Are there differences in the performances of the three decision tree algorithms?
c) What impacts do other parameters of the algorithms have on the results?
3. [Classification – Naïve Bayes algorithm] [30 marks]
Suppose we have data on a few individuals randomly examined for basic health check.
The following table gives the data on these individuals’ health-related attributes. CP1407 Assignment 2
- Page 2 -
Blood Sugar
Habit Class
Heavy Tall High 3 Smoker P
Heavy Short High 1 Nonsmoker P
Normal Tall Normal 3 Nonsmoker N
Heavy Tall Normal 2 Smoker N
Low Medium Normal 2 Nonsmoker N
Low Tall Normal 1 Nonsmoker P
Normal Medium High 3 Smoker P
Low Short High 2 Smoker P
Heavy Tall High 2 Nonsmoker P
Low Medium Normal 3 Smoker P
Heavy Medium Normal 3 Smoker N
Use the data together with the Naïve Bayes classifier to perform a new classification for
the following new instance. Create and use the classifier by hand, not with Weka, and
show all your working.
Blood Sugar
Habit Class
Low Tall High 2 Smoker ?
4. [Association Rules Mining] [20 marks]
The following table film watching histories for several viewers of an on-demand service.
User Id Items
001 Airplane!, Downfall, Evita, Idiocracy, Jurassic Park
002 Casablanca, Downfall, Evita, Flubber, Jurassic Park
003 Airplane!, Downfall, Half Baked, Jurassic Park
004 Airplane!, Downfall
005 Casablanca, Downfall, Flubber, Jurassic Park, Zoolander
006 Casablanca, Downfall, Half Baked, Idiocracy, Zoolander
007 Evita, Idiocracy, Jurassic Park
008 Downfall, Jurassic Park, Zoolander
009 Casablanca, Downfall, Evita, Half Baked, Jurassic Park, Zoolander
a) Follow the steps outlined in Practical 07 and conduct a mining task for Boolean
association rules using the Apriori algorithm in Weka.
b) Set different parameters and observe the association rules discovered.
c) Weka provides association evaluation parameters other than support and
confidence. Note the evaluation results by those evaluation parameters of example
CP1407 Assignment 2
- Page 3 -
5. [Clustering] [20 marks]
Consider the following 2-dimensional point data set presented in (x,y) coordinates:
P1(1,1), P2(1,3), P3(4,3), P4(5,4), P5(9,4), P6(9, 6).
Apply the hierarchical clustering method by hand (using Agglomerative algorithm) to
get final two clusters. Use the Manhattan distance function to measure the distance
between points and use the single-linkage scheme to do clustering. Show all your
Exemplary Good Satisfactory Limited Very Limited
**-100% 70-80% 50-60% 30-40% 0-20%
请加QQ:99515681 邮箱:99515681@qq.com WX:codinghelp