Category Archives: Data Mining

Data Mining: Exercise 8

Design of network topology


Number of input nodes
Too few nodes => misclassification
Too many nodes=> overfitting


Problems with dollar sign:

Problem with tilde sign:



Unit 5- Multiple Linear Regression


including more than one independent variable in the regression model, makes us extend the simple linerar regression model to a multiple linear regression model.

Relationship between response variables and several predictors simultaneously.

Model building , interpration difficulties due to complexity.

Multiple linear regression with two predictors:

where, Y is the dependent variable.
X1,X2…Xk are predictors(independent variables)
Epsylon is the random error
beta1, beta2, beta0 are unknown regression coefficients

Example=> oil consumption:

Y=oil consumption(per month)
X1=outdoor temperature

X2=size of house(in meter square)



now beta1 is expected change in Y(oil consiumption) at one unit increase in X1(outdoor temperature), when all other predictors are kept constant, i.e. in this case the size of the house is not changed.

beta1 is estimated with beta1=-27.2 degree C



The random error term epsylon is normally distributed and has mean zero. i.e. E(epsylon)=0

Epsylon has (unknown) variance sigma epsylon^2. i.e. all random errors have the same variance.

Adjusted R^2
R^2adj=1- SSE/(n-k-1)/SST/(n-1)



As for simple linear regression:

plots of residual against y prime
plots of residuals against xi
normal probability plot of residuals
plots of residuals in observation order
Cook’s distance
Studentized residuals
Standardized residuals

Can only occur for multiple regression.
Predictors explaining the same variation of the response variabl.

Oil consumption continued:
One predictor measuring house size in cm^2 and another predictor in m^2
Variance inflation factor


Condition Index for collinearity:
between 10 and 30=>weak collinearity
between 30 and 100=>moderate
collinearity>100=>strong collinearity

Example of Oil consumption continued:
Assume that we would like to use outdoor temperature X1 and house size X2 as predictors. Additionally, we want to use a third predictor:

X3={1 if extra-thick walls, 0 otherwise


Model Selection Strategies:
Mldel ranked using R^2, adjusted R^2 or mallow’s Cp
Stepwise selection methods:
Backward, forward, stepwise selection

r^2 Selection
In a data set with 7 possible predictors, there would be 2^7-1=127 possible regression models.
For every model size(k=1,2,…..,p) look at, let say, m models, chosen

Mallow’s Cp:
Large Cp=>biased model
it’s a formula.
where MSEp=mean squared error for a model with p parametes
mean squared error for the full model
n=number of observations

Exercise Sheet 5

1d theke clear na , eta clear korte hobe , In Sha Allah.


Lagle onno kono tutorial ba example dekhte hobe.

Exercise Sheet 4

Data Mining Methods: Unit 4
Correlation and Simple Linear Regression

Interpretation of the correlation coefficient
Possible range: [-1, 1]
-1: perfect negative linear relationship
0: no linear relationship,
1: perfect positive linear relationship.

Regression: Objective

To predict one variable from other variables.
To explain the variability of one variable using the other variables.

Predicts scores on one variable from the scores on a second variable.

Response variable: predicting variable (Y )
Predictor variable: predictions based on this variable (X)

Simple regression:
Only one predictor variable; otherwise multiple regression

Linear regression:

Predictions of the response variable (Y ) is a linear function of  the predictor variable (X)

Data Preprocessing/Exercise Sheet 2

Data Preprocessing in the Data Mining Process:

The data mining/KDD process
Why data preprocessing?

Issues in Data Preprocessing:

Data Cleaning
Data Transformation
Variable Construction
Data Reduction and Discretization
Data Integration

The data mining/KDD Process:
Understanding customer: 10%-20%
Understanding data:20-30
Prepare data: 40-70%
Build Models: 10-20%
Evaluate models: 10%-20%
Take action:10%20%

Why data mining?

Real – world data is dirty
Low data quality anyway a huge problem in data mining
Garbage in,garbage out
Different methods, different requirements

R Working Codes for data mining:

R code is case sensitive:
I am doing it from professors sheet.

dim means dimension


This line i could not make work:

hist(Ozone,breaks=25,ylim=(c(0,45)),main=”Original data”)

And another question how the imputation works


Exercise 2 (K)= I have to find the answers


Exercise 3: Answer:



R programming


Manipulation of Vectors and Numbers
Vectors and Assignment
Extraction of Elements from VectorsMatrices
Basic Manipulations
The Data Frame
Cumulative Distribution Function
Measures of Central Tendency
Measures of Spread
Correlation[Ektu dekhte hobe]



Training Set, Validation Set, Test Set

Training Set is a subset of the dataset used to build predictive models.
Validation Set is a subset of the dataset used to assess the performance of model built in the training phase
– It provides a test platform for fine-tuning model’s parameters and selecting the best performing model
– Not all modeling algorithms need a validation set
Test set or unseen examples is a subset of the dataset to assess the likely future performance of a model.
– If a model fits the training set much better than it fits the test set. Overfitting is probably the cause


Binary Classification(two class classification)

true|false, 1|0, -1|+1, male|female

Multi-class classification problems can be seen as binary classification problems.

Model Evaluation:

Data Science: Discrete vs Continuous

Making Predictions with WEKA

How to Save Your Machine Learning Model and Make Predictions in Weka

Decision Tree

Data Mining Deep Study – Confusion Matrix

A confusion matrix shows the number of correct and incorrect predictions made by the classification models compared by actual outcomes (target value) in the data.


Found a good lecture regarding confusion matrix with easy explanation for HIV AIDS. Video is found below and my own drawing regarding this is also given below:

WEKA Rushdi Shams Track

In 3rd video it explains some of the details about different results output comes. It’s important.

In 4th video blue is yes and red is no

In 5th video it’s explained about the testing and training in details so it must be watched.

In 7th video K fold 10 means 10 different models for 10 different folds

In 8th we have tried the IRIS data

In 9th feature selection methods where attribute can be selected for different algorithms and results may vary. (Wrapper method)
feauture selection means attribute selection

In 10th ranker algotihms uses for ranking features or attributes wrapper method for machine learning tasks where filter method useful for data mining tasks


@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeirc
@attribute humidity numeric
@attribute windy {TRUE,FALSE}
@attribute play {yes,no}

sunny, 90, 77, TRUE, no
overcast, 88, 90, FALSE, no

Mission Data Mining


For predicting class from model

Some data mining tuts

How to Run Your First Classifier in Weka