How to deal with imbalance datasets for target variable? - r

Currently I am facing a problem to predict numbers of goals made(soccer) with an imbalance data (760 obs * 129 variables), Target variable - FTHG- Full time home goals with
0 1 2 3 4 5 6 -count of goals made
175 243 176 107 43 10 6 -- total times of goals made respectively
My objective is convert it into majority(1) and minority(0) into binary class, to perform sampling techniques and further use XGBoost.
Could anyone let me know how to convert it into binary class majority and minority to perform sampling smot techniques.

Related

Possible forecast algorithms when time series is short with quarterly data spikes

I have an year s data with quarterly spikes, like below:
Sample code in R to create the dataframe:
x <- data.frame("Month" = c(1:12), "Count" = c(110,220,2500,150,180,1800,300,550,5000,205,313,4218))
Here is how the data looks:
Month Count
1 110
2 220
3 2500
4 150
5 180
6 1800
7 300
8 550
9 5000
10 205
11 313
12 4218
We can see that last month of every quarter has spike. My target is to forecast for next one year based on this data. I tried linear regression with some feature engineering (like how far a month is away from quarter) and results were obviously not satisfactory as it doesn't appear there is linear dependency.
I tried other techniques like seasonal naive and STLF (using R) and am currently going through few interpolation techniques (like lagrange or newtonInterpolation), there appears to be a lot of materials to study. Can anyone suggest a good possible solution for this so that I can explore further?

How to fix linear model fitting error in S-plus

I am trying to fit values in my algorithm so that I could predict a next month's number. I am getting a No data for variable errror when clearly I've defined what the objects are that I am putting into the equation.
I've tried to place them in vectors so that it could use one vector as a training data set to predict the new values. Current script has worked for me for a different dataset but for some reason isn't working here.
The data is small so I was wondering if that has anything to do with it. The data is:
Month io obs Units Sold
12 in 1 114
1 in 2 29
2 in 3 105
3 in 4 30
4 in 5
I'm trying to predict Units Sold with the code below
matt<-TEST1
isdf<-matt[matt$month<=3,]
isdf<-na.omit(isdf)
osdf<-matt[matt$Units.Sold==4,]
lmfit<-lm(Units.Sold~obs+Month,data=isdf,na.action=na.omit)
predict(lmFit,osdf[1,1])
I am expecting to be able to place lmfit in predict and get an output.

How to create a hierarchical cluster using categorical and numerical data is R?

I want to create a hierarchical cluster to show types of careers and the balance that those who are in those careers have in their bank account.
I a dataset with two variables, job and balance:
job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88
I want the result to look like this:
Where A, B ,C etc are the job categories.
Can anyone help me start this or give me some help?
I have no idea how to begin.
Thanks!
You can start by using the distand hclust functions.
df <- read.table(text = " job balance
1 unemployed 1787
2 services 4789
3 management 1350
4 management 1476
5 blue-collar 0
6 management 747
7 self-employed 307
8 technician 147
9 entrepreneur 221
10 services -88")
dist computes the distance between each element (by default, the euclidian distance):
distances <- dist(df$balance)
You can then cluster you values using the distance matrix generated above:
clusters <- hclust(distances)
By default, hclust applies complete-linkage clustering to your data.
Finally, you can plot your results as a tree:
plot(clusters, labels = df$job)
Here, we clustered all the entries in your data frame, that's why some jobs are duplicated. If you want to have a single value per job, you can for example take the mean balance for each job using tapply:
means <- tapply(df$balance, df$job, mean)
And then cluster the jobs:
distances <- dist(means)
clusters <- hclust(distances)
plot(clusters)
You can then try to use other distance measures or other clustering algorithms (see help(dist) and help(hclust) for other methods).

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

Using KNN for pattern matching of time series

I want to try to implement a KNN algorithm for pattern matching (or pattern recognision) in my time series data. The data are of consumption measurements. I've got a table with some columns, where the first column is datetime of the measurement and the other columns represent the measurements. There is one example:
datetime mains stove kitchen microwave TV
2013-04-21 14:22:13 341.03 6 57 5 0
2013-04-21 14:22:16 342.36 6 57 5 0
2013-04-21 14:22:20 342.52 6 58 5 0
2013-04-21 14:22:23 342.07 6 57 5 0
2013-04-21 14:22:26 341.77 6 57 5 0
2013-04-21 14:22:30 341.66 6 55 5 0
I want to use the KNN algorithm to compare the pattern of the mains signal with the patterns of other signals. So my training set would consist of labeled measurements of every appliance and the test data set would consist of the mains signal measurements. The aim of this is to detect changes in the signal - which appliance was turned on in which time.
What I actually want to ask is:
how to cope with the datetime format? In which format should it be passed to KNN? (I wonder there will be some conversion to integer or normalization?)
is the KNN algorithm suitable for this task?
how to generally perform pattern matching with KNN?
What I've already tried - I tried to put single vector consisting of labeled patterns of data (of each appliance) to the KNN as the training set and then put mains data as the test set. I totally omitted the datetime column. I've got bad resuls.
I'm implementing this in R language.
Any ideas?

Resources