Test performing on counts - r

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

Related

Cluster analysis in R on large data set

I have a data set with rankings as the column names and about 15,000 contestants. My data looks like:
contestant
1
2
3
4
101
13
0
5
12
14
0
1
34
6
...
...
...
...
...
500
0
2
23
3
I've been working on doing cluster analysis on this dataset. The dendrograms are obviously not very helpful with this dataset--it produces a thick block line because of the large number of entries.
I'm wondering if there is a better way to do cluster analysis with this type of data. I've tried
fviz_cluster()
and similar commands, as well as went through multiple tutorials. Many tutorials guided me through making dendograms. The data all seems to be different than mine (comparing two variables, etc) and much smaller. Essentially, I'm asking which types of cluster analysis may work well with this type of data.

Possible forecast algorithms when time series is short with quarterly data spikes

I have an year s data with quarterly spikes, like below:
Sample code in R to create the dataframe:
x <- data.frame("Month" = c(1:12), "Count" = c(110,220,2500,150,180,1800,300,550,5000,205,313,4218))
Here is how the data looks:
Month Count
1 110
2 220
3 2500
4 150
5 180
6 1800
7 300
8 550
9 5000
10 205
11 313
12 4218
We can see that last month of every quarter has spike. My target is to forecast for next one year based on this data. I tried linear regression with some feature engineering (like how far a month is away from quarter) and results were obviously not satisfactory as it doesn't appear there is linear dependency.
I tried other techniques like seasonal naive and STLF (using R) and am currently going through few interpolation techniques (like lagrange or newtonInterpolation), there appears to be a lot of materials to study. Can anyone suggest a good possible solution for this so that I can explore further?

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

R smbinning package: why 'Too many categories' for some variables?

I have a dataset in R containing many variables of different types and I am attempting to use the smbinning package to calculate Information Value.
I am using the following code:
smbinning.sumiv(Sample,y="flag")
This code produces IV for most of the variables, but for some the Process column states 'Too many categories' as shown in the output below:
Char IV Process
12 relationship NA Too many categories
15 nationality NA Too many categories
22 business_activity NA Too many categories
23 business_activity_group NA Too many categories
25 local_authority NA Too many categories
26 neighbourhood NA Too many categories
If I take a look at the values of business_activity_group for instance, I can see that there are not too many possible values it can take:
Affordable Rent Combined Commercial Community Combined
2546 4
Freeholders Combined Garages
23 6
General Needs Combined Keyworker
57140 340
Leasehold Combined Market Rented Combined
88 1463
Older Persons Combined Rent To Homebuy
4774 76
Shared Ownership Combined Staff Acommodation Combined
167 5
Supported Combined
2892
I thought this could be due to low volumes in some of the categories so I tried banding some of the groups together. This did not change the result.
Can anyone please explain why 'Too many categories' occurs, and what I can do to these variables in order to produce IV from the smbinning package?

To use the correct test for independence

I have two groups (data.frame) in R called good and bad which contain good users and bad users respectively.
The group good contains game_id which is the id for a computergame and number which is how many times this game has been played.
For example good$game_id we get 1 2 3 ... 20. We have 20 games.
Similar good$number we get 45214 1254 23 ... 8914 which is the number the game has been played. For example has game_id==1 been played 45214 times in group good.
Similar for bad.
We also have the same number of users in the two groups.
So for head(good,20) we get
game_id number
1 45214
2 1254
...
20 8914
I want to investigate if there is dependence between the number of times a fixed computergame has been played.
For game_id==1 I would try to use Pearson's Chi test for 'Independence'.
In R I type chisq.test(good[1,2], bad[1,2]) to see if there is indepence between good and bad for game_id==1 but I get an error message: x and y must have same levels.
How can this problem be solved ?

Resources