For Loop for Correlations - r

I am wanting to get correlation values between two variables for each county.
I have subset my data as shown below and get the appropriate value for the individual Adams county, but am now wanting to do the other counties:
CorrData<-read.csv("H://Correlation
Datasets/CorrelationData_Master_Regression.csv")
CorrData2<-subset(CorrData, CountyName=="Adams")
dzCases<-(cor.test(CorrData2$NumVisit, CorrData2$dzdx,
method="kendall"))
dzCases
I am wanting to do a For Loop or something similar that will make the process more efficient, and so that I don't have write 20 different variable correlations for each of the 93 counties.
When I run the following in R, it doesn't give an error, but it doesn't give me the response I was hoping for either. Rather than the Spearman's Correlation for each county, it seems to be ignoring the loop portion and just giving me the correlation between the two variables for ALL counties.
CorrData<-read.csv("H:\\CorrelationData_Master_Regression.csv")
for (i in CorrData$CountyName)
{
dzCasesYears<-cor.test(CorrData$NumVisit, CorrData$dzdx,
method="spearman")
}
A very small sample of my data looks similar to this:
CountyName Year NumVisits dzdx
Adams 2010 4.545454545 1.19
Adams 2011 20.83333333 0.20
Elmore 2010 26.92307692 0.24
Elmore 2011 0 0.61
Brown 2010 0 -1.16
Brown 2011 17.14285714 -1.28
Clark 2010 25 -1.02
Clark 2011 0 1.13
Cass 2010 17.85714286 0.50
Cass 2011 27.55102041 0.11
I have tried to find a similar example online, but am not having luck!
Thank you in advance for all your help!

You are looping but not using your iterator 'i' in your code. If this makes sense with respect with what you want to do (and judging from your condition). Based on comments, you might want to make sure you are using numerics. Also, i noticed that you are not iterating into your output cor.test vector. I'm not sure a loop is the most efficient way to do it, but it will be just fine and since your started with a loop, You should have something of the kind:
dzCasesYears = list() #Prep a list to store your corr.test results
counter = 0 # To store your corr.test into list through iterating
for (i in unique(CorrData$CountyName))
{
counter = counter + 1
# Creating new variables makes the code clearer
x = as.numeric(CorrData[CorrData$CountyName == i,]$NumVisit)
y = as.numeric(CorrData[CorrData$CountyName == i,]$dzdx)
dzCasesYears[[counter]] <-cor.test(x,y,method="spearman")
}
And it's always good to put a unique there when you are iterating.

data.table makes operations like this very simple.
library('data.table')
CorrData <- as.data.table(read.csv("H:\\CorrelationData_Master_Regression.csv"))
CorrData[, cor(dzdx, NumVisits), CountyName]
With the sample data, it's all negative ones because there's two points per county and so the correlation is perfect. The full dataset should be more interesting!
CountyName V1
1: Adams -1
2: Elmore -1
3: Brown -1
4: Clark -1
5: Cass -1
Edit to include p values from cor.test as OP asked in the comment
This is also quite simple!
CorrData[, .(cor=cor(dzdx, NumVisits),
p=cor.test(dzdx, NumVisits)$p.value),
CountyName]
...But it won't work with your sample data as two points per county is not enough for cor.test to get a p value. Perhaps you could take #smci's advice and dput a larger subset of the data to make your question truly reproducible

Related

Updating Values within a Simulation in R

I am working on building a model that can predict NFL games, and am looking to run full season simulations and generate expected wins and losses for each team.
Part of the model is based on a rating that changes each week based on whether or not a team lost. For example, lets say the Bills and Ravens each started Sundays game with a rating of 100, after the Ravens win, their rating now increases to 120 and the Bills decrease to 80.
While running the simulation, I would like to update the teams rating throughout in order to get a more accurate representation of the number of ways a season could play out, but am not sure how to include something like this within the loop.
My loop for the 2017 season.
full.sim <- NULL
for(i in 1:10000){
nflpredictions$sim.homewin <- with(nflpredictions, rbinom(nrow(nflpredictions), 1, homewinpredict))
nflpredictions$winner <- with(nflpredictions, ifelse(sim.homewin, as.character(HomeTeam), as.character(AwayTeam)))
winningteams <- table(nflpredictions$winner)
projectedwins <- data.frame(Team=names(winningteams), Wins=as.numeric(winningteams))
full.sim <- rbind(full.sim, projectedwins)
}
full.sim <- aggregate(full.sim$Wins, by= list(full.sim$Team), FUN = sum)
full.sim$expectedwins <- full.sim$x / 10000
full.sim$expectedlosses <- 16 - full.sim$expectedwins
This works great when running the simulation for 2017 where I already have the full seasons worth of data, but I am having trouble adapting for a model to simulate 2018.
My first idea is to create another for loop within the loop that iterates through the rows and updates the ratings for each week, something along the lines of
full.sim <- NULL
for(i in 1:10000){
for(i in 1:nrow(nflpredictions)){
The idea being to update a teams rating, then generate the win probability for the week using the GLM I have built, simulate who wins, and then continue through the entire dataframe. The only thing really holding me back is not knowing how to add a value to a row based on a row that is not directly above. So what would be the easiest way to update the ratings each week based on the result of the last game that team played in?
The dataframe is built like this, but obviously on a larger scale:
nflpredictions
Week HomeTeam AwayTeam HomeRating AwayRating HomeProb AwayProb
1 BAL BUF 105 85 .60 .40
1 NE HOU 120 90 .65 .35
2 BUF LAC NA NA NA NA
2 JAX NE NA NA NA NA
I hope I explained this well enough... Any input is greatly appreciated, thanks!

How to split data up using an if statement?

I have a data set in R that involves students and GPAs, for example
Student GPA
Jim 3.00
Tom 3.29
Ana 3.99
and so on.
I want a column that puts them in a bin. for example
Student GPASplit
Jim 3.0-3.5
Tom 3.0-3.5
Ana 3.5-4.0
Because when I try to take the statistics for the GPA all the bins are seperated based on the actual GPA. For example I am trying to find the percentage for how many students have higher than a 3.5, a GPA between 3.0-3.5, and so forth. But I get the percentage in terms of the actual GPA and when you have 4000 data points all with different GPAs, it is hard to figure out how many have a GPA higher than 3.5 and so forth? Does this make sense? Sorry if it doesn't.
You can use the cut() function to split data into bins that you define. You have to be careful about values that fall exactly on the boundaries though, and make sure they're being treated how you want. With your example data:
> df$GPA_split = cut(df$GPA, breaks = c(3.0, 3.5, 4.0), include.lowest = TRUE)
> df
Student GPA GPA_split
1 Jim 3.00 [3,3.5]
2 Tom 3.29 [3,3.5]
3 Ana 3.99 (3.5,4]
# Count values in each bin
> table(df$GPA_split)
[3,3.5] (3.5,4]
2 1

R predict function not using entire data in the test data set, only using partial data and predicting

I have a train data set which has 700 records. I prepared the model using c5.0 function with this data.
library(C50)
abc_model <- C5.0(abc_train[-5], abc_train$resultval)
I have test data, which has 5000 records.
I am using predict function to do the prediction on these 5000 recs.
abc_Test <- read.csv("FullData.csv", quote="")
abc_pred <- predict(abc_model, abc_test)
This is giving me the prediction for ONLY 700 recs, not all 5000.
How to make this predict for all 5000?
When I have the train data size larger than test data size, then the result is fine, I get all data, I am able to combine test data with results and get the output into ".CSV". But when train data size is smaller than test data, all records are not getting predicted.
x <- data.frame(abc_test, abc_pred)
Any inputs how to overcome this problem? I am not an expert in R. Any suggestions will help me a lot.
Thanks Richard.
Below is my train data, few recs.
Id Value1 Value2 Country Result
20835 63 1 United States yes
3911156 60 12 Romania no
39321 10 3 United States no
29425 80 9 Australia no
Below is my test data, few recs again.
Id Value1 Value2 Country
3942587 114 12 United States
3968314 25 13 Sweden
3973205 83 10 Russian Federation
17318 159 9 Russian Federation
I am trying to find the Result value and append this to my test data. But, like i described, I am getting the Result only for 700 records, not all 5000
You should try this:
str(abc_train)
str(abc_test)
lapply(abc_train[ names(abc_train) != "Result"] , table)
lapply(abc_train[] , table)
Then you will probably find that some of the levels for some of the variables in abc_test were not in abc_train, so estimates could not be produced. I'm guessing you thought that the numeric values would be handled as though a regression had been done, but that won't happen if those columns are factors in any prediction function and perhaps never depending on the function's behavior. Looking at C50::C5.0.default, it appears there may be no regression option for variables.

What's the smart way to aggregate data?

Suppose there is a dataset of different regions, each region a subset of a state, and some outcome variable:
regions <- c("Michigan, Eastern",
"Michigan, Western",
"Minnesota",
"Mississippi, Northern",
"Mississippi, Southern",
"Missouri, Eastern",
"Missouri, Western")
set.seed(123)
outcome <- rpois(7, 12)
testset <- data.frame(regions,outcome)
regions outcome
1 Michigan, Eastern 10
2 Michigan, Western 11
3 Minnesota 17
4 Mississippi, Northern 12
5 Mississippi, Southern 12
6 Missouri, Eastern 17
7 Missouri, Western 13
A useful tool would aggregate each region and add, or take the mean or maximum, etc. of outcome by region and generate a new data frame for state. A sum, for example, would output this:
state outcome
1 Michigan 21
3 Minnesota 17
4 Mississippi 24
6 Missouri 30
The aggregate() function won't solve this problem. Is there something else in R that is built for this? It seems like grep could be used to generate the new column "states" as part of an application specific program. Seems like this would already be out there somewhere though.
The reason this isn't straight forward is that the structure of your data is not consistent, so you couldn't build a library simply for it.
Your state, region column is basically an index column, and you want to index across part of it. tapply is designed for this, but there's no reason to build in a function to do it automatically for this specific scenario. You could do it without creating the column though
tapply(outcome,gsub(",.*$","",testset$regions),sum)
The index column just replaces the , and everything after it, leaving the index column.
PS: you have a slight typo in your example, your data.frame should be
testset <- data.frame(regions,outcome)

Summary statistics of retail prices grouped by categorical data

I need some help writing a function that takes three categorical inputs and returns a vector of summary statistics based on these inputs.
The data set contains information on retail goods that can be specified by their retail segment, brand name, and type of good along with its retail price and what it actually sold for.
Now I need to write a function that will take these inputs and average, count, and calculate whatever else is needed.
I have set the function up as follows (using made up data):
dataold = data.frame(segment=c("golf","tenis","football","tenis","golf","golf"),
brand=c("x","y","z","y","x","a"),
type=c("iron","ball","helmet","shoe","driver","iron"),
retail=c(124,.60,80,75,150,108),
actual=c(112,.60,72,75,135,100))
retailsum = funtion(segment,brand,type){
datanew = dataold[which(dataold$segment='segment' &
dataold$brand='brand' &
dataold$type='type'),c("retail","actaul")]
summary = c(dim(datanew)[1],colMeans(datanew))
return(summary)
}
The code inside the function braces works on its own, but once I wrap a function around it I start getting errors or it will just return 0 counts and NaN for the means.
Any help would be greatly appreciated. I have very little experience in R, so I apologize if this is a trivial question, but I have not been able to find a solution.
There are rather a lot of errors in your code, including:
misspelling of function
using single = (assignment) rather than == (equality test)
mistype of actual
hardcoding of segment, brand and type in your function, rather than referencing the arguments.
This is how your function could look like, i.e. it produces valid results:
retailsum <- function(data, segment,brand,type, FUN=colMeans){
x = with(data, data[segment==segment && brand==brand && type==type,
c("retail","actual")])
match.fun(FUN)(x)
}
retailsum(dataold, "golf", "x", "iron", colMeans)
retail actual
89.60000 82.43333
And here is a (possibly much more flexible) solution using the plyr package. This calculates your function for all combinations of segment, brand and type:
library(plyr)
ddply(dataold, .(segment, brand, type), colwise(mean))
segment brand type retail actual
1 football z helmet 80.0 72.0
2 golf a iron 108.0 100.0
3 golf x driver 150.0 135.0
4 golf x iron 124.0 112.0
5 tenis y ball 0.6 0.6
6 tenis y shoe 75.0 75.0
Andrie's solution is pretty complete already. (ddply is cool! Didn't know about that function...)
Just one addition, though: If you want to compute the summary values over all possible combinations, you can do this as a one-liner using R's onboard function by:
by(dataold, list(dataold$segment, dataold$brand, dataold$type),
function(x) summary(x[,c('retail', 'actual')])
)
That is not strictly what you asked for, but may still be instructive.

Resources