I have a problem relating to subsetting.
Basically I have a dataset. This toy dataset is a good small example:
df<- data.frame(year = c(1980:2019), randnorm = rnorm(40, 0, 1), count1 = rpois(40, 18),
lograndnorm=(rlnorm(40, 3, 2)))
For each value of year between 2000 and 2019, I want to remove each years observation, and output a subset of the total df data excluding a year. I then want to take the year removed and enter it into a model, and use the remainder of the data to train the model.
For example, subset_ex2010 might be excluding 2010. Therefore, all data except for where year= 2010 goes into subset_ex2010 , and I can then use that data to predict 2010.
Once those parameters are entered into the model, the output is saved (after the model has run) and the loop does the next year, that is, removes 2009 from the full df dataframe and subsets the remainder.
I've tried:
for(i in 2000:2019){
subset_excl_[i] <- subset(df, year<i | year>i] )
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
and,
for(i in 2000:2019){
subset_excl_[i] <- [df$year-i]
subset_of_[i] <- subset(df, year==i] )
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmmod[i], subset_of_[i])
}
but both fall over. Any assistance would be gratefully received.
I don't know linear programming. But. In both your blocks of code
lmmod[i] <- lm(count1 ~ randnorm + lograndnorm, data=subset_excl_[i])
distPred[i] <- predict(lmMod[i], subset_of_[i])
You're referring to both lmmod and lmMod. R is case-sensitive.
If that alone doesn't fix it - put a broswer() call in the head of the loop and single step til you find where it's blowing up.
Related
I have the following code and I wanna make a for loop out of it. I only have to change the year numbers on all lines (years 1996-2019). The following is my code:
# loading health data
health_data_1996 <- read.csv("1996-Annual.csv")
#delete data which is not needed
health_data_1996 <- health_data_1996[!(health_data_1996$Measure.Name != "Unemployment Rate, Annual" &
health_data_1996$Measure.Name != "High School Graduation"),]
health_data_1996 <- health_data_1996[,-c(1,2,5,7:11)]
#rename value column
colnames(health_data_1996)[3] <- "1996"
Can somebody tell me how I could make a for loop out of this?
Thank you very much for your help.
Since you just want to read the datasets and not combine them I suggest the following. I'm assuming here that all your CSV files have the same name structure.
# create a vector with all the years
years <- 1996:2019
# apply the desired function on every value in years consecutively
all_data <- lapply(years, function(y) {
df <- read.csv(paste0(y, "-Annual.csv"))
df <- df[df$Measure.Name == "Unemployment Rate, Annual" |
df$Measure.Name == "High School Graduation", ]
df <- df[, -c(1, 2, 5, 7:11)]
colnames(df)[3] <- y
df
})
This will give you a named list where every element is the dataset for a given year. So for example if you want the data from 2019 you should be able to retrieve it with all_data[["2019"]].
I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value
I'm working on calculating item statistics for a series of multiple choice exams. I've got a solution using mapply that technically works, but it takes a few hours to calculate one of the more complicated statistics. The first dataset that I have is one that includes a separate row for every question that every student answered for every assessment.
df <- data.frame(c(rep("s1", 5), rep("s2", 5), rep("s3", 5),rep("s4", 5)),"a1", c("i1", "i2", "i3", "i4", "i5"), c(1, 0), 1)
colnames(df) <- c("student", "assessment", "item", "score", "points.possible")
The first step that I do (and only do once) is to create a table of all unique items. In this case, that would be simple, as there's only one assessment and 5 items.
unique <- subset(df[,c("assessment", "item")], !duplicated(df[,c("assessment", "item")]))
I then need to calculate a statistic for each one of these items. However, the tricky part is that the calculations requires calculating the overall scores that students got on the entire assessment. Here's the function I wrote to do that.
fun1 <- function(a.id, i.id) {
# subset original dataframe for just one assessment
subsetdf <- df[df$assessment == a.id,]
# generate list of students that got the item right and wrong
correct <- subsetdf$student[subsetdf$item==i.id & subsetdf$score==1]
wrong <- subsetdf$student[subsetdf$item==i.id & subsetdf$score==0]
# scores by student
scores <- aggregate(score ~ student, data=subsetdf,sum)/aggregate(points.possible ~ student, data=subsetdf, sum)
# average scores for students that got item right/wrong
x.1 <- sum(subsetdf$score[subsetdf$student %in% correct])/sum(subsetdf$points.possible[subsetdf$student %in% correct])
x.0 <- sum(subsetdf$score[subsetdf$student %in% wrong])/sum(subsetdf$points.possible[subsetdf$student %in% wrong])
# percent of students that got item right
p <- length(correct)/(length(correct)+length(wrong))
# final stat calculation
r <- ((x.1-x.0)*sqrt(p*(1-p)))/sd(scores[,2])
print(r)
}
I then used mapply to loop this function over the entire original dataset while using the smaller dataset for the inputs.
unique$r <- mapply(fun1, unique$assessment, unique$item)
I was happy that I was able to get it to work, but when I do it with the larger datasets (~7 million rows for "df", ~2000 rows for "unique), it takes quite a while (several hours). Any tips on other ways to tackle this problem that are more efficient? I've learned that one problem is that my function creates a copy of the original large dataset every time it loops through, but I don't know how to do the problem without that.
I still consider myself a beginner for this kind of usage for R, so any advice would be much appreciated!
When you perform
scores <- aggregate(score ~ student, data=subsetdf,sum)/aggregate(points.possible ~ student, data=subsetdf, sum)
the result is not strictly numeric, the result is a data frame (for example, for a.id = 'a1', i.id = 'i1'):
> aggregate(score ~ student, data=subsetdf,sum)
student score
1 s1 3
2 s2 2
3 s3 3
4 s4 2
So when you divide the two, the result of 's1'/'s1' is not numeric and throws a warning.
There is no need to create correct and wrong. Treat the value of that column as an indicator to tell you whether the student was right or wrong.
Instead, do the following:
scores <- aggregate(subsetdf[,c('score', 'points.possible')], by = list(subsetdf$student), sum)
names(scores) <- c('student', 'score','points.possible')
scores$avg.score <- scores$score/scores$points.possible
I would do the same for x.0 and x.1. If you create a subset by i.id and then aggregate that subset of the dataframe, this should also save you a few steps.
The fact that you are checking for each student whether or not they are in correct or wrong twice (for score and points.possible) is also pretty expensive.
I am trying to create two vectors of the 20th and 80th percentiles of monthly return data for companies from 1927 to 2013. The issue I have encountered is that in my nested four loop I don't know how to reference both the month and the year (i.e. the returns across all companies in April 1945). Right now the code looks like this:
qunatile<-function(r){
vec20<-c(rep(0,1038))
vec80<-c(rep(0,1038))
for(i in 1927:2013){
for(j in 1:12){
vec20[j+12(i-1927)]<-quantile(r$(i, j),20)
vec80[j+12(i-1927)]<-quantile(r$(i, j),80)
}
}
data1decilest<-rbind(ps1NYSE,vec20,vec80)
}
But I know that that r$(i, j) notation is not correct. I was wondering if anyone knew how to do what I am attempting with that clearly incorrect code (i.e. reference all returns from a given month in a given year.
Thank you!
One option that would eliminate nesting loops is to create a column in your dataframe that contains a month/year combo (e.g. "Jan1955", "Apr1999", etc.) and then split your dataframe on that variable, and apply quantile functions. It's hard to say if this is solving your problem since there is not a reproducible example. I assume here your data is called df and contains a date and a value column.
library(lubridate)
library(plyr)
df$newtime <- paste0(month(df$date, label = T, abbr = T), year(df$date))
q20 <- function(df){ quantile(df$value, 20) }
q80 <- function(df){ quantile(df$value, 80) }
vec20 <- ddply(df, .(newtime), FUN=q20)
vec80 <- ddply(df, .(newtime), FUN=q80
I have this code which was mostly written by one of the members here that exports all the graphs I need from my data set under the condition that the trendline coefficient is positive (increasing trendline).
lung <- read.csv("LAC.csv")
attach(lung) #data
age <- lung$Age
mirna <- data.frame(lung)
stuff <- data.frame(matrix(ncol = 500, nrow = 40))
pdf("test.pdf") # exports to pdf all the graphs
lapply(colnames(mirna)[-1],function(col){ #function for plotting
form <- formula(paste(col, "age", sep = "~"))
fit <- lm(form, mirna)
stuff_want <- stuff
if (coef(fit)[2] >0) { #plotting with condition
plot(form, df, xlab = "Age", main= "miRNA expression with increasing age")
abline(fit, col = 4)
}
})
dev.off()
This gives me a pdf file which I was hoping to use later to check which of the miRNA in the dataset are required and isolate the columns manually. However, I severely underestimated the number of mirRNA that meet the condition and now face a new conundrum on how to export the data from a column with and increasing trendline into a separate data frame which I would later save as a .csv file and use for further analysis.
Please keep in mind my knowledge of R is very limited although I am spending days in Rhelp and books. My idea was to create a separate data frame (stuff_want) to which the columns that satisfy the condition (coef(lm()) > 0) will be transferred. My initial thought was to use append() function and under the if condition write append(stuff_want, mirna, after = length(mirna)) followed by write.csv() function. The output of this is just NA filled .csv file.
Anyone able to explain to me why this is not working?
All the best,
Paulius
So here is one way (similar to #agstudy's comment), using the same made up data as in my previous answer
# make up some data
x <- seq(1,10,len=100)
set.seed(1) # for reproducible example
df <- data.frame(x,y1=1+2*x+rnorm(100),
y2=3-4*x+rnorm(100),
y3=2+0.001*x+rnorm(100))
# you start here...
result <- sapply(colnames(df)[-1],function(col){
form <- formula(paste(col,"x",sep="~"))
fit <- lm(form,df)
if(coef(fit)[2] > 0) TRUE else FALSE
})
cols <- names(result)[result]
cols
# [1] "y1" "y3"
This creates a named vector, result which elements have the same names as your response variables, and values = TRUE if that variable has positive slope, FALSE otherwise. Then
cols <- names(result)[result]
is a vector of the variable names with slope > 0. Finally, to extract the actual data, you would use:
stuff_want <- stuff[,cols]