R mapply function running slowly - r

I'm working on calculating item statistics for a series of multiple choice exams. I've got a solution using mapply that technically works, but it takes a few hours to calculate one of the more complicated statistics. The first dataset that I have is one that includes a separate row for every question that every student answered for every assessment.
df <- data.frame(c(rep("s1", 5), rep("s2", 5), rep("s3", 5),rep("s4", 5)),"a1", c("i1", "i2", "i3", "i4", "i5"), c(1, 0), 1)
colnames(df) <- c("student", "assessment", "item", "score", "points.possible")
The first step that I do (and only do once) is to create a table of all unique items. In this case, that would be simple, as there's only one assessment and 5 items.
unique <- subset(df[,c("assessment", "item")], !duplicated(df[,c("assessment", "item")]))
I then need to calculate a statistic for each one of these items. However, the tricky part is that the calculations requires calculating the overall scores that students got on the entire assessment. Here's the function I wrote to do that.
fun1 <- function(a.id, i.id) {
# subset original dataframe for just one assessment
subsetdf <- df[df$assessment == a.id,]
# generate list of students that got the item right and wrong
correct <- subsetdf$student[subsetdf$item==i.id & subsetdf$score==1]
wrong <- subsetdf$student[subsetdf$item==i.id & subsetdf$score==0]
# scores by student
scores <- aggregate(score ~ student, data=subsetdf,sum)/aggregate(points.possible ~ student, data=subsetdf, sum)
# average scores for students that got item right/wrong
x.1 <- sum(subsetdf$score[subsetdf$student %in% correct])/sum(subsetdf$points.possible[subsetdf$student %in% correct])
x.0 <- sum(subsetdf$score[subsetdf$student %in% wrong])/sum(subsetdf$points.possible[subsetdf$student %in% wrong])
# percent of students that got item right
p <- length(correct)/(length(correct)+length(wrong))
# final stat calculation
r <- ((x.1-x.0)*sqrt(p*(1-p)))/sd(scores[,2])
print(r)
}
I then used mapply to loop this function over the entire original dataset while using the smaller dataset for the inputs.
unique$r <- mapply(fun1, unique$assessment, unique$item)
I was happy that I was able to get it to work, but when I do it with the larger datasets (~7 million rows for "df", ~2000 rows for "unique), it takes quite a while (several hours). Any tips on other ways to tackle this problem that are more efficient? I've learned that one problem is that my function creates a copy of the original large dataset every time it loops through, but I don't know how to do the problem without that.
I still consider myself a beginner for this kind of usage for R, so any advice would be much appreciated!

When you perform
scores <- aggregate(score ~ student, data=subsetdf,sum)/aggregate(points.possible ~ student, data=subsetdf, sum)
the result is not strictly numeric, the result is a data frame (for example, for a.id = 'a1', i.id = 'i1'):
> aggregate(score ~ student, data=subsetdf,sum)
student score
1 s1 3
2 s2 2
3 s3 3
4 s4 2
So when you divide the two, the result of 's1'/'s1' is not numeric and throws a warning.
There is no need to create correct and wrong. Treat the value of that column as an indicator to tell you whether the student was right or wrong.
Instead, do the following:
scores <- aggregate(subsetdf[,c('score', 'points.possible')], by = list(subsetdf$student), sum)
names(scores) <- c('student', 'score','points.possible')
scores$avg.score <- scores$score/scores$points.possible
I would do the same for x.0 and x.1. If you create a subset by i.id and then aggregate that subset of the dataframe, this should also save you a few steps.
The fact that you are checking for each student whether or not they are in correct or wrong twice (for score and points.possible) is also pretty expensive.

Related

Is there an R function that can create an adjacency list / edge list / adjacency dataframe from a csv, to then use in iGraph?

I am trying to perform a Social Network Analysis of Congressional Roll Call data. The data I have comes as a csv, from voteview.com, and has the following format:
Format of the csv
There are a high number of unique bills (represented by roll number) that I need to loop through to see how often politicians (represented by icpsr) agree in their vote (represented by cast_code).
However, I am really unsure of how I would loop through this data frame, check if two politicians vote the same on a unique bill, and then add that to a new data frame which would have three columns [politician 1|politican 2|weight (how many times they voted the same on unique bills)].
I have produced the following code when there was just a single bill being considered, which was able to get me a network map:
#1. creating a dataframe with all the yayers and one with all the nayers
yay_list <- S117 %>% filter(cast_code == '1')
nay_list <- S117 %>% filter(cast_code == '6')
#2. a list of the icpsr numbers who agree for yay and nay
y_list <- list(yay_list$icpsr)
n_list <- list(nay_list$icpsr)
#3. trying to use this list to make an igraph graph - BUT it does not recognise it
# I am not sure where to go next
make_ring(yay_list)
a1 <- as_adj_list(y_list)
#4. Alternative method - using only columns for icpsr & cast_code
# this will make an edge/adjency style data frame
foo <- S117[, c("icpsr", "cast_code")]
library(plyr)
# define a function returning the edges for a single group
group.edges <- function(x) {
edges.matrix <- t(combn(x, 2))
colnames(edges.matrix) <- c("Sen_A", "Sen_B")
edges.df <- as.data.frame(edges.matrix)
return(edges.df)
}
# apply the function above to each group and bind altogether
all.edges <- do.call(rbind, lapply(unstack(foo), group.edges))
# add weights if needed
#all.edges$weight <- 1
#all.edges <- aggregate(weight ~ Sen_A + Sen_B, all.edges, sum)
all.edges
#convert to a dataframe for igraph
df <- data.frame(all.edges)
df
# use igraph function on new datafame and plot
g <- graph_from_data_frame(df)
print(g, e=TRUE, v=TRUE)
plot(g)
# a plot is produced, which is good, but I do not know how to do this for
# a situation where there are multiple bills - it seems very complicated
Does anyone have any advice on how I would create a similar style edge list data frame, ideally with weights (as there are many bills in the data frame not just 1)?
The weight should show how many times politicians vote the same way (either yay or nay) on unique bills.
Thanks!

Print multiple Outputs stored in a vector with a Loop

I'm new in R and coding in general...
I have computed multiple anova analysis on multiple columns (16 in total).
For that purpose, the method "Purr" helped me :
anova_results_5sector <- purrr::map(df_anova_ch[,3:18], ~aov(.x ~ df_anova_ch$Own_5sector))
summary(anova_results_5sector[[1]])
So the dumbest way to retrieve output (p-value, etc) is the following method
summary(anova_results_5sector$Env_Pillar)
summary(anova_results_5sector$Gov_Pillar)
summary(anova_results_5sector$Soc_Pillar)
summary(anova_results_5sector$CSR_Strat)
summary(anova_results_5sector$Comm)
summary(anova_results_5sector$ESG_Comb)
summary(anova_results_5sector$ESG_Contro)
summary(anova_results_5sector$ESG_Score)
summary(anova_results_5sector$Env_Innov)
summary(anova_results_5sector$Human_Ri)
summary(anova_results_5sector$Management)
summary(anova_results_5sector$Prod_Resp)
I've tried to use a loop :
for(i in 1:length(anova_results_5sector)){
summary(anova_results_5sector$[i])
}
It didn't work, I dont know and did not find how to deal with $ in for loop
Here you have a look of the structure of the output vector
Structure of output
I have tried several times with others methods, more or less complicated. Often the examples found online are too simple and does not allow me to adapt to my data.
Any tips ?
Thank you and sorry for such an noobie question
Whenever I use a loop for an analysis I like to store the results in a data.frame, it allows to keep a good overview. Since you did not provide a reproducible example I used the iris dataset:
data("iris")
#make a data frame to store the results with as many columns and rows as you need
anova_results <- data.frame(matrix(ncol = 3, nrow = 3))
#one column per value you want to store and one row per anova you want to run
x <- c("number", "Mean_Sq", "p_value") #assign all values you want to store as column names
colnames(anova_results) <- x
anova_results$number <- 1:3 #assign numers for each annova you want to run, eg. 3
In the loop you can now extract the results of the anova that you are interested in, I use mean squares and p-value as an example, but you can of course add others. Don't forget to add a coulmn for other values you want to add.
for (i in 2:4){
my_anova <- aov(iris[[1]] ~ iris[[i]])
p <- summary(my_anova)[[1]][["Pr(>F)"]][1] #extract the p value
anova_results$p_value[anova_results$number == i-1] <- p
mean <- summary(my_anova)[[1]][["Mean Sq"]][1] #extract the mean quares
anova_results$Mean_Sq[anova_results$number == i-1] <- mean
}
View(anova_results)

Creating multiple dimensional list to replace subseting - Is it worth?

Basic idea:
As said before, is a good idea to substitute subsisting a data frame, for a multidimensional list?
I have a function that need to generate a subset from a quite big data frame close to 30 thousand times. Thus, creating a 4 dimensional list, will give me instant access to the subset, without loosing time generating it.
However, I don't know how R treats this objects, so I would like you opinion on it.
More concrete example if needed:
What I was trying to do is to use the inputation method of KNN. Basically, the algorithm says that the value found as outliers has to be replaced with K(K in a number, it could be 1,2,3...) closest neighbor. The neighbor in this example are the rows with the same attributes in the first 4 columns. And, the closed neighbors are the one with the smallest difference between the fifth column. If it is not clear what I said, please still consider reading the code, because, I found it hard to describe in words.
This are the objects
#create a vector with random values
values <- floor(runif(5e7, 0, 50)
possible.outliers <- floor(runif(5e7, 0, 10000)
#use this values, in a mix way, create a data frame
df <- data.frame( sample(values), sample(values), sample(values),
sample(values), sample(values), sample(possible.outliers)
#all the values greater then 800 will be marked as outliers
df$isOutlier = df[,6] > 800
This is the function which will be used to replace the outliers
#with the generated data frame, do this function
#Parameter:
# *df: The entire data frame from the above
# *vector.row: The row that was marked that contains an outlier. The outlier will be replaced with the return of this function
# *numberK: The number of neighbors to take into count.
# !Very Important: Consider that, the last column, the higher the
# difference between their values, less attractive
# they are for an inputation.
foo <- function(df, vector.row, numberK){
#find the neighbors
subset = df[ vector.row[1] == df[,1] & vector.row[2] == df[,2] &
vector.row[3] == df[,3] & vector.row[4] == df[,4] , ]
#taking the "distance" from the rows, so It can find which are the
# closest neighbors
subset$distance = subset[,5] - vector.row[5]
#not need to implement
"function that find the closest neighbors from the distance on subset"
return (mean(ClosestNeighbors))
}
So, the function runtime is quite big. For this reason, I am searching for alternatives and I thought that, maybe, if I replace the subsisting for something like this:
list[[" Levels COl1 "]][[" Levels COl2 "]]
[[" Levels COl3 "]][[" Levels COl4 "]]
What this should do is an instant access to the subset, instead of generating it all the time inside the function.
Is it a reasonable idea? I`am a noob in R.
If you did not understood what is written, or would like something to be explained in more detain or in other words, please tell me, because I know it is not the most direct question.

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!
If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

Find values of DataFrame A in DataFrame B and replace them with other column's values (create unique identifier for panel study)

I want to find values from one DataFrame A in another
DataFrame B and replace them with values from another column of B.
My problem in detail:
I have ten datasets that all together compose one-panel study. That means, that some people have been interviewed more than one time and - accordingly - have rows in more than one of those datasets. Each dataset also contains new study members who had not been interviewed before.
Unfortunately, the unique identifier to represent study members has to be deleted, for some reason. I have to replace it with another unique identifier without losing the panel quality (that is, the same person has to be associated with the same identifier in all datasets they appear in).
My idea was to:
a) load all datasets,
b) create a DataFrame with an allocation table with an old identifier and newly created identifier and then
c) "search and replace" identifiers in the original datasets.
The last step doesn't work.
I have two questions:
1) Does anyone know how to do step 3?
2) My way seems cumbersome. Does anyone know a more simple solution?
a) Loading
library(foreign)
#df1 <- read.spss("D1.sav", to.data.frame=TRUE, use.value.labels=FALSE)
#df2 <- read.spss("D2.sav", to.data.frame=TRUE, use.value.labels=FALSE)
#... (for all 10 datasets)
#For the sake of the example two random datasets that also
#include some NA and "overlap"
df1 <- c(NA,NA,NA,seq(100,200))
df2 <- c(NA,NA,NA,seq(150,250))
b) Creating Allocation
(only for unique ids
because people who were interviewed more
than once should receive the same allocation
in all datasets (panel study))
df <- data.frame(id=c(df1,df2),
alloc=c(df1,df2))
df <- subset(df, !duplicated(df$id))
df$alloc <- 1:dim(df)[1]
c) Overwriting old identifer with a new one (doesn't work)
Here for the example datasets:
df1 <- ifelse(df1 %in% df[,1], df[,2], df1)
df2 <- ifelse(df2 %in% df[,1], df[,2], df2)
#With the real datasets in this form:
#df1$identifer <- ifelse(df1$identifer %in% df[,1],
#df[,2], df1$identifer)
#df2$identifer <- ifelse(df2$identifer %in% df[,1],
#df[,2], df2$identifer)
#... (for all 10 datasets)

Resources