R - Trying to avoid a loop here - r

First post ever here, but I've been reading a lot so thanks!
I have a huge dataframes with many columns, but only 4 matter here:
dates/classes/names/grades.
For each date, i have several classes (with students), each with several people (names - always the same people in their respective classes), each one having ONE grade per date.
On the first date, I retrieve the best student per class considering his grade, using max[].
However, for the next dates, I want to do the following:
If the previous best student is still in the top 3 of his class, then we consider him to still be the best one.
Else, we consider the new 1st student to be the best one.
Hence, every date depends on the previous one.
Is it possible to do this without a loop?
I can't find out how, as every iteration depends on the previous one.
This is my code below.
Apologies if it's not optimized!
Thanks a lot :)
for (i in (1:(length(horizon)-1))) #horizon is the vector of dates
{
uni3 <- dataaf[dataaf[,1] == as.numeric(horizon[i]),] #dataaf contains all the data, we only keep the date for the considered date i
if (i == 1) #we take the best student per class
{
selecdate <- data.frame() #selecdate is the dataframe containing the best people for this date
for (z in (1:15) #15 classes
{
selecsec <- na.omit(uni3[uni3[,14] == z,]) #classes are column 14
ligneselec <- max(selecsec[,13]) #grades are column 13
selecsec <- data.frame(uni3[match(ligneselec,uni3[,13]),])
selecdate <- rbind(selecdate,selecsec)
}
}
else { #we keep a student if he was in the previous top 3, else we take the best one
selecdate <- data.frame()
for (z in (1:15))
{
lastsec <- na.omit(lastdate[lastdate[,14] == z,]) #last results
#retrieving the top 3 people this date
selecsec <- na.omit(uni3[uni3[,14] == z,])
newligneselec <- tail(sort(selecsec[,13]),3)
selecsec <- data.frame(selecsec[rev(match(newligneselec,selecsec[,13])),])
if((length(match(selecsec[,3],lastsec[,3])[!is.na(match(selecsec[,3],lastsec[,3]))]) == 0))
{
ligneselec <- max(selecsec[,13])
selecsec <- data.frame(uni3[match(ligneselec,uni3[,13]),])
}
else
{
selecsec <- lastsec
}
selecdate <- rbind(selecdate,selecsec)
}
}
lastdate <- selecdate #recording the last results
}
EDIT : Here is an example.
In date 1, John and Audrey are both selected in class 1 and 2.
On date 2, John is still among the best 3, so he remains selected,
while Audrey is only 4th so Jim (ranked 1st for the date 2) replaces
her.
On date 3, John is still among the best 3, so he remains selected (no ties issues in the data I work on). Jim is now 4th, so Sandra takes his place.
structure(list(Dates = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("12/02", "13/02", "14/02"
), class = "factor"), Classes = c(1, 1, 1, 1, 1, 2, 2, 2, 2,
2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2
), Names = structure(c(6L, 3L, 9L, 7L, 1L, 8L, 4L, 10L, 5L, 2L,
6L, 3L, 9L, 7L, 1L, 8L, 4L, 10L, 5L, 2L, 6L, 3L, 9L, 7L, 1L,
8L, 4L, 10L, 5L, 2L), .Label = c("Ashley", "Audrey", "Bob", "Denis",
"Jim", "John", "Kim", "Sandra", "Terry", "Tim"), class = "factor"),
Grades = c(10, 5, 3, 2, 1, 3, 4, 5, 6, 7, 8, 2, 10, 9, 1,
7, 5, 1, 8, 2, 5, 1, 4, 8, 8, 7, 6, 5, 4, 3)), .Names = c("Dates",
"Classes", "Names", "Grades"), row.names = c(NA, -30L), class = "data.frame")

Edited to reflect clarified request in the comments.
###---------- CREATING THE DATA (may be different from what you had in mind)
# Classes and Students
Classes <- c("U.S. History", "English", "NonLinear Optimization")
Students <- c("James", "Jamie", "John", "Jim", "Jane", "Jordan", "Jose")
df.1 <- expand.grid(Classes = Classes, Students = Students, stringsAsFactors = T)
# Generate Dates
Dates.seq <- seq(as.Date("2017/2/10"), as.Date("2017/3/27"), "days")
df.2 <- merge(Dates.seq, df.1)
# Generate Grades
grading <- c(4.0, 3.7, 3.3, 3.0, 2.7, 2.3, 2.0, 1.7)
Grades <- sample(grading, size = dim(df.2)[1], replace = T, prob = grading/sum(grading)) # smart students
df <- data.frame(df.2, Grades)
colnames(df) <- c("Dates","Classes","Students","Grades")
# Works assuming your df has the following labeled and formatted columns
str(df)
#'data.frame': 966 obs. of 4 variables:
# $ Dates : Date, format: "2017-02-10" "2017-02-11" "2017-02-12" ...
# $ Classes : Factor w/ 3 levels "U.S. History",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Students: Factor w/ 7 levels "James","Jamie",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ Grades : num 2.3 3.3 2.3 3.3 2.7 4 4 1.7 2.3 4 ...
# No aggregateion, just splitting by classes
df.split1 <- split(df, df[,"Classes"])
# Then splitting each of those lists by Dates
df.split2 <- lapply(df.split1, function(x) split(x, x[,"Dates"]))
# double the lapply becuase now we have lists within lists
top1 <- lapply(df.split2, function(i) lapply(i, function(j) j[order(-j[,"Grades"])[1], "Students"]))
top3 <- lapply(df.split2, function(i) lapply(i, function(j) j[order(-j[,"Grades"])[1:3], "Students"]))
# Easier to read
AllClasses <- levels(df[,"Classes"])
AllDates <- unique(df[,"Dates"])
# Initialize a matrix to keep track of changes in the Top1 and Top3
superstar <- matrix(NA, nrow = length(AllDates), ncol = length(AllClasses),
dimnames = list(as.character(AllDates), AllClasses))
# Looping
for(date in 1:length(AllDates)){
for(class in AllClasses){
if(date == 1){
# First NewTop1 = First Top1
superstar[date, class] <- unlist(top1[[class]][date])
} else {
# If superstar in date-1 is in the Top3 of date now,
if(superstar[date-1, class] %in% as.numeric(unlist(top3[[class]][date]))){
# still superstar
superstar[date,class] <- superstar[date-1, class]
} else{
# new superstar is highest scorer of date now
superstar[date,class] <- unlist(top1[[class]][date])
}
}
}
}
# painful for me trying to figure out how to convert superstar numbers to names but this worked
superstar.char <- as.data.frame(matrix(levels(df[,"Students"])[superstar], ncol = length(AllClasses)))
dimnames(superstar.char) <- dimnames(superstar)
superstar.char # superstar with Students as characters
Let me know if you have any difficulties!

It is possible to solve anything you would otherwise solve in a loop with a recursive function (a function that calls itself). Since you are changing the behavior of the function depending on i you'll need to pass i as param into the function. You'll also need the function to be able to realize when it is done and return the result set.

Related

R Convert categorical data to dummy set by other variable

I have this data set, I put a screenshot of real data instead of a code or something.
sorry for messing up, I am a newbie here in R
enter image description here
Then, I want to change the data into dummy set for "13 Source" categorical data, but it has to be summarized by "HH No". Which will look like this
enter image description here
I've tried to use to.dummy by varhandle, model.matrix but ended up messy dataset.
Could anybody help me how to deal with this?
Thanks a million in advance
There are a number of ways to make dummy variables from factors - here is one way to create a summary presence table.
Assume df is your data frame. You can use xtabs to start with, which will create a frequency table from your 2 columns.
By comparing to see if your values are > 0, you will get TRUE if > 0, and FALSE otherwise. Adding 0 at the end will make TRUE the number 1 and FALSE the number 0.
(xtabs(~ HH_No + Source, df) > 0) + 0
Output
Source
HH_No Deep_well Rainwater
1 1 1
3 1 1
4 0 1
Data
df <- structure(list(HH_No = c(1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3,
3, 3, 4, 4), Source = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L), .Label = c("Deep_well",
"Rainwater"), class = "factor")), class = "data.frame", row.names = c(NA,
-16L))

How to plot data from a 3 columns dataframe as a heatmap plot in R?

I'm new to R and I would appreciate your help.
I have a 3 columns df that looks like this:
> head(data)
V.hit J.hit frequency
1 IGHV1-62-3*00 IGHJ2*00 0.51937442
2 IGHV5-17*00 IGHJ3*00 0.18853542
3 IGHV3-5*00 IGHJ1*00 0.09777304
4 IGHV2-9*00 IGHJ3*00 0.03040866
5 IGHV5-12*00 IGHJ4*00 0.02900040
6 IGHV5-12*00 IGHJ2*00 0.00910554
This is just part of the data for example. I want to create a Heat map so that the X-axis will be "V.hit" and the Y-axis will be "J.hit", and the values of the heatmap will be the frequency (im interested of the freq for each combination of V+j). I tried to use this code for the interpolation:
library(akima)
newData <- with(data, interp(x = `V hit`, y = `J hit`, z = frequency))
but I'm getting this error:
Error in interp.old(x, y, z, xo, yo, ncp = 0, extrap = FALSE, duplicate = duplicate, :
missing values and Infs not allowed
so I don't know how to deal with it. I want to achieve this final output:
> head(fld)
# A tibble: 6 x 5
...1 `IGHJ1*00` `IGHJ2*00` `IGHJ3*00` `IGHJ4*00`
<chr> <dbl> <dbl> <dbl> <dbl>
1 IGHV10-1*00 0.00233 0.00192 NA 0.000512
2 IGHV1-14*00 NA NA 0.00104 NA
3 IGHV1-18*00 NA 0.000914 NA NA
4 IGHV1-18*00 NA NA 0.000131 NA
5 IGHV1-19*00 0.0000131 NA NA NA
6 IGHV1-26*00 NA 0.000214 NA NA
while cells that are "NA" will be assigned as "0".
And then I'm assuming I will be able to use the heatmap function to create my heat map graph. any help would be really appreciated!
Here is an idea using geom_tile(). Your data is called foo. I created all possible combination of V.hit and J.hit using complete(). For missing values, I asked complete() to use 0 to fill. Then, I used geom_tile() to produce the following graphic. You may want to consider the order of levels, if neccessary.
library(tidyverse)
complete(foo, V.hit, nesting(J.hit), fill = list(frequency = 0)) %>%
ggplot(aes(x = J.hit, y = V.hit, fill = frequency)) +
geom_tile()
In base R we could adapt #GregSnow's solution for a correlation matrix to a frequency heatmap.
First, we cut the vector, say into quartiles (the default in quantile) and get factor values.
dat$freq.fac <- cut(dat$frequency, quantile(dat$frequency, na.rm=TRUE), include.lowest=T)
Second to prepare the colors, we just copy the factor column and relevel them with builtin heat.colors and a white color for the zero values.
dat <- within(dat, {
freq.col <- freq.fac
levels(freq.col) <- c(heat.colors(length(levels(dat$freq.fac)), rev=T), "#FFFFFF")
})
Third, apply white color to NAs or zero value respectively.
dat$freq.col[is.na(dat$freq.col)] <- "#FFFFFF"
dat$frequency[is.na(dat$frequency)] <- 0
Fourth, apply xtabs and create a color matrix and match colors and levels after.
dat.x <- xtabs(frequency ~ v.hit + j.hit, dat)
col.m <- matrix(dat$freq.col[match(dat$frequency, as.vector(dat.x))], nrow=nrow(dat.x))
Finally plot using rasterImage function.
op <- par(mar=c(.5, 4, 4, 3)+.1) ## adapt outer margins
plot.new()
plot.window(xlim=c(0, 5), ylim=c(0, 5))
rasterImage(col.m, 0, 1, 5, 5, interpolate=FALSE)
rect(0, 1, 5, 5) ## frame it with a box
## numbers in the cells
text(col(round(dat.x, 3)) - .5, 5.45 - row(round(dat.x, 3))*.8, round(dat.x, 3))
mtext("Frequency heatmap", 3, 2, font=2, cex=1.2) ## title
mtext(rownames(dat.x), 2, at=5.45 -(1:5)*.8, las=2) ## y-axis
mtext(colnames(dat.x), 3, at=(1:5)-.5) ## y-axis (upper)
## a legend
legend(-.15, .75, legend=c("Frequency:\t", 0, paste("<", seq(.25, 1, .25))), horiz=TRUE,
pch=c(NA, rep(22, 5)), col=1, pt.bg=c(NA, levels(dat$freq.col)[c(5, 1:4)]),
bty="n", xpd=TRUE, cex=.75, text.font=2)
par(op) ## reset margins
Yields
Toy data:
dat <- structure(list(v.hit = structure(c(1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L), .Label = c("A", "B", "C", "D", "E"), class = "factor"),
j.hit = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L
), .Label = c("F", "G", "H", "I", "J"), class = "factor"),
frequency = c(NA, NA, 0.717618508264422, NA, NA, 0.777445221319795,
NA, 0.212142521282658, 0.651673766085878, 0.125555095961317,
NA, 0.386114092543721, 0.0133903331588954, NA, 0.86969084572047,
0.34034899668768, 0.482080115471035, NA, 0.493541307048872,
0.186217601411045, 0.827373318606988, NA, 0.79423986072652,
0.107943625887856, NA)), row.names = c(NA, -25L), class = "data.frame")
You can interpolate with a linear model if the variables correlate.
mdl <- lm(z ~ ., df)
out <- NULL
for(x in seq(min(df$x), max(df$x), (max(df$x) - min(df$x)/100) )){
tmp <- c()
for(y in seq(min(df$y), max(df$y), (max(df$y) - min(df$y)/100) )){
h <- predict(
mdl,
data.frame(x = x, y = y)
)
tmp = c(tmp, h)
}
if(is.null(out)){
out = as.matrix(tmp)
}else{
out = cbind(out, tmp)
}
}
fig <- plot_ly(z = out, colorscale = "Hot", type = "heatmap")
fig <- fig %>% layout(
title = "Interpolated Heatmap of Z Given x, y",
xaxis = list(
title = "x"
),
yaxis = list(
title = "y"
)
)
fig

Add characters to existing cells with condition R

I have the following table:
structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = c(1,
1, 1, 1, 1, 2, 2, 2)), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
I would like to add to $stops new characters when the stop did not change but the $Id did.
For example, I would like to get:
structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = structure(c(1L,
1L, 2L, 2L, 3L, 4L, 4L, 4L), .Label = c("1", "1-1", "1-2", "2"
), class = "factor")), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
I just would like to do so if the Id is different than the previous one, and if the Stops is the same than the previous one...
I tried with mutate() but it seems I am quite far away to get something working here...
Here's a looples attempt using data.table
library(data.table)
setDT(df)[, `:=`(stops = as.character(stops), Idindx = rleid(Id))]
indx <- unique(df, by = "Idindx")[, counter := (1:.N) - 1L, by = rleid(stops)]
df[indx[counter > 0], stops := paste(stops, i.counter, sep = "-"), on = "Idindx"]
# Id stops Idindx
# 1: a 1 1
# 2: a 1 1
# 3: b 1-1 2
# 4: b 1-1 2
# 5: a 1-2 3
# 6: c 2 4
# 7: c 2 4
# 8: c 2 4
The first step is to create an unique index for each Id (as they aren't unique) and convert stops to a character (per your desired output)
Then, operating on unique indexes identify counts of same stops and join back to the original data
You could write a loop to solve your problem:
# Original data
data <- structure(list(Id = structure(c(1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L
), .Label = c("a", "b", "c"), class = "factor"), stops = c(1,
1, 1, 1, 1, 2, 2, 2)), .Names = c("Id", "stops"), row.names = c(NA,
-8L), class = "data.frame")
# Add new column, which will be converted in the following loop
data$stops_new <- as.character(data$stops)
new <- 1
for(i in 2:nrow(data)) {
# Convert values of stops_new, if your specified conditions appear
if(data$Id[i] != data$Id[i - 1] & data$stops[i] == data$stops[i - 1]) {
data$stops_new[i] <- paste(data$stops_new[i], "-", new, sep = "")
# Repeat the convertion for all values with the same ID and stop-value
j <- i + 1
while(data$Id[i] == data$Id[j] & data$stops[i] == data$stops[j]) {
data$stops_new[j] <- paste(data$stops[i], "-", new, sep = "")
j <- j + 1
}
new <- new + 1
}
}
data
this is a base R solution.
create indicators showing you whether Id has changed (id.ind) and whether stops has changed (stops.ind) from the previous line (convention being that these indicators are set to "0", i.e. no change, for the first row):
stops.ind <- c(0, diff(dat$stops))
id.ind <- c(0, diff(as.numeric(dat$Id)))
create new stops vector:
stops <- new.stops <- dat$stops
row by row check whether a) there is a change in id and no change in stops or b) there is no change in either from the previous row. in case a) increase k by one and append "-k" to stops value b) use previous value of stops:
k <- 0
for(i in 2 : nrow(dat)){
if(id.ind[i] != 0 & stops.ind[i] == 0){
k <- k + 1
new.stops[i] <- paste0(stops[i], "-", k)
}
if(id.ind[i] == 0 & stops.ind[i] == 0)
new.stops[i] <- new.stops[i - 1]
}
new.stops
# [1] "1" "1" "1-1" "1-1" "1-2" "2" "2" "2"
new.dat <- data.frame(Id = dat$Id, stops = new.stops)

Dataframe manipulation: Convert certain columns of a dataframe into a list based on a key value column

I have a DF like the example created by the code below.
a = data.frame( name = c(rep("Tim",5),rep("John",3)),id = c(rep(1,5),rep(2,3)), value = 1:7)
And I want to transform it into a result that looks like this.
b = data.frame( name = c("Tim","John"), ID = c(1:2), b = NA)
b$value = list(c(1:5),c(6:8))
How would I go about doing this transformation?
For the actual data frame, I will have many columns to the left of the ID column, which I will want to perform calculations on with the columns of lists that will be created on the right side of the ID field.
For example, on the DF b above, I might want to perform a function call with "Tim" as an argument and loop through each individual element in the list = {1,2,3,4,5} and the output of that loop is another list with the same number of elements.
Try
aggregate(value~.,a, FUN=c)
# name id value
#1 Tim 1 1, 2, 3, 4, 5
#2 John 2 6, 7, 8
data
a <- structure(list(name = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L), .Label = c("John", "Tim"), class = "factor"), id = c(1,
1, 1, 1, 1, 2, 2, 2), value = 1:8), .Names = c("name", "id",
"value"), row.names = c(NA, -8L), class = "data.frame")

Converting this ugly for-loop to something more R-friendly

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.
I'm trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that's frowned upon in R)... I have records of three million-odd course enrollments: student UserID's paired with CourseID's. In each row, there's a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she's taken before the course in the enrollment.
The code I'm using for the for-loop follows:
data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
sum <- 0
count <- 0
for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
if (count != 0)
data$Mean.Prior.Score[i] <- sum / count
}
I'm pretty sure this would work, but it runs incredibly slowly... my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.
Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn't be too terribly slow...
There's got to be a better way to do this... any suggestions?
Edit: As per mnel's request... finally got dput to play nicely. Had to add control = NULL. Here 'tis:
structure(list(Username = structure(1:20, .Label = c("100225",
"100226", "100228", "1013170", "102876", "105796", "106753",
"106755", "108568", "109038", "110150", "110200", "110350", "111873",
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"),
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L,
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L,
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L,
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L,
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L,
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L,
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L,
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A",
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A",
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B",
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W",
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"),
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400,
1309737600, 1099267200, 1098662400, 1099267200, 1099267200,
1098662400, 1098662400, 1099267200, 1099267200, 1099267200,
1098662400, 1104105600, 1098662400, 1098662400, 1104105600,
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L,
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L,
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L,
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A",
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"),
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21",
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"),
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125,
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L,
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L,
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27",
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L,
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L,
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27",
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L,
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L,
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200",
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438",
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432",
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"),
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19",
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username",
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID",
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date",
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")
I found that data.table worked well.
# Create some data.
library(data.table)
set.seed(1)
n=3e6
numCourses=5 # Average courses per student
data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
# test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
# I vastly reduced the number of comparisons with a better "test" function.
test2=function(CourseEndDate,Score,CourseStartDate) {
o.end = order(CourseEndDate)
run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
idx=findInterval(CourseStartDate,CourseEndDate[o.end])
idx=ifelse(idx==0,NA,idx)
run.avg[idx]
}
system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1)
# For three million courses, at an average of 5 courses per student:
# user system elapsed
# 122.06 0.22 122.45
Running a test to see if it looks the same as your code:
set.seed(1)
n=1e2
data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
data["246"]
# UserID course Score CourseStartDate CourseEndDate MeanPriorScore
#1: 246 54 0.4531314 2000-08-09 2000-09-20 0.9437248
#2: 246 89 0.9437248 2000-02-19 2000-03-02 NA
# A comparison with your for loop (slightly modified)
data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
# I think you forgot a bracket here. Also, There is no need to work with the rownames.
for (i in seq(nrow(data))) {
sum <- 0
count <- 0
# I reduced the complexity of figuring out the vector to loop through.
# It will result in the exact same thing if there are no rownames.
for (j in which(data$UserID == data$UserID[i])) {
if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
# I had to add "[i]" here. I think that is what you meant.
if (count != 0) data$MeanPriorScore.old[i] <- sum / count
}
identical(data$MeanPriorScore,data$MeanPriorScore.old)
# [1] TRUE
This seems to be what you want
library(data.table)
# create a data.table object
DT <- data.table(data)
# key by userID
setkeyv(DT, 'userID')
# for each userID, where the Course.End.Date < Course.Start.Date
# return the mean score
# This is too simplistic
# DT[Course.End.Date < Course.Start.Date,
# list(Mean.Prior.Score = mean(Score)) ,
# by = list(userID)]
As per #jorans comment, this will be more complex than the code above.
This is only an outline of what I think a solution might entail. I'm going to use plyr just to illustrate the steps needed, for simplicity.
Let's just restrict ourselves to the case of one student. If we can calculate this for one student, extending it with some sort of split-apply will be trivial.
So let's suppose we have scores for a particular student, sorted by course end date:
d <- sample(seq(as.Date("2011-01-01"),as.Date("2011-01-31"),by = 1),100,replace = TRUE)
dat <- data.frame(date = sort(d),val = rnorm(100))
First, I think you'd need to summarise this by date and then calculate the cumulative running mean:
dat_sum <- ddply(dat,.(date),summarise,valsum = sum(val),n = length(val))
dat_sum$mn <- with(dat_sum,cumsum(valsum) / cumsum(n))
Finally, you'd merge these values back into the original data with the duplicate dates:
dat_merge <- merge(dat,dat_sum[,c("date","mn")])
I could probably write something that does this in data.table using an anonymous function to do all those steps, but I suspect others may be better able to do something that will be concise and fast. (In particular, I don't recommend actually tackling this with plyr, as I suspect it will still be extremely slow.)
I think something like this should work though it'd be helpful to have test data with multiple courses per user. Also might need +1 on the start dates in findInterval to make condition be End.Date < Start.Date instead of <=.
# in the test data, one is POSIXct and the other a factor
data$Course.Start.Date = as.Date(data$Course.Start.Date)
data$Course.End.Date = as.Date(data$Course.End.Date)
data = data[order(data$Course.End.Date), ]
data$Mean.Prior.Score = ave(seq_along(data$User.ID), data$User.ID, FUN=function(i)
c(NA, cumsum(data$Score[i]) / seq_along(i))[1L + findInterval(data$Course.Start.Date[i], data$Course.End.Date[i])])
With three million rows, maybe a database is helpful. Here an sqlite example which I believe creates something similar to your for loop:
# data.frame for testing
user <- sample.int(10000, 100)
course <- sample.int(10000, 100)
c_start <- sample(
seq(as.Date("2004-01-01"), by="3 months", length.ou=12),
100, replace=TRUE
)
c_end <- c_start + as.difftime(11, units="weeks")
c_idx <- sample.int(100, 1000, replace=TRUE)
enroll <- data.frame(
user=sample(user, 1000, replace=TRUE),
course=course[c_idx],
c_start=as.character(c_start[c_idx]),
c_end=as.character(c_end[c_idx]),
score=runif(1000),
stringsAsFactors=FALSE
)
#variant 1: for-loop
system.time({
enroll$avg.p.score <- NA
for (i in 1:nrow(enroll)) {
sum <- 0
count <- 0
for (j in which(enroll$user==enroll$user[[i]]))
if (enroll$c_end[[j]] < enroll$c_start[[i]]) {
sum <- sum + enroll$score[[j]]
count <- count + 1
}
if(count !=0) enroll$avg.p.score[[i]] <- sum / count
}
})
#variant 2: sqlite
system.time({
library(RSQLite)
con <- dbConnect("SQLite", ":memory:")
dbWriteTable(con, "enroll", enroll, overwrite=TRUE)
sql <- paste("Select e.user, e.course, Avg(p.score)",
"from enroll as e",
"cross join enroll as p",
"where e.user=p.user and p.c_end < e.c_start",
"group by e.user, e.course;")
res <- dbSendQuery(con, sql)
dat <- fetch(res, n=-1)
})
On my machine, sqlite is ten times faster. If that is not enough, it would be possible to index the database.
I can't really test this, as your data doesn't appear to satisfy the inequality in any combination, but I'd try something like this:
library(plyr)
res <- ddply(data, .(User.ID), function(d) {
with(subset(merge(d, d, by=NULL, suffixes=c(".i", ".j")),
Course.End.Date.j < Course.Start.Date.i),
c(Mean.Prior.Score = mean(Score.j)))
})
res$Mean.Prior.Score[is.nan(res$Mean.Prior.Score)] = 0
Here is how it works:
ddply: Group data by User.ID and execute function for each subset d of rows for one User.ID
merge: Create two copies of the data for one user, one with columns suffixed .i the other .j
subset: From this outer join, only select those matching the given inequality
mean: Compute the mean for the matched rows
c(…): Give a name to the resulting column
res: Will be a data.frame with columns User.ID and Mean.Prior.Score
is.nan: mean will return NaN for zero-length vectors, change these to zeros
I guess this might be reasonably fast if there are not too many rows for each User.ID. If this isn't fast enough, the data.tables mentioned in other answers might help.
Your code is a bit fuzzy on the desired output: you treat data$Mean.Prior.Score like a length-one variable, but assign to it in every iteration of the loop. I assume that this assignment is meant only for one row. Do you need means for every row of the data frame, or only one mean per user?

Resources