R left join by multiple columns resulting in NAs - r

I have two data frames alpha and beta.
dput(alpha)
structure(list(ID = c(29503L, 29507L, 29508L, 29510L),
Q_ID = structure(1:4, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458"),
class = "factor"),
L_Atmpt = c(0L, 0L, 0L, 0L),
Q_Atmpt = c(0L, 1L, 0L, 1L),
Q_Result = c(1L, 1L, 1L, 0L),
Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L),
Start_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34"),
class = "factor"),
End_Time = structure(1:4, .Label = c("2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-1Ê12:03:33", "2017-10-31Ê13:11:57"),
class = "factor"),
Duration = c(173L, 55L, 98L, 1921L)),
class = "data.frame", row.names = c(NA, -4L))
dput(beta)
structure(list(ID = c(29503L, 29507L, 29508L, 29510L, 29515L, 30160L),
Q_ID = structure(1:6, .Label = c("q:1392763916495:441", "q:1392763916495:445", "q:1392763916495:449", "q:1392763920794:458", "q:1392763920794:462", "q:1392763925803:530"),
class = "factor"),
L_Atmpt = c(0L, 0L, 0L, 0L, 0L, 1L),
Q_Atmpt = c(0L, 1L, 0L, 1L, 0L, 0L),
Q_Result = c(1L, 1L, 1L, 0L, 0L, 0L),
Time_on_Screen = c(13839L, 185162L, 264418L, 2183464L, 768470L, 885800L),
Start_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:19:08", "2017-10-31Ê11:51:20", "2017-10-31Ê11:54:26", "2017-10-31Ê11:59:09", "2017-10-31Ê12:35:34", "2017-10-31Ê13:16:09"),
class = "factor"),
End_Time = structure(c(2L, 3L, 4L, 5L, 6L, 1L), .Label = c("2017-10-25Ê00:33:53", "2017-10-31Ê11:51:33", "2017-10-31Ê11:57:31", "2017-10-31Ê12:03:33", "2017-10-31Ê13:11:57", "2017-10-31Ê13:28:57"),
class = "factor")),
class = "data.frame", row.names = c(NA,-6L))
I want to merge them and get a final data frame gamma. The data frame alpha has a special column: alpha$duration, which I need to add or append at the end of the data frame beta.
beta has more instances than alpha and I want to perform left join so all the instances of beta are retained. This means that some of the entries of the column gamma$duration will be NULL or NA.
I expect, the NULLs or NAs will be those entries where the ID of alpha does not match with the ID of beta. However, for my original data (which has more than 10K rows and around 20 or so variables), I get something like below:
ID Q_ID L_Atmpt Q_Atmpt Q_Result Time_on_Screen Start_Time End_Time Duration
29503 q:1392763916495:441 0 0 1 13839 2017-10-31Ê11:51:20 2017-10-31Ê11:51:33 NA
29507 q:1392763916495:445 0 1 1 185162 2017-10-31Ê11:54:26 2017-10-31Ê11:57:31 NA
29508 q:1392763916495:449 0 0 1 264418 2017-10-31Ê11:59:09 2017-10-31Ê12:03:33 NA
29510 q:1392763920794:458 0 1 0 2183464 2017-10-31Ê12:35:34 2017-10-31Ê13:11:57 NA
29515 q:1392763920794:462 0 0 0 768470 2017-10-31Ê13:16:09 2017-10-31Ê13:28:57 NA
30160 q:1392763925803:530 1 0 0 885800 2017-10-25Ê00:19:08 2017-10-25Ê00:33:53 NA
Unfortunately, the toy example that I shared is not replicating/capturing my problem. I understand it could be challenging to imagine why I am getting NA in my original problem. But any thoughts or advice on this would be highly appreciated.
For reference, I am sharing the different scripts that I have used, they all have rendered the same output:
library(plyr)
gamma = join(beta, alpha, type = "left")
library(dplyr)
gamma = left_join(beta, alpha)
library(sqldf)
gamma = sqldf('SELECT beta.*, alpha.duration
FROM beta LEFT JOIN alpha
on beta.ID == alpha.ID AND
beta.Q_ID == alpha.Q_ID AND
beta.L_Atmpt == alpha.L_Atmpt AND
beta.Q_Atmpt == alpha.Q_Atmpt AND
beta.Start_Time == alpha.Start_Time')
I would like to mention that the column alpha$duration in my original data frame was created after some pre-processing steps such as:
#Step 1: Ordering the data by ID and Start_Time
beta = beta[with(beta, order(ID, Q_ID, Q_Atmpt, Start_Time)), ]
#Step 2: End_Time lagging
library(Hmisc)
# to calculate the time difference we lag the End_Time
beta$End_Time_forward = Lag(beta$End_Time, +1)
# for comparisons, we also lag the IDs
beta$ID_forward = Lag(beta$ID, +1)
#Step 3: Now calculate the required time differences
library(sqldf)
alpha = sqldf('SELECT beta.*,
(Start_Time - End_Time_forward),
(End_Time - End_Time_forward)
FROM beta
WHERE ID_forward == ID')
#Step 4: Columns renaming
names(alpha)[names(alpha) == "(Start_Time - End_Time_forward)"] = "duration"
names(alpha)[names(alpha) == "(End_Time - End_Time_forward)"] = "end_duration"
#Step 5:Few instances have negative duration, so replace the gap between
# (last end time and current start time) with the (last end time and current
# end time) difference
alpha = alpha %>%
mutate(duration = if_else(duration < 0, end_duration, duration))
#Step 6: Convert the remaining negatives with NAs
alpha$duration[alpha$duration < 0] <- NA
#Step 7: Now replace those NAs by using the imputeTS function
library(imputeTS)
alpha$duration = na_locf(alpha$duration, option = 'locf',
na_remaining = 'rev', maxgap = Inf)
I suspect, the last two steps where I have manipulated the gamma$duration variable might have something to do with such unexpected results

I have not been able to determine the actual cause of this issue, however, I have found a work-around this problem:
beta$duration = as.integer(0)
test2 = merge(x = beta, y = alpha,
by = c("ID", "Q_ID", "L_Atmpt", "Q_Atmpt", "Q_Result", "Time_on_Screen", "Start_Time", "End_Time"),
all.x = TRUE)
Through this, I can access/retain the duration column of the data frame alpha and then use it as I want to.

Related

Calculating distance between multiple points at the same time of the day

I have two dataframes, one with my boat GPS positions (5512 records) and one with fishing boats positions (35381 records). I want to calculate the distance between my boat and all other fishing boats that were present in the area at the same time of the day (to the minute).
I created a IDdatecode (yyyymmddhhmm) for all the positions, then I merged the two dataframes based on the same IDdatecode. I did this:
merged_table<- merge(myboat,fishboats,by="IDdatecode",all.y=TRUE)
To calculate the distance I used the formula:
merged_table$distance_between_vessels=distm(c("lon1","lat1"),c("lon2","lat2"),fun=distGeo)
where lon1, lat1 are my boat positions and lon2, lat2 are fishing boats.
But I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "distance_between_vessels", value = NA_real_) :
replacement has 1 row, data has 35652
In addition: Warning messages:
1: In .pointsToMatrix(x) : NAs introduced by coercion
2: In .pointsToMatrix(y) : NAs introduced by coercion
What I tried so far is:
use this other formula: merged_table$distance_between_vessels=distGeo(c("lon1","lat1"),c("lon2","lat2"))
put all the columns of lat and lon "as.numeric"
use only interval times where both my boat and fishing boats were present
ignore the warning and keep going
But I still get only a list of NAs.
I used the function "distGeo" in a much simplier dataset (only my boat position) where I calculated manually the distance between first and second point, then between second and third point, and so on. The function works perfectly as it gives me exactly the right distance between two points (I checked it on ArcGIS). This is what I did:
distGeo(mydata[1, ], mydata[2, ])
distGeo(mydata[2, ], mydata[3, ])
distGeo(mydata[3, ], mydata[4, ])
So, I want to calculate 'one-to-many' distances based on a unique time of the day, but I get the above error. Any ideas on why? Thanks :)
Here, my first 10 rows of the merged table:
structure(list(Record = 1:10, IDdatecode = structure(c(1L, 2L,
3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("d201805081203",
"d201805081204", "d201805081205", "d201805081206", "d201805081207",
"d201805081208"), class = "factor"), lon1 = c(12.40203333, 12.4071,
12.41165, 12.41165, 12.41485, 12.41485, 12.41663333, 12.41663333,
12.41841667, 12.41841667), lat1 = c(45.1067, 45.10921667, 45.11218333,
45.11218333, 45.11303333, 45.11303333, 45.11313333, 45.11313333,
45.11348333, 45.11348333), boat1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = "RB", class = "factor"), lon2 = c(13.02718,
13.02585827, 13.02453654, 13.02173, 13.02321482, 13.02052301,
13.02189309, 13.01931602, 13.02057136, 13.01810904), lat2 = c(44.98946,
44.99031749, 44.99117498, 44.98792, 44.99203246, 44.98868065,
44.99288995, 44.98944129, 44.99374744, 44.99020194), boat2 = structure(c(1L,
1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("IMPERO II",
"MISTRAL"), class = "factor")), .Names = c("Record", "IDdatecode",
"lon1", "lat1", "boat1", "lon2", "lat2", "boat2"), row.names = c(NA,
-10L), class = "data.frame")
V2, Update (January 17, 2022)
Glad it works for you. If you are willing to avoid for-loops you could consider a dplyr approach. Have a look.
library(dplyr)
df <- silvia %>%
rowwise() %>%
mutate(distance = geosphere::distGeo(c(lon1, lat1), c(lon2, lat2)))
df
The base R **apply-family would be another option.
V1 (January 16, 2022)
Hopefully this approach does help you. Often it is recommended to not use for-loops. However, I used one, since they are easy to understand.
I made the following assumptions:
boat1 is your boat. lat1 and lon1 represent the position of boat1 for any IDdatecode;
as I did not fully understand what you mean with "based on a unique time of the day" I assumed looping over each row is sufficient;
the function distGeo() is from geosphere package.
# loading your dataframe as "silvia"
silvia <- structure(list(Record = 1:10, IDdatecode = structure(c(1L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L),
.Label = c("d201805081203","d201805081204", "d201805081205", "d201805081206", "d201805081207", "d201805081208"),
class = "factor"), lon1 = c(12.40203333, 12.4071, 12.41165, 12.41165, 12.41485, 12.41485, 12.41663333,
12.41663333, 12.41841667, 12.41841667), lat1 = c(45.1067, 45.10921667, 45.11218333, 45.11218333, 45.11303333,
45.11303333, 45.11313333, 45.11313333, 45.11348333, 45.11348333), boat1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = "RB", class = "factor"), lon2 = c(13.02718, 13.02585827, 13.02453654, 13.02173, 13.02321482,
13.02052301, 13.02189309, 13.01931602, 13.02057136, 13.01810904), lat2 = c(44.98946, 44.99031749, 44.99117498, 44.98792,
44.99203246, 44.98868065, 44.99288995, 44.98944129, 44.99374744, 44.99020194), boat2 = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 2L), .Label = c("IMPERO II", "MISTRAL"), class = "factor")), .Names = c("Record", "IDdatecode",
"lon1", "lat1", "boat1", "lon2", "lat2", "boat2"), row.names = c(NA, -10L), class = "data.frame")
# for EACH ROW in "silvia" calculate the distance between c("lon1", "lat1") and c("lon2", "lat2")
for (i in 1:nrow(silvia)){
silvia$distance[i] <- geosphere::distGeo(c(silvia[i, "lon1"], silvia[i,"lat1"]),
c(silvia[i, "lon2"], silvia[i,"lat2"]))
}
# here you see the first 5 entrys of the df "silvia"
# the distances are calculated in metres
# the parameters a and f are set to WGS84 by default.
head(silvia, n=5)
#> Record IDdatecode lon1 lat1 boat1 lon2 lat2 boat2
#> 1 1 d201805081203 12.40203 45.10670 RB 13.02718 44.98946 IMPERO II
#> 2 2 d201805081204 12.40710 45.10922 RB 13.02586 44.99032 IMPERO II
#> 3 3 d201805081205 12.41165 45.11218 RB 13.02454 44.99117 IMPERO II
#> 4 4 d201805081205 12.41165 45.11218 RB 13.02173 44.98792 MISTRAL
#> 5 5 d201805081206 12.41485 45.11303 RB 13.02321 44.99203 IMPERO II
#> distance
#> 1 50943.77
#> 2 50503.93
#> 3 50118.46
#> 4 50005.52
#> 5 49774.51
Note. Created on 2022-01-16 by the reprex package (v2.0.1)

Elegant way to write function

I have an input column (symbols) which has more than 10000 rows and they contain operator symbols and text values like ("",">","<","","****","inv","MOD","seen") as shown below in the code as values. This column doesn't contain any numbers. It only contains the value which are stated in the code.
What I would like to do is map those operator symbols ('<','>' etc) to different codes, 1) Operator_codes 2) Value_codes and have these two different codes as separate columns
I already have a working code but it is not very efficient as you can see I repeat the same operation twice. Once for Operator_codes and then for value_codes. I am sure there must be some efficient way to write this. I am new to R and not very familiar with other approach.
oper_val_concepts = function(DF){
operators_source = str_extract(.$symbols)
operators_source = as.data.frame(operators_source)
colnames(operators_source) <- c("Symbol")
operator_list = c("",">","<","-","****","inv","MOD","seen")
operator_codes = c(123L,14L,16L,13L,0L,0L,0L,0L)
value_codes=c(14L,12L,32L,123L,16L
,41L,116L,186L)
operator_code_map = map2(operator_list,operator_codes,function(x,y)c(x,y))
%>%
data.frame()
value_code_map = map2(operator_list,value_codes,function(x,y) c(x,y)) %>%
data.frame()
operator_code_map = t(operator_code_map)
value_code_map = t(value_code_map)
colnames(operator_code_map) <- c("Symbol","Code")
colnames(value_code_map) <- c("Symbol","Code")
rownames(operator_code_map) = NULL
rownames(value_code_map) = NULL
dfm<-merge(x=operators_source,y=operator_code_map,by="Symbol",all.x =
TRUE)
dfm1<-merge(x=operators_source,y=value_code_map,by="Symbol",all.x = TRUE)
}
t1 = oper_val_concepts(test)
dput command output is
structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA,-9L), class =
"data.frame")
I am expecting an output to be two columns in a dataframe as shown below.
Based on what I am understanding, it seems like you want to create a dataframe that will act as a key (see key below). Once you have this, you can just join the dataframe that just contains symbols with this key dataframe.
df <- structure(list(Symbols = structure(c(2L, 3L, 1L, 4L, 2L, 3L,
5L, 4L, 6L), .Label = c("****", "<", ">", "inv", "mod", "seen"
), class = "factor")), .Names = "Symbols", row.names = c(NA, -9L), class = "data.frame")
key <- data.frame(Symbols = c("",">","<","-","****","inv","mod","seen"),
Oerator_code_map = c(123L,14L,16L,13L,0L,0L,0L,0L),
value_code_map = c(14L,12L,32L,123L,16L,41L,116L,186L))
df %>% left_join(key, by = "Symbols")
output
Symbols Oerator_code_map value_code_map
1 < 16 32
2 > 14 12
3 **** 0 16
4 inv 0 41
5 < 16 32
6 > 14 12
7 mod 0 116
8 inv 0 41
9 seen 0 186

In R, subtract value from previous row in data frame based on hierarchical data structure

I need to calculate the following equation in a new column within my data frame.
Relative displacement index:
|{(net displacement l−net displacement l+1)/net displacementl}*100|,
where I need to calculate the absolute value of net displacement between GPS location l and location l + 1. The resulting number will be a percentage value.
The problem I have is I cannot figure out a simple way to plug in the formula above where I calculate the relative change in amplitude of net displacement from row 1 to row 2, row 2 to row 3, row 3 to row 4, etc. I've included an example data frame for your reference. I've also included a column in the data frame that contains the desired output. The data is hierarchially structured (GPS locations nested within COLLAR_ID, DATETIME, MONTH, DAY, YEAR, and HOUR.
dput(droplevels(head(example, 4)))
structure(list(COLLAR_ID = c(33827L, 33827L, 33827L, 33827L),
DATETIME = structure(1:4, .Label = c("10/1/2013 10:00", "10/1/2013 10:30",
"10/1/2013 17:00", "10/1/2013 17:30"), class = "factor"),
WEEK = c(1L, 1L, 1L, 1L), YEAR = c(2013L, 2013L, 2013L, 2013L
), MONTH = c(10L, 10L, 10L, 10L), DAY = c(1L, 1L, 1L, 1L),
HOUR = c(10L, 10L, 17L, 17L), X = c(384349L, 431753L, 242501L,
448158L), Y = c(8864608L, 8757741L, 7306632L, 1159880L),
MOVEMENT_DISTANCE = c(78.1, 99.82, 35.9, 3), NET_DISPLACEMENT = c(135.35,
205.65, 403.79, 434.83)), .Names = c("COLLAR_ID", "DATETIME","WEEK", "YEAR","MONTH","DAY", "HOUR", "X", "Y", "MOVEMENT_DISTANCE","NET_DISPLACEMENT"), row.names = c(NA, 4L), class = "data.frame")
We can use the lead function from the dplyr package. Assuming that your data frame is called dat. dat2 is the final output.
library(dplyr)
dat2 <- dat %>%
mutate(RDI = abs((NET_DISPLACEMENT - lead(NET_DISPLACEMENT))/NET_DISPLACEMENT * 100))
head(dat2$RDI)
# [1] 51.9394163 96.3481644 7.6871641 0.5128441 1.2020342 0.3243490
You can also use base R as follows.
dat3 <- dat
dat3$RDI <- with(dat, abs((NET_DISPLACEMENT - c(NET_DISPLACEMENT[-1], NA))/NET_DISPLACEMENT * 100))
head(dat3$RDI)
# [1] 51.9394163 96.3481644 7.6871641 0.5128441 1.2020342 0.3243490

group variables depending on defined circular area with center of circle having variable radius

I have a data table object:
> dput(head(trackdatacompvar))
structure(list(wellvid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("A4-009",
"B3-006", "B4-015", "C2-009", "C2-034", "C3-017", "C4-014", "C4-016",
"C4-026", "C4-036"), class = "factor"), TRACK_ID = c(0L, 0L,
0L, 0L, 0L, 0L), treatment = structure(c(2L, 2L, 2L, 2L, 2L,
2L), .Label = c("Si_induced", "Si_notinduced"), class = "factor"),
A = c(0L, 0L, 0L, 0L, 0L, 0L), X = c(50.216, 50.216, 50.091,
50.091, 50.216, 50.216), Y = c(295.609, 295.609, 295.477,
295.477, 295.609, 295.609), T = 0:5, V = c(0, 0, 0.181793839279557,
0, 0.181793839279557, 0), x_grpA = c(641.67, 641.67, 641.67,
641.67, 641.67, 641.67), y_grpA = c(625, 625, 625, 625, 625,
625), rad_grpA = c(50L, 50L, 50L, 50L, 50L, 50L), x_grpB = c(889.58,
889.58, 889.58, 889.58, 889.58, 889.58), y_grpB = c(377.08,
377.08, 377.08, 377.08, 377.08, 377.08), rad_grpB = c(20L,
20L, 20L, 20L, 20L, 20L)), .Names = c("wellvid", "TRACK_ID",
"treatment", "A", "X", "Y", "T", "V", "x_grpA", "y_grpA", "rad_grpA",
"x_grpB", "y_grpB", "rad_grpB"), sorted = "wellvid", class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000210788>)
I want to define 4 groups of data depending on circular area. Groups A and B will be dependent on the x,y origin of 2 beads (labelled as x_grpA, y_grpA and x_grpB, y_grpB), group C is an outside area and group D as the area where groups A and B overlap (but this area is sometimes not there). The 2 circular groups should be inside a circular area with radius of 115 µm. This 115 µm is dependent on the size of the bead, so I also have in my data 2 radius (rad_grpA and rad_grpB). To understand it visually, here are 2 pictures:
My original idea is to reuse the awesome script I was given before. So, I tried defining the center of the each data point and the corresponding length of the whole area of group A as:
center_grpA <- c(trackdatacompvar$x_grpA, trackdatacompvar$y_grpA)
circle_grpA <- (trackdatacompvar$rad_grpA)*2 + 115
But after this I am lost.
In the end I want to put inside my dataframe their grouping as one variable.
Would appreciate any help! Thanks :)
We can use a little convenience function from a package of mine here:
check_if_in_circle <- function(points, x, y, r) {
(points[, 1] - x) ^ 2 + (points[, 2] - y) ^ 2 < r ^ 2
}
Now we check for each point, whether it's in circle A, circle B, and then ifelse to figure out whether to assign A, B, C or D. I use within to avoid typing that long data name.
trackdatacompvar <- within(trackdatacompvar,
{
grpA <- check_if_in_circle(points = cbind(X, Y),
x_grpA, y_grpA, rad_grpA + 115)
grpB <- check_if_in_circle(points = cbind(X, Y),
x_grpB, y_grpB, rad_grpB + 115)
grp <- ifelse(grpA, ifelse(grpB, 'D', 'A'),
ifelse(grpB, 'B', 'C'))
} )
For the few rows you gave us, all are in group C.

Converting this ugly for-loop to something more R-friendly

Been using SO as a resource constantly for my work. Thanks for holding together such a great community.
I'm trying to do something kinda complex, and the only way I can think to do it right now is with a pair of nested for-loops (I know that's frowned upon in R)... I have records of three million-odd course enrollments: student UserID's paired with CourseID's. In each row, there's a bunch of data including start/end dates and scores and so forth. What I need to do is, for each enrollment, calculate the average score for that user across the courses she's taken before the course in the enrollment.
The code I'm using for the for-loop follows:
data$Mean.Prior.Score <- 0
for (i in as.numeric(rownames(data)) {
sum <- 0
count <- 0
for (j in as.numeric(rownames(data[data$UserID == data$UserID[i],]))) {
if (data$Course.End.Date[j] < data$Course.Start.Date[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
if (count != 0)
data$Mean.Prior.Score[i] <- sum / count
}
I'm pretty sure this would work, but it runs incredibly slowly... my data frame has over three million rows, but after a good 10 minutes of chugging, the outer loop has only run through 850 of the records. That seems way slower than the time complexity would suggest, especially given that each user has only 5 or 6 courses to her name on average.
Oh, and I should mention that I converted the date strings with as.POSIXct() before running the loop, so the date comparison step shouldn't be too terribly slow...
There's got to be a better way to do this... any suggestions?
Edit: As per mnel's request... finally got dput to play nicely. Had to add control = NULL. Here 'tis:
structure(list(Username = structure(1:20, .Label = c("100225",
"100226", "100228", "1013170", "102876", "105796", "106753",
"106755", "108568", "109038", "110150", "110200", "110350", "111873",
"111935", "113579", "113670", "117562", "117869", "118329"), class = "factor"),
User.ID = c(2313737L, 2314278L, 2314920L, 9708829L, 2325896L,
2315617L, 2314644L, 2314977L, 2330148L, 2315081L, 2314145L,
2316213L, 2317734L, 2314363L, 2361187L, 2315374L, 2314250L,
2361507L, 2325592L, 2360182L), Course.ID = c(2106468L, 2106578L,
2106493L, 5426406L, 2115455L, 2107320L, 2110286L, 2110101L,
2118574L, 2106876L, 2110108L, 2110058L, 2109958L, 2108222L,
2127976L, 2106638L, 2107020L, 2127451L, 2117022L, 2126506L
), Course = structure(c(1L, 7L, 10L, 15L, 11L, 19L, 4L, 6L,
3L, 12L, 2L, 9L, 17L, 8L, 20L, 18L, 13L, 16L, 5L, 14L), .Label = c("ACCT212_A",
"BIOS200_N", "BIS220_T", "BUSN115_A", "BUSN115_T", "CARD205_A",
"CIS211_A", "CIS275_X", "CIS438_S", "ENGL112_A", "ENGL112_B",
"ENGL227_K", "GM400_A", "GM410_A", "HUMN232_M", "HUMN432_W",
"HUMN445_A", "MATH100_X", "MM575_A", "PSYC110_Y"), class = "factor"),
Course.Start.Date = structure(c(1098662400, 1098662400, 1098662400,
1309737600, 1099267200, 1098662400, 1099267200, 1099267200,
1098662400, 1098662400, 1099267200, 1099267200, 1099267200,
1098662400, 1104105600, 1098662400, 1098662400, 1104105600,
1098662400, 1104105600), class = c("POSIXct", "POSIXt"), tzone = "GMT"),
Term.ID = c(12056L, 12056L, 12056L, 66282L, 12057L, 12056L,
12057L, 12057L, 12056L, 12056L, 12057L, 12057L, 12057L, 12056L,
13469L, 12056L, 12056L, 13469L, 12056L, 13469L), Term.Name = structure(c(2L,
2L, 2L, 4L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 3L, 2L,
2L, 3L, 2L, 3L), .Label = c("Fall 2004", "Fall 2004 Session A",
"Fall 2004 Session B", "Summer Session A 2011"), class = "factor"),
Term.Start.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-10-21",
"2004-10-28", "2004-12-27", "2011-06-26"), class = "factor"),
Score = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.125,
0, 0, 0, 0, 0), First.Course.Date = structure(c(1L, 1L, 1L,
4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L,
1L, 3L), .Label = c("2004-10-25", "2004-11-01", "2004-12-27",
"2011-07-04"), class = "factor"), First.Term.Date = structure(c(1L,
1L, 1L, 4L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L,
1L, 3L, 1L, 3L), .Label = c("2004-10-21", "2004-10-28", "2004-12-27",
"2011-06-26"), class = "factor"), First.Timer = c(TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), Course.Code = structure(c(1L,
6L, 9L, 13L, 9L, 17L, 4L, 5L, 3L, 10L, 2L, 8L, 15L, 7L, 18L,
16L, 11L, 14L, 4L, 12L), .Label = c("ACCT212", "BIOS200",
"BIS220", "BUSN115", "CARD205", "CIS211", "CIS275", "CIS438",
"ENGL112", "ENGL227", "GM400", "GM410", "HUMN232", "HUMN432",
"HUMN445", "MATH100", "MM575", "PSYC110"), class = "factor"),
Course.End.Date = structure(c(1L, 1L, 1L, 4L, 2L, 1L, 2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 3L, 1L, 3L), .Label = c("2004-12-19",
"2005-02-27", "2005-03-26", "2011-08-28"), class = "factor")), .Names = c("Username",
"User.ID", "Course.ID", "Course", "Course.Start.Date", "Term.ID",
"Term.Name", "Term.Start.Date", "Score", "First.Course.Date",
"First.Term.Date", "First.Timer", "Course.Code", "Course.End.Date"
), row.names = c(NA, 20L), class = "data.frame")
I found that data.table worked well.
# Create some data.
library(data.table)
set.seed(1)
n=3e6
numCourses=5 # Average courses per student
data=data.table(UserID=as.character(round(runif(n,1,round(n/numCourses)))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
# test=function(CourseEndDate,Score,CourseStartDate) sapply(CourseStartDate, function(y) mean(Score[y>CourseEndDate]))
# I vastly reduced the number of comparisons with a better "test" function.
test2=function(CourseEndDate,Score,CourseStartDate) {
o.end = order(CourseEndDate)
run.avg = cumsum(Score[o.end])/seq_along(CourseEndDate)
idx=findInterval(CourseStartDate,CourseEndDate[o.end])
idx=ifelse(idx==0,NA,idx)
run.avg[idx]
}
system.time(data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1)
# For three million courses, at an average of 5 courses per student:
# user system elapsed
# 122.06 0.22 122.45
Running a test to see if it looks the same as your code:
set.seed(1)
n=1e2
data=data.table(UserID=as.character(round(runif(n,1,1000))),course=1:n,Score=runif(n),CourseStartDate=as.Date('2000-01-01')+round(runif(n,1,365)))
data$CourseEndDate=data$CourseStartDate+round(runif(n,1,100))
setkey(data,UserID)
data$MeanPriorScore<-data[,test2(CourseEndDate,Score,CourseStartDate),by=UserID]$V1
data["246"]
# UserID course Score CourseStartDate CourseEndDate MeanPriorScore
#1: 246 54 0.4531314 2000-08-09 2000-09-20 0.9437248
#2: 246 89 0.9437248 2000-02-19 2000-03-02 NA
# A comparison with your for loop (slightly modified)
data$MeanPriorScore.old<-NA # Set to NaN instead of zero for easy comparison.
# I think you forgot a bracket here. Also, There is no need to work with the rownames.
for (i in seq(nrow(data))) {
sum <- 0
count <- 0
# I reduced the complexity of figuring out the vector to loop through.
# It will result in the exact same thing if there are no rownames.
for (j in which(data$UserID == data$UserID[i])) {
if (data$CourseEndDate[j] <= data$CourseStartDate[i]) {
sum <- sum + data$Score[j]
count <- count + 1
}
}
# I had to add "[i]" here. I think that is what you meant.
if (count != 0) data$MeanPriorScore.old[i] <- sum / count
}
identical(data$MeanPriorScore,data$MeanPriorScore.old)
# [1] TRUE
This seems to be what you want
library(data.table)
# create a data.table object
DT <- data.table(data)
# key by userID
setkeyv(DT, 'userID')
# for each userID, where the Course.End.Date < Course.Start.Date
# return the mean score
# This is too simplistic
# DT[Course.End.Date < Course.Start.Date,
# list(Mean.Prior.Score = mean(Score)) ,
# by = list(userID)]
As per #jorans comment, this will be more complex than the code above.
This is only an outline of what I think a solution might entail. I'm going to use plyr just to illustrate the steps needed, for simplicity.
Let's just restrict ourselves to the case of one student. If we can calculate this for one student, extending it with some sort of split-apply will be trivial.
So let's suppose we have scores for a particular student, sorted by course end date:
d <- sample(seq(as.Date("2011-01-01"),as.Date("2011-01-31"),by = 1),100,replace = TRUE)
dat <- data.frame(date = sort(d),val = rnorm(100))
First, I think you'd need to summarise this by date and then calculate the cumulative running mean:
dat_sum <- ddply(dat,.(date),summarise,valsum = sum(val),n = length(val))
dat_sum$mn <- with(dat_sum,cumsum(valsum) / cumsum(n))
Finally, you'd merge these values back into the original data with the duplicate dates:
dat_merge <- merge(dat,dat_sum[,c("date","mn")])
I could probably write something that does this in data.table using an anonymous function to do all those steps, but I suspect others may be better able to do something that will be concise and fast. (In particular, I don't recommend actually tackling this with plyr, as I suspect it will still be extremely slow.)
I think something like this should work though it'd be helpful to have test data with multiple courses per user. Also might need +1 on the start dates in findInterval to make condition be End.Date < Start.Date instead of <=.
# in the test data, one is POSIXct and the other a factor
data$Course.Start.Date = as.Date(data$Course.Start.Date)
data$Course.End.Date = as.Date(data$Course.End.Date)
data = data[order(data$Course.End.Date), ]
data$Mean.Prior.Score = ave(seq_along(data$User.ID), data$User.ID, FUN=function(i)
c(NA, cumsum(data$Score[i]) / seq_along(i))[1L + findInterval(data$Course.Start.Date[i], data$Course.End.Date[i])])
With three million rows, maybe a database is helpful. Here an sqlite example which I believe creates something similar to your for loop:
# data.frame for testing
user <- sample.int(10000, 100)
course <- sample.int(10000, 100)
c_start <- sample(
seq(as.Date("2004-01-01"), by="3 months", length.ou=12),
100, replace=TRUE
)
c_end <- c_start + as.difftime(11, units="weeks")
c_idx <- sample.int(100, 1000, replace=TRUE)
enroll <- data.frame(
user=sample(user, 1000, replace=TRUE),
course=course[c_idx],
c_start=as.character(c_start[c_idx]),
c_end=as.character(c_end[c_idx]),
score=runif(1000),
stringsAsFactors=FALSE
)
#variant 1: for-loop
system.time({
enroll$avg.p.score <- NA
for (i in 1:nrow(enroll)) {
sum <- 0
count <- 0
for (j in which(enroll$user==enroll$user[[i]]))
if (enroll$c_end[[j]] < enroll$c_start[[i]]) {
sum <- sum + enroll$score[[j]]
count <- count + 1
}
if(count !=0) enroll$avg.p.score[[i]] <- sum / count
}
})
#variant 2: sqlite
system.time({
library(RSQLite)
con <- dbConnect("SQLite", ":memory:")
dbWriteTable(con, "enroll", enroll, overwrite=TRUE)
sql <- paste("Select e.user, e.course, Avg(p.score)",
"from enroll as e",
"cross join enroll as p",
"where e.user=p.user and p.c_end < e.c_start",
"group by e.user, e.course;")
res <- dbSendQuery(con, sql)
dat <- fetch(res, n=-1)
})
On my machine, sqlite is ten times faster. If that is not enough, it would be possible to index the database.
I can't really test this, as your data doesn't appear to satisfy the inequality in any combination, but I'd try something like this:
library(plyr)
res <- ddply(data, .(User.ID), function(d) {
with(subset(merge(d, d, by=NULL, suffixes=c(".i", ".j")),
Course.End.Date.j < Course.Start.Date.i),
c(Mean.Prior.Score = mean(Score.j)))
})
res$Mean.Prior.Score[is.nan(res$Mean.Prior.Score)] = 0
Here is how it works:
ddply: Group data by User.ID and execute function for each subset d of rows for one User.ID
merge: Create two copies of the data for one user, one with columns suffixed .i the other .j
subset: From this outer join, only select those matching the given inequality
mean: Compute the mean for the matched rows
c(…): Give a name to the resulting column
res: Will be a data.frame with columns User.ID and Mean.Prior.Score
is.nan: mean will return NaN for zero-length vectors, change these to zeros
I guess this might be reasonably fast if there are not too many rows for each User.ID. If this isn't fast enough, the data.tables mentioned in other answers might help.
Your code is a bit fuzzy on the desired output: you treat data$Mean.Prior.Score like a length-one variable, but assign to it in every iteration of the loop. I assume that this assignment is meant only for one row. Do you need means for every row of the data frame, or only one mean per user?

Resources