So I have a large data set of students at a school that looks like this:
library(data.table)
set.seed(1)
school <- data.table("id" = rep(1:10, each = 10), "year" = rep(2000:2009, each = 10),
"grade" = sample(c(9:11, rep(NA, 5)), 100, replace = T))
What I want to do is create a column that indicates if a student has previously been in the same grade as he is now.
The desired output for this example can be found here (I crated a link to save space).
This may sound simple but it is not since students can go back in grades, or be absent in years prior.
I would like a way to do this using data.table as the dataset is very large. so far I've tried the following:
library(dplyr)
library(scales)
school[, repetition := any(school[censor((.I - 10):(.I + 10),
range = c(0, NROW(school))) %>% na.omit
][school[.I, id] == id] == grade)]
However, this doesn't work as I don't know how to distinguish "upper level" (from the first school[...] call) operators like .I and id from inside the second school[...] call.
P.D.: I'll accept suggestions for a better title. Thanks!
We can use duplicated to get logical value for grades that repeat for each id and year.
library(data.table)
school[, repetition := duplicated(grade, incomparables = NA), .(id, year)]
Related
in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.
To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.
The results are different in case of duplicates, for instance Event_id==1002:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.
I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))
I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.
There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)
I am trying to do multiple aggregation steps using data.table. First I want to find the median value at each concentration for a specific type of sample by plate, then I want to find the maximum of the medians for each plate.
library(data.table)
set.seed(1)
DT <- data.table(plate = rep(paste0("plate",1:3),each=11),
type = rep(c(rep(1,9),2,2),3),
value = sample(1:25,33,replace=TRUE),
conc = rep(c(rep(1:3,each=3),4,4),3)
)
I got the following to work:
DT[,med := median(value[type==1]),by=list(plate,conc)]
DT[,max := max(med,na.rm=TRUE),by=plate]
Is it possible to do a multiple step aggregation without adding the intermediate med column?
You could e.g. do the following:
DT[, max := max(.SD[, median(value[type == 1]), by = conc]$V1, na.rm = T),
by = plate]
but I'm pretty sure your two line way is much faster.
I have a data frame with annual exports of firms to different countries in different years. My problem is i need to create a variable that says, for each year, how many firms there are in each country. I can do this perfectly with a "tapply" command, like
incumbents <- tapply(id, destination-year, function(x) length(unique(x)))
and it works just fine. My problem is that incumbents has length length(destination-year), and I need it to have length length(id) -there are many firms each year serving each destination-, to use it in a subsequent regression (of course, in a way that matches the year and the destination). A "for" loop can do this, but it is very time-consuming since the database is kind of huge.
Any suggestions?
You don't provide a reproducible example, so I can't test this, but you should be able to use ave:
incumbents <- ave(id, destination-year, FUN=function(x) length(unique(x)))
Just "merge" the tapply summary back in with the original data frame with merge.
Since you didn't provide example data, I made some. Modify accordingly.
n = 1000
id = sample(1:10, n, replace=T)
year = sample(2000:2011, n, replace=T)
destination = sample(LETTERS[1:6], n, replace=T)
`destination-year` = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, `destination-year`)
Now tabulate your summaries. Note how I reformatted to a data frame and made the names match the original data.
incumbents = tapply(id, `destination-year`, function(x) length(unique(x)))
incumbents = data.frame(`destination-year`=names(incumbents), incumbents)
Finally, merge back in with the original data:
merge(dat, incumbents)
By the way, instead of combining destination and year into a third variable, like it seems you've done, tapply can handle both variables directly as a list:
incumbents = melt(tapply(id, list(destination=destination, year=year), function(x) length(unique(x))))
Using #JohnColby's excellent example data, I was thinking of something more along the lines of this:
#I prefer not to deal with the pesky '-' in a variable name
destinationYear = paste(destination, year, sep='-')
dat = data.frame(id, year, destination, destinationYear)
#require(plyr)
dat <- ddply(dat,.(destinationYear),transform,newCol = length(unique(id)))
#Or if more speed is required, use data.table
require(data.table)
datTable <- data.table(dat)
datTable <- datTable[,transform(.SD,newCol = length(unique(id))),by = destinationYear]