Using data.table in r to eliminate inner for loop - r

I have an inner for-loop in R which I have identified as significant bottleneck in my code. The script simulates the effect of a time-varying policy on individuals prior to adulthood. The outer loop runs over a list of cohorts (yob = 1910,...,1930 etc.) that I would like to study. The inner loop counts from ages from a = 5 to a = 17. CSL.details is a data.table that contains the details of each law that I am studying in form of the variables I grab, which vary by year = birthyear + a. To understand the overall effects of the policy by birth cohort, I need to track ca_years1, ca_years2, ca_years3, and ca_years4 for each a.
ages = seq.int(5,17)
state = "Massachusetts"
yob = seq.int(1910, 1930)
for (birthyear in yob){
ca_years1 = 0; ca_years2 = 0; ca_years3 = 0; ca_years4 = 0;
for (a in ages){
thisyear = birthyear + a
# Grab each law for given state and year and implement exemption permit
thislaw <- CSL.details[statename == state & yob == birthyear & thisyear == year]
if (nrow(thislaw) == 0) next
exempt_workpermit = (ca_years2 >= thislaw$workyrs & a >= thislaw$workage & thislaw$workage > 0)
exempt_yearstodropout = (ca_years3 >= thislaw$earlyyrs & a >= thislaw$earlyyrs_condition & thislaw$earlyyrs > 0)
exempt_cont = ((ca_years2 + ca_years4) >= thislaw$contyrs & thislaw$contyrs > 0)
# Increment each law when school is required
if(thislaw$entryage <= a & a < thislaw$exitage){
ca_years1 = ca_years1 + 1
if(!exempt_workpermit){ca_years2 = ca_years2 + 1}
if(!exempt_yearstodropout){ca_years3 = ca_years3 + 1}
}
if(thislaw$contage > a &
a >= thislaw$workage &
!exempt_cont &
thislaw$workage > 0 &
!(thislaw$entryage <= a & a < thislaw$exitage & !exempt_workpermit)
){ca_years4 = ca_years4 + 1}
}
CSL.exposures[statename == state & yob == birthyear]$ca_years1 = ca_years1
CSL.exposures[statename == state & yob == birthyear]$ca_years2 = ca_years2
CSL.exposures[statename == state & yob == birthyear]$ca_years3 = ca_years3
CSL.exposures[statename == state & yob == birthyear]$ca_years4 = ca_years4
}
Is there a data.table solution for replacing the inner-loop? I am an intermediate R coder and it is a bit difficult to think of how to get started. Although I would prefer data.table exclusively, I am open to dplyr-type solutions if they significantly speed up the code.
Edit: here is an example of what CSL.detail looks like, as a copy-pasted data.table.
statename year yob statefip entryage exitage earlyyrs earlyyrs_condition workage workyrs contage contyrs statecompschoolyr
1: Massachusetts 1913 1800 25 7 16 4 14 14 4 16 0 1852
2: Massachusetts 1913 1801 25 7 16 4 14 14 4 16 0 1852
3: Massachusetts 1913 1802 25 7 16 4 14 14 4 16 0 1852
4: Massachusetts 1913 1803 25 7 16 4 14 14 4 16 0 1852
5: Massachusetts 1913 1804 25 7 16 4 14 14 4 16 0 1852

I managed to refactor the code to solve the problem. The key idea is to exploit state and yob as grouping variables (since all calculations happen within a state and yob pair). This completely eliminates the outer loops and requires only a single loop, iterating by age. I am just saving this answer here for reference, but I am not sure that there is a broader lesson for the stackoverflow.com community so feel free to delete. The time savings are on the order of 95%, primarily because it reduces the overhead time to call data.table.
for(a in ages){
# grab running total of years of education compelled by state and year of birth
CSL.details[CSL.exposures, on = .(statename, yob),
`:=` (ca_years1 = i.ca_years1,
ca_years2 = i.ca_years2,
ca_years3 = i.ca_years3,
ca_years4 = i.ca_years4)] %>%
.[year == a + yob,
`:=`(
# create exemptions by age based on number of years of schooling completed
exempt_workpermit = (ca_years2 >= workyrs & a >= workage & workage > 0),
exempt_yearstodropout = (ca_years3 >= earlyyrs & a >= earlyyrs_condition & earlyyrs > 0),
exempt_cont = ((ca_years2 + ca_years4) >= contyrs & contyrs > 0)
), by = .(statename, yob)]
CSL.exposures[
CSL.details[year == a + yob], on = .(yob, statename),
`:=` (exempt_workpermit = i.exempt_workpermit, exempt_yearstodropout = i.exempt_yearstodropout,
exempt_cont = i.exempt_cont, entryage = i.entryage,
exitage = i.exitage, contage = i.contage, workage = i.workage) ] %>%
.[ ,
`:=` (
ca_years1 =
fifelse(entryage <= a & a < exitage,
ca_years1 + 1, ca_years1, na = as.numeric(ca_years1)),
ca_years2 =
fifelse(entryage <= a & a < exitage & !exempt_workpermit,
ca_years2 + 1, ca_years2, na = as.numeric(ca_years2)),
ca_years3 =
fifelse(entryage <= a & a < exitage & !exempt_yearstodropout,
ca_years3 + 1, ca_years3, na = as.numeric(ca_years3)),
ca_years4 =
fifelse(contage > a & a >= workage & !exempt_cont &
workage > 0 &
!(entryage <= a & a < exitage & !exempt_workpermit),
ca_years4 + 1, ca_years4, na = as.numeric(ca_years4))),
by = .(statename, yob)
]
}

Related

Translating STATA syntax into R syntax [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 months ago.
Improve this question
I'm having some trouble translating some STATA codes to R codes:
Stata codes:
gen joint_gpw = sbud_jpw * q44 if sbud_jpw < 888 & q44 < 888
gen sbud_gpw_all = sbud_gpw if sbud_gpw < 888
replace sbud_gpw_all = q31 if sbud_gpw_all ==. & q31 < 888
replace sbud_gpw_all = joint_gpw if sbud_gpw_all ==. & joint_gpw !=.
replace sbud_gpw_all = 888 if q16_1 == 0 & sbud_gpw_all ==.
replace sbud_gpw_all = 888 if (sbud_gpw == 888 & q31 == 888 & sbud_jpw == 888 & q44 == 888) & sbud_gpw_all ==.
replace sbud_gpw_all = 999 if (sbud_gpw == 999 | q31 == 999 | sbud_jpw == 999 | q44 == 999 | (q44 !=. & sbud_jpw == 888)) & sbud_gpw_all ==.
Here is the R code I tried:
dat%>%
dplyr::mutate(joint_gpw = ifelse((sbud_jpw<888 & q44<888),sbud_jpw * q44,NA))%>%
dplyr::mutate(sbud_gpw_all = ifelse(sbud_gpw < 888,sbud_gpw,NA))%>%
dplyr::mutate(sbud_gpw_all = ifelse((sbud_gpw_all= NA & q31<888),q31,NA))%>%
dplyr::mutate(sbud_gpw_all = ifelse((sbud_gpw_all = NA & joint_gpw != NA),joint_gpw,NA))%>%
dplyr::mutate(sbud_gpw_all) = ifelse((q16_1 = 0 & sbud_gpw_all = NA),888,NA)%>%
dplyr::mutate(sbud_gpw_all) = ifelse((sbud_gpw = 888 & q31 = 888 & sbud_jpw = 888 & q44 = 888) & sbud_gpw_all = NA,888,NA)%>%
dplyr::mutate(sbud_gpw_all) = ifelse(((sbud_gpw = 999 | q31 = 999 | sbud_jpw = 999 | q44 = 999 | (q44 != NA & sbud_jpw == 888)) & sbud_gpw_all = NA)),999,NA)
Errors that showed up before:
Error: unexpected '=' in:
" dplyr::mutate(sbud_gpw_all) = ifelse((q16_1 = 0 & sbud_gpw_all = NA),888,NA)%>%
dplyr::mutate(sbud_gpw_all) = ifelse((sbud_gpw = 888 & q31 = 888 & sbud_jpw = 888 & q44 = 888) & sbud_gpw_all ="
I would like to know that if these two set of codes are equivalent? I greatly appreciate all the help there is! Thanks!!!
The error originates from the closing parenthesis ) after sbud_gpw_all in the last three lines.
Also, despite not the error thrown, you're overwriting sbud_gpw_all with every mutate. I don't know Stata and you didn't provide a minimal reproducible example but I have a feeling your code could work like this:
dat %>%
mutate(
joint_gpw = if_else(sbud_jpw < 888 & q44 < 888, sbud_jpw * q44, NA_real_),
sbud_gpw_all = case_when(
sbud_gpw < 888 ~ sbud_gpw,
q31 < 888 ~ q31,
!is.na(joint_gpw) ~ joint_gpw,
q16_1 == 0 ~ 888,
sbud_gpw == 888 & q31 == 888 & sbud_jpw == 888 & q44 == 888 ~ 888,
sbud_gpw == 999 | q31 == 999 | sbud_jpw == 999 | q44 == 999 | (!is.na(q44) & sbud_jpw == 888) ~ 999
)
)
This will first create the column joint_gpw using dplyr::if_else() if sbud_jpw < 888 & q44 < 888. Afterwards, there is a set of conditions (before the ~) that are checked sequentially. The first that matches the row, provides the value (behind the ~ operator).
Note that, as Sotos pointed out in a comment, NAs in R are checked with is.na(x), not with ==/!=, as those will always return NA. I omitted the NA check for most lines because those are implied in the sequential nature of case_when() -- as soon as one condition matches, the later ones are not evaluated anymore. The NA_real_ is a numeric NA value. Using if_else() and case_when(), you have to be explicit about the data type.

Kusto - Group by duration value to show numbers

I use the below query to calculate the time diff between 2 events. But I am not sure how to group the duraions. I tried case function but it does not seem to work. Is there a way to group the duration . For example a pie or column chart to show number of items with durations more than 2 hours, more than 5 hours and more than 10 hours. Thanks
| where EventName in ('Handligrequest','Requestcomplete')
| summarize Time_diff = anyif(Timestamp,EventName == "SlackMessagePosted") - anyif(Timestamp,EventName == "ReceivedSlackMessage") by CorrelationId
| where isnotnull(Time_diff)
| extend Duration = format_timespan(Time_diff, 's')
| sort by Duration desc```
// Generate data sample. Not part of the solution
let t = materialize (range i from 1 to 1000 step 1 | extend Time_diff = 24h*rand());
// Solution Starts here
t
| summarize count() by time_diff_range = case(Time_diff >= 10h, "10h <= x", Time_diff >= 5h, "5h <= x < 10h", Time_diff >= 2h, "2h <= x < 5h", "x < 2h")
| render piechart
time_diff_range
count_
10h <= x
590
5h <= x < 10h
209
x < 2h
89
2h <= x < 5h
112
Fiddle

Create new variable in R with assumptions from SPSS file

I've read in my SPSS file in R and want to recode a new variable if such and such assumptions are made. To be specific:
I want to turn my spssdata_sub$gest variable into a new variable if the following the conditions are met:
spssdata_sub$indusert != 2 & spssdata_sub$ivf != 1 & spssdata_sub$leie != 3 & spssdata_sub$svkompl_II != 7 & spssdata_sub$svkompl_II != 2 & spssdata_sub$svkompl_II != 1
Anyone here who can help me with a code?
Does one of the following codes work for you?
Either this adapted version of Renu's solution
spssdata_sub$gest <- ifelse(spssdata_sub$indusert != 2 & spssdata_sub$ivf != 1 & spssdata_sub$leie != 3 & spssdata_sub$svkompl_II != 7 & spssdata_sub$svkompl_II != 2 & spssdata_sub$svkompl_II != 1, spssdata_sub$gest, NA)
or this code for filtering observations:
library(dplyr)
spssdata_sub_new <- spssdata_sub %>%
filter(indusert != 2 & ivf != 1 & leie != 3 & svkompl_II != 7 & svkompl_II != 2 & ssvkompl_II != 1)
One way is the following, if you really mean either one of the conditions
Mynewdata <- dplyr::filter(spssdata, indusert != 2, ivf != 1, leie != 3,
svkompl_II != 7 & svkompl_II != 2 & svkompl_II != 1)
only keeps entries that are neither, or putting it the other way exludes entries that have either indusert = 2 or ivf = 1 etc... one of the condition is enough to exclude it.
add-on: or something also like that:
Mynewdata <- dplyr::filter(spssdata, indusert != 2, ivf != 1, leie != 3,
!(svkompl_II %in% c(7,2,1))

Exporting Summary Data to CSV in R

Hello everyone I am working on a script that I would like to export to a CSV file.
Everything is working well with the exception that I would like to add column names and headers for the below data.
For instance variable A is the summary data of fixed income trades in 2017. I would like Row 1 in the output file to read as such.
Any help would be greatly appreciated. My code is written below. Thanks in advance!!
#SENDS THE RESULTS TO FILE CALLED OUTFILE.TXT WHICH IS OVERWRITTEN EACH TIME SCRIPT IS RUN
sink("outfile.csv")
#SHORT-TERM PRE-REFUNDED TRADE DATA
A = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2017 & MSRB$Par.Traded >=500 & MSRB$Class == "PRE-REFUNDED"),]
B = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2017 & MSRB$Par.Traded >=1000 & MSRB$Class == "PRE-REFUNDED"),]
C = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2018 & MSRB$Par.Traded >=500 & MSRB$Class == "PRE-REFUNDED"),]
D = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2018 & MSRB$Par.Traded >=1000 & MSRB$Class == "PRE-REFUNDED"),]
E = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2019 & MSRB$Par.Traded >=500 & MSRB$Class == "PRE-REFUNDED"),]
F = MSRB[which(MSRB$Coupon.Rate >= 2 & MSRB$Year == 2019 & MSRB$Par.Traded >=1000 & MSRB$Class == "PRE-REFUNDED"),]
#SUMMARY OF PRE-REFUNDED DATA
summary(A$Yield)
summary(B$Yield)
summary(C$Yield)
summary(D$Yield)
summary(E$Yield)
summary(F$Yield)
#END OF OUTPUT FILE
sink()

R: recommendation on how to compute new columns on multiple condition of others for every row in data.frame

For every entry in rows i need to compute two variables as new columns in a data.frame depending conditional on more than 60 other columns. I would like your recommendation on how to realize that elegant (while and for, with, ifelse, foreach, by or ddply?). I don't like to do that manually like i did for the first cases in the example code and i don't care for performance.
Further: Probably i would not need to ask if i would have understood how to use functions like transform (with ddply or by) and what they do. Thus i hope you can recommend good tutorials on that, maybe relating to my case. I found a lot but in different context and was not able to comprehend it entrily or transcribe it for my case.
My case: I have three columns for each of 20 events representing the kind and date of that event. For each row I need to compute (and save to that data.frame) the difference in time between one special event (depending on whether a special kind happened before or after another) and a date fixed for every entry in rows. Furthermore i need to save the date of that event.
This is how i did (it works, but it is running only through the first cases):
#event.2 (1. event month), event.3 (1. event year), event.4 (1. event kind), event.5 (2. event month), event.6 (2. event year), ...
df$dit[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(df$event.4 == 3 & ((1/12*df$event.2)+df$event.3) > df$fixdate) & (df$event.7 == 1 | df$event.7 == 2)
)] = ((1/12*df$event.2)+df$event.3) - df$fixdate
df$date[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(df$event.4 == 3 & ((1/12*df$event.2)+df$event.3) > df$fixdate) & (df$event.7 == 1 | df$event.7 == 2)
)] = ((1/12*df$event.2)+df$event.3)
df$dit[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(df$event.4 == 1 & ((1/12*df$event.2)+df$event.3) > df$fixdate)
| (df$event.4 == 2 & ((1/12*df$event.2)+df$event.3) > df$fixdate)
)] = 0
df$date[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(df$event.4 == 1 & ((1/12*df$event.2)+df$event.3) > df$fixdate)
| (df$event.4 == 2 & ((1/12*df$event.2)+df$event.3) > df$fixdate)
)] = df$fixdate
df$dit[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(
(df$event.4 == 1 & ((1/12*df$event.2)+df$event.3) < df$fixdate)
& (
(df$event.7 == 1 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
| (df$event.7 == 2 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
)
)
|
(
(df$event.4 == 2 & ((1/12*df$event.2)+df$event.3) < df$fixdate)
& (
(df$event.7 == 1 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
| (df$event.7 == 2 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
)
)
)] = ((1/12*df$event.5)+df$event.6) - df$fixdate
df$date[(!is.na(df$event.2) & !is.na(df$event.3) & !is.na(df$event.4) & !is.na(df$event.5) & !is.na(df$event.6) & !is.na(df$event.7))
& (
(
(df$event.4 == 1 & ((1/12*df$event.2)+df$event.3) < df$fixdate)
& (
(df$event.7 == 1 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
| (df$event.7 == 2 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
)
)
|
(
(df$event.4 == 2 & ((1/12*df$event.2)+df$event.3) < df$fixdate)
& (
(df$event.7 == 1 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
| (df$event.7 == 2 & ((1/12*df$event.5)+df$event.6) > df$fixdate)
)
)
)] = ((1/12*df$event.5)+df$event.6)
You can define your conditions as expressions and use them within transform. The idea is to factorize your conditions at most as possible .
COND1 <- expression(!is.na(event.2) & !is.na(event.3) &
!is.na(event.4) & !is.na(event.5) &
!is.na(event.6) & !is.na(event.7))
COND2 <- expression(event.4 == 3 & ((1/12*event.2)+event.3) > fixdate) &
(event.7 == 1 | event.7 == 2))
COND3 <- expression(event.4 == 1 & ((1/12*event.2)+event.3) > fixdate)
COND4 <- expression(event.4 == 2 & ((1/12*event.2)+event.3) > fixdate)
### you continue here with the rest of conditions....
Then using them within transform you can do something like:
transform(df, date = ifelse(eval(COND1) & eval(COND2),((1/12*event.2)+event.3),NA),
transform(df, date = ifelse(eval(COND1) & (eval(COND3)|eval(COND4)),fixdate,NA))
## Note also that the seond "dit" variable is deduced from "date"
transform(df,dit=date-fixdate)

Resources