Grouping and Running a For Loop in R - r

I'm looking to include a group statement within my for loop and I'm having difficulty finding any details into how to properly do this.
The example below , calculates the Extra, Outstanding and Current Column within my loop statement. I'm trying to group by id so that the loop will restart with every id. My current code:
dat <- tibble(
id = c("A","A","A","A","A","A","B","B"),
rn= c(1,2,3,4,5,6,1,2),
current = c(100,0,0,0,0,0,500,0),
paid = c(10,12,12,13,13,13,20,20),
pct_extra = c(.02,.05,.05,.07, .03, .01, .09,.01),
Extra = NA,
Outstanding = NA)
for(i in 1:nrow(dat)){
dat$Extra[i] <- dat$current[i]*dat$pct_extra[i]
dat$Outstanding[i] <- dat$current[i] - dat$paid[i] - dat$Extra[i]
if(i < nrow(dat)){
dat$current[(i+1)] <- dat$Outstanding[i]}}
I saw other posts with this same question and they seem to revert to using dplyr. So my first attempt was:
for(i in 1:nrow(dat)){
dat%>%
group_by(id)%>%
mutate(Extra=pct_extra*(current-paid),
Outstanding=current-paid-Extra,
current=if_else(rn==1,current,lag(Outstanding)))}
But this attempt didnt actually calculate the Extra, Outstanding and current columns which my guess is because I'm not using the loop statement properly.
Does anyone have any suggestions/references on how I can include a group statement into my for loop?
Thanks!

A few things.
for loops (surrounding dplyr pipes) are generally not necessary with dplyr grouping, this is no exception (though we will use your for loop in a "single group at a time" way).
Even if it were, you loop with i and never use i, so you're doing the same calculation to all rows, nrow(dat) times.
Third, you aren't storing the results.
My first attempt (after realizing the rolling nature of this) was to try to adapt slider::slide to it, but unfortunately I couldn't get it to work.
In older-dplyr, I would dat %>% group_by(id) %>% do({...}), but they've superseded do in lieu of across and multi-row summarize (which I could not figure out how to make do this).
So then I realized that your for loop works fine, it just needs to be applied one group at a time.
func <- function(z) {
for (i in seq_len(nrow(z))) {
z$Extra[i] <- z$current[i]*z$pct_extra[i]
z$Outstanding[i] <- z$current[i] - z$paid[i] - z$Extra[i]
if (i < nrow(z)) {
z$current[(i+1)] <- z$Outstanding[i]
}
}
z
}
library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # map, can be done with base::Map as well
dat %>%
group_by(id) %>%
nest(quux = -id) %>%
mutate(quux = map(quux, func)) %>%
unnest(quux) %>%
ungroup()
# # A tibble: 8 x 7
# id rn current paid pct_extra Extra Outstanding
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 100 10 0.02 2 88
# 2 A 2 88 12 0.05 4.4 71.6
# 3 A 3 71.6 12 0.05 3.58 56.0
# 4 A 4 56.0 13 0.07 3.92 39.1
# 5 A 5 39.1 13 0.03 1.17 24.9
# 6 A 6 24.9 13 0.01 0.249 11.7
# 7 B 1 500 20 0.09 45 435
# 8 B 2 435 20 0.01 4.35 411.

Related

R: For a list of dfs, filter for a value in Column1, to extract mean and SD of another Column2 (only of rows with the filtered value in Column1)

I have a big dataset ('links_cl', each participant of a study has several 100 rows), which I need to subset into dfs, one for each participant.
For those 42 dfs, I then need to do the same operation again and again. After spending half a day trying to write my own function, trying to find a solution online, I now have to ask here.
So, I am looking for a way to
subset the huge dataset several times and have one in my environment for every participant, without using the same code 42 times. What I did so far 'by hand' is:
Subj01 <- subset(links_cl, Subj == 01, select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE"))
filter for Column 'Condition' (either == 1,2,3 or 4), and describe/get the mean and sd of 'RT_LINK', which I so far also did 'manually'.
Subj01 %>% filter(Condition == 01) %>% describe(Subj01$RT_LINK)
But here I just get the description of the whole df of Subj01, so I would have to find 4x41 means by hand. It would be great to just have an output with the means and SDs of every participant, but I have no idea where to start and how to tell R to do this.
I tried this, but it won't work:
subsetsubj <- function(x,y) {
Subj_x <- links_cl %>%
subset(links_cl,
Subj == x,
select = c("Condition", "ACC_LINK", "RT_LINK", "CONF_LINK", "ACC_SOURCE", "RT_SOURCE", "CONF_SOURCE")) %>%
filter(Condition == y) %>%
describe(Subj_x$RT_LINK)
}
I also tried putting all dfs into a List and work with that, but it lead to nowhere.
If there is a solution without the subsetting, that would also work. This just seemed a logical step to me. Any idea, any help how to solve it?
You don't really need to split the dataset up into one dataframe for each patient. I would recommend a standard group_by()/summarize() approach, like this:
links_cl %>%
group_by(Subj, Condition) %>%
summarize(mean_val = mean(RT_LINK),
sd_val = sd(RT_LINK))
Output:
Subj Condition mean_val sd_val
<int> <int> <dbl> <dbl>
1 1 1 0.0375 0.873
2 1 2 0.103 1.05
3 1 3 0.184 0.764
4 1 4 0.0375 0.988
5 2 1 -0.0229 0.962
6 2 2 -0.156 0.820
7 2 3 -0.175 0.999
8 2 4 -0.0763 1.12
9 3 1 0.272 1.02
10 3 2 0.0172 0.835
# … with 158 more rows
Input:
set.seed(123)
links_cl <- data.frame(
Subj = rep(1:42, each =100),
Condition = rep(1:4, times=4200/4),
RT_LINK = rnorm(4200)
)

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

Alternative to nested loops with filtered calculations and data export

I have a large data file (11 million observations) and there are columns for ID, year, month, time period (and variables, like speed that I'm interested in). I'd like to perform calculations for each of these and summarize the results in a new CSV, such that I'd results and formatted with rows for each unique ID/year/month/hour.
I was able to accomplish this with a series of nested loops, which worked fine when the file was smaller (a few thousands observations). I've been trying to find a better method with apply functions, but can't get the same structure. I'm using groupby to create a few new columns before loops, which runs quickly, but doesn't give me a summary output csv.
results = NULL
data.calc = NULL
tmp = NULL
PERIOD = 5:9
YEAR = 2014:2017
LINK = 1:5
MONTH = 1:12
for(link in LINK,
for (year in YEAR){
for (month in MONTH){
for (period in PERIOD){
data.calc = filter(data,
LinkID_Int == link,
Year==year,
MONTH==month,
Period==period
)
#Speed
spd.5 = quantile(data.calc$speed, 0.05)
spd.20 = quantile(data.calc$speed, 0.20)
spd.50 = quantile(data.calc$speed, 0.50)
spd.85 = quantile(data.calc$speed, 0.85)
spd.SD = sd(data.calc$speed)
tmp = tibble(link,
year,
month,
period,
spd.5, spd.20, spd.50, spd.85,
spd.SD,
)
results = rbind(results, tmp)
}
}
}
}
write.csv(results, file="C:/Users/...", row.names = FALSE)
This code works, but runs for hours with few results. I like the logic of for loops, meaning it's easy for me to read and understand what's happening, but I've seen plenty of posts that there are faster ways to go about this. I have about 30 actual calculations running in the loops, across several different variables.
I really appreciate any guidance on this.
A lot of your slow-down I think is because you repeatedly filter your data (time-consuming with 11M rows). Since you're already using dplyr (for ::filter), I suggest a "tidy" way of doing this. Since we don't have your data, I'll demonstrate with mtcars:
library(dplyr)
mtcars %>%
group_by(gear, vs, am) %>%
summarize_at(vars(disp), .funs = list(~n(), ~mean(.), ~sd(.), q50 = ~quantile(.,0.5)))
# # A tibble: 7 x 7
# # Groups: gear, vs [6]
# gear vs am n mean sd q50
# <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 3 0 0 12 358. 71.8 355
# 2 3 1 0 3 201. 72.0 225
# 3 4 0 1 2 160 0 160
# 4 4 1 0 4 156. 14.0 157.
# 5 4 1 1 6 88.9 20.4 78.8
# 6 5 0 1 4 229. 114. 223
# 7 5 1 1 1 95.1 NaN 95.1
You can see how some columns are automatically named for the function, and one I over-rode. This is "just another frame" that can be exported (e.g., to CSV).
If you have more than one variable over which you want summarize statistics, just include them in your call to vars, and the column names break out appropriately:
mtcars %>%
group_by(gear, vs, am) %>%
summarize_at(vars(mpg, disp), .funs = list(~n(), ~mean(.), ~sd(.), q50 = ~quantile(.,0.5)))
# # A tibble: 7 x 11
# # Groups: gear, vs [6]
# gear vs am mpg_n disp_n mpg_mean disp_mean mpg_sd disp_sd mpg_q50 disp_q50
# <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 3 0 0 12 12 15.0 358. 2.77 71.8 15.2 355
# 2 3 1 0 3 3 20.3 201. 1.93 72.0 21.4 225
# 3 4 0 1 2 2 21 160 0 0 21 160
# 4 4 1 0 4 4 21.0 156. 3.07 14.0 21 157.
# 5 4 1 1 6 6 28.0 88.9 5.12 20.4 28.8 78.8
# 6 5 0 1 4 4 19.1 229. 5.02 114. 17.8 223
# 7 5 1 1 1 1 30.4 95.1 NaN NaN 30.4 95.1
One more "BTW": iteratively building results using rbind(results, tmp) works fine for a few iterations, but it gets really slow. Because: each time you rbind, it makes a complete copy of all of the data in both. If results is 1M rows before the call to rbind, then while the row-binding is going on, you have (at least) 2M rows (1M rows, two copies) in memory at a time. While doing this once or twice is generally not a problem, you can see how doing this hundreds or thousands of times (depending on the number of factors you have) can be problematic.
Better practices include:
pre-allocating your output list with something like:
out <- vector("list", prod(length(LINK), length(YEAR), length(MONTH), length(PERIOD))
ind <- 0L
for (...) {
for (...) {
for (...) {
for (...) {
tmp <- (do-stuff-here)
ind <- ind + 1L
out[[ind]] <- tmp
}
}
}
}
out <- do.call(rbind, out)
do it within lapply and assign the output to out, though this is a little harder to concoct four-nested-for into a single lapply
I still argue that trying to do nested for and filtering data on each pass is a bad place to start. Even though you can remove the inefficiency of copying data each time with iterative-rbind, you'll still have unnecessary overhead of filtering.
But if you must, then consider filtering per-for:
out <- vector("list", prod(...)) # as above
ind <- 0L
for (lk in LINK) {
dat_l <- mydat[LinkID_Int == lk,,drop=FALSE]
for (yr in YEAR) {
dat_y <- dat_l[Year == yr,,drop=FALSE]
for (mh in MONTH) {
dat_m <- dat_y[Month == mh,,drop=FALSE]
for (pd in PERIOD) {
data.calc <- dat_m[Period == pd,,drop=FALSE]
tmp <- {do-stuff-here}
ind <- ind + 1L
out[[ ind ]] <- tmp
}
}
}
}
In this case, at least each inner-loop is filtering on much less data. This is still inefficient, but is slightly less so.
(I still think the dplyr solution above is more readable, likely much faster, more maintainable, and more extensible.)
Always avoid running rbind in a loop as it leads to excessive copying in memory. See Patrick Burns' Circle 2, "Growing Objects", of R Inferno.
Since you require inline grouped aggregation consider base R's ave which returns same length as input vector so can be assigned to new columns.
results <- transform(data,
spd.5 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.05)),
spd.20 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.2)),
spd.50 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.5)),
spd.85 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.85)),
spd.SD = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=sd)
)
For full grouping aggregation of your data consider base R's aggregate:
agg_raw <- aggregate(speed ~ Year + MONTH + Period,
function(x) c(spd.5 = unname(quantile(x, 0.05)),
spd.20 = unname(quantile(x, 0.2)),
spd.50 = unname(quantile(x, 0.5)),
spd.85 = unname(quantile(x, 0.85)),
spd.SD = sd(x))
)
results <- do.call(data.frame, agg_raw)
colnames(results) <- gsub("speed.", "", colnames(results))

R loop through multiple sub groups with using functions

Hi I am trying to learn how to loop through multiple groups within a data frame and apply certain arithmetic operations. I do not have a programming background and am struggling to loop through the multiple conditions.
My data looks like the following:
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
What I would like to do is to look through each event and each category and take the average value of Money in cases where Play_together == 1. When Play_together==0, then I would like to apply Money/100.
I understand that the loop would look something like the following:
for i in 1:nrow(z){
#loop for event{
#loop for Category{
#Define avg or division function
}
}
}
However, I cannot seem to implement this using a nested loop. I saw another post (link: apply function for each subgroup) which uses dplyr package. I was wondering if someone could help me to implement this without using any packages (I know this might take longer as compared to using R packages). I am trying to learn R and this is the first time I am working with nested loops.
The final output will look like this:
where for event 1, the following holds:
a) For cateory 1:
Play_together ==1 in row 1; we take the avg of Money value and hence final output = 23/1= 23
Play_together==0 in row 2; we take Money/100= 0.11
b) For category 2:
Play_together == 1 for all observations. We take avg Money for all three observations.
This holds similarly for Event 2. In my actual dataset, I have event = 600 and number of category ranging from 1 - 10. Some events may have only 1 category and a maximum of 10 categories. So any function needs to be extremely flexible. The total number of observations in my dataset is around 1.5 million so any changes in the looping process to reduce the time taken to carry out the operation is going to be extremely helpful (Although at this stage my priority is the looping process itself).
Once again it would be a great help if you can show me how to use nested looping and explain the steps in brief. Much appreciated.
will something like this do?
I know it's using dplyr, but that package is made for this kind of jobs ;-)
Event = c(1,1,1,1,1,2,2,2,2,2)
Indiv1=c(4,5,6,11,45,66,8,9,32,45)
Indiv2=c(7,81,91,67,12,34,56,78,90,12)
Category=c(1,1,2,2,2,1,2,2,1,1)
Play_together=c(1,0,1,1,1,1,1,1,0,1)
Money=c(23,11,78,-9,-12,345,09,43,21,90)
z = data.frame(Event,Indiv1,Indiv2,Category,Play_together,Money)
library(dplyr)
df_temp <- z %>%
group_by( Event, Category, Play_together ) %>%
summarise( money_mean = mean( Money ) ) %>%
mutate( final_output = ifelse( Play_together == 0, money_mean / 100, money_mean )) %>%
select( -money_mean )
df <- z %>%
left_join(df_temp, by = c("Event", "Category", "Play_together" )) %>%
arrange(Event, Category)
Consider base R's by, the object-oriented wrapper to tapply designed to subset dataframes by factor(s) but unlike split can pass subsets into a defined function. Then, run conditional logic with ifelse for Final_Output field. Finally, stack all subsetted dataframes for final object.
# LIST OF DATAFRAMES
by_list <- by(z, z[c("Event", "Category")], function(sub) {
tmp <- subset(sub, Play_together==1)
sub$Final_Output <- ifelse(sub$Play_together == 1, mean(tmp$Money), sub$Money/100)
return(sub)
})
# APPEND ALL DATAFRAMES
final_df <- do.call(rbind, by_list)
row.names(final_df) <- NULL
final_df
# Event Indiv1 Indiv2 Category Play_together Money Final_Output
# 1 1 4 7 1 1 23 23.00
# 2 1 5 81 1 0 11 0.11
# 3 2 66 34 1 1 345 217.50
# 4 2 32 90 1 0 21 0.21
# 5 2 45 12 1 1 90 217.50
# 6 1 6 91 2 1 78 19.00
# 7 1 11 67 2 1 -9 19.00
# 8 1 45 12 2 1 -12 19.00
# 9 2 8 56 2 1 9 26.00
# 10 2 9 78 2 1 43 26.00

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

Resources