R: Loop through data.frame in row-pairs

R: Loop through data.frame in row-pairs - r

I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!

I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.

Related

select top n values by group with n depending on other value in data frame

I'm quite new to r and coding in general. Your help would be highly appreciated :)
I'm trying to select the top n values by group with n depending on an other value (in the following called factor) from my data frame. Then, the selected values shoud be summarised by group to calculate the mean (d100). My goal is to get one value for d100 per group.
(Background: In forestry there is an indicator called d100 which is the mean diameter of the 100 thickest trees per hectare. If the size of the sampling area is smaller than 1 ha you need to select accordingly fewer trees to calculate d100. That's what the factor is for.)
First I tried to put the factor inside my dataframe as an own column. Then I thought maybe it would help to have something like a "lookup-table", because R said, that n must be a single number. But I don't know how to create a lookup-function. (See last part of the sample code.) Or maybe summarising df$factor before using it would do the trick?
Sample data:
(I indicated expressions where I'm not sure how to code them in R like this: 'I dont know how')
# creating sample data
library(tidyverse)
df <- data.frame(group = c(rep(1, each = 5), rep(2, each = 8), rep(3, each = 10)),
BHD = c(rnorm(23, mean = 30, sd = 5)),
factor = c(rep(pi*(15/100)^2, each = 5), rep(pi*(20/100)^2, each = 8), rep(pi*(25/100)^2, each = 10))
)
# group by ID, then select top_n values of df$BHD with n depending on value of df$factor
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*df$factor,
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
# other thought: having a "lookup-table" for the factor like this:
lt <- data.frame(group = c(1, 2, 3),
factor = c(pi*(15/100)^2, pi*(20/100)^2, pi*(25/100)^2))
# then
df %>%
group_by(group) %>%
slice_max(
BHD,
n = 100*lt$factor 'where lt$group == df$group',
with_ties = F) %>%
summarise(d100 = mean('sliced values per group'))
I already found this answer to a problem which seems similar to mine, but it didn't quite help.

Since all the factor values are the same within each group, you can select any one factor value.
library(dplyr)
df %>%
group_by(group) %>%
top_n(BHD, n = 100* first(factor)) %>%
ungroup
# group BHD factor
# <dbl> <dbl> <dbl>
# 1 1 25.8 0.0707
# 2 1 24.6 0.0707
# 3 1 27.6 0.0707
# 4 1 28.3 0.0707
# 5 1 29.2 0.0707
# 6 2 28.8 0.126
# 7 2 39.5 0.126
# 8 2 23.1 0.126
# 9 2 27.9 0.126
#10 2 31.7 0.126
# … with 13 more rows

How can I iterate a function over specific columns of a series of dataframes where I can set the order?

I work for an insurance company and I am trying to improve something that I built. I have about 150 data frames that look like this:
library(data.table)
dt_Premium<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense<-data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
I take the base premium and then I multiply by factors, and then add a fixed expense at the very end. My code is currently something like:
dt_Final_Premium<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
What I hate about this:
-The 2:4 stuff (I would like to be able to use a named range)
-The typing is monstrous considering all of the tables and policies I actually have
-It is very confusing for anybody except me (the author) to understand and edit/adjust the code
-I would like to be able to have each rating step as part of a list, and then just iterate over that list (or a similar process).
-Ideally I would be able to get the values at each step. For example :
step2_answer<-cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4])
There just has to be a way were I can take a dataframe/datatable and then just multiply or add to the next dataframe/datatable in the series. Thanks for taking a look at this?

How about something like this using dplyr?!
Here I am using the same calculation that you have mentioned but row wise using mutate function of dplyr which makes it clear to see the step by step and for anyone to understand the calculation easily.
library(data.table)
library(dplyr)
dt_Premium <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Base_Premium_Fire= c(45,55,105,92),
Base_Premium_Water= c(20,21,24,29),
Base_Premium_Theft= c(3,5,6,7))
dt_Discount_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Discount_Factor_Fire= c(.9,.95,.99,.97),
Discount_Factor_Water= c(.8,.85,.9,.96),
Discount_Factor_Theft= c(1,1,1,1))
dt_Territory_Factors <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Territory_Factor_Fire= c(1.9,1.2,.91,1.03),
Territory_Factor_Water= c(1.03,1.3,1.25,1.01),
Territory_Factor_Theft= c(1,1.5,1,.5))
dt_Fixed_Expense <- data.table(Policy = c("Pol123","Pol333","Pol555","Pol999"),
Fixed_Expense_Fire= c(5,5,5,5),
Fixed_Expense_Water= c(7,7,7,7),
Fixed_Expense_Theft= c(9,9,9,9))
dt_Final_Premium <- cbind(dt_Premium[,1],dt_Premium[,2:4]*
dt_Discount_Factors[,2:4]*
dt_Territory_Factors[,2:4]+
dt_Fixed_Expense[,2:4])
new_dt_final_premium <-
dt_Premium %>%
# Joining all tables together
left_join(dt_Discount_Factors, by = "Policy") %>%
left_join(dt_Territory_Factors, by = "Policy") %>%
left_join(dt_Fixed_Expense, by = "Policy") %>%
# Calculating required calculation
mutate(
Base_Premium_Fire =
Base_Premium_Fire * Discount_Factor_Fire * Territory_Factor_Fire + Fixed_Expense_Fire,
Base_Premium_Water =
Base_Premium_Water * Discount_Factor_Water * Territory_Factor_Water + Fixed_Expense_Water,
Base_Premium_Theft =
Base_Premium_Theft * Discount_Factor_Theft * Territory_Factor_Theft + Fixed_Expense_Theft) %>%
select(Policy, Base_Premium_Fire, Base_Premium_Water, Base_Premium_Theft)

Since your columns have a clean naming, some pivoting may do the work:
library(tidyverse) #to be run after library(data.table)
dt_Premium %>%
left_join(dt_Discount_Factors, by="Policy") %>%
left_join(dt_Territory_Factors, by="Policy") %>%
left_join(dt_Fixed_Expense, by="Policy") %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #or calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")
After joining all the columns, you can pivot to a long format and turn columns to rows. There, you can separate the name into the real name (Base, Discount, Fixed...) and the object (Fire, Water, ...) and return to the wide format. The tricky part is to get a good regular expression, as your names use the underscore twice. Mine can be vastly improved but will do the work for now.
After this, you can calculate whatever you want, select only the result and pivot to wide one last time. If you want to get all the results, you may tweak this last pivot with prefixes.
Pivoting is quite a gymnastics, but it has proven to be very effective once you get used to it.
As you have a lot of tables, if you can get them as a list, you can also use purrr::reduce to join them all at once and simplify the first lines of code:
list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense) %>%
reduce(left_join, by='Policy') %>%
pivot_longer(cols=-Policy)%>%
separate(name, into=c("name", "object"), sep="_.*_") %>%
pivot_wider() %>%
mutate(total=Base*Discount*Territory+Fixed) %>% #of calculate the value for a specific step
select(Policy, object, total) %>%
pivot_wider(names_from = "object", values_from = "total")

Another option is to reorganize the data by converting into a long format, merge and then perform the calculations:
DT <- Reduce(merge, lapply(dtList, function(d) {
vn <- sub('_([^_]*)$', '', names(d)[2L]) #see reference [1]
melt(d, id.vars="Policy", value.name=vn)[,
variable := gsub("(.*)_(.*)_(.*)", "\\3", variable)]
}))
DT
DT[, disc_prem := Base_Premium * Discount_Factor][,
disc_prem_loc := disc_prem * Territory_Factor][,
Final_Premium := disc_prem_loc + Fixed_Expense]
output:
Policy variable Base_Premium Discount_Factor Territory_Factor Fixed_Expense disc_prem disc_prem_loc Final_Premium
1: Pol123 Fire 45 0.90 1.90 5 40.50 76.9500 81.9500
2: Pol123 Theft 3 1.00 1.00 9 3.00 3.0000 12.0000
3: Pol123 Water 20 0.80 1.03 7 16.00 16.4800 23.4800
4: Pol333 Fire 55 0.95 1.20 5 52.25 62.7000 67.7000
5: Pol333 Theft 5 1.00 1.50 9 5.00 7.5000 16.5000
6: Pol333 Water 21 0.85 1.30 7 17.85 23.2050 30.2050
7: Pol555 Fire 105 0.99 0.91 5 103.95 94.5945 99.5945
8: Pol555 Theft 6 1.00 1.00 9 6.00 6.0000 15.0000
9: Pol555 Water 24 0.90 1.25 7 21.60 27.0000 34.0000
10: Pol999 Fire 92 0.97 1.03 5 89.24 91.9172 96.9172
11: Pol999 Theft 7 1.00 0.50 9 7.00 3.5000 12.5000
12: Pol999 Water 29 0.96 1.01 7 27.84 28.1184 35.1184
data:
dtLs <- list(dt_Premium, dt_Discount_Factors, dt_Territory_Factors, dt_Fixed_Expense)
Reference:
regex-return-all-before-the-second-occurrence

I am guessing reading some of rdata.table vignettes would help you tighten up syntax and make it more terse. Some of us think terse = 'more readable' in numeric programming. Others think that represents some level of insanity:
vignette(package="data.table")
Understanding Map, Reduce, mget and other functional notation in R and rdata.table may help. Here are some things I have done from a data.table mindset:
Dropping cols syntax might be more terse using 'i' to drop a vector of cols:
dt[is.na(dt)] <- 0 # replace NA with 0
drop_col_list <- c('dropcol1','dropcol2','dropcol3') # drop col list
# dt <- dt[!drop_col_list,sapply(dt,as.numeric)] # make selected dt cols numeric type
dt[!drop_col_list,SumCol := Reduce(`+`, dt)] # adds Sum col with 'functional programming' iteration
The lapply(.SD, func) format is very powerful:
fsum <- function(x) {sum(x,na.rm=TRUE)}
dt[,lapply(.SD,fsum),by=,.SDcols=c("col1","col2","col3","col4")]
# or
dt[!drop_col_list,lapply(.SD,fsum)]
This shows applying the internal data.table 'set' function (':=') and mget to create cols derived from operations with functional programming on two data.tables. The data.table(s) may need to have the same nrow():
nm1 <- names(dt1)[1:4]
nm2 <- names(dt2)[1:4]
dt[, SumCol := Reduce(`+`, Map(`*`, mget(nm1), mget(nm2)))]
The loop below isn't really rdata.table'esq' programming but outputs a data.table. Probably this isn't as fast as more data.table like syntax:
seqXpi <- function(x) {x * pi}
seqXexp <- function(x) {x * exp(1)}
l <- {};
for(x in seq(1,10,1)) l <- as.data.table(rbind(l,cbind(seq=x,seqXpi=seqXpi(x),seqXexp=seqXexp(x))))

how to use bracket notation (or an alternative) while programming with dplyr

I'm trying to write a function to calculate toplines (as commonly used in polling data). It needs to include both a "percent" and "valid percent" column.
Here's an example
library(tidyverse)
# prepare some data
d <- gss_cat %>%
mutate(tvhours2 = tvhours,
tvhours2 = replace(tvhours2, tvhours > 5 , "6-8"),
tvhours2 = replace(tvhours2, tvhours > 8 , "9+"),
tvhours2 = fct_explicit_na(tvhours2),
# make a weight variable
fakeweight = rnorm(n(), mean = 1, sd = .25))
The following function works as far as it goes:
make.topline <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight)) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
n = sum(!!weight))
table
}
make.topline(variable = tvhours2, data = d, weight = fakeweight)
I'm struggling to implement the valid percent column. Here is the syntax I tried.
make.topline2 <- function(variable, data, weight){
variable <- enquo(variable)
weight <- enquo(weight)
table <- data %>%
# calculate denominator
mutate(total = sum(!!weight),
valid.total = sum(!!weight[!!variable != "(Missing)"])) %>%
# calculate proportions
group_by(!!variable) %>%
summarise(pct = (sum(!!weight)/first(total))*100,
valid.pct = (sum(!!weight)/first(valid.total))*100,
n = sum(!!weight))
table
}
make.topline2(variable = tvhours2, data = d, weight = fakeweight)
This yields the following error:
Error: Base operators are not defined for quosures.
Do you need to unquote the quosure?
# Bad:
myquosure != rhs
# Good:
!!myquosure != rhs
Call `rlang::last_error()` to see a backtrace
I know the problem is in this line, but I don't know how to fix it:
mutate(valid.total = sum(!!weight[!!variable != "(Missing)"]))

You can put parentheses around the !!weight. I think of this as making sure we are using the extract brackets only after weight is unquoted (so an order of operations thing).
That line would then look like:
valid.total = sum((!!weight)[!!variable != "(Missing)"])
Alternatively, you could use the new curly-curly operator ({{), which takes the place of the enquo()/!! combination for relatively simple cases like yours. Then your function would look something like
make.topline <- function(variable, data, weight){
table <- data %>%
# calculate denominator
mutate(total = sum({{ weight }}),
valid.total = sum({{ weight }}[{{ variable }} != "(Missing)"])) %>%
# calculate proportions
group_by({{ variable }}) %>%
summarise(pct = (sum({{ weight }})/first(total))*100,
valid.pct = (sum({{ weight }})/first(valid.total))*100,
n = sum({{ weight }}))
table
}
Like the parentheses solution, this runs without error.
make.topline(variable = tvhours2, data = d, weight = fakeweight)
# A tibble: 9 x 4
tvhours2 pct valid.pct n
<fct> <dbl> <dbl> <dbl>
1 0 3.16 5.98 679.
2 1 10.9 20.6 2342.
3 2 14.1 26.6 3022.
4 3 9.10 17.2 1957.
5 4 6.67 12.6 1432.
6 5 3.24 6.13 696.
7 6-8 4.02 7.61 864.
8 9+ 1.67 3.16 358.
9 (Missing) 47.2 89.3 10140.

loess regression on each group with dplyr::group_by()

Alright, I'm waving my white flag.
I'm trying to compute a loess regression on my dataset.
I want loess to compute a different set of points that plots as a smooth line for each group.
The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.
Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.
I just can't figure out how to make this work on a per-group basis.
Here are some examples of my failed attempts.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))
> test2
# A tibble: 136 x 2
# Groups: CpG [4]
CpG x
<chr> <dbl>
1 cg01003813 0.781
2 cg01003813 0.793
3 cg01003813 0.805
4 cg01003813 0.816
5 cg01003813 0.829
6 cg01003813 0.841
7 cg01003813 0.854
8 cg01003813 0.866
9 cg01003813 0.878
10 cg01003813 0.893
This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.
test2 <- test %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::do({
predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
})
This one simply fails with the following error.
"Error: Results 1, 2, 3, 4 must be data frames, not numeric"
Also it still isn't applied as a new column with dplyr::mutate
fems <- fems %>%
group_by(CpG) %>%
dplyr::arrange(AVGMOrder) %>%
dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))
This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.
I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.
data looks like this:
> head(test)
X geneID CpG CellLine Meth AVGMOrder neworder Group SmoothMeth
1 40 XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210 1 1 5 0.7808767
2 94 XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120 1 1 5 0.7927130
3 148 XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448 1 1 5 0.8045080
4 202 XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890 1 1 5 0.8163997
5 27 XG cg25296477 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880 2 2 3 0.8285259
6 81 XG cg01003813 iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240 2 2 3 0.8409501
unique(test$CpG)
[1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"
So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).
My actual dataset has a few thousand of those CpG's, not just the four.
https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing

This is a neat Tidyverse way to make it work:
library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)
models <- fems %>%
tidyr::nest(-CpG) %>%
dplyr::mutate(
# Perform loess calculation on each CpG group
m = purrr::map(data, loess,
formula = Meth ~ AVGMOrder, span = .5),
# Retrieve the fitted values from each model
fitted = purrr::map(m, `[[`, "fitted")
)
# Apply fitted y's as a new column
results <- models %>%
dplyr::select(-m) %>%
tidyr::unnest()
# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
geom_point() +
geom_line(aes(y = fitted))

You may have already figured this out -- but if not, here's some help.
Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.
So for your case:
fems <- fems %>%
group_by(CpG) %>%
arrange(CpG, AVGMOrder) %>%
mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))
Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.

Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.
# Define important variables
span_60 <- 60/365 # 60 days of a year
span_365 <- 365/365 # a whole year
# Group and order the data set
df <- as.data.frame(
df %>%
group_by(profile) %>%
arrange(profile, day) %>%
)
)
# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
# Declare that the loess column exists, but is blank
df$loess_60 <- NA
df$loess_365 <- NA
# Identify all unique profilee IDs
all_ids <- unique(x$profile)
# Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
# and isolate the according rows that belong to the profilee ID.
for (i in all_ids) {
len_entries <- length(which(x$profile == i))
queried_rows <- result <- x[which(x$profile == i), ]
# Run the loess fit and write the result to the according column
fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
x[which(x$profile == i), "loess_60"] <- fit_60
x[which(x$profile == i), "loess_365"] <- fit_365
}
# Return the initial data frame
return(x)
}
# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)

Simulating a timeseries in dplyr instead of using a for loop

So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.

Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205

If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))

Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))

The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.

What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Loop through data.frame in row-pairs - r

Related

select top n values by group with n depending on other value in data frame

How can I iterate a function over specific columns of a series of dataframes where I can set the order?

how to use bracket notation (or an alternative) while programming with dplyr

loess regression on each group with dplyr::group_by()

Simulating a timeseries in dplyr instead of using a for loop

Categories

Resources