Normally, I would handle this kind of issue by running a loop (probably not the best solution still), but I am working with an extremely large dataset (7.8 million observations) and I have been trying to program it more efficiently. Here is a very small subset of my dataset:
df = data.frame(STATE = c("PA", "PA", "MD","MD", "MO", "MO"),
DIVISION = c("Middle_Atlantic", "Middle_Atlantic","South_Atlantic","South_Atlantic","West_North_Central","West_North_Central"),
Middle_Atlantic_NSA = c(117.77, 119.43, 119.43, 120.72, 119.11, 117.77),
Middle_Atlantic_SA = c(118.45, 119.65, 119.65, 120.73, 119, 118.45),
South_Atlantic_NSA = c(134.45, 135.2, 135.2, 136.69, 134.07, 134.45),
South_Atlantic_SA = c(134.25, 134.83, 134.83, 135.97, 133.86, 134.25),
West_North_Central_NSA=c(152.24, 153.61, 153.61, 155.19, 151.08, 152.24),
West_North_Central_SA=c(152.77, 153.19, 153.19, 154.44, 151.63, 152.77),
DIV_HPI_NSA = c(117.77, 119.43, 135.2, 136.69, 151.08, 152.24),
DIV_HPI_SA = c(118.45, 119.65, 134.83, 135.97, 151.63, 152.77))
I have included my desired output for variables "DIV_HPI_NSA" and "DIV_HPI_SA". What I am attempting to accomplish is to look up the value in "DIVISION" (e.g. "Middle_Atlantic") attaching suffix "_NDA" to it and return the corresponding value of that variable (in this case "Middle_Atlantic") to the new variable "DIV_HPI_NSA". I am doing the same thing for the "DIV_HPI_SA" variable. Currently, I am trying to use either the get() function or the eval(parse(text = "text_here")) method to evaluate the strings as column names and produce the correct values, however they are not working as desired for me. Ideally I would prefer a dplyr solution as this has been processing relatively quick as opposed to loops. I am not sure why this is not working in dplyr, and would like to understand why and how I could execute it successfully. Here is a screenshot of a color coordinated desired output.
Here is my current code:
comb.df = df %>%
mutate(DIV_HPI_NSA = get(paste0(DIVISION,"_NSA")),
DIV_HPI_SA = eval(parse(text = (paste0(DIVISION,"_SA")))))
This is how I would do it through a loop - which produces the correct result but it takes a ridiculous amount of time:
for(i in 1:dim(comb.df)[1]){
comb.df$DIV_HPI_NSA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_NSA")]
comb.df$DIV_HPI_SA[i] = comb.df[i, paste0(comb.df$DIVISION[i],"_SA")]
}
My current output (i.e. DIV_HPI_NSA) keeps providing the column's output that corresponds to the first element evaluated in the "DIVISION" column. For example, the dplyr method for "DIV_HPI_NSA" returns only values from "Middle_Atlantic_NSA" column as that is the first element in "DIVISION". The eval() also has the same issue and is not generating the correct rows output.
Is there a better/faster method than dplyr, and/or how can I fix my dplyr code for it to work properly?
Please let me know if you may need additional information.
Thanks in advance!
The answer will maybe depend on the number of values DIVISION can take.
Here is a little benchmark with only "_NSA", but obviously you can do the same with "_SA" later.
#your base function in a for loop
x1 = function(db){
for(i in 1:dim(db)[1]){
db$DIV_HPI_NSA[i] = db[i, paste0(db$DIVISION[i],"_NSA")]
db$DIV_HPI_SA[i] = db[i, paste0(db$DIVISION[i],"_SA")]
}
db}
#the very same function using 'apply', which is supposed to be much faster than base loop
x2= function(db){
db %>% apply(1, function(x){
x["DIV_HPI_NSA2"] = x[paste0(x["DIVISION"],"_NSA")]
x["DIV_HPI_SA2"] = x[paste0(x["DIVISION"],"_SA")]
x
}) %>% t %>% as.data.frame
}
#if DIVISION have few values, you can use 'dplyr::case_when' this way
x3= function(db){
db %>% mutate(output2 = case_when(
DIVISION=="Middle_Atlantic" ~ Middle_Atlantic_NSA,
DIVISION=="South_Atlantic" ~ South_Atlantic_NSA,
DIVISION=="West_North_Central" ~ West_North_Central_NSA
))
}
#but if DIVISION can take a lot of values, you may have to rlang the function a bit
x4= function(db){
db = db %>% mutate(output2 = -999) #start with dummy value
xx=data.frame(A=dff$DIVISION, B=paste0(dff$DIVISION,"_NSA"), stringsAsFactors = F) %>%
unique %>%
split(seq(nrow(.))) #turns xx into a list of its rows
for(i in xx){
db = db %>% mutate(output2 = case_when(DIVISION==i$A ~ !!sym(i$B), T~output2))
}
db
}
#here are some replicates of your dataset to increase the number of lines
df60 = df[rep(seq_len(nrow(df)), 10),]
df600 = df[rep(seq_len(nrow(df)), 100),]
df6k = df[rep(seq_len(nrow(df)), 1000),]
df60k = df[rep(seq_len(nrow(df)), 10000),]
df600k = df[rep(seq_len(nrow(df)), 100000),]
#the benchmark of every function with every dataset
(mbm=microbenchmark(
base = x1(df),
base60 = df60 %>% x1,
base600 = df600 %>% x1,
base6k = df6k %>% x1,
apply = x2(df),
apply60 = df60 %>% x2,
apply600 = df600 %>% x2,
apply6k = df6k %>% x2,
dplyr = x3(df),
dplyr60 = x3(df60),
dplyr600 = x3(df600),
dplyr6k = x3(df6k),
dplyr60k = x3(df60k),
dplyr600k = x3(df600k),
dplyrcw = x4(df),
dplyrcw60 = x4(df60),
dplyrcw600 = x4(df600),
dplyrcw6k = x4(df6k),
dplyrcw60k = x4(df60k),
dplyrcw600k = x4(df600k),
times=6
))
# Unit: microseconds
# expr min lq mean median uq max neval cld
# base 515.283 599.3395 664.6767 683.396 739.3735 795.351 3 a
# base60 5125.835 5209.1620 5515.3047 5292.489 5710.0395 6127.590 3 a
# base600 53225.746 53300.1395 66678.0210 53374.533 73404.1585 93433.784 3 b
# base6k 587666.127 618005.9505 629841.8157 648345.774 650929.6600 653513.546 3 d
# apply 1220.559 1272.8895 1342.4810 1325.220 1403.4420 1481.664 3 a
# apply60 2265.710 2384.9575 2497.3980 2504.205 2613.2420 2722.279 3 a
# apply600 10852.649 11579.6225 12047.9227 12306.596 12645.5595 12984.523 3 a
# apply6k 114463.342 125155.8980 137072.6593 135848.454 148377.3180 160906.182 3 c
# dplyr 1298.964 1352.9355 1433.0417 1406.907 1500.0805 1593.254 3 a
# dplyr60 1604.559 1647.0435 1713.2313 1689.528 1767.5675 1845.607 3 a
# dplyr600 1357.676 1456.6845 1556.4223 1555.693 1655.7955 1755.898 3 a
# dplyr6k 1954.644 1970.1425 2025.0260 1985.641 2060.2170 2134.793 3 a
# dplyr60k 6366.085 6584.1590 6809.2833 6802.233 7030.8825 7259.532 3 a
# dplyr600k 46893.576 53406.6235 58086.0983 59919.671 63682.3595 67445.048 3 b
# dplyrcw 5824.182 5834.0285 5999.5897 5843.875 6087.2935 6330.712 3 a
# dplyrcw60 5591.885 5683.0535 6032.4097 5774.222 6252.6720 6731.122 3 a
# dplyrcw600 5664.820 5811.2360 5900.6413 5957.652 6018.5520 6079.452 3 a
# dplyrcw6k 6390.883 6522.7120 9003.2733 6654.541 10309.4685 13964.396 3 a
# dplyrcw60k 14379.395 14936.6140 15179.6070 15493.833 15579.7130 15665.593 3 a
# dplyrcw600k 85238.503 86607.3005 92601.6017 87976.098 96283.1510 104590.204 3 b
Conclusion
For a 6k line dataset,
apply (137s) is 6x faster than base (630s)
vanilla dplyr is even much faster (2s)
rlanged dplyr is a bit slower than vanilla (9s)
Times seem to expand linearly with base and apply at 100ms/line, so 8M lines should takes approximately 8M seconds = 1 week.
dplyr times seem to expand exponentially though, so I cannot say if it will work on your big dataset.
Related
Consider data created here:
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100))
#Note that not all IDs are the same length
We have data for 100 IDs, where each individual has a unique ID number. Columns not.imp1:3 are only relevant to show the structure of the dataframe.
We want to add a leading zero to the first 95 ID numbers. I am trying to do this using dplyr pipes, but cant figure out how to add the zeros.
Here is how I subset the data that I want to add the zeros to:
library(dplyr)
data%>%
select(ID)%>%
slice(1:95)
I have tried several things like adding %>%mutate(paste0("0",.)) to the pipe, but havent gotten anything to work. what is the best way to do this?
Using sprintf() to pad to 3 digits:
data %>% mutate(ID = sprintf("%03d", ID))
You can change %03d to change how many leading zeros to add. Eg. %05d will ensure all IDs are at least 5 digits long.
You can try this approach
data2 <- data %>%
mutate(ID = ifelse(row_number()<= 95, paste0("0", ID), ID))
head(data2)
# ID not.imp1 not.imp2 not.imp3
# 1 09449 -1.4297317 -2.2210106 0.1923912
# 2 07423 1.9010681 1.0825734 -0.8855694
# 3 06283 0.2508254 -0.5307967 2.1645044
# 4 05593 -2.2451267 0.1281156 -1.8528800
# 5 09194 -0.1677409 -0.7422480 -0.4237452
# 6 07270 -0.2536918 1.2289698 1.0083092
tail(data2)
# ID not.imp1 not.imp2 not.imp3
# 95 06538 1.0071791 0.1596557 -0.7099883
# 96 4829 0.2444440 0.8869954 -1.2938356
# 97 2571 -1.1012023 0.8343393 -0.6264487
# 98 150 0.2116460 -0.2146265 -1.8281045
# 99 3107 -1.2379193 0.3491078 1.4531531
# 100 9953 -0.9326725 1.1146032 -1.5542687
Use of str_pad is helpful
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100)) %>%
mutate(ID = str_pad(string = ID, width = 4, side = 'left', pad = 0))
This would be a quick and easy way to do it. I didn't use dplyr or pipeing, but you could merge this idea with the code you already tried if you want to.
data[which(nchar(data[,"ID"])==3),"ID"]<-paste0(0,data[which(nchar(data[,"ID"])==3),"ID"])
data[which(nchar(data[,"ID"])==2),"ID"]<-paste0(00,data[which(nchar(data[,"ID"])==2),"ID"])
I'm progressively transitioning from SAS to R, and at the moment I am trying to replicate what I used to do with macros.
I have a table that contains all my data (let's call it IDF_pop) and from this table I create two other : YVE_pop and EPCI_pop, which are two subsets from the main table. I prefer creating separate tables, but I guess this might not be optimal. Here's how I proceed :
## Let's say the main table contains 10 lines.
## codgeo is the city's postal code, epci is the area, and I have three
## variables that describe different parts of the population
codgeo <- c("75014","75020","78300","78520","78650","91200","91600","92500","93100","95230")
epci <- c("001","001","002","002","003","004","004","005","006","007")
pop0_15 <- c(10000*runif(10))
pop15_64 <- c(10000*runif(10))
pop65p <- c(10000*runif(10))
IDF_pop <- data.frame(codgeo,epci,pop0_15,pop15_64,pop65p)
## I'd like my population to be in one single column, for this I'll use melt
IDF_pop_line <- melt(IDF_pop,c("codgeo","epci"))
## Now I want to create separate tables for the Yvelines department (codgeo starts with 78) and for EPCI 002
## I could do it in two lines but I wanted to train using functions so here goes
localisation <- function(code_dep, lib_dep, code_epci, lib_epci){
do.call("<<-",
list(paste0(eval(lib_dep),"_pop_ligne"),
IDF_pop_line %>% filter(stri_sub(codgeo,from=1,length=2)==code_dep)
)
)
do.call("<<-",
list(paste0(eval(lib_epci),"_pop_ligne"),
IDF_pop_line %>% filter(epci==code_epci)
)
)
}
do.call("localisation",list("78","YVE","002","GPSO"))
With this, I have my 3 tables (IDF_, YVE_, GPSO_) and can now get to the main problem.
What I want to do next is summarise my tables. I'm trying to write a function that would work for all 3 tables.
I'd like it to be fully dependent on the parameter, but it seems that do.call won't accept a paste0 in its second argument.
## Aggregating the tables. I'll call the function 3 times, one for each level.
agregation <- function(lib){
# This doesn't :
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
paste0(eval(lib),"_pop_line") %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation",list("IDF")) # This one doesn't work
agregation2 <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
IDF_pop_line %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
do.call("agregation2",list("IDF")) # This one does
As you can see, the only working way I've found as of now is to write the full name of the table I'm using for aggregation. But this goes against the initial idea of having something that can be freely parametered.
How can I modify the first version of my function, in a way that will make it work for all three possible parameters ?
Lastly, I am aware that a simple workaround would have been to keep my IDF_pop_line table and filter at the last moment to create the 3 aggregated tables, but I prefer having separate tables from the get-go.
Thanks in advance for your help !
In your agregation function string paste0(eval(lib),"_pop_line") returns a name of dataframe not dataframe itself.
Try get
agregation <- function(lib){
do.call("<<-",
list(paste0(eval(lib),"_pop_agr"),
get(paste0(eval(lib),"_pop_line")) %>%
group_by(variable) %>%
summarise(pop = sum(value))
)
)
}
Here is a suggestion using data.table.
You can use the IDF_pop you create before entering all functions.
library(data.table)
#make adata.table out of YVE_pop_ligne
setDT( IDF_pop )
#create groups to summarise by
IDF_pop[ epci == "002", GSPO := TRUE][]
IDF_pop[ grepl("^78", codgeo) , YVE := TRUE][]
#melt and filter only values where a filter is TRUE
dt <- data.table::melt( IDF_pop,
id.vars = c("codgeo", "epci", "pop0_15", "pop15_64", "pop65p"),
measure.vars = c("GSPO", "YVE"))[ value == TRUE,][]
in between result (dt)
# codgeo epci pop0_15 pop15_64 pop65p variable value
# 1: 78300 002 6692.394 5441.225 4008.875 GSPO TRUE
# 2: 78520 002 2128.604 6808.004 1889.822 GSPO TRUE
# 3: 78300 002 6692.394 5441.225 4008.875 YVE TRUE
# 4: 78520 002 2128.604 6808.004 1889.822 YVE TRUE
# 5: 78650 003 8482.971 6556.482 5098.929 YVE TRUE
code
#now summarising is easy, sum by varianle-group on all pop-columns
dt[, lapply( .SD, sum), by = variable, .SDcols = names(dt)[grepl("^pop", names(dt) )] ]
final output
# variable pop0_15 pop15_64 pop65p
# 1: GSPO 7171.683 5855.894 11866.55
# 2: YVE 12602.153 8028.948 14364.21
I am trying to arrange my data. The csv file that I load contains results of 15 precincts for one locality. The number of rows are 150 because the names of the 10 candidates repeat for each of the 15 precincts.
My goal is to make the names of the 10 candidates as columns without repeating their names and with the results for each candidate as the values. I use the code below, however I have to do it 15 times because I cut my data in intervals of 10 to extract the results of one precinct. It's the same for "binondov". I have to cut my data in intervals of 8 because there are 8 candidates for each precinct.
Is there a way to write my code as a loop? Thanks!
binondop1 <- binondop[1:10,]
binondop1a <- binondop1[order(binondop1[,2]),]
binondov1 <- binondov[1:8,]
binondov1a <- binondov1[order(binondov1[,2]),]
colnames(binondop1a) = colnames(binondov1a) =
c('X', 'Candidate', 'Party', 'Vote', 'Percentage')
binondo1 <- rbind(binondop1a, binondov1a)
binondo <- rbind(t(binondo1$Vote), t(binondo2$Vote),
t(binondo3$Vote), t(binondo4$Vote),
t(binondo5$Vote), t(binondo6$Vote),
t(binondo7$Vote), t(binondo8$Vote),
t(binondo9$Vote), t(binondo10$Vote),
t(binondo11$Vote), t(binondo12$Vote),
t(binondo13$Vote),t(binondo14$Vote),
t(binondo15$Vote))
colnames(binondo) <- c('Acosta', 'Aquino', 'DLReyes', 'EEjercito',
'Gordon', 'Madrigal', 'Perlas', 'Teodoro',
'Villanueva', 'Villar', 'Binay', 'Chipeco',
'Fernando', 'Legarda', 'Manzano', 'Roxas',
'Sonza', 'Yasay')
It's hard to say exactly without seeing a sample data set, but perhaps something like this will help get you where you need to your answer.
library(dplyr)
library(tidyr)
df <- data.frame(Candidate = c(rep('Acosta',3), rep('Aquino',3), rep('DLReyes',3)),
Party = c('R','R','R','L','L','L','D','D','D'),
Vote = rep(c('A','B','C'),3),
Percentage = c(5,4,2,6,8,3,1,3,2))
df2 <- df %>%
mutate(Candidate = paste0(Candidate, ' (', Party, ')')) %>%
select(-Party) %>%
spread(Candidate, Percentage)
I would like to process some GPS-Data rows, pairwise.
For now, I am doing it in a normal for-loop but I'm sure there is a better and faster way.
n = 100
testdata <- as.data.frame(cbind(runif(n,1,10), runif(n,0,360), runif(n,14,16), runif(n, 46,49)))
colnames(testdata) <- c("speed", "heading", "long", "lat")
head(testdata)
diffmatrix <- as.data.frame(matrix(ncol = 3, nrow = dim(testdata)[1] - 1))
colnames(diffmatrix) <- c("distance","heading_diff","speed_diff")
for (i in 1:(dim(testdata)[1] - 1)) {
diffmatrix[i,1] <- spDists(as.matrix(testdata[i:(i+1),c('long','lat')]),
longlat = T, segments = T)*1000
diffmatrix[i,2] <- testdata[i+1,]$heading - testdata[i,]$heading
diffmatrix[i,3] <- testdata[i+1,]$speed - testdata[i,]$speed
}
head(diffmatrix)
How would i do that with an apply-function?
Or is it even possible to do that calclulation in parallel?
Thank you very much!
I'm not sure what you want to do with the end condition but with dplyr you can do all of this without using a for loop.
library(dplyr)
testdata %>% mutate(heading_diff = c(diff(heading),0),
speed_diff = c(diff(speed),0),
longdiff = c(diff(long),0),
latdiff = c(diff(lat),0))
%>% rowwise()
%>% mutate(spdist = spDists(cbind(c(long,long + longdiff),c(lat,lat +latdiff)),longlat = T, segments = T)*1000 )
%>% select(heading_diff,speed_diff,distance = spdist)
# heading_diff speed_diff distance
# <dbl> <dbl> <dbl>
# 1 15.9 0.107 326496
# 2 -345 -4.64 55184
# 3 124 -1.16 25256
# 4 85.6 5.24 221885
# 5 53.1 -2.23 17599
# 6 -184 2.33 225746
I will explain each part below:
The pipe operator %>% is essentially a chain that sends the results from one operation into the next. So we start with your test data and send it to the mutate function.
Use mutate to create 4 new columns that are the difference measurements from one row to the next. Adding in 0 at the last row because there is no measurement following the last datapoint. (Could do something like NA instead)
Next once you have the differences you want to use rowwise so you can apply the spDists function to each row.
Last we create another column with mutate that calls the original 4 columns that we created earlier.
To get only the 3 columns that you were concerned with I used a select statement at the end. You can leave this out if you want the entire dataframe.
So, while lag and lead in dplyr are great, I want to simulate a timeseries of something like population growth. My old school code would look something like:
tdf <- data.frame(time=1:5, pop=50)
for(i in 2:5){
tdf$pop[i] = 1.1*tdf$pop[i-1]
}
which produces
time pop
1 1 50.000
2 2 55.000
3 3 60.500
4 4 66.550
5 5 73.205
I feel like there has to be a dplyr or tidyverse way to do this (as much as I love my for loop).
But, something like
tdf <- data.frame(time=1:5, pop=50) %>%
mutate(pop = 1.1*lag(pop))
which would have been my first guess just produces
time pop
1 1 NA
2 2 55
3 3 55
4 4 55
5 5 55
I feel like I'm missing something obvious.... what is it?
Note - this is a trivial example - my real examples use multiple parameters, many of which are time-varying (I'm simulating forecasts under different GCM scenarios), so, the tidyverse is proving to be a powerful tool in bringing my simulations together.
Reduce (or its purrr variants, if you like) is what you want for cumulative functions that don't already have a cum* version written:
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
or with purrr,
data.frame(time = 1:5, pop = 50) %>%
mutate(pop = accumulate(pop, ~.x * 1.1))
## time pop
## 1 1 50.000
## 2 2 55.000
## 3 3 60.500
## 4 4 66.550
## 5 5 73.205
If the starting value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will give you the next four values. With your code, you could do:
data.frame(time=1:5, pop=50) %>%
mutate(pop = pop * 1.1^(1:n() - 1))
Or,
base = 50
data.frame(time=1:5) %>%
mutate(pop = base * 1.1^(1:n()-1))
Purrr's accumulate function can handle time-varying indices, if you pass them
to your simulation function as a list with all the parameters in it. However, it takes a bit of wrangling to get this working correctly. The trick here is that accumulate() can work on list as well as vector columns. You can use the tidyr function nest() to group columns into a list vector containing the current population state and parameters, then use accumulate() on the resulting list column. This is a bit complicated to explain, so I've included a demo, simulating logistic growth with either a constant growth rate or a time-varying stochastic growth rate. I also included an example of how to use this to simulate multiple replicates for a given model using dpylr+purrr+tidyr.
library(dplyr)
library(purrr)
library(ggplot2)
library(tidyr)
# Declare the population growth function. Note: the first two arguments
# have to be .x (the prior vector of populations and parameters) and .y,
# the current parameter value and population vector.
# This example function is a Ricker population growth model.
logistic_growth = function(.x, .y, growth, comp) {
pop = .x$pop[1]
growth = .y$growth[1]
comp = .y$comp[1]
# Note: this uses the state from .x, and the parameter values from .y.
# The first observation will use the first entry in the vector for .x and .y
new_pop = pop*exp(growth - pop*comp)
.y$pop[1] = new_pop
return(.y)
}
# Starting parameters the number of time steps to simulate, initial population size,
# and ecological parameters (growth rate and intraspecific competition rate)
n_steps = 100
pop_init = 1
growth = 0.5
comp = 0.05
#First test: fixed growth rates
test1 = data_frame(time = 1:n_steps,pop = pop_init,
growth=growth,comp =comp)
# here, the combination of nest() and group_by() split the data into individual
# time points and then groups all parameters into a new vector called state.
# ungroup() removes the grouping structure, then accumulate runs the function
#on the vector of states. Finally unnest transforms it all back to a
#data frame
out1 = test1 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This is the same example, except I drew the growth rates from a normal distribution
# with a mean equal to the mean growth rate and a std. dev. of 0.1
test2 = data_frame(time = 1:n_steps,pop = pop_init,
growth=rnorm(n_steps, growth,0.1),comp=comp)
out2 = test2 %>%
group_by(time)%>%
nest(pop, growth, comp,.key = state)%>%
ungroup()%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
# This demostrates how to use this approach to simulate replicates using dplyr
# Note the crossing function creates all combinations of its input values
test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>%
mutate(growth=rnorm(n_steps*10, growth,0.1))
out3 = test3 %>%
group_by(rep)%>%
group_by(rep,time)%>%
nest(pop, growth, comp,.key = state)%>%
group_by(rep)%>%
mutate(
state = accumulate(state,logistic_growth))%>%
unnest()
print(qplot(time, pop, data=out1)+
geom_line() +
geom_point(data= out2, col="red")+
geom_line(data=out2, col="red")+
geom_point(data=out3, col="red", alpha=0.1)+
geom_line(data=out3, col="red", alpha=0.1,aes(group=rep)))
The problem here is that dplyr is running this as a set of vector operations rather than evaluating the term one at a time. Here, 1.1*lag(pop) is being interpreted as "calculate the lagged values for all of pop, then multiple them all by 1.1". Since you set pop=50 lagged values for all the steps were 50.
dplyr does have some helper functions for sequential evaluation; the standard function cumsum, cumprod, etc. work, and a few new ones (see ?cummean) all work within dplyr. In your example, you could simulate the model with:
tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>%
mutate(pop = pop*cumprod(growth_rate))
time pop growth_rate
1 50.000 1.0
2 55.000 1.1
3 60.500 1.1
4 66.550 1.1
5 73.205 1.1
Note that I added growth rate as a column here, and I set the first growth rate to 1. You could also specify it like this:
tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>%
mutate(pop = pop*cumprod(lead(growth_rate,default=1))
This makes it explicit that the growth rate column refers to the rate of growth in the current time step from the previous one.
There are limits to how many different simulations you can do this way, but it should be feasible to construct a lot of discrete-time ecological models using some combination of the cumulative functions and parameters specified in columns.
What about the map functions, i.e.
tdf <- data_frame(time=1:5)
tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x)))