I am simulating the number of events from a Poisson distribution (with parameter 9). For each event, I am simulating the price of the event using a lognormal distribution with parameters 2 and 1.1
I am running 100 simulations (each simulation represents a year). The code (which I am happy with) is:
simul <- list()
for(i in 1:100) {simul[[i]] <- rlnorm (rpois(1, 9), meanlog = 2, sdlog = 1.1) }
My issue is that the output "simul" is a list of lists and I don't know how to apply basic operations to it.
I want to be able to:
1. cap each individual simulated value (due to budget constraints)
2. obtain the total of all the simulated values, separately for each year (with and without capping)
3. obtain the mean value for each individual year (with and without capping)
4. calculate the 95th percentile for each year (with and without capping)
5. output the results into a dataframe (so that one column represents the total, one the mean, one the percentile etc) and each row represents a year
Something that seems to work is me pulling out individual lists:
sim1 <- simul[1]
I can now use "unlist" to flatten the list and apply any operations I want.
sims1 <- data.frame(unlist(sim1), nrow=length(sim1), byrow=F)
sims1 <- subset(sims1, select = c(unlist.sim1.))
colnames(sims1) <- "sim1"
quantile <- data.frame(quantile(sims1$sim1, probs = c(0.95)))
But I dont want to write the above logic 100 times for each sublist in the list... is there a way around it?
Any help on this would be much appreciated.
You actually don't have a list of lists, you have a list of vectors. With lists, a single bracket subset will return a list, e.g. simul[1:3] will return a list of the first 3 items of simul, and for consistency simul[1] returns a list of the first item of simul.
To extract an element, rather than a length-1 list, use [[. simul[[1]] is a vector you don't need to run unlist() on.
The nice thing about lists is you can work on them with for loops or with lapply/sapply functions. For example
raw_means = sapply(simul, mean)
raw_sums = sapply(simul, sum)
raw_95 = sapply(simul, quantile, probs = 0.95)
result = data.frame(raw_means, raw_sums, raw_95)
Or with a loop,
raw_means = raw_sums = raw_95 = numeric(length(simul))
for (i in seq_along(simul)) {
raw_means[i] = mean(simul[[i]])
raw_sums[i] = sum(simul[[i]])
raw_95[i] = quantile(simul[[i]], probs = 0.95)
}
result = data.frame(raw_means, raw_sums, raw_95)
When you say "cap", I'm not sure if you mean subset or reduce values (e.g., with pmin), so I'll leave that to you. But I'd recommend making a new list, like simul_cap = lapply(simul, pmin, 9) (if that's the operation you want) and running the same code on your capped list. You could even make the summary statistics a function, so instead of copy-pasting a bunch you end up doing raw_result = foo(simul) and cap_result = foo(simul_cap).
Related
I created the following function that takes 3 numeric parameters, size of longitude (in degrees), size of latitude (in degrees) and year. The function creates squares (grids) of size denoted by the first two parameters and then allocates the observations in the dataset over those grids, seperated by year (the third parameter). The function is working as intended.
To use the function to construct a 2x2 Assemblage (the grid with the all the observations in it) for the year 2009, I call:
assemblage_2009 <- CreateAssembleage(2, 2, 2009)
However, I would like to create assembleages iteratively from the year 2009 to 2018.
I tried to do a for loop with i in 2009:2018 without much success. I also tried lapply but also without much success.
Any ideas from more experienced R users?
The function:
CreateAssembleage <- function(size_long, size_lat, year){
# create a dataset to hold only values with the chosen year
data_grid_year <- dplyr::filter(data_grid, Year == year)
# Create vectors to hold the columns (easier to work with)
Longitude <- data_grid_year$Longitude
Latitude <- data_grid_year$Latitude
dx <- size_long # set up the dimensions (easier to change here than inside the code)
dy <- size_lat
# construct the grids
gridx <- seq(min(Longitude), max(Longitude), by = dx) # the values we discussed for the big square
gridy <- seq(min(Latitude), max(Latitude), by = dy)
# take the data and create 3 new columns (x, y, cell) by finding the specified data inside the constructed grids
grid_year <- data_grid_year %>%
mutate(
x = findInterval(Longitude, gridx),
y = findInterval(Latitude, gridy),
cell = paste(x, y, sep = ",")) %>%
relocate(Sample_Id, Latitude, Longitude, x, y, cell) # bring forward the new columns
### Create the assemblage
data_temp <- grid_year %>%
group_by(cell) %>% # group by the same route id
select(-c(Sample_Id, Latitude, Longitude, Midpoint_Date_Local,
Year, Month, Chlorophyll_Index, x, y)) %>% # remove unneeded columns
summarise(across(everything(), sum)) # calculate the sum
return(data_temp) #return the result
}
Thank you all for any ideas.
I cannot check whether your function works since I don't have any data from you. This said, there are multiple possibilities to call a function n times and save the output.
Since you didn't specify the problem, I have to assume you struggle to run the function in a loop and save the output.
Also, I'll have to assume that 1st: your function works, 2nd: size_long and size_lat are always set to 2. If you have different ideas, you'll have to make more clear what you want.
Some options:
Create a list with the output using lapply. Note that here, you'll have to set size_long = 2, size_lat = 2 when you define the function, so these values are the standard values. Furthermore, make year the first argument.
years <- 2009:2018
results <- lapply(years, CreateAssembleage)
Create a list with the output using a for loop:
results <- list()
for(i in 2009:2018){
list[[paste0("assemblage_", i)]] <- CreateAssembleage(size_long = 2, size_lat = 2, year = i)
}
If need be, create multiple variables, one for each year:
for(i in 2009:2018){
do.call("<-", list(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2,
year = i)))
}
Same as 3. but using assign:
for(i in 2009:2018){
assign(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2, year = i))
}
Note that if you want to alter not only year but also the other variables each time, e.g., change size_lat for each iteration, you'll have to use mapply instead of lapply, or, in case of the loops, you'll have to create vectors (or a dataframe) with the other variables as well and adjust your loop.
Edit: As suggested by MrFlick, I changed the order of the options and added the assign-option. Loops are easier to understand for most beginners, but they can be annoyingly slow for large datasets. So it is probably best to get used to lapply.
I am converting my for-loops in R for a model that has multiple input datasets. In the for-loop I use the current loop value to retrieve values from other datasets. I am looking to replicate this using an apply function (over columns in a dataset) however I'm struggling to establish index of the apply function in order to retrieve the appropriate variables from other data
The apply function references the column by the variable in the function which is fine and I've tried to use both colname (after having named my various columns by number) but have not had any joy. Below is an example dataset and for loop with what I'd like to achieve (simplified somewhat). The length of the vectors and the number of columns in the tabular dataset will always be equal.
iteration<-1:3
df <- data.frame("column1" = 6:10, "column2" = 12:16, "column3" = 31:35)
variable1<-rnorm(3,mean = 25)
variable2<-rnorm(3, mean = 0.21)
outcome<-numeric()
for (i in iteration) {
intermediate<-(mean(df[,i])*variable1[i])^variable2[i]
outcome<-c(outcome,intermediate)
}
outcome
The expected results are outcome above...trying this in apply
What I imagine it to be is this:
apply(df, 2, function(x) (mean(x)*variable1[colnumber(x)])^variable2[colnumber(x)]
or perhaps
apply(df, 2, function(x) (mean(x)*variable1[x])^variable2[x])
but these two obviously do not work.
first time user so apologies for any etiquette issues but found the answer to my own problem using the purrr package, but maybe this helps someone else
pmap(list(df, variable1, variable2), function(df, variable1, variable2) (mean(df)*variable1)^variable2)
I'm new to the forum and to r, so please forgive the sloppy code.
In short, I am trying to get a normal distribution to iteratively use the parameters drawn from two lists for use in a For Loop that generates a 30x10000 matrix of random samples using these parameters.
The first list (List1) is a collection of numeric vectors. The second list (List2) has corresponding values I would like to use for the standard deviation argument in rnorm: i.e. vector 1 from List1's standard deviation is Value1 in List2.
set.seed(1500) #set up random gen
var1 = rnorm(1:1000, mean = #mean of vector(i) from list1, sd = #value(i) from List2)
sample(var1,size=1)
X = matrix(ncol = 30, nrow = 10000)
for(j in 1:length(var1)){ #simulates data using parameters set by rnorm var1 function
for(i in 1:10000){
X[i.j] = sample(var1,1)
}
}
Here's the original post where this code is inspired from.
Cheers!
It seems mapply() would help you:
# First let's turn the list1 into means.
dist.means = lapply(list1,mean)
Lapply is a way to execute a function for every element in a list. Mapply works in a very similar way but uses multiples lists.
samples = mapply(rnorm, 30*10000, dist.means, list2,SIMPLIFY=F)
A little bit more explanation: mapply() runs rnorm() multiple times. In the first attempt, it runs using the first element of first list as the first argument, the first element of second list as second argument, etc. So in our case it will run rnorm( 30*10000, dist.means[[1]], list2[[1]] ) then rnorm( 30*10000, dist.means[[2]], list2[[2]] ) and store the output in a list.
Note that I use a small trick here. The first list is a single number 30*10000. When you give list of different sizes to mapply it recycles the shorter one, i.e. it repeats the shorter lists until it has the same length of the longer lists.
Hope that helps
I have a large data set I am attempting to sample rows from. Each row has a family ID, and there may be one or multiple rows for each family ID. I want to parse the data set by randomly sampling one row for each family ID. I have attempted to accomplish this by using both tapply() and split() + lapply() functions, but to no avail. Below is code that reproduces my issue - the size and scope of the factor levels and data entries mirror the data set I am working with.
set.seed(63)
f1 <- factor(c(rep(30000:32000, times=1),
rep(30500:31700, times = 2),
rep(30900:31900, times = 3)))
f2 <- factor(rep(sample(1:7, replace = TRUE), times = length(f1)/7))
x1 <- round(matrix(rnorm(length(f1)*300), nrow = length(f1), ncol = 300),3)
df <- data.frame(f1, f2, x1)
Next, I used tapply to sample one row per factor from f1, and then check for repeats. (f2 is a secondary factor that indexes another aspect of the observations, but is [hopefully] irrelevant here; I only include it for full disclosure of the structure of my data set).
s1 <- tapply(1:nrow(df), df$f1, sample, size=1)
any(duplicated(s1))
The output for the second line of code using duplicated is TRUE, which means there are repeats. Stumped, I tried split to see if that was the problem.
df.split <- split(1:nrow(df), df$f1)
any(duplicated(df.split))
The output here for duplicated is FALSE, so the problem is not split. I then used the output df.split with lapply and sample to see if the problem was with tapply.
df.unique <- unlist(lapply(df.split, sample, size = 1, replace = FALSE,
prob = NULL))
any(duplicated(df.unique))
In the first line, I sampled one value from each element of df.split which outputs a list, then I used unlist to convert into a vector. The output for duplicated here is also TRUE.
Somewhere within sample and lapply there is funky stuff going on (since tapply merely calls lapply). I'm not sure how to fix the issue (I searched SO and Google and found nothing related to my issue), so any help would be greatly appreciated!
EDIT: I'm hoping someone could tell me why the above code using tapply and lapply is not working as intended. Arthur has provided a nice answer, and I have coded a loop for sample as well. I'm wondering why the above code is misbehaving.
I would do that:
library(data.table)
data.table(df)[,.SD[sample(.N,1)],by='f1']
... but actually your original approach with tapply is faster if you just want an index and not the actual subset table ; however, you must notice that sample(n) actually samples in 1:n when length(n)==1. See ?sample. This version is error-proof:
s1 <- tapply(1:nrow(df), list(df$f1), function(v) v[sample(1:length(v), 1)])` is error prooff
I have a data.frame where each column represents a different individual and each row represents different food items eaten.
My goal is to resample each column via bootstrapping and then calculate a metric score and C.I.s for each individual (data column) using a defined function.
I have done this successfully on a single vector but cannot figure out how to apply the bootstrapping and metric function to individual columns in a data frame. Below is the code I have to apply it to a single vector:
data.1 <- c(10, 50, 200, 54, 6) ## example vector
## create function
metric.function <- function(x){
p <- x/sum(x)
dap <- 1/sum(p^2)
return(dap)
}
vect <- c() ## empty vector for bootstrap data
for (i in 1:1000){
data.2 <- sample(data.1, replace = TRUE) ##bootstrap sample ##
vect[i] <- metric.function (data.2) ## apply metric.function ##
}
summary(vect) ## summary
quantile(vect, probs = c(0.025, 0.975)) ## C.I.
This works fine for a single vector but I want to apply it independently to multiple columns in a data frame, for example in the example.df below I want to apply it to x1:x10 independently resulting in 10 metric scores and 10 C.I.s
example.df<-data.frame(replicate(10,sample(0:50,10,rep=TRUE)))
I have tried changing the vector item to a data.frame and messing around with apply and dply but cannot figure it out, can anyone suggest how to do it or point me in the direction of useful guide/website etc?
This is a perfect chance to use replicate and sapply.
replicate(1000, sapply(example.df, function(x)
metric.function(sample(x, replace = TRUE))))
sapply will operate column-wise (given that a data.frame is in a sense a list of columns); once we've isolated a column within sapply, we need only resample it & apply our metric.