Creating multiple datasets with R function

Creating multiple datasets with R function - r

I created the following function that takes 3 numeric parameters, size of longitude (in degrees), size of latitude (in degrees) and year. The function creates squares (grids) of size denoted by the first two parameters and then allocates the observations in the dataset over those grids, seperated by year (the third parameter). The function is working as intended.
To use the function to construct a 2x2 Assemblage (the grid with the all the observations in it) for the year 2009, I call:
assemblage_2009 <- CreateAssembleage(2, 2, 2009)
However, I would like to create assembleages iteratively from the year 2009 to 2018.
I tried to do a for loop with i in 2009:2018 without much success. I also tried lapply but also without much success.
Any ideas from more experienced R users?
The function:
CreateAssembleage <- function(size_long, size_lat, year){
# create a dataset to hold only values with the chosen year
data_grid_year <- dplyr::filter(data_grid, Year == year)
# Create vectors to hold the columns (easier to work with)
Longitude <- data_grid_year$Longitude
Latitude <- data_grid_year$Latitude
dx <- size_long # set up the dimensions (easier to change here than inside the code)
dy <- size_lat
# construct the grids
gridx <- seq(min(Longitude), max(Longitude), by = dx) # the values we discussed for the big square
gridy <- seq(min(Latitude), max(Latitude), by = dy)
# take the data and create 3 new columns (x, y, cell) by finding the specified data inside the constructed grids
grid_year <- data_grid_year %>%
mutate(
x = findInterval(Longitude, gridx),
y = findInterval(Latitude, gridy),
cell = paste(x, y, sep = ",")) %>%
relocate(Sample_Id, Latitude, Longitude, x, y, cell) # bring forward the new columns
### Create the assemblage
data_temp <- grid_year %>%
group_by(cell) %>% # group by the same route id
select(-c(Sample_Id, Latitude, Longitude, Midpoint_Date_Local,
Year, Month, Chlorophyll_Index, x, y)) %>% # remove unneeded columns
summarise(across(everything(), sum)) # calculate the sum
return(data_temp) #return the result
}
Thank you all for any ideas.

I cannot check whether your function works since I don't have any data from you. This said, there are multiple possibilities to call a function n times and save the output.
Since you didn't specify the problem, I have to assume you struggle to run the function in a loop and save the output.
Also, I'll have to assume that 1st: your function works, 2nd: size_long and size_lat are always set to 2. If you have different ideas, you'll have to make more clear what you want.
Some options:
Create a list with the output using lapply. Note that here, you'll have to set size_long = 2, size_lat = 2 when you define the function, so these values are the standard values. Furthermore, make year the first argument.
years <- 2009:2018
results <- lapply(years, CreateAssembleage)
Create a list with the output using a for loop:
results <- list()
for(i in 2009:2018){
list[[paste0("assemblage_", i)]] <- CreateAssembleage(size_long = 2, size_lat = 2, year = i)
}
If need be, create multiple variables, one for each year:
for(i in 2009:2018){
do.call("<-", list(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2,
year = i)))
}
Same as 3. but using assign:
for(i in 2009:2018){
assign(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2, year = i))
}
Note that if you want to alter not only year but also the other variables each time, e.g., change size_lat for each iteration, you'll have to use mapply instead of lapply, or, in case of the loops, you'll have to create vectors (or a dataframe) with the other variables as well and adjust your loop.
Edit: As suggested by MrFlick, I changed the order of the options and added the assign-option. Loops are easier to understand for most beginners, but they can be annoyingly slow for large datasets. So it is probably best to get used to lapply.

Related

Loop in R to extract list of lists

I am simulating the number of events from a Poisson distribution (with parameter 9). For each event, I am simulating the price of the event using a lognormal distribution with parameters 2 and 1.1
I am running 100 simulations (each simulation represents a year). The code (which I am happy with) is:
simul <- list()
for(i in 1:100) {simul[[i]] <- rlnorm (rpois(1, 9), meanlog = 2, sdlog = 1.1) }
My issue is that the output "simul" is a list of lists and I don't know how to apply basic operations to it.
I want to be able to:
1. cap each individual simulated value (due to budget constraints)
2. obtain the total of all the simulated values, separately for each year (with and without capping)
3. obtain the mean value for each individual year (with and without capping)
4. calculate the 95th percentile for each year (with and without capping)
5. output the results into a dataframe (so that one column represents the total, one the mean, one the percentile etc) and each row represents a year
Something that seems to work is me pulling out individual lists:
sim1 <- simul[1]
I can now use "unlist" to flatten the list and apply any operations I want.
sims1 <- data.frame(unlist(sim1), nrow=length(sim1), byrow=F)
sims1 <- subset(sims1, select = c(unlist.sim1.))
colnames(sims1) <- "sim1"
quantile <- data.frame(quantile(sims1$sim1, probs = c(0.95)))
But I dont want to write the above logic 100 times for each sublist in the list... is there a way around it?
Any help on this would be much appreciated.

You actually don't have a list of lists, you have a list of vectors. With lists, a single bracket subset will return a list, e.g. simul[1:3] will return a list of the first 3 items of simul, and for consistency simul[1] returns a list of the first item of simul.
To extract an element, rather than a length-1 list, use [[. simul[[1]] is a vector you don't need to run unlist() on.
The nice thing about lists is you can work on them with for loops or with lapply/sapply functions. For example
raw_means = sapply(simul, mean)
raw_sums = sapply(simul, sum)
raw_95 = sapply(simul, quantile, probs = 0.95)
result = data.frame(raw_means, raw_sums, raw_95)
Or with a loop,
raw_means = raw_sums = raw_95 = numeric(length(simul))
for (i in seq_along(simul)) {
raw_means[i] = mean(simul[[i]])
raw_sums[i] = sum(simul[[i]])
raw_95[i] = quantile(simul[[i]], probs = 0.95)
}
result = data.frame(raw_means, raw_sums, raw_95)
When you say "cap", I'm not sure if you mean subset or reduce values (e.g., with pmin), so I'll leave that to you. But I'd recommend making a new list, like simul_cap = lapply(simul, pmin, 9) (if that's the operation you want) and running the same code on your capped list. You could even make the summary statistics a function, so instead of copy-pasting a bunch you end up doing raw_result = foo(simul) and cap_result = foo(simul_cap).

Applying function to different combinations of the arguments in R

I have two variables (one independent and one dependent), containing 5 data points each, which I have created a function (x,y) to fit different models to them. This is working quite nice. However, the problem is that I also need to apply this same function to different combinations of these data points. In other words, I need to apply the function using the different combinations of using only 4, 3, and 2 data points. In total, there are 25 possible combinations. I was wondering what would be the most efficient way of doing it?
Please, see below an example of my data:
tte <- c(100,172,434,857,1361) #dependent variable
po <- c(446,385,324,290,280) #independent variable
Results <- myFunction (tte=tte, po=po) # customized function
Below is an example of how I am getting all the possible combinations using 4 data points:
tte4 <- combn(tte,4)
po4 <- combn(po,4)
Please, note that the first column of tte4 has always to be analyzed with the first column of po4. Then, the second column of tte4 with the second column of po4 and so on. What I need to do is to use myFunction on all these combinations.
I have tried to implement it through a for loop and through mapply without much success.
Any thoughts?

Consider using the simplify=FALSE argument of combn, then pass the list of vectors with mapply (or its wrapper Map).
tte_list <- combn(tte,4, simplify = FALSE)
po_list <- combn(po, 4, simplify = FALSE)
# MATRIX OR VECTOR RETURN
res_matrix <- mapply(myFunction, tte_list, po_list)
# LIST RETURN
res_list <- Map(myFunction, tte_list, po_list)

Since I don't know what function you want to perform, I just summed the columns. This function takes three arguments:
index = A sequence of 1 to how many columns there are in tte4 (should be same as po4)
x = tte4
y = po4.
Then it should use that index on both matrices to ID the columns you want. And in this case, I summed them.
tte <- c(100,172,434,857,1361) #dependent variable
po <- c(446,385,324,290,280) #independent variable
results <- function(index, x, y){
i.x <- x[,index]
i.y <- y[,index]
sum(i.x) + sum(i.y)
}
tte4 <- combn(tte, 4)
po4 <- combn(po,4)
index <- 1:ncol(tte4)
sapply(index, results, x = tte4, y = po4)
#[1] 3008 3502 3891 4092 4103

How do I perform this code using apply statements instead of this for loop?

I have a list of dataframes where each column in the df corresponds to the evaluation of a function values from different numeric vectors of the same length.
Each list object(dataframe) is generated with a different function
I would like to iterate through each list object (dataframe) to
1. Generate a plot for each list object(dataframe), with columns as data series.
2. Generate a new list of new dataframes which contains a column for each column mean from the original dataframe
The below code is functional, but is there a better way to use apply statements and avoid the for loop?
plots <- list()
trait.estimate <- list()
for(i in 1:length(component.estimation)) { #outter loop start
component.estimation[[i]]$hr <- hr #add hr vector to end of dataframe
temporary.df <- melt(component.estimation[[i]] , id.vars = 'hr', variable.name = 'treatment')
#Store a plot of each df
plots[[i]] <- ggplot(temporary.df, aes(hr , value), group = treatment, colour = treatment, fill = treatment) +
geom_point(aes(colour = treatment, fill = treatment))+
geom_line(aes(colour= treatment, linetype = treatment))+
ggtitle( names(component.estimation)[i])+ #title to correspond to trait
theme_classic()
#Generate column averages for each df
trait.estimate[[i]] <- apply(component.estimation[[i]] ,2, mean)
trait.estimate[[i]] <- as.data.frame(trait.estimate[[i]])
trait.estimate[[i]]$treatment <- row.names(trait.estimate[[i]])
} #outter loop close

Your for loop looks fine to me, I wouldn't worry about transitioning to lapply. Personally, I think lapply is great when you want to do something simple, but when you want something more complicated, a for loop can be just as readable.
The only real change I'd make is to use colMeans rather than apply(., 2, mean). I also might break apart the trait.estimate part and the plotting part as they seem wholly separate operations. Seems nicer organizationally.
As an example, pulling out the trait.estimate calculations would look like this:
# inside for loop version
trait.estimate[[i]] <- colMeans(component.estimation[[i]])
trait.estimate[[i]] <- as.data.frame(trait.estimate[[i]])
trait.estimate[[i]]$treatment <- row.names(trait.estimate[[i]])
# outside for loop lapply version
trait.estimate = lapply(component.estimation, colMeans)
trait.estimate = lapply(trait.estimate, as.data.frame)
trait.estimate = lapply(trait.estimate, function(x) x$treatment = row.names(x))
# all in one lapply version with anonymous function
trait.estimate = lapply(component.estimation, function(x) {
means = colMeans(x)
means = as.data.frame(means)
means$treatment = row.names(means)
return(means)
})
Which is better? I'll leave that to you to decide. Use whichever you prefer.

How to match and store results from a long nested for loop into an empty column in a data frame in R

I'm trying to store p values from a long nested for loop into an empty column in a data frame. I've tried looking up examples close to my code, but I feel as though my code is really long (and maybe even incorrect) that the same things that can be applied to other for loops can't be applied to mine.
The overview of what I'm trying to do is I'm trying to compare the relatedness of observed paired birds to the relatedness of all possible paired birds in a given year by finding a p value. To do this, I'm writing a for loop where I am selecting a range of years from a huge data set, and then I am applying a bunch of functions to those given years where I'm trying to narrow down the data for observed pairs and then I'm adding a column for relatedness and transferring those relatedness values for the pairs from another data set. I am then applying another for loop function within this in order to create a data frame with all possible paired birds in that given year and also adding and transferring a column of relatedness values for the pairs. From these two data frames of pairs and relatedness within each year, I want to apply the wilcox test to find the p value for each given year. I want to transfer over these p values into a separate data frame that I have created with a year column and a p value column.
Here is my (crazy looking) code:
`year <- c(2000:2013)
pvalue <- c(NA)
results <- data.frame(year, pvalue)
for(j in c(2000:2013)) {
allbr_demo_noEPP_year <- subset(allbr_demo_noEPP, Year == j)
allbr_demo_noEPP_year_geno_obs <- allbr_demo_noEPP_year[allbr_demo_noEPP_year$Pairs %in% c(genome$pair1,genome$pair2),]
allbr_demo_noEPP_year_geno_obs$relatedness <- laply(allbr_demo_noEPP_year_geno_obs$Pairs, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
allbr_demo_noEPP_year_geno <- allbr_demo_noEPP_year[c(allbr_demo_noEPP_year$MB_USFWS,allbr_demo_noEPP_year$FB_USFWS) %in% genotyped$V2,]
breeder_list_males <- allbr_demo_noEPP_year_geno_obs[,8]
breeder_list_females <- allbr_demo_noEPP_year_geno_obs[,10]
unq_breeder_list_males <- unique(breeder_list_males)
unq_breeder_list_females <- unique(breeder_list_females)
all_poss_combo <-list()
for(i in unq_breeder_list_males){
print(i)
all_poss_combo[[i]]<-paste0(i, ",", unq_breeder_list_females)}
lapply(X = all_poss_combo, FUN= function(x) length(unique(x)))
all_poss_df<-unlist(all_poss_combo, use.names = F)
all_poss_df <- data.frame("combo"=all_poss_df, "M"=NA, "F"=NA)
all_poss_df$M <- substr(all_poss_df$combo, start = 1, stop = 10)
all_poss_df$F <- substr(all_poss_df$combo, start = 12, stop = 22)
all_poss_df_geno <- all_poss_df[all_poss_df$combo %in% c(genome$pair1,genome$pair2),]
all_poss_df_geno$relatedness <- laply(all_poss_df_geno$combo, function(x) genome[genome$pair1==x|genome$pair2==x,'PI_HAT'])
wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness, all_poss_df_geno$relatedness, alternative='greater')}`
To be honest, I'm not even sure if this for loop will work (it seems pretty complex to me, but I am a beginner), but I was told that doing a for loop for this situation should work. I understand there are probably easier or faster ways to do what I am trying to do, which I also welcome, but I would also like to see how I could fix this for loop so it would work and how I could store the results from it into a data frame.
Thank you so much for any help given!

If you are simply looking to save the p value:
str(wilcox.test(rnorm(10), rnorm(10, 2))) # example from running ?Wilcox.test
wilcox.test(rnorm(10), rnorm(10, 2))$p.value #
So with your dataset, perhaps putting this in the bottom of your for loop:
pvalue[j] <- wilcox.test(allbr_demo_noEPP_year_geno_obs$relatedness,
all_poss_df_geno$relatedness, alternative='greater')$p.value

How to create multiple different lags

I am trying to create new variables in a dataframe that represent multiple lags. I have one time series in it right now "series" and I would like to create 10 different variables, each representing a certain lag of "series". So the resulting data frame would have the original variable "series," plus 10 variables named (1, 2, 3, 4, ... 10) that would represent that number of lags. I am currently trying this on a for loop:
for (i in 1:max.lag){
lag.death$"i" <- lag(tscampos, i)
}
But after reading here, I suspect I might want to use one of the apply functions? Any ideas?

There you go: this function will allow you getting a lagged version of your serie whenever you'll need it. ('better than storing each lagged replicate of the same serie in 10 different columns I find)
lag.death = data.frame(series = floor(runif(10,0,100)));
lag.death$serie
lagit4me = function(serie,lag){
n = length(serie);
pad = rep(0,lag);
return(c(pad,serie)[1:n]);
}
lagit4me(lag.death$serie,1);
lagit4me(lag.death$serie,3);
'can tweak it then to allow negative lags or etc.
( But if you really need it: )
allIn1 = lapply(0:10,lagit4me,serie=lag.death$series);
allIn1 = data.frame(allIn1);
names(allIn1) = 0:10;
allIn1
Enjoy :)

You can also use purrr::map(), similar to lapply() above. This uses dplyr::lag(), instead of lagit4me()
library(dplyr)
library(purrr)
num.lags <- 0:10
list.lags <-
purrr::map(
.x = num.lags,
.f = ~ dplyr::lag(series, .x)
)
Note, you need to name the list elements to coerce to a data_frame
chr.lags <- paste0("lag_", num.series.lags)
names(list.model.subset.lags) <- chr.lags
tbl.model.subset.lags <-
dplyr::bind_rows(list.model.subset.lags)
This produces a tbl with 11 variables, the input variable (lag_0) and 10 lagged variables (with NAs)
print(tbl.model.subset.lags)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Creating multiple datasets with R function - r

Related

Loop in R to extract list of lists

Applying function to different combinations of the arguments in R

How do I perform this code using apply statements instead of this for loop?

How to match and store results from a long nested for loop into an empty column in a data frame in R

How to create multiple different lags

Categories

Resources