Applying function to different combinations of the arguments in R - r

I have two variables (one independent and one dependent), containing 5 data points each, which I have created a function (x,y) to fit different models to them. This is working quite nice. However, the problem is that I also need to apply this same function to different combinations of these data points. In other words, I need to apply the function using the different combinations of using only 4, 3, and 2 data points. In total, there are 25 possible combinations. I was wondering what would be the most efficient way of doing it?
Please, see below an example of my data:
tte <- c(100,172,434,857,1361) #dependent variable
po <- c(446,385,324,290,280) #independent variable
Results <- myFunction (tte=tte, po=po) # customized function
Below is an example of how I am getting all the possible combinations using 4 data points:
tte4 <- combn(tte,4)
po4 <- combn(po,4)
Please, note that the first column of tte4 has always to be analyzed with the first column of po4. Then, the second column of tte4 with the second column of po4 and so on. What I need to do is to use myFunction on all these combinations.
I have tried to implement it through a for loop and through mapply without much success.
Any thoughts?

Consider using the simplify=FALSE argument of combn, then pass the list of vectors with mapply (or its wrapper Map).
tte_list <- combn(tte,4, simplify = FALSE)
po_list <- combn(po, 4, simplify = FALSE)
# MATRIX OR VECTOR RETURN
res_matrix <- mapply(myFunction, tte_list, po_list)
# LIST RETURN
res_list <- Map(myFunction, tte_list, po_list)

Since I don't know what function you want to perform, I just summed the columns. This function takes three arguments:
index = A sequence of 1 to how many columns there are in tte4 (should be same as po4)
x = tte4
y = po4.
Then it should use that index on both matrices to ID the columns you want. And in this case, I summed them.
tte <- c(100,172,434,857,1361) #dependent variable
po <- c(446,385,324,290,280) #independent variable
results <- function(index, x, y){
i.x <- x[,index]
i.y <- y[,index]
sum(i.x) + sum(i.y)
}
tte4 <- combn(tte, 4)
po4 <- combn(po,4)
index <- 1:ncol(tte4)
sapply(index, results, x = tte4, y = po4)
#[1] 3008 3502 3891 4092 4103

Related

Creating multiple datasets with R function

I created the following function that takes 3 numeric parameters, size of longitude (in degrees), size of latitude (in degrees) and year. The function creates squares (grids) of size denoted by the first two parameters and then allocates the observations in the dataset over those grids, seperated by year (the third parameter). The function is working as intended.
To use the function to construct a 2x2 Assemblage (the grid with the all the observations in it) for the year 2009, I call:
assemblage_2009 <- CreateAssembleage(2, 2, 2009)
However, I would like to create assembleages iteratively from the year 2009 to 2018.
I tried to do a for loop with i in 2009:2018 without much success. I also tried lapply but also without much success.
Any ideas from more experienced R users?
The function:
CreateAssembleage <- function(size_long, size_lat, year){
# create a dataset to hold only values with the chosen year
data_grid_year <- dplyr::filter(data_grid, Year == year)
# Create vectors to hold the columns (easier to work with)
Longitude <- data_grid_year$Longitude
Latitude <- data_grid_year$Latitude
dx <- size_long # set up the dimensions (easier to change here than inside the code)
dy <- size_lat
# construct the grids
gridx <- seq(min(Longitude), max(Longitude), by = dx) # the values we discussed for the big square
gridy <- seq(min(Latitude), max(Latitude), by = dy)
# take the data and create 3 new columns (x, y, cell) by finding the specified data inside the constructed grids
grid_year <- data_grid_year %>%
mutate(
x = findInterval(Longitude, gridx),
y = findInterval(Latitude, gridy),
cell = paste(x, y, sep = ",")) %>%
relocate(Sample_Id, Latitude, Longitude, x, y, cell) # bring forward the new columns
### Create the assemblage
data_temp <- grid_year %>%
group_by(cell) %>% # group by the same route id
select(-c(Sample_Id, Latitude, Longitude, Midpoint_Date_Local,
Year, Month, Chlorophyll_Index, x, y)) %>% # remove unneeded columns
summarise(across(everything(), sum)) # calculate the sum
return(data_temp) #return the result
}
Thank you all for any ideas.
I cannot check whether your function works since I don't have any data from you. This said, there are multiple possibilities to call a function n times and save the output.
Since you didn't specify the problem, I have to assume you struggle to run the function in a loop and save the output.
Also, I'll have to assume that 1st: your function works, 2nd: size_long and size_lat are always set to 2. If you have different ideas, you'll have to make more clear what you want.
Some options:
Create a list with the output using lapply. Note that here, you'll have to set size_long = 2, size_lat = 2 when you define the function, so these values are the standard values. Furthermore, make year the first argument.
years <- 2009:2018
results <- lapply(years, CreateAssembleage)
Create a list with the output using a for loop:
results <- list()
for(i in 2009:2018){
list[[paste0("assemblage_", i)]] <- CreateAssembleage(size_long = 2, size_lat = 2, year = i)
}
If need be, create multiple variables, one for each year:
for(i in 2009:2018){
do.call("<-", list(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2,
year = i)))
}
Same as 3. but using assign:
for(i in 2009:2018){
assign(paste0("assemblage_", i), CreateAssembleage(size_long = 2, size_lat = 2, year = i))
}
Note that if you want to alter not only year but also the other variables each time, e.g., change size_lat for each iteration, you'll have to use mapply instead of lapply, or, in case of the loops, you'll have to create vectors (or a dataframe) with the other variables as well and adjust your loop.
Edit: As suggested by MrFlick, I changed the order of the options and added the assign-option. Loops are easier to understand for most beginners, but they can be annoyingly slow for large datasets. So it is probably best to get used to lapply.

Random sampling R

I am new to R and trying to exploit a fairly simple task. I have a dataset composed of 20 obs of 19 variabile and I want to generate three non overlapping groups of 5 obs. I am using the slice_sample function from dplyr package, but how do I reiterate excluding the obs already picked up in the first round?
library( "dplyr")
set.seed(123)
NF_1 <- slice_sample(NF, n = 5)
You can use the sample function from base R.
All you have to do is sample the rows with replace = FALSE, which means you won't have any overlapping. You can also define the number of samples.
n_groups <- 3
observations_per_group <- 5
size <- n_groups * obersavations_per_group
selected_samples <- sample(seq_len(nrow(NF)), size = size, replace = FALSE)
# Now index those selected rows
NF_1 <- NF[selected_samples, ]
Now, according to your comment, if you want to generate N dataframes, each with a number of samples and also label them accordingly, you can use lapply (which is a function that "applies" a function to a set of values). The "l" in "lapply" means that it returns a list. There are other types of apply functions. You can read more about that (and I highly recommend that you do!) here.
This code should solve your problem, or at least give you a good idea or where to go.
n_groups <- 3
observations_per_group <- 5
size <- observations_per_group * n_groups
# First we'll get the row samples.
selected_samples <- sample(
seq_len(nrow(NF)),
size = size,
replace = FALSE
)
# Now we split them between the number of groups
split_samples <- split(
selected_samples,
rep(1:n_groups, observations_per_group)
)
# For each group (1 to n_groups) we'll define a dataframe with samples
# and store them sequentially in a list.
my_dataframes <- lapply(1:n_groups, function(x) {
# our subset df will be the original df with the list of samples
# for group at position "x" (1, 2, 3.., n_groups)
subset_df <- NF[split_samples[x], ]
return(subset_df)
})
# now, if you need to access the results, you can simply do:
first_df <- my_dataframes[[1]] # use double brackets to access list elements

How to apply (or sapply) a function for double range of data?

I have this function
fun_2 <- function(x,L_inf,y){
L_inf-((L_inf-y)/
(exp(-B*(x-c(1:12))/12)))
}
B <- 0.5
The problem is similar this previous post R: How to create a loop for, for a range of data in a function?
In this case i would to apply the fun_2 for this two range of data:
L_inf_range <- seq(17,20,by=0.1) #31 values
y_range <- seq(4,22,by=0.1) # 19 values
I tried with:
sapply(L_inf_range, function(L) fun_2(12, L_inf=L,y_range))
but is not the expected output. My expected output is a new matrix genereted by sapply(or other kind of function) where the fun_2 is apply for all the value in L_inf_range and each time for all value of y_range.
Substantially it will be a matrix where fun_2 is apply for each values of L_inf_range(31 values) minus y_range (L_inf-y in fun_2) each time.
You could also run sapply twice changing the order of your arguments to make it, a little easier.
fun_2_mod <- function(y, L_inf, x = X){ fun_2(x, L_inf, y)}
sapply(y_range, function(Y){sapply(L_inf_range, fun_2_mod, y = Y)})
You can bind the resulting list of lists together into a data frame.

Method in [R] for arrays of data frames

I am looking for a best practice to store multiple vector results of an evaluation performed at several different values. Currently, my working code does this:
q <- 55
value <- c(0.95, 0.99, 0.995)
a <- rep(0,q) # Just initialize the vector
b <- rep(0,q) # Just initialize the vector
for(j in 1:length(value)){
for(i in 1:q){
a[i]<-rnorm(1, i, value[j]) # just as an example function
b[i]<-rnorm(1, i, value[j]) # just as an example function
}
df[j] <- data.frame(a,b)
}
I am trying to find the best way to store individual a and b for each value level
To be able to iterate through the variable "value" later for graphing
To have the value of the variable "value" and/or a description of it available
I'm not exactly sure what you're trying to do, so let me know if this is what you're looking for.
q = 55
value <- c(sd95=0.95, sd99=0.99, sd995=0.995)
a = sapply(value, function(v) {
rnorm(q, 1:q, v)
})
In the code above, we avoid the inner loop by vectorizing. For example, rnorm(55, 1:55, 0.95) will give you 55 random normal deviates, the first drawn from a distribution with mean=1, the second from a distribution with mean=2, etc. Also, you don't need to initialize a.
sapply takes the place of the outer loop. It applies a function to each value in value and returns the three vectors of random draws as the data frame a. I've added names to the values in value and sapply uses those as the column names in the resulting data frame a. (It would be more standard to make value a list, rather than a vector with named elements. You can do that with value <- list(sd95=0.95, sd99=0.99, sd995=0.995) and the code will otherwise run the same.)
You can create multiple data frames and store them in a list as follows:
q <- list(a=10, b=20)
value <- list(sd95=0.95, sd99=0.99, sd995=0.995)
df.list = sapply(q, function(i) {
sapply(value, function(v) {
rnorm(i, 1:i, v)
})
})
This time we have two different values for q and we wrap the sapply code from above inside another call to sapply. The inner sapply does the same thing as before, but now it gets the value of q from the outer sapply (using the dummy variable i). We're creating two data frames, one called a and the other called b. a has 10 rows and b has 20 (due to the values we set in q). Both data frames are stored in a list called df.list.

How to vectorize a for loop in R

I'm trying to clean this code up and was wondering if anybody has any suggestions on how to run this in R without a loop. I have a dataset called data with 100 variables and 200,000 observations. What I want to do is essentially expand the dataset by multiplying each observation by a specific scalar and then combine the data together. In the end, I need a data set with 800,000 observations (I have four categories to create) and 101 variables. Here's a loop that I wrote that does this, but it is very inefficient and I'd like something quicker and more efficient.
datanew <- c()
for (i in 1:51){
for (k in 1:6){
for (m in 1:4){
sub <- subset(data,data$var1==i & data$var2==k)
sub[,4:(ncol(sub)-1)] <- filingstat0711[i,k,m]*sub[,4:(ncol(sub)-1)]
sub$newvar <- m
datanew <- rbind(datanew,sub)
}
}
}
Please let me know what you think and thanks for the help.
Below is some sample data with 2K observations instead of 200K
# SAMPLE DATA
#------------------------------------------------#
mydf <- as.data.frame(matrix(rnorm(100 * 20e2), ncol=20e2, nrow=100))
var1 <- c(sapply(seq(41), function(x) sample(1:51)))[1:20e2]
var2 <- c(sapply(seq(2 + 20e2/6), function(x) sample(1:6)))[1:20e2]
#----------------------------------#
mydf <- cbind(var1, var2, round(mydf[3:100]*2.5, 2))
filingstat0711 <- array(round(rnorm(51*6*4)*1.5 + abs(rnorm(2)*10)), dim=c(51,6,4))
#------------------------------------------------#
You can try the following. Notice that we replaced the first two for loops with a call to mapply and the third for loop with a call to lapply.
Also, we are creating two vectors that we will combine for vectorized multiplication.
# create a table of the i-k index combinations using `expand.grid`
ixk <- expand.grid(i=1:51, k=1:6)
# Take a look at what expand.grid does
head(ixk, 60)
# create two vectors for multiplying against our dataframe subset
multpVec <- c(rep(c(0, 1), times=c(4, ncol(mydf)-4-1)), 0)
invVec <- !multpVec
# example of how we will use the vectors
(multpVec * filingstat0711[1, 2, 1] + invVec)
# Instead of for loops, we can use mapply.
newdf <-
mapply(function(i, k)
# The function that you are `mapply`ing is:
# rbingd'ing a list of dataframes, which were subsetted by matching var1 & var2
# and then multiplying by a value in filingstat
do.call(rbind,
# iterating over m
lapply(1:4, function(m)
# the cbind is for adding the newvar=m, at the end of the subtable
cbind(
# we transpose twice: first the subset to multiply our vector.
# Then the result, to get back our orignal form
t( t(subset(mydf, var1==i & mydf$var2==k)) *
(multpVec * filingstat0711[i,k,m] + invVec)),
# this is an argument to cbind
"newvar"=m)
)),
# the two lists you are passing as arguments are the columns of the expanded grid
ixk$i, ixk$k, SIMPLIFY=FALSE
)
# flatten the data frame
newdf <- do.call(rbind, newdf)
Two points to note:
Try not to use words like data, table, df, sub etc which are commonly used functions
In the above code I used mydf in place of data.
You can use apply(ixk, 1, fu..) instead of the mapply that I used, but I think mapply makes for cleaner code in this situation

Resources