How do I apply a function on each elements over two Dataframes?
In the following example I want to prevent the double for-loop:
for(m in 1:nrow(DF1)) {
for(n in 1:ncol(DF1)) {
mySeq <- seq(DF1[m,n], DF2[m,n], 0.01)
# Do the rest with mySeq ...
}
}
My purpose is to create sequences between each element of two dataframes with the same index.
The probably fastest solution and first thougth was mySeq <- seq(DF1, DF2, 0.01). But this doest'n work because the arguments of seq(from,to) have to be of length 1.
My second try was to use apply(). This doesn't work because it only applies on one dataframe. Then I searched for an appropriate apply solution and found mapply(). With mapply is it possible to apply on two dataframes, but there is no possibility to apply on each elements in the dataframe but rather on the rows of da dataframe. And I dont want to take use of nested apply() calls.
So my question is how to code the example shown above without using a double for-loop nor a nested apply?
I'm not sure what function you are trying to apply on the elements but I have used the sweep() function for something similar in the past. For example:
df = data.frame(x = 1:10, y = 1:10, z = 1:10)
sweep(df, 1:2, 1)
Here sweep goes through every element of df and subtracts 1 but you can specify your own function to operate on the elements. Then you can either tie your 2 data frames together and use sweep() or apply it separately.
Related
I have a matrix (11x42) and I would like to apply a function for each column one by one and put the result back in a new 11x42 matrix with a new name and modified column names.
I am not used to loops so I am a bit struggling. Here is what I have so far, but not working.
for (i in 1:ncol(matrix))
{
res[[i]] <-residuals(lm(matrix[,i]~HW))
}
I would like to also use the function paste0("new_", i) to change the names of each column.
Here I was trying to create 42 vectors (res1 to res 42) that I would cbind into a new matrix. But it's not working. And I am pretty sure that could be done within the loop as well.
Thanks in advance!
Since its a matrix you should use apply with margin 2, i.e.
new_mat <- apply(your_mat, 2, function(i) residuals(lm(i~HW)))
colnames(new_mat) <- paste0('new_', colnames(your_mat))
I am creating 15 rows in a dataframe, like this. I cannot show my real code, but the create row function involves complex calculations that can be put in a function. Any ideas on how I can do this using lapply, apply, etc. to create all 15 in parallel and then concatenate all the rows into a dataframe? I think using lapply will work (i.e. put all rows in a list, then unlist and concatenate, but not exactly sure how to do it).
for( i in 1:15 ) {
row <- create_row()
# row is essentially a dataframe with 1 row
rbind(my_df,row)
}
Something like this should work for you,
create_row <- function(){
rnorm(10, 0,1)
}
my_list <- vector(100, mode = "list")
my_list_2 <- lapply(my_list, function(x) create_row())
data.frame(t(sapply(my_list_2,c)))
The create_row function is just make the example reproducible, then we predefine an empty list, then fill it with the result from the create_row() function, then convert the resulting list to a data frame.
Alternatively, predefine a matrix and use the apply functions, over the row margin, then use the t (transpose) function, to get the output correct,
df <- data.frame(matrix(ncol = 10, nrow = 100))
t(apply(df, 1, function(x) create_row(x)))
I am attempting to write a function which accepts a dataframe, and then generates subset dataframes within a for() loop. As a first step, I tried the following:
dfcreator<-function(X,Z){
for(i in 1:Z){
df<-subset(X,Stratum==Z) #build dataframe from observations where index=value
assign(paste0("pop", Z),df) #name dataframe
}
}
This however does not save anything in to memory, and when I try to specify a return() I am still not getting what I need. For reference, I am using the
Sweden data set (which is native to RStudio).
EDIT Per Melissa's Advice!
I tried to implement the following code:
sampler <- function(df, n,...) {
return(df[sample(nrow(df),n),])
}
sample_list<-map2(data_list, stratumSizeVec, sampler)
where stratumSizeVec is a 1X7 df and data_list is a list of seven dfs. When I do this, I get seven samples in sample list all of the same size equal to stratumSizeVec[1]. Why is map2 not inputting the in the following manner
sampler(data_list$pop0,stratumSizeVec[1])
sampler(data_list$pop1,stratumSizeVec[2])
...
sampler(data_list$pop6,stratumSizeVec[7])
Furthermore, is there a way to "nest" the map2 function within lapply?
I'm confused as to why you never actually utilize i anywhere in your loop. It looks like you're creating Z copies of the data set where Stratum == Z - is that what you are after?
as for your code, I would use the following:
data_list <- split(df, df$Stratum)
names(data_list) <- paste0("pop", sort(unique(df$Stratum)))
This doesn't define a function, we are calling base-R function (namely split) which splits up a data frame based on some vector (here, we use df$Stratum). The result is a list of data frames, each with a single value of Stratum.
Random sampling from rows
sampled_data <- lapply(data_list, function(df, n,...) { # n is the number of rows to take, the dots let you send other information to the `sample` function.
df[sample(nrow(df), n, ...),]
},
n = 5,
replace = FALSE # this is default, but the purpose of using the ... notation is to allow this (and any other options in the `sample` function) to be changed.
)
You can also define the function separately:
sampler <- function(df, n,...) {
df[sample(nrow(df), n, ...),]
}
sampled_data <- lapply(data_list, sampler, n = 10) # replace 10 with however many samples you want.
purrr:map2 method
As defined, the sampler function does not need to be modified, each element of the first list (data_list) is put into the first argument of sampler, and the corresponding element of the 2nd "list" (sampleSizeVec) is put into the 2nd argument.
library(purrr)
map2(data_list, sampleSizeVec, sampler, replace = FALSE) # replace = FALSE not needed, there as an example only.
I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))
I want to perform a function - that uses multiple columns - on certain rows of a data frame based on the contents in one column. I can, of course, accomplish this task using a simple for loop, but I am sure that it must be possible to do so more elegantly using one of the apply functions. I just can't quite figure it out.
(data <- data.frame(a = sample(10), b = sample(10), c=NA))
# for every value of b that is greater than 5,
# set c to be equal to a function of a and b, say: 3 * a + b
# otherwise, c = a
for(i in 1:nrow(data)){
if(data$b[i] > 5) {
data$c[i] <- 3*data$a[i]+data$b[i]
} else {
data$c[i] <- data$a[i]
}
}
data
I realize that there are three things going on here: (1) figuring out which rows to perform the function on, (2) performing the function on those rows and (3) performing the alternate function on the other rows. If I could figure out how to apply a function using multiple columns to every row, I could subset the data before I did that.
I thought that code like this would allow me to perform a function using multiple columns:
sapply(data$b, function(b, a) 3*a+b, a=data$a)
#or
lapply(data$b, function(b, a) 3*a+b, a=data$a)
But it returns an nxn matrix of numbers (or n lists that are n long), and I can't figure out how it calculated them.
I also suspect it's possible to do the selection and the function at the same time (maybe with code like this:
data$c <- sapply(data$b, function(b, table) 3*table$a[b>5] + b[b>5], table=data)
)
But that code results in similar output problems.
I think most of my problems stem from the fact that I am not quite comfortable with the apply functions, especially with multiple arguments, but none of my fiddling has enlightened me.
Thank you!
You can use plyr:ddply (easiest for me) if you need to run functions rowwise
In this example as Blue Magister describes, probably easier to do it directly as:
data$c<-ifelse(data$b > 5, 3 * data$a + data$b, data$a)
But here's a ddply example
require(plyr)
ddply(data, c("a","b"), function(df)ifelse(df$b>5,df$a+df$b,df$a))
or
data<-adply(data,1,transform,c=ifelse(b>5,a+b,b))
Or obviously in this case you can just use apply:
data$c<-apply(data, 1, function(x)ifelse(x["b"]>5,x["a"]+x["b"],x["a"]))