lapply set of functions across multiple dataframes - r

I have a set of functions I need to apply to several dataframes. I want to use the lapply function instead of for() loops.
#sample data frame
id lastpage attribute_2
1 20 232
2 8 232
3 6 129
4 20 1271
5 20 129
6 20 74
The functions work when I apply it to one dataframe at a time. It basically removes duplicates (based on attribute_2) with the lowest values for variable 'lastpage':
df <- df[order(df$attribute_2, -df$lastpage),]
df <- df[!duplicated(df$attribute_2),]
When I try to (l)apply this function to several dataframes, nothing seems to have changed when I call the dataframe. Intuitively I think I am messing up something when calling df, but I am not sure what:
df.list <- list(df0, df1, df2, df3)
myFunc <- function(df) {
df <- df[order(df$attribute_2, -df$lastpage),]
df <- df[!duplicated(df$attribute_2),]
return(df)
}
df.list <- lapply(df.list, FUN = myFunc)
Your help is much appreciated!
I have looked at all similar previous questions on lapply functions, specifically this one: Applying a set of operations across several data frames in r
I am probably making a very obvious mistake, but I just can't find it.
EDIT: thanks everyone for the help
For anyone wondering what code I exactly use now:
df.list <- list(df0, df1, df2, df3)
myFunc <- function(x) {
x <- x[order(x$attribute_2, -x$lastpage),]
x <- x[!duplicated(x$attribute_2),]
}
df.list2 <- lapply(df.list, myFunc)
df2_c<-df.list2[[3]]

Your code probably works as expected but you’re assigning its result to df.list, not to the original data.frames. The list contains copies of these, so they would never get modified. This is intentional, and the desired behaviour in R.
In fact, just keep working with your list of data.frames.

This example does what you intend to do:
set.seed(314)
df <- data.frame(x = sample(1:10, size = 50, replace = TRUE),
y = sample(1:10, size = 50, replace = TRUE))
df.list <- list(df,df,df,df)
lapply(df.list,nrow)
testfunction <- function(data){
data[!duplicated(data$x),]
}
lapply(df.list, testfunction)
I think there is something wrong with your function. I noticed that you reference column email which is not in your dataframe.
It is also advisable to rename the variables that are used inside the function, so you don't reference global variables.
And as Konrad said in the other answer, your original dataframes stayed the same, so call them for example as follows:
df.list2 <- lapply(df.list, testfunction)
df.list2[[1]]

Related

Compute 15 rows in parallel (through vectorization) and create df with them

I am creating 15 rows in a dataframe, like this. I cannot show my real code, but the create row function involves complex calculations that can be put in a function. Any ideas on how I can do this using lapply, apply, etc. to create all 15 in parallel and then concatenate all the rows into a dataframe? I think using lapply will work (i.e. put all rows in a list, then unlist and concatenate, but not exactly sure how to do it).
for( i in 1:15 ) {
row <- create_row()
# row is essentially a dataframe with 1 row
rbind(my_df,row)
}
Something like this should work for you,
create_row <- function(){
rnorm(10, 0,1)
}
my_list <- vector(100, mode = "list")
my_list_2 <- lapply(my_list, function(x) create_row())
data.frame(t(sapply(my_list_2,c)))
The create_row function is just make the example reproducible, then we predefine an empty list, then fill it with the result from the create_row() function, then convert the resulting list to a data frame.
Alternatively, predefine a matrix and use the apply functions, over the row margin, then use the t (transpose) function, to get the output correct,
df <- data.frame(matrix(ncol = 10, nrow = 100))
t(apply(df, 1, function(x) create_row(x)))

Remove duplicate rows for multiple dataframes

I have over 100 dataframes (df1, df2, df3, ....) each contains the same variables. I want to loop through all of them and remove duplicates by id. For df1, I can do:
df1 <- df1[!duplicated(df1$id), ]
How can I do this in an efficient way?
If you're dealing with 100 similarly-structured data.frames, I suggest instead of naming them uniquely, you put them in a list.
Assuming they are all named df and a number, then you can easily assign them to a list with something like:
df_varnames <- ls()[ grep("^df[0-9]+$", ls()) ]
or, as #MatteoCastagna suggested in a comment:
df_varnames <- ls(pattern = "^df[0-9]+$")
(which is both faster and cleaner). Then:
dflist <- sapply(df_varnames, get, simplify = FALSE)
And from here, your question is simply:
dflist2 <- lapply(dflist, function(z) z[!duplicated(z$id),])
If you must deal with them as individual data.frames (again, discouraged, almost always slows down processing while not adding any functionality), you can try a hack like this (using df_varnames from above):
for (dfname in df_varnames) {
df <- get(dfname)
assign(dfname, df[! duplicated(df$id), ])
}
I cringe when I consider using this, but I admit I may not understand your workflow.

How to generalize union() to take N arguments?

How can I append/ push data into union dynamically?
For instance, I have 4 data sets to merge,
mydata <- union(data1, data2, data3, data4)
But sometimes I have less than 4 while sometimes more than that.
Any ideas how can I solve this problem?
Make some reproducible data:
#dummy data
data1 <- data.frame(x=letters[1:3])
data2 <- data.frame(x=letters[2:4])
data3 <- data.frame(x=letters[5:7])
We can use rbind with unique in a string then evaluate:
#get list of data frames to merge, update pattern as needed
data_names <- ls()[grepl("data\\d",ls())]
data_names <- paste(data_names,collapse=",")
#make command string
myUnion <- paste0("unique(rbind(",data_names,"))")
#evaluate
eval(parse(text=myUnion))
EDIT:
Here is another better/simpler way, using do.call:
unique(do.call("rbind",lapply(objects(pattern="data\\d"),get)))
You could roll your own function like vunion defined below. Not sure if this actually works, my [R] got a bit stale ;)
Basically, you accept any number of arguments (hence ...) and make use of those as if they were packed in a list. Just choose and remove the first 2 items from that list, calculate their union, append them to the list, repeat.
vunion <- function(...){
data <- list(...)
n <- length(data)
if(n > 2){
u <- list(t(union(data[[1]], data[[2]])))
return(do.call(vunion, as.list(c(tail(data, -2), u))))
} else {
return(union(data[[1]], data[[2]]))
}
}

How to apply a single argument function to all columns within all dataframes in a list

Say I have a list dflist which contains dataframes df1 and df2.
df1 <- data.frame(VAR1 = letters[1:10], VAR2 = seq(1:10))
df2 <- data.frame(VAR3 = letters[11:20], VAR4 = seq(11:20))
dflist <- list(df1 = df1, df2 = df2)
In general, I want to apply a single argument function to each of the variables in each dataframe in the list. To make the question more concrete, say I'm interested in setting the variable names to lowercase. Using a dataframe paradigm, I'd just do this:
colnames(df1) <- tolower(colnames(df1))
colnames(df2) <- tolower(colnames(df2))
However, this becomes prohibitive when I have dozens of variables in each of the 20 or 30 dataframes I'm working on, hence the shift to using lists.
I'm aware that this question stems from my fundamental misunderstanding of the *apply family of functions, but I've been unable to locate examples of functions applied to deeper than the first sublevel of a list. Thanks for any input.
As #akrun suggested, the answer is simply:
lapply(dflist, function(x) {colnames(x) <- tolower(colnames(x)); x })

Correct implementation of lapply

In so far as I understand it, when using r it can be more elegant to use functions such as lapply rather than for loops (that are used more often than not in other object oriented languages). However I cannot get my head around the syntax and am making foolish errors when trying to implement simple tasks with the command. For example:
I have a series of dataframes loaded from csv files using a for loop.The following dummy dataframes adequately describe the data:
x <- c(0,10,11,12,13)
y <- c(1,NA,NA,NA,NA)
z <- c(2,20,21,22,23)
a <- c(0,6,5,4,3)
b <- c(1,7,8,9,10)
c <- c(2,NA,NA,NA,NA)
df1 <- data.frame(x,y,z)
df2 <- data.frame(a,b,c)
I first generate a list of dataframe names (data_names- I do this when loading the csv files) and then simply want to sum the columns. My attempt of course does not work:
lapply(data_names, function(df) {
counts <- colSums(!is.na(data_names))
})
I could of course use lists (and I realise in the long run this maybe better) however from a pedagogical point of view I would like to understand lapply better.
Many thanks for any pointers
It's really just your use of is.na and the fact you don't need to use the asignment operator <- inside the function. lapply returns a list which is the result of applying FUN to each element of the input list. You assign the output of lapply to a variable, e.g. res <- lapply( .... , FUN ).
I'm also not too sure how you made the list initially, but the below should suffice. You also don't need an anonymous function in this case, you can use the named colSums and also provide the na.rm = TRUE argument to take care of persky NAs in your data:
lapply( list( df1, df2 ) , colSums , na.rm = TRUE )
[[1]]
x y z
46 1 88
[[2]]
a b c
18 35 2
So you can read this as:
For each df in the list:
apply colSums with the argument na.rm = TRUE
The result is a list, each element of which is the result of applying colSums to each df in the list.

Resources