Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)
Related
Say I have 10 dataframes. I would like to check if all have same column names irrespective of their cases.
I can do this in multiple steps, but I was wondering if there is a shortcut way to do this?
We place the datasets in a list, loop over the list with lapply, get the column names, convert it to a single case, get the unique and check if the length is 1
length(unique(lapply(lst1, function(x) sort(toupper(names(x)))))) == 1
#[1] TRUE
data
lst1 <- list(mtcars, mtcars, mtcars)
You can use Reduce + intersect to get all the common column names in the list of dataframes and compare it with the names of any single dataframe in the list.
all(sort(Reduce(intersect, lapply(list_df, names))) == sort(names(list_df[[1]])))
I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))
I have a dataframe df1 which I wish to loop over by row, such that I can use it to update another dataframe, df2.
I take each row of df1 and use a user-defined function to update df2:
updateDF2 <- function (row_of_df1, df2) {
# do something to df2 conditional on df1's columns
assign('df2',df2,envir=.GlobalEnv)
}
Note the "assign" above updates df2.
To test the user-defined function updateDF2, I took out a random row from df1 and assigned it to a new vector. I then call updateDF2 with the new vector and df2 as arguments. This has consistently worked with no issue.
It's the looping that I have problem with. I get error messages
Error in row_of_df1$Column_of_condition: $ operator is invalid for atomic vectors
when I use
apply(df1, 1, function(x) updateDF2(row_of_df1=x, df2=df2))
The same error occurs when I use
apply(df1[1,], 1, function(x) updateDF2(row_of_df1=x, df2=df2))
But if I use
new_vector <- df1[1,]
updateDF2(new_vector, df2)
there would be no error. What's the difference here?
Since individual rows of df1 works with the user-defined function, do I need to explicitly write a loop over rows of df1, or can I use one of the apply family commands to make it work?
Since you don't provide any data, or any meaningful code, this is just a guess.
The apply(...) function coerces its first argument to a matrix and processes that row-wise (if the second argument is 1). So the rows that are passed to FUN are atomic vectors, not rows of a data frame. You can see this as follows:
df <- data.frame(x=1:10, y=rnorm(10), z=rpois(10,4))
class(df[1,])
#[1] "data.frame"
apply(df[1,],1,class)
# 1
# "numeric"
In your function updateDF2(...), you are probably referring to the elements of row_of_df1 as, e.g., row_of_df1$A, etc., where A is the name of a column in df1. This will not work with an atomic vector. You could use row_of_df1["A"], or row_of_df1[1] for example, but you cannot use the $ operator.
You should also be aware that there are other problems with using apply(...). Since it coerces the first argument to a matrix, and by definition all elements in a matrix must have the same data type, if df1 has any columns of type character, the whole matrix will be coerced to character.
I'm trying to convert a large list (220559 elements) into a data frame. Each element is either chr (RT) or chr(0)
I tried:
data.frame(t(sapply(my.list, c)))
I got the data frame, but it turned out to be one observation with 220559 variables instead of one variable with 220559 observations.
Is there an easy way to switch the observations with the variables? Or do I have to create the data frame differently? I'm new to R and really looking forward to your help.
So you have a giant list where is element is either the character "RT" or is it an empty character vector (character(0)). And you want to turn this into a data frame with one row and one column for each item in the list (220559 columns).
The problem is that data.frames like all columns to have the same number of observations (rows). And length("RT")==1 while length(character(0))==0. So you can either drop those columns, or convert those values to NA. I'm going to assume the latter for my example.
# "large" list
xx<-sample(list(character(), "RT"), 1000, replace=T)
#make into data.frame
df<-data.frame(lapply(xx, function(x) if(length(x)==0) NA else x))
#add nicer names
names(df)<-paste0("V",seq_along(df))
That's it. Normally to turn a list into a data.frame you just call data.frame(). It was just a bit trickier because of your zero-length vectors.
I am trying to name the columns of matrix from data from a vector.
suppose I have the following matrix:
A <- matrix(1:110, ncol=11)
and also a vector with 11 values from read.table:
code <- data1$code
I would like to do something like:
colnames(A)=data.frame(code)
to put the names of the columns using the values from the vector code
It will be far simpler just to pass code (or perhaps as.character(code), if it is a factor variable
colnames(A) <- as.character(code)
Passing a data.frame with one column will not work, as this has length =1 (the one column).
A data.frame is a list with two elements of the correct lengths to dimnames you could set both rownames and colnames at the same time.