I have a dataframe with cases that repeat on the rows. Some rows have more complete data than others. I would like to group cases and then assign the first non-missing value to all NA cells in that column for that group. This seems like a simple enough task but I'm stuck. I have working syntax but when I try to use apply to apply the code to all columns in the dataframe I get a list back instead of a dataframe. Using do.call(rbind) or rbindlist or unlist doesn't quite fix things either.
Here's the syntax.
df$groupid<-group_indices (df,id1,id2) #creates group id on the basis of a combination of two variables
df%<>%group_by(id1,id2) #actually groups the dataframe according to these variables
df<-summarise(df, xvar1=xvar1[which(!is.na(xvar1))[1]]) #this code works great to assign the first non missing value to all missing values but it only works on 1 column at a time (X1).
I have many columns so I try using apply to make this a manageable task..
df<-apply(df, MARGIN=2, FUN=function(x) {summarise(df, x=x[which(!is.na(x))[1]])
}
)
This gets me a list for each variable, I wanted a dataframe (which I would then de-duplicate). I tried rbindlist and do.call(rbind) and these result in a long dataframe with only 3 columns - the two group_by variables and 'x'.
I know the problem is simply how I'm using apply, probably the indexing with 'which', but I'm stumped.
What about using lapply with do.call and cbind, like the following:
df <- do.call(cbind, lapply(df, function(x) {summarise(df, x=x[which(!is.na(x))[1]])}))
Related
I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))
Given a dataframe df and a function f which is applied to df:
df[] <- lapply(df, f)
What is the magic R is performing to replace columns in df with collection of vectors in the list from lapply? I see that the result from lapply is a list of vectors having the same names as the dataframe df. I assume some magic mapping is being done to map the vectors to df[], which is the collection of columns in df (methinks). Just works? Trying to better understand so that I remember what to use the next time.
A data.frame is merely a list of vectors having the same length. You can see it using is.list(a_data_frame). It will return TRUE.
[] can have different meaning or action depending of the object it is applied on. It even can be redefined as it is in fact a function.
[] allows to subset or insert vector columns from data.frame.
df[1] get the first column
df[1] <- 2 replace the first column with 2 (repeated in order to have the same length as other columns)
df[] return the whole data.frame
df[] <- list(c1,c2,c3) sets the content of the data.frame replacing it's current content
Plus a wide number of other way to access or set data in a data.frame (by column name, by subset of rows, of columns, ...)
I've been trying to create a dataframe from my original dataframe, where rows in the new dataframe would represent mean of every 20 rows of the old dataframe. I discovered a function called colMeans, which does the job pretty well, the only problem, which still persists is how to change that vector of results back to dataframe, which can be further analysed.
my code for colMeans: (matrix1 in my original dataframe converted to matrix, this was the only way I managed to get it to work)
a<-colMeans(matrix(matrix1, nrow=20));
But here I get the numeric sequence, which has all the results concatenated in one single column(if I try for example as.data.frame(a)). How am I supposed to get this result back into dataframe where each column includes only the results for specific column name and not all the averages.
I hope my question is clear, thanks for help.
Based on the methods('as.data.frame'), as.data.frame.list is an option to convert each element of a vector to columns of a data.frame
as.data.frame.list(a)
data
m1 <- matrix(1:20, ncol=4, dimnames=list(NULL, paste0('V', 1:4)))
a <- colMeans(m1)
I am trying to generate summary statistics by using sapply on 5 columns with numerical data. There is however 1 column with sex F/M (which is the second column of my dataframe), which I do not need to apply this to. I have tried removing the column by using
data_2 <- data_2[,2]
and a bunch of other methods but they do not seem to remove the column.
I have to work out the mean, sd, min, max and sample size with the sapply function.
In cases like this, I find it easier to use indices, instead of the data itself:
sapply((1:ncol(data_2))[-2], function(i) {
c(mean(data_2[,i]), sd(data_2[,i])) # add other functions
})
Use
data_2 <- data_2[, -2]
minus removes the column, without the minus you're just returning the second column.
However overwriting data_2 with data_2[, -2] is not optimal, so better just to run sapply on data_2[, -2].
I want to split a large dataframe into a list of dataframes according to the values in two columns. I then want to apply a common data transformation on all dataframes (lag transformation) in the resulting list. I'm aware of the split command but can only get it to work on one column of data at a time.
You need to put all the factors you want to split by in a list, eg:
split(mtcars,list(mtcars$cyl,mtcars$gear))
Then you can use lapply on this to do what else you want to do.
If you want to avoid having zero row dataframes in the results, there is a drop parameter whose default is the opposite of the drop parameter in the "[" function.
split(mtcars,list(mtcars$cyl,mtcars$gear), drop=TRUE)
how about this one:
library(plyr)
ddply(df, .(category1, category2), summarize, value1 = lag(value1), value2=lag(value2))
seems like an excelent job for plyr package and ddply() function. If there are still open questions please provide some sample data. Splitting should work on several columns as well:
df<- data.frame(value=rnorm(100), class1=factor(rep(c('a','b'), each=50)), class2=factor(rep(c('1','2'), 50)))
g <- c(factor(df$class1), factor(df$class2))
split(df$value, g)
You can also do the following:
split(x = df, f = ~ var1 + var2...)
This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.