Change factor levels for variable over multiple data frames - r

I have 4 data sets which I would like to get the percentage for each group for each data set. This is all fine using prop.table(table(df1$group)) changing for df2$group and so on, but I would like labels on my tables. So I have converted the column to a factor and assigned appropriate levels, however this involves assigning the levels to each data set.
I have tried using lapply but I end up with NAs for the factor levels.
Here is some data
df1 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df2 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df3 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df4 <- data.table(id=(1:100), group= sample(5,100, replace=T))
df1$group <- as.factor(df1$group)
df2$group <- as.factor(df2$group)
df3$group <- as.factor(df3$group)
df4$group <- as.factor(df4$group)
what I have tried:
df <- list(df1,df2,df3,df4)
df <- lapply(df,function(x) x[,group:=factor(group, levels = c("A","B","C","D","E"))])
but this returns changes the levels but results in NAs.
The data are all in data.tables and I am interested in 5 factors per data.table. I would also be interested in changing the class of multiple variables across multiple data.tables but for simplicity this could be another question.

We need to specify the labels that correspond to the levels present in the original data
lapply(df, function(x) x[, group := factor(group, levels = 1:5, labels = LETTERS[1:5])])

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

R Aggregate with a yet undefined range of columns (including factors)

I probably miss the right words to find my answer using the search function. I will have a dataset with a yet unknown number of columns, because they are a function of work within another program and later changes there will change the number of variables in the dataset. However, the dataset has a clear structure, with 6 variables in the beginning (including the below mentioned code, a factor variable, and year and starting at the 7 column all the other variables that are a function of the work in the other program (MaxQDA).
So I wish to have a flexible call for 7 to N columns for an aggregate function to replace the dot in the following code, which to my understanding calls for all columns.
dataset2 <- aggregate(. ~ code+jahr,
data = dataset,
sum,
na.action=na.pass
)
Suggestions from here do not help, as I don't know how to transfer the code+jahr into other suggested variations of aggregate-function writing.
addendum: Or, put differently: I wish to exempt a few columns from the aggregate-function, while summing up a range of other columns.
Since there was confusion about vector types. I have some factor data like ID and Name. Data would look like this
set.seed(42)
test2 <- as.data.frame(matrix(sample(16 * 4, replace=TRUE), ncol=16, nrow=4))
code <-c("aaa", "bbb","aaa", "ddd")
jahr <- c("1990", "1993", "2007", "2020")
id <- c("id1", "id2", "id3", "id4")
Name <- c("bla", "bla2", "bla3", "bla4")
test <- data.frame(code, jahr, id, Name)
dataset <- data.frame(test, test2)
dataset[1:4] <- lapply(dataset[, 1:4], as.factor)
Using dataset above we want to remove id and Name from the aggregation since they are factors that are not used to define groups. The simplest way to do that is to extract those columns of data:
dataset2 <- aggregate(. ~ code+jahr, data = dataset[ , -(3:4)], sum, na.action=na.pass)
A slightly more complicated method is to define a logical statement that identifies columns that are factors but not used for grouping. The main advantage is not having to figure out column numbers and making it relatively simple to change the grouping variables:
keep <- colnames(dataset) %in% c("code", "jahr") | sapply(dataset, is.numeric)
dataset2 <- aggregate(. ~ code+jahr, data = dataset[, keep], sum, na.action=na.pass)
Both produce the same results

how to replace the value of a column with factor level based on defined factors in R

I have a dataset, where I defined the factors for rows gene.fac and columns cell.fac.
load('Analysis.RData')
top200_groups <- data.frame (cluster = cell.fac, t(top200))
melted <- melt(top200_groups, id.vars=c("cluster"))
After the application with melt function, I can see
Then I want to replace the genename in melted$variable with the factors defined in gene.fac.
Is there easy way to transform this? Thanks.
Here is a solution with dplyr:
load('Analysis.RData')
top200_groups <- data.frame (cluster = cell.fac, t(top200))
melted <- melt(top200_groups, id.vars=c("cluster"))
df2 <- as.data.frame(gene.fac)
df2$variable <- factor(rownames(df2))
df_new <- full_join(melted, df2, by = "variable")
The new data.frame df_new has two columns, your old one and the new one. You can erase the old one.
There is also warning but no error, when you execute the code, that dplyr changed to character.

R assign levels to factor variable

I was given an Excel table similar to this:
datos <- data.frame(op= 1:4, var1= c(4, 2, 3, 2))
Now, there are other tables with the keys to op and var1, which happen to be categorical variables. Suppose that after loading them, they become:
set.seed(1)
op <- paste("op",c(1:4),sep="")
var1 <- sample(LETTERS, 19, replace= FALSE)
As you can see, there are unused levels in the data frame. I want to replace the numbers for the proper associated levels. This is what I've tried:
datos[] <- lapply(datos, factor)
levels(datos$op) <- op
levels(datos$var1) <- var1
This fails, because it reorders the factors alphabetically and gives a wrong output. I then tried:
datos$var1 <- factor(datos$var1, levels= var1, ordered= TRUE)
but this puts everything in datos$var1 as NA (I guess that's because of unmatching lengths.
What would be the rigth way to do this?
Following the kind advice of #docendoDiscimus, I post this answer for future reference:
For the data provided in the question:
datos$var1 <- factor(var1[datos$var1], levels= unique(var1))
datos
## op
Please notice that this solution should be applied without converting datos$var1 to factor (that is, without applying the code datos[] <- lapply(datos, factor).

R: mapply function returning error: level sets of factors are different

I have two dataframes (DfA and DfB). Each dataframe has three factor variables: species, type and region. DfA also has a numeric value column, and I want to use it to estimate numeric values in a new column of DfB, based on shared attributes.
I have a function which asks for the species, type and region, then creates a subset of DfA with those attributes and runs an algorithm on the subset to estimate the new value. When I run the function and specify the values manually as a test, it works fine.
If all of the factor levels and combinations in DfB have matching factors in DfA, the function works fine with mapply. But if any row in DfB contains a factor level that is not present in DfA, I get an error (level sets of factors are different). Example: if DfA includes data for regions A,B and C, and DfB contains data for regions A,B,C and D, mapply returns the error; if I remove the rows with region D, the mapply function works.
How can I specify that, if the row contains a factor level that makes the function impossible, to skip it or put NA in instead and move on to run the function on the rows for which the function works?
You can drop/add levels to your data.frames to make sure your function works rather than cater for a special case:
# dropping and setting levels
Z = as.factor(sample(LETTERS[1:5],20,replace=T))
levels(Z)
Y = as.factor(Z[-which(Z %in% LETTERS[4:5])])
levels(Y)
Y=droplevels(Y) # drop the levels
levels(Y)
levels(Y) = levels(Z) # bring them back
levels(Y)
Y = factor(Y,levels=LETTERS[1:7]) # expand them
levels(Y)
attr(Y,"levels")
attr(Y,"levels") = LETTERS[1:8] # keep expanding them
levels(Y)
require(plyr)
Y = mapvalues(Y,levels(Y),letters[1:length(levels(Y))]) # change the labels of the levels
levels(Y)
x<-factor(Y, labels=LETTERS[(length(unique(Y))+1):(2*length(unique(Y)))]) # change the labels of the levels on another variable
In your case:
dfa = data.frame("LVL1"=as.factor(sample(LETTERS[1:2],20,replace=T)))
dfb = data.frame("LVL2"=as.factor(sample(LETTERS[2:5],20,replace=T)))
newLevels = sort(unique(union(levels(dfa$LVL1),levels(dfb$LVL2))))
dfa$LVL1 = factor(dfa$LVL1,levels=newLevels)
dfb$LVL2 = factor(dfb$LVL2,levels=newLevels)

Resources