For extracting levels from a data.table, is the standard way to lapply over the data.table as a list or do it inside the brackets somehow?
For example, using the npk builtin data, I know that the first 4 columns are factors and I want to extract the levels.
dat <- as.data.table(npk)
This is what I want, a list of the levels
levs <- lapply(dat[,1:4,with=FALSE], levels)
But, am I missing the data.table way that would be something like this? (this isnt right though because it repeats the levels to match the longest one).
levs2 <- dat[, lapply(.SD, levels), .SDcols=names(dat)[1:4]]
ps. sorry if this seems dumb, I am just trying to pick up the proper data.table idioms.
Your first example seems reasonable to me, and I don't think that you can do it within the brackets of the data.table, as the return type should be a list.
Another option is Filter(Negate(is.null),lapply(DT,levels)), which has the added benefit of not needing to know which columns are factors beforehand
Related
I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))
I would like to know if there is an "easy/quick" way to convert character variables to factor.
I am aware, that one could make a vector with the column names and then use lapply. However, I am working with a large data frame with more than 200 variables, so it would be preferable not having to write the 200+ names in the vector.
I am also aware that I can coerce the entire data frame by using lapply, type.convert and sapply, but as I am working with time series data where some is categorical, and some is numerical, I am not interested in that either.
Is there any way to use the column number in this? I.e. [ ,2:200]? I tried the following, but without any luck:
df[ ,2:30] <- lapply(df[ ,2:30], type.convert)
sapply(df, factor)
With the solution above, I would still have to do multiple of them, but it would still be quicker than writing all the variable names.
I also have a feeling a loop might be usable here, but I would not be sure of how to write it out, or if it is even a way to do it.
df[ ,2:30] <- lapply(df[ ,2:30], as.factor)
As you write, that you need to convert (all?) character variables to factors, you could use mutate_if from dplyr
library(dplyr)
mutate_if(df, is.character, as.factor)
With this you only operate on columns for which is.character returns TRUE, so you don't need to worry about the column positions or names.
I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.
this is a simple question, and I am sure it is easily solvable with either tapply, apply, or by, etc. However, I am still relatively new to this, and I would like to ask for advice.
The problem:
I have a data frame with say 5 columns. Columns 4 and 5 are factors, say. For each factor in column 5, I want to execute a function over columns 1:3 for each group in my column 5. This is, in principle, easily doable. However, I want to have the output as a nice table, and I want to learn how to do this in an elegant way, which is why I would like to ask you here.
Example:
df <- data.frame(x1=1:6, x2=12:17, x3=3:8, y=1:2, f=1:3)
Now, the command
by(df[,1:3], df$y, sum)
would give me the sum based on each factor level in y, which is almost what I want. Two additional steps are needed: one is to do this for each factor level in f. This is almost trivial. I could easily wrap lapply around the above command and I would get what I want, except this: I want to generate a table with the results, and maybe even use it to generate a heatmap.
Hence: is there an easy and more elegant way to do this and to generate a matrix with corresponding output? This seems like an everyday-task for data scientists, which is why I suspect that there is an existing built-in solution...
Thanks for any help or any hint, no matter how small!
You can use the reshape2 and plyr packages to accomplish this.
library(plyr)
df2 <- ddply(df, .(y, f), sum)
and then to turn it into a f by y matrix:
library(reshape2)
acast(df2, f ~ y, value.var = "V1")
I have a large data.table (9 M lines) with two columns: fcombined and value
fcombined is a factor, but its actually the result of interacting two factors.
The question now is what is the most efficient way to split up the one factor column in two again?
I have already come up with a solution that works ok, but maybe there is more straight forward way that i have missed. The working example is:
library(stringr)
f1=1:20
f2=1:20
g=expand.grid(f1,f2)
combinedfactor=as.factor(paste(g$Var1,g$Var2,sep="_"))
largedata=1:10^6
DT=data.table(fcombined=combinedfactor,value=largedata)
splitfactorcol=function(res,colname,splitby="_",namesofnewcols){#the nr. of cols retained is length(namesofnewcols)
helptable=data.table(.factid=seq_along(levels(res[[colname]])) ,str_split_fixed(levels(res[[colname]]),splitby,length(namesofnewcols)))
setnames(helptable,colnames(helptable),c(".factid",namesofnewcols))
setkey(helptable,.factid)
res$.factid=unclass(res[[colname]])
setkey(res,.factid)
m=merge(res,helptable)
m$.factid=NULL
m
}
splitfactorcol(DT,"fcombined",splitby="_",c("f1","f2"))
I think this does the trick and is about 5x faster.
setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
do.call(rbind, strsplit(levels(fcombined), "_")))]]
I split the levels and then simply merged that result back into the original data.table.
Btw, in my tests strsplit was about 2x faster (for this task) than the stringr function.