Coercing multiple time-series columns to factors in large dataframe - r

I would like to know if there is an "easy/quick" way to convert character variables to factor.
I am aware, that one could make a vector with the column names and then use lapply. However, I am working with a large data frame with more than 200 variables, so it would be preferable not having to write the 200+ names in the vector.
I am also aware that I can coerce the entire data frame by using lapply, type.convert and sapply, but as I am working with time series data where some is categorical, and some is numerical, I am not interested in that either.
Is there any way to use the column number in this? I.e. [ ,2:200]? I tried the following, but without any luck:
df[ ,2:30] <- lapply(df[ ,2:30], type.convert)
sapply(df, factor)
With the solution above, I would still have to do multiple of them, but it would still be quicker than writing all the variable names.
I also have a feeling a loop might be usable here, but I would not be sure of how to write it out, or if it is even a way to do it.

df[ ,2:30] <- lapply(df[ ,2:30], as.factor)

As you write, that you need to convert (all?) character variables to factors, you could use mutate_if from dplyr
library(dplyr)
mutate_if(df, is.character, as.factor)
With this you only operate on columns for which is.character returns TRUE, so you don't need to worry about the column positions or names.

Related

Function in R for several variables

I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))

Extracting levels from data.table

For extracting levels from a data.table, is the standard way to lapply over the data.table as a list or do it inside the brackets somehow?
For example, using the npk builtin data, I know that the first 4 columns are factors and I want to extract the levels.
dat <- as.data.table(npk)
This is what I want, a list of the levels
levs <- lapply(dat[,1:4,with=FALSE], levels)
But, am I missing the data.table way that would be something like this? (this isnt right though because it repeats the levels to match the longest one).
levs2 <- dat[, lapply(.SD, levels), .SDcols=names(dat)[1:4]]
ps. sorry if this seems dumb, I am just trying to pick up the proper data.table idioms.
Your first example seems reasonable to me, and I don't think that you can do it within the brackets of the data.table, as the return type should be a list.
Another option is Filter(Negate(is.null),lapply(DT,levels)), which has the added benefit of not needing to know which columns are factors beforehand

Identifying character variables and changing them to numeric in R

I have a dataset with nearly 30,000 rows and 1935 variables(columns). Among these many are character variables (around 350). Now I can change data type of an individual column using as.numeric on it, but it is painful to search for columns which are character type and then apply this individually on them. I have tried writing a function using a loop but since the data size is huge, laptop is crashing.
Please help.
Something like
take <- sapply(data, is.numeric)
which(take == FALSE)
identify which variables are numeric, but I don't know how extract automatically, so
apply(data[, c(putcolumnsnumbershere)], 1, as.character))
use
sapply(your.data, typeof)
to create a vector of variable types, then use this vector to identify the character vector columns to be converted.

r: convert multiple factors to numeric simultaneously

I know how to convert one factor of a dataframe to numeric:
rds$fcv12afa3num <- as.numeric(levels(rds$fcv12afa3))[rds$fcv12afa3]
My two questions:
But how can I convert all dataframe-columns simultaneously, if the df consists only of factors?
How can I convert several factors simultaneously, based on a pattern of the column name?
I have many NA's, if that matters.
Thanks for your answer, Christian
Without example data, I can't give a completely exact answer, but this should get you started.
factorVars <- names(YourData)[vapply(YourData, is.factor, logical(1))]
YourData[, factorVars] <- lapply(YourData[, factorVars, drop = FALSE],
as.numeric)
Some notes:
Use drop = FALSE to handle the case of there only being one factor in your data frame.
If all of the factors are data frames, you may get a list object in return. You'd have to run that list through as.data.frame to get your data frame back.

Change part of data frame to character, then to numeric in R

I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.

Resources