I have a dataset with nearly 30,000 rows and 1935 variables(columns). Among these many are character variables (around 350). Now I can change data type of an individual column using as.numeric on it, but it is painful to search for columns which are character type and then apply this individually on them. I have tried writing a function using a loop but since the data size is huge, laptop is crashing.
Please help.
Something like
take <- sapply(data, is.numeric)
which(take == FALSE)
identify which variables are numeric, but I don't know how extract automatically, so
apply(data[, c(putcolumnsnumbershere)], 1, as.character))
use
sapply(your.data, typeof)
to create a vector of variable types, then use this vector to identify the character vector columns to be converted.
Related
I want to unclass several factor variables in R. I need this functionality for a lot of variables. At the moment I repeat the code for each variable which is not convenient:
unclass:
myd$ati_1 <-unclass(myd$ati_1)
myd$ati_2 <-unclass(myd$ati_2)
myd$ati_3 <-unclass(myd$ati_3)
myd$ati_4 <-unclass(myd$ati_4)
I've looked into the apply() function family but I do not even know if this is the correct approach. I also read about for loops but every example is only about simple integers, not when you need to loop over several variables.
Would be glad if someone could help me out.
You can use a loop:
block <- c("ati_1", "ati_2", "ati_3", "ati_4")
for (j in block) {myd[[j]] <- unclass(myd[[j]])}
# The double brackets allows you to specify actual names to extrapolate within the data frame
Here are a few ways. We use CO2 which comes with R and has several factor columns. This unclasses those columns.
If you need some other criterion then
set ix to the names or positions or a logical vector defining those columns to be transformed in the base R solution
replace is.factor in the collapse solution with a vector of names or positions or a logical vector denoting the columns to convert
in the dplyr solution replace where(...) with the same names, positions or logical.
Code follows. In all of these the input is not overridden so you still have the input available unchanged if you want to rerun it from scratch and, in general, overwriting objects is error prone.
# Base R
ix <- sapply(CO2, is.factor)
replace(CO2, ix, lapply(CO2[ix], unclass))
# collapse
library(collapse)
ftransformv(CO2, is.factor, unclass)
# dplyr
library(dplyr)
CO2 %>%
mutate(across(where(is.factor), unclass))
Depending on what you want this might be sufficient or omit the as.data.frame if a matrix result is ok.
as.data.frame(data.matrix(CO2))
I would like to know if there is an "easy/quick" way to convert character variables to factor.
I am aware, that one could make a vector with the column names and then use lapply. However, I am working with a large data frame with more than 200 variables, so it would be preferable not having to write the 200+ names in the vector.
I am also aware that I can coerce the entire data frame by using lapply, type.convert and sapply, but as I am working with time series data where some is categorical, and some is numerical, I am not interested in that either.
Is there any way to use the column number in this? I.e. [ ,2:200]? I tried the following, but without any luck:
df[ ,2:30] <- lapply(df[ ,2:30], type.convert)
sapply(df, factor)
With the solution above, I would still have to do multiple of them, but it would still be quicker than writing all the variable names.
I also have a feeling a loop might be usable here, but I would not be sure of how to write it out, or if it is even a way to do it.
df[ ,2:30] <- lapply(df[ ,2:30], as.factor)
As you write, that you need to convert (all?) character variables to factors, you could use mutate_if from dplyr
library(dplyr)
mutate_if(df, is.character, as.factor)
With this you only operate on columns for which is.character returns TRUE, so you don't need to worry about the column positions or names.
I have a data frame consisting of five character variables which represent specific bacteria. I then have thousands of observations of each variable that all begin with the letter K. eg
x <- c(K0001,K0001,K0003,K0006)
y <- c(K0001,K0001,K0002,K0003)
z <- c(K0001,K0002,K0007,K0008)
r <- c(K0001,K0001,K0001,K0001)
o <- c(K0003,K0009,K0009,K0009)
I need to identify unique observations in the first column that don't appear in any of the remaining four columns. I have tried the approach suggested here which I think would work if I could create individual vectors using select ...
How to tell what is in one vector and not another?
but when I try to create a vector for analysis using the code ...
x <- select(data$x)
I get the error
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "character
I have tried to mutate the vectors using as.factor and as.numeric but neither of these approaches work as the first gives an equivalent error as above, and as.numeric returns NAs.
Thanks in advance
The reference that you cited recommended using setdiff. The only thing that you need to do to apply that solution is to convert the four columns into one, so that it can be treated as a set. You can do that with unlist
setdiff(data$x, unlist(data[,2:5]))
"K0006"
I am reading a txt file into R and have several columns that should be numeric, but everything is interpreted as character. Now I would like to convert only a few columns within that matrix (I converted it to a matrix in a first step) to numeric, but I only managed to extract columns, but that way I got rid of the type matrix...
data <- as.numeric(data[,1])
Now, I've found similar questions here but none of the answers worked in the way that it conserved the type matrix.
For example, I've tried to store the affected columns in a vector and then perform the action on that vector with lapply
cols<- c("a","b","d")
data<- as.matrix(lapply(cols, as.numeric))
But this gives me only empty fields, and of course it only shows the columns I selected and not the rest of the matrix. I also got the error message
NAs introduced by coercion
As a last step I tried the following, but I ended up having a list and not a matrix anymore
data[1:25] <- as.matrix(lapply(data[1:25], as.numeric))
What I would like to have, is a matrix where several columns (not just 1:25 as in my example above but rather, say, columns 1,3 and 6) are converted to numeric and the rest stays the same.
Does someone have an answer and maybe even an explanation for why the things I've tried didn't work?
I have a simple problem. I have a data frame with 121 columns. columns 9:121 need to be numeric, but when imported into R, they are a mixture of numeric and integers and factors. Columns 1:8 need to remain characters.
I’ve seen some people use loops, and others use apply(). What do you think is the most elegant way of doing this?
Thanks very much,
Paul M
Try the following... The apply function allows you to loop over either rows, cols, or both, of a dataframe and apply any function, so to make sure all your columns from 9:121 are numeric, you can do the following:
table[,9:121] <- apply(table[,9:121],2, function(x) as.numeric(as.character(x)))
table[,1:8] <- apply(table[,1:8], 2, as.character)
Where table is the dataframe you read into R.
Briefly I specify in the apply function the table I want to loop over - in this case the subset of your table we want to make changes to, then we specify the number 2 to indicate columns, and finally give the name of the as.numeric or as.character functions. The assignment operator then replaces the old values in your table with the new ones of correct format.
-EDIT: Just changed the first line as I recalled that if you convert from a factor to a number, what you get is the integer of the factor level and not the number you think you are getting to factors first need to be converted to characters, then numbers, which was can do just by wrapping as.character inside as.numeric.
When you read in the table use strinsAsFactors=FALSE then there will not be any factors.