As always, apologies for the simple Q.
I've got a large dataset and want to change a specified list of columns into a numeric class. I can do it, but it's not very elegant and unless I change the memory requirements it won't run as the merge is too exhausts the vector memory!
library(tidyverse)
#Extract column names I want to turn into numeric from data
make_numeric <- data[252:321] %>% select(-c(contains("UNITS"))) %>% colnames()
Here I want to turn columns that are contained in make_numeric into as.numeric and insert straight back into data. I can't do this in one go, so instead I extract the data, convert, and then merge.
tmp <- data %>% select(record_id, make_numeric)
tmp <- lapply(tmp[2:56], as.numeric)
tmp <- as.data.frame(tmp)
tmp2 <- data %>% select(-make_numeric)
tmp3 <- merge(tmp, tmp2)
I'm certain there must be a better way...
There is a dplyr solution:
library(tidyverse)
library(dplyr)
#Extract column names I want to turn into numeric from data
make_numeric <- data[252:321] %>% select(-c(contains("UNITS"))) %>% colnames()
#Mutate desired columns to numeric
data <- data %>% mutate_at(vars(make_numeric), as.numeric)
Does this work?
library(data.table)
#convert to data.table
dt<- as.data.table(data)
#change colnames to numeric
dt[, colnames(dt)[colnames(dt) %in% cols] := lapply(.SD, as.numeric), .SDcols = colnames(dt)[colnames(dt) %in% cols]]
Related
I have the following Tibbles.
tmp <- tibble()
tmp2 <- tibble()
tmp <- tmp %>% rbind( colSums( y_matrix) )
tmp2 <- tmp2 %>% rbind( proportions( colSums( y_matrix )))
data <- bind_cols(tmp,tmp2)
I want to add column names for "data" accordingly. The number of columns in tmp and tmp2 will change from time to time. So how can I add column names without defining them one by one?
The expected column names in the output is like this.
c1 c2 c1_prop c2_prop
Is there any method to create this?
I don't have enough reputation to comment this data.table solution, which you could could always send to as_tibble(). If this wasn't what you were after, could you put an explicit example of the data and expected output?
library(data.table)
setDT(data)
setnames(data, ncol(tmp)+(1:ncol(tmp2)), paste0(names(tmp),"_prop"))
However, wouldn't it just be better to name the columns correctly before merging?
I would like to change the format (class) of some columns of my data.frame object (mydf) from charactor to factor.
I don't want to do this when I'm reading the text file by read.table() function.
Any help would be appreciated.
Hi welcome to the world of R.
mtcars #look at this built in data set
str(mtcars) #allows you to see the classes of the variables (all numeric)
#one approach it to index with the $ sign and the as.factor function
mtcars$am <- as.factor(mtcars$am)
#another approach
mtcars[, 'cyl'] <- as.factor(mtcars[, 'cyl'])
str(mtcars) # now look at the classes
This also works for character, dates, integers and other classes
Since you're new to R I'd suggest you have a look at these two websites:
R reference manuals:
http://cran.r-project.org/manuals.html
R Reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
# To do it for all names
df[] <- lapply( df, factor) # the "[]" keeps the dataframe structure
# to do it for some names in a vector named 'col_names'
col_names <- names(df)
df[col_names] <- lapply(df[col_names] , factor)
Explanation. All dataframes are lists and the results of [ used with multiple valued arguments are likewise lists, so looping over lists is the task of lapply. The above assignment will create a set of lists that the function data.frame.[<- should successfully stick back into into the dataframe, df
Another strategy would be to convert only those columns where the number of unique items is less than some criterion, let's say fewer than the log of the number of rows as an example:
cols.to.factor <- sapply( df, function(col) length(unique(col)) < log10(length(col)) )
df[ cols.to.factor] <- lapply(df[ cols.to.factor] , factor)
You could use dplyr::mutate_if() to convert all character columns or dplyr::mutate_at() for select named character columns to factors:
library(dplyr)
# all character columns to factor:
df <- mutate_if(df, is.character, as.factor)
# select character columns 'char1', 'char2', etc. to factor:
df <- mutate_at(df, vars(char1, char2), as.factor)
If you want to change all character variables in your data.frame to factors after you've already loaded your data, you can do it like this, to a data.frame called dat:
character_vars <- lapply(dat, class) == "character"
dat[, character_vars] <- lapply(dat[, character_vars], as.factor)
This creates a vector identifying which columns are of class character, then applies as.factor to those columns.
Sample data:
dat <- data.frame(var1 = c("a", "b"),
var2 = c("hi", "low"),
var3 = c(0, 0.1),
stringsAsFactors = FALSE
)
Another short way you could use is a pipe (%<>%) from the magrittr package. It converts the character column mycolumn to a factor.
library(magrittr)
mydf$mycolumn %<>% factor
I've doing it with a function. In this case I will only transform character variables to factor:
for (i in 1:ncol(data)){
if(is.character(data[,i])){
data[,i]=factor(data[,i])
}
}
Unless you need to identify the columns automatically, I found this to be the simplest solution:
df$name <- as.factor(df$name)
This makes column name in dataframe df a factor.
You can use across with new dplyr 1.0.0
library(dplyr)
df <- mtcars
#To turn 1 column to factor
df <- df %>% mutate(cyl = factor(cyl))
#Turn columns to factor based on their type.
df <- df %>% mutate(across(where(is.character), factor))
#Based on the position
df <- df %>% mutate(across(c(2, 4), factor))
#Change specific columns by their name
df <- df %>% mutate(across(c(cyl, am), factor))
I need your help to simplify the following code.
I need to name the columns of matrix and format each of it as factor.
How can I do that for 100 columns without doing it one by one.
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
train.data <- data.frame(x1=factor(z[,1],x2=factor(z[,2],....,x100=factor(z[,52]))
Here's one option
setNames(data.frame(lapply(split(z, col(z)), factor)), paste0("x", 1:p))
or use magrittr piping syntax
library(magrittr)
split(z, col(z)) %>%
lapply(factor) %>%
data.frame %>%
setNames(paste0("x", 1:p))
I would like to change the format (class) of some columns of my data.frame object (mydf) from charactor to factor.
I don't want to do this when I'm reading the text file by read.table() function.
Any help would be appreciated.
Hi welcome to the world of R.
mtcars #look at this built in data set
str(mtcars) #allows you to see the classes of the variables (all numeric)
#one approach it to index with the $ sign and the as.factor function
mtcars$am <- as.factor(mtcars$am)
#another approach
mtcars[, 'cyl'] <- as.factor(mtcars[, 'cyl'])
str(mtcars) # now look at the classes
This also works for character, dates, integers and other classes
Since you're new to R I'd suggest you have a look at these two websites:
R reference manuals:
http://cran.r-project.org/manuals.html
R Reference card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
# To do it for all names
df[] <- lapply( df, factor) # the "[]" keeps the dataframe structure
# to do it for some names in a vector named 'col_names'
col_names <- names(df)
df[col_names] <- lapply(df[col_names] , factor)
Explanation. All dataframes are lists and the results of [ used with multiple valued arguments are likewise lists, so looping over lists is the task of lapply. The above assignment will create a set of lists that the function data.frame.[<- should successfully stick back into into the dataframe, df
Another strategy would be to convert only those columns where the number of unique items is less than some criterion, let's say fewer than the log of the number of rows as an example:
cols.to.factor <- sapply( df, function(col) length(unique(col)) < log10(length(col)) )
df[ cols.to.factor] <- lapply(df[ cols.to.factor] , factor)
You could use dplyr::mutate_if() to convert all character columns or dplyr::mutate_at() for select named character columns to factors:
library(dplyr)
# all character columns to factor:
df <- mutate_if(df, is.character, as.factor)
# select character columns 'char1', 'char2', etc. to factor:
df <- mutate_at(df, vars(char1, char2), as.factor)
If you want to change all character variables in your data.frame to factors after you've already loaded your data, you can do it like this, to a data.frame called dat:
character_vars <- lapply(dat, class) == "character"
dat[, character_vars] <- lapply(dat[, character_vars], as.factor)
This creates a vector identifying which columns are of class character, then applies as.factor to those columns.
Sample data:
dat <- data.frame(var1 = c("a", "b"),
var2 = c("hi", "low"),
var3 = c(0, 0.1),
stringsAsFactors = FALSE
)
Another short way you could use is a pipe (%<>%) from the magrittr package. It converts the character column mycolumn to a factor.
library(magrittr)
mydf$mycolumn %<>% factor
I've doing it with a function. In this case I will only transform character variables to factor:
for (i in 1:ncol(data)){
if(is.character(data[,i])){
data[,i]=factor(data[,i])
}
}
Unless you need to identify the columns automatically, I found this to be the simplest solution:
df$name <- as.factor(df$name)
This makes column name in dataframe df a factor.
You can use across with new dplyr 1.0.0
library(dplyr)
df <- mtcars
#To turn 1 column to factor
df <- df %>% mutate(cyl = factor(cyl))
#Turn columns to factor based on their type.
df <- df %>% mutate(across(where(is.character), factor))
#Based on the position
df <- df %>% mutate(across(c(2, 4), factor))
#Change specific columns by their name
df <- df %>% mutate(across(c(cyl, am), factor))
Suppose, you have a data.frame like this:
x <- data.frame(v1=1:20,v2=1:20,v3=1:20,v4=letters[1:20])
How would you select only those columns in x that are numeric?
EDIT: updated to avoid use of ill-advised sapply.
Since a data frame is a list we can use the list-apply functions:
nums <- unlist(lapply(x, is.numeric), use.names = FALSE)
Then standard subsetting
x[ , nums]
## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)
For a more idiomatic modern R I'd now recommend
x[ , purrr::map_lgl(x, is.numeric)]
Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:
dplyr::select_if(x, is.numeric)
Newer versions of dplyr, also support the following syntax:
x %>% dplyr::select(where(is.numeric))
The dplyr package's select_if() function is an elegant solution:
library("dplyr")
select_if(x, is.numeric)
Filter() from the base package is the perfect function for that use-case:
You simply have to code:
Filter(is.numeric, x)
It is also much faster than select_if():
library(microbenchmark)
microbenchmark(
dplyr::select_if(mtcars, is.numeric),
Filter(is.numeric, mtcars)
)
returns (on my computer) a median of 60 microseconds for Filter, and 21 000 microseconds for select_if (350x faster).
in case you are interested only in column names then use this :
names(dplyr::select_if(train,is.numeric))
iris %>% dplyr::select(where(is.numeric)) #as per most recent updates
Another option with purrr would be to negate discard function:
iris %>% purrr::discard(~!is.numeric(.))
If you want the names of the numeric columns, you can add names or colnames:
iris %>% purrr::discard(~!is.numeric(.)) %>% names
This an alternate code to other answers:
x[, sapply(x, class) == "numeric"]
with a data.table
x[, lapply(x, is.numeric) == TRUE, with = FALSE]
library(purrr)
x <- x %>% keep(is.numeric)
The library PCAmixdata has functon splitmix that splits quantitative(Numerical data) and qualitative (Categorical data) of a given dataframe "YourDataframe" as shown below:
install.packages("PCAmixdata")
library(PCAmixdata)
split <- splitmix(YourDataframe)
X1 <- split$X.quanti(Gives numerical columns in the dataset)
X2 <- split$X.quali (Gives categorical columns in the dataset)
If you have many factor variables, you can use select_if funtion.
install the dplyr packages. There are many function that separates data by satisfying a condition. you can set the conditions.
Use like this.
categorical<-select_if(df,is.factor)
str(categorical)
Another way could be as follows:-
#extracting numeric columns from iris datset
(iris[sapply(iris, is.numeric)])
Numerical_variables <- which(sapply(df, is.numeric))
# then extract column names
Names <- names(Numerical_variables)
This doesn't directly answer the question but can be very useful, especially if you want something like all the numeric columns except for your id column and dependent variable.
numeric_cols <- sapply(dataframe, is.numeric) %>% which %>%
names %>% setdiff(., c("id_variable", "dep_var"))
dataframe %<>% dplyr::mutate_at(numeric_cols, function(x) your_function(x))