I have a data frame with two columns: one is strings, the other one is integers.
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> df
x rnames
1 5 item.1
2 3 item.2
3 5 item.3
4 3 item.4
5 1 item.5
6 3 item.6
7 4 item.7
8 5 item.8
9 4 item.9
10 5 item.10
11 5 item.11
12 2 item.12
13 2 item.13
14 1 item.14
15 3 item.15
16 4 item.16
17 5 item.17
18 4 item.18
19 1 item.19
20 1 item.20
I'm trying to aggregate the strings into list or vectors of strings (characters) with the 'c' or the 'list' function, but getting weird results:
> aggregate(rnames ~ x, df, c)
x rnames
1 1 16, 6, 11, 13
2 2 4, 5
3 3 12, 15, 17, 7
4 4 18, 20, 8, 10
5 5 1, 14, 19, 2, 3, 9
When I use 'paste' instead of 'c', I can see that the aggregate is working correctly - but the result is not what I'm looking for.
> aggregate(rnames ~ x, df, paste)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
What I'm looking for is that every aggregated group would be presented as a vector or a lit (hence the use of c) as opposed to the single string I'm getting with 'paste'. Something along the lines of the following (which in reality doesn't work):
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
Any help would be appreciated.
You fell in the usual trap of data.frame: your character column is not a character column, it is a factor column! Hence the numbers instead of the characters in your result:
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 2 5 5 5 5 4 3 3 2 4 ...
$ rnames: Factor w/ 20 levels "item.1","item.10",..: 1 12 14 15 16 17 18 19 20 2 ...
To prevent the conversion to factors, use argument stringAsFactors=FALSE in your call to data.frame:
> df <- data.frame(x, rnames,stringsAsFactors=FALSE)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 5 5 3 5 5 3 2 5 1 5 ...
$ rnames: chr "item.1" "item.2" "item.3" "item.4" ...
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.9, item.13, item.17
2 2 item.7
3 3 item.3, item.6, item.19
4 4 item.12, item.15, item.16
5 5 item.1, item.2, item.4, item.5, item.8, item.10, item.11, item.14, item.18, item.20
Another solution to avoid the conversion to factor is function I:
> df <- data.frame(x, I(rnames))
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 3 5 4 5 4 5 3 3 1 1 ...
$ rnames:Class 'AsIs' chr [1:20] "item.1" "item.2" "item.3" "item.4" ...
Excerpt from ?I:
In function data.frame. Protecting an object by enclosing it in I() in
a call to data.frame inhibits the conversion of character vectors to
factors and the dropping of names, and ensures that matrices are
inserted as single columns. I can also be used to protect objects
which are to be added to a data frame, or converted to a data frame
via as.data.frame.
It achieves this by prepending the class "AsIs" to the object's
classes. Class "AsIs" has a few of its own methods, including for [,
as.data.frame, print and format.
'm not sure just exactly what it is that you are looking for... so perhaps some reference output would be good to give us an idea of what we are aiming at?
But, since your last bit of code seems to be close to what you are after, maybe a solution like the following would work:
> library(plyr)
> ddply(df, .(x), summarize, rnames = paste(rnames, collapse = "|"))
x rnames
1 1 item.9|item.11|item.20
2 2 item.1|item.2|item.15|item.16
3 3 item.7|item.8
4 4 item.4|item.5|item.6|item.12|item.13
5 5 item.3|item.10|item.14|item.17|item.18|item.19
You can vary how the individual elements are stuck together by changing the collapse argument to paste().
Alternatively, if you want to just have each of the groups as a vetor then you could use this:
> df$rnames = as.character(df$rnames)
> L = dlply(df, .(x), function(df) {df$rnames})
> L
$`1`
[1] "item.9" "item.11" "item.20"
$`2`
[1] "item.1" "item.2" "item.15" "item.16"
$`3`
[1] "item.7" "item.8"
$`4`
[1] "item.4" "item.5" "item.6" "item.12" "item.13"
$`5`
[1] "item.3" "item.10" "item.14" "item.17" "item.18" "item.19"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
x
1 1
2 2
3 3
4 4
5 5
This gives you a list of vectors, which is what you were after. And each group can be indexed out of the resulting list:
> L[[1]]
[1] "item.9" "item.11" "item.20"
Related
I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.
I am using the following code, which works fine (improvement suggestions very much welcome):
WeeklySlopes <- function(Year, Week){
DynamicQuery <- paste('select DayOfYear, Week, Year, Close from SourceData where year =', Year, 'and week =', Week, 'order by DayOfYear')
SubData = sqldf(DynamicQuery)
SubData$X <- as.numeric(rownames(SubData))
lmfit <- lm(Close ~ X, data = SubData)
lmfit <- tidy(lmfit)
Slope <- as.numeric(sqldf("select estimate from lmfit where term = 'X'"))
e <- globalenv()
e$WeeklySlopesDf[nrow(e$WeeklySlopesDf) + 1,] = c(Year,Week, Slope)
}
WeeklySlopesDf <- data.frame(Year = integer(), Week = integer(), Slope = double())
WeeklySlopes(2017, 15)
WeeklySlopes(2017, 14)
head(WeeklySlopesDf)
Is there really no other way to append a row to my existing dataframe. I seem to need to access the globalenv. On the other hand, why can sqldf 'see' the 'global' dataframe SourceData?
dfrm <- data.frame(a=1:10, b=letters[1:10]) # reproducible example
myfunc <- function(new_a=20){ g <- globalenv(); g$dfrm[3,1] <- new_a; cat(dfrm[3,1])}
myfunc()
20
dfrm
a b
1 1 a
2 2 b
3 20 c # so your strategy might work, although it's unconventional.
Now try to extend dataframe outside a function:
dfrm[11, ] <- c(a=20,b="c")
An occult disaster (conversion of numeric column to character):
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: chr "1" "2" "20" "4" ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So use a list to avoid occult coercion:
dfrm <- data.frame(a=1:10, b=letters[1:10]) # start over
dfrm[11, ] <- list(a=20,b="c")
str(dfrm)
'data.frame': 11 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
Now try within a function:
myfunc <- function(new_a=20, new_b="ZZ"){ g <- globalenv(); g$dfrm[nrow(dfrm)+1, ] <- list(a=new_a,b=new_b)}
myfunc()
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = "ZZ") :
invalid factor level, NA generated
str(dfrm)
'data.frame': 12 obs. of 2 variables:
$ a: num 1 2 3 4 5 6 7 8 9 10 ...
$ b: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
So it succeeds, but if there are any factor columns, non-existent levels will get turned into NA values (with a warning). You method of using named access to objects in the global environment is rather unconventional but there is a set of tested methods that you might want to examine. Look at ?R6. Other options are <<- and assign which allows one to specify the environment in which the assignment is to occur.
I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.
I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.
What is the difference between as.data.frame(x) and data.frame(x)
In this following example, the result is the same at the exception of the columns names.
x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
> as.data.frame(x)
V1 V2 V3
1 1 1 1
2 1 1 1
3 1 1 1
As mentioned by Jaap, data.frame() calls as.data.frame() but there's a reason for it:
as.data.frame() is a method to coerce other objects to class data.frame. If you're writing your own package, you would store your method to convert an object of your_class under as.data.frame.your_class(). Here are just a few examples.
methods(as.data.frame)
[1] as.data.frame.AsIs as.data.frame.Date
[3] as.data.frame.POSIXct as.data.frame.POSIXlt
[5] as.data.frame.aovproj* as.data.frame.array
[7] as.data.frame.character as.data.frame.complex
[9] as.data.frame.data.frame as.data.frame.default
[11] as.data.frame.difftime as.data.frame.factor
[13] as.data.frame.ftable* as.data.frame.integer
[15] as.data.frame.list as.data.frame.logLik*
[17] as.data.frame.logical as.data.frame.matrix
[19] as.data.frame.model.matrix as.data.frame.numeric
[21] as.data.frame.numeric_version as.data.frame.ordered
[23] as.data.frame.raw as.data.frame.table
[25] as.data.frame.ts as.data.frame.vector
Non-visible functions are asterisked
data.frame() can be used to build a data frame while as.data.frame() can only be used to coerce other object to a data frame.
for example:
# data.frame()
df1 <- data.frame(matrix(1:12,3,4),1:3)
# as.data.frame()
df2 <- as.data.frame(matrix(1:12,3,4),1:3)
df1
# X1 X2 X3 X4 X1.3
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
df2
# V1 V2 V3 V4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
As you noted, the result does differ slightly, and this means that they are not exactly equal:
identical(data.frame(x),as.data.frame(x))
[1] FALSE
So you might need to take care to be consistent in which one you use.
But it is also worth noting that as.data.frame is faster:
library(microbenchmark)
microbenchmark(data.frame(x),as.data.frame(x))
Unit: microseconds
expr min lq median uq max neval
data.frame(x) 71.446 73.616 74.80 78.9445 146.442 100
as.data.frame(x) 25.657 27.631 28.42 29.2100 93.155 100
y <- matrix(1:1e6,1000,1000)
microbenchmark(data.frame(y),as.data.frame(y))
Unit: milliseconds
expr min lq median uq max neval
data.frame(y) 17.23943 19.63163 23.60193 41.07898 130.66005 100
as.data.frame(y) 10.83469 12.56357 14.04929 34.68608 38.37435 100
The difference becomes clearer when you look at their main arguments:
as.data.frame(x, ...): check if object is a data frame, or coerce if possible. Here, "x" can be any R object.
data.frame(...): build a data frame. Here, "..." allows specifying all the components (i.e. the variables of the data frame).
So, the results by Ophelia are similar since both functions received a single matrix as argument: however, when these functions receive 2 (or more) vectors, the distinction becomes clearer:
> # Set seed for reproducibility
> set.seed(3)
> # Create one int vector
> IDs <- seq(1:10)
> IDs
[1] 1 2 3 4 5 6 7 8 9 10
> # Create one char vector
> types <- sample(c("A", "B"), 10, replace = TRUE)
> types
[1] "A" "B" "A" "A" "B" "B" "A" "A" "B" "B"
> # Try to use "as.data.frame" to coerce components into a dataframe
> dataframe_1 <- as.data.frame(IDs, types)
> # Look at the result
> dataframe_1
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
duplicate row.names: A, B
> # Inspect result with head
> head(dataframe_1, n = 10)
IDs
A 1
B 2
A.1 3
A.2 4
B.1 5
B.2 6
A.3 7
A.4 8
B.3 9
B.4 10
> # Check the structure
> str(dataframe_1)
'data.frame': 10 obs. of 1 variable:
$ IDs: int 1 2 3 4 5 6 7 8 9 10
> # Use instead "data.frame" to build a data frame starting from two components
> dataframe_2 <- data.frame(IDs, types)
> # Look at the result
> dataframe_2
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Inspect result with head
> head(dataframe_2, n = 10)
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Check the structure
> str(dataframe_2)
'data.frame': 10 obs. of 2 variables:
$ IDs : int 1 2 3 4 5 6 7 8 9 10
$ types: Factor w/ 2 levels "A","B": 1 2 1 1 2 2 1 1 2 2
As you see "data.frame()" works fine, while "as.data.frame()" produces an error as it recognises the first argument as the object to be checked and coerced.
To sum up, "as.data.frame()" should be used to convert/coerce one single R object into a data frame (as you correctly did using a matrix), while "data.frame()" to build a data frame from scratch.
Try
colnames(x) <- c("C1","C2","C3")
and then both will give the same result
identical(data.frame(x), as.data.frame(x))
What is more startling are things like the following:
list(x)
Provides a one-elemnt list, the elemnt being the matrix x; whereas
as.list(x)
gives a list with 9 elements, one for each matrix entry
MM
Looking at the code, as.data.frame fails faster. data.frame will issue warnings, and do things like remove rownames if there are duplicates:
> x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> rownames(x) <- c("a", "b", "b")
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
Warning message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3 --> row.names NOT used
> as.data.frame(x)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
duplicate row.names: b