Using dplyr's rename() including variable names not in data set - r

I am trying to transition some plyr code to dplyr, and getting stuck with the new functionality of rename() in dplyr. I'd like to be able to reuse a single rename() expression for a set of datasets with overlapping but not identical original names. For example,
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
And then,
rename(sample1, var1 = A, var2 = B, var3 = C)
I would like the result to be that variable A is renamed var1, and B is renamed var2, not adding a var3 in this case. Instead, I get
Error: Unknown variables: C.
In contrast, the plyr syntax would let me use
rename(sample1, c("A" = "var1", "B" = "var2", "C" = "var3"))
rename(sample2, c("A" = "var1", "B" = "var2", "C" = "var3"))
and not throw an error. Is there a way to get the same result in dplyr without getting the Unknown variables error?

Completely ignoring your actual request on how to do this with dplyr, I would like suggest a different approach using a lookup table:
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
rename_map <- c("A"="var1",
"B"="var2",
"C"="var3")
names(sample1) <- rename_map[names(sample1)]
str(sample1)
names(sample2) <- rename_map[names(sample2)]
str(sample2)
Fundamentally the algorithm is simple:
Build a lookup table of current variable names to desired names
Using the names() function, do a lookup into the map with the mapping indexes and assign those mapped variables to the appropriate columns.
EDIT: As per Hadley's suggestion, I used a named vector instead of a list, makes life much easier. I always forget about named vectors :(

#no need to use rename
oldnames<-unique(c(names(sample1),names(sample2)))
newnames<-c("var1","var2","var3")
name_df<-data.frame(oldnames,newnames)
mydata<-list(sample1,sample2) # combined two datasets as a list
#one liner
finaldata <- lapply(mydata, function(i) {colnames(i)<-name_df[name_df[,1] %in% colnames(i),2]
return(i)})
> finaldata
[[1]]
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
[[2]]
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t

With dplyr, we can use a named vector with old names as values and new names as names, then unquote only the values in name_vec that matches names in your dataset. rename supports unquoting characters, so there is no need to convert them to sym beforehand:
library(dplyr)
name_vec <- c(var1 = "A", var2 = "B", var3 = "C")
sample1 %>%
rename(!!name_vec[name_vec %in% names(.)])
sample2 %>%
rename(!!name_vec[name_vec %in% names(.)])
Also, with setNames:
name_vec <- c(A = "var1", B = "var2", C = "var3")
sample1 %>%
setNames(name_vec[names(.)])
sample2 %>%
setNames(name_vec[names(.)])
Output:
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t

I’ve used #earino’s answer before
myself, but discovered that it can be unsafe. If column names of the data
frame are missing in the (names of the) named vector, those column names are silently replaced with NA and that is certainly not what you want.
d1 <- data.frame(A = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
rename_vec <- c("B" = "var2", "C" = "var3")
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
The same can happen, if you run names(d1) <- rename_vec[names(d1)]
twice by accident, because when you run it the second time, none of the
colnames(d1) are in names(rename_vec).
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA: int 1 2 3 4 5 6 7 8 9 10
#> $ NA: chr "a" "b" "c" "d" ...
We just need to select those columns that are in the data frame and in the rename vector.
d2 <- data.frame(B1 = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
sel <- is.element(colnames(d2), names(rename_vec))
names(d2)[sel] <- rename_vec[names(d2)][sel]
str(d2)
#> 'data.frame': 10 obs. of 2 variables:
#> $ B1 : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
UPDATE: I initially had a solution here that involved string replacement, which turned out to be unsafe as well, because it allowed for partial matching. This one is better, I think.

Related

Convert many variables in a dataframe from numeric to factor [duplicate]

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

r looping through dataframe to change to factors [duplicate]

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

How to Transpose Data in R

I used RNCEP backstage to get a reanalyses data for temp. My data looks something like this:
(DD1 <- array(1:12, dim = c(2, 3, 2),
dimnames = list(c("A", "B"),
c("a", "b", "c"),
c("First", "Second"))))
# , , First
#
# a b c
# A 1 3 5
# B 2 4 6
#
# , , Second
#
# a b c
# A 7 9 11
# B 8 10 12
str(DD1)
# int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "a" "b" "c"
# ..$ : chr [1:2] "First" "Second"
I think this is a tabular data?
I need to write the data as csv file where I have something like this:
y a a b b c c
x A B A B A B
1 2 3 4 5 6
7 8 9 10 11 12
But when I used write.csv I got this:
write.csv(DD1)
# "","a.First","b.First","c.First","a.Second","b.Second","c.Second"
# "A",1,3,5,7,9,11
# "B",2,4,6,8,10,12
I thought I had to transpose the data first. So I used this:
DD2 <- as.data.frame.table(DD1)
I also used t() but that also did not work.
Transpose function in R is t(), so hopefully this will work on the dataframe you are trying to transpose.
DD3= t(DD2)
You were on the right track with as.data.frame.table(DD1). That would give you a "long" dataset, that can then be converted to a "wide" form that you can use write.csv on.
Note, however, that R only allows one row of headers, so you will have to combine what you show as "x" and "y" into a single header row.
Here's the approach I would suggest:
library(data.table)
(DD2 <- dcast(data.table(as.data.frame.table(DD1)),
Var3 ~ Var1 + Var2, value.var = "Freq"))
# Var3 A_a A_b A_c B_a B_b B_c
# 1: First 1 3 5 2 4 6
# 2: Second 7 9 11 8 10 12
You can then easily use write.csv on the "DD2" object.

Coerce multiple columns to factors at once

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

Extract and sort specific values from an R data frame

I´m quite new to R, and this might there be an easy answer to, but still:
I have a data frame on the form
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
My dataset is way bigger than this, lots more rows and columns. But the basic idea is the same: I want to sort the n highest numeric values, independent of which column they appear in, and return a list with the values in decreasing order and their corresponding string in column "en".
Like this:
e 11
d 10
c 9
b 8
a 7
e 5
and so on.
How could I go about accomplishing this?
You can use the package reshape2 to melt your data and sort the value column, like this :
require(reshape2)
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
df2 <- melt(df, id = "en")
## 'data.frame': 10 obs. of 3 variables:
## $ en : chr "a" "b" "c" "d" ...
## $ variable: Factor w/ 2 levels "to","tre": 1 1 1 1 1 2 2 2 2 2
## $ value : int 1 2 3 4 5 7 8 9 10 11
df2[order(df2$value, decreasing = TRUE), c("en", "value")]
## en value
## 10 e 11
## 9 d 10
## 8 c 9
## 7 b 8
## 6 a 7
## 5 e 5
## 4 d 4
## 3 c 3
## 2 b 2
## 1 a 1
But I'm sure there are other ways to do that !!
Less elegant but without extra packages (it will work with any number of columns):
col1<-rep(df[,1],ncol(df)-1)
col2<-c()
for(i in 2:ncol(df)) {
col2<-c(col2,df[,i])
}
newdf<-data.frame(en=col1,value=col2)
newdf<-newdf[order(as.numeric(newdf[,2]),decreasing=TRUE),]

Resources