Convert many variables in a dataframe from numeric to factor [duplicate] - r

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?

Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"

Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]

The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33

As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)

You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c

and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))

Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]

A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))

If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.

Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9

It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

Related

r looping through dataframe to change to factors [duplicate]

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

Coerce multiple columns to factors at once

I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.

Using dplyr's rename() including variable names not in data set

I am trying to transition some plyr code to dplyr, and getting stuck with the new functionality of rename() in dplyr. I'd like to be able to reuse a single rename() expression for a set of datasets with overlapping but not identical original names. For example,
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
And then,
rename(sample1, var1 = A, var2 = B, var3 = C)
I would like the result to be that variable A is renamed var1, and B is renamed var2, not adding a var3 in this case. Instead, I get
Error: Unknown variables: C.
In contrast, the plyr syntax would let me use
rename(sample1, c("A" = "var1", "B" = "var2", "C" = "var3"))
rename(sample2, c("A" = "var1", "B" = "var2", "C" = "var3"))
and not throw an error. Is there a way to get the same result in dplyr without getting the Unknown variables error?
Completely ignoring your actual request on how to do this with dplyr, I would like suggest a different approach using a lookup table:
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
rename_map <- c("A"="var1",
"B"="var2",
"C"="var3")
names(sample1) <- rename_map[names(sample1)]
str(sample1)
names(sample2) <- rename_map[names(sample2)]
str(sample2)
Fundamentally the algorithm is simple:
Build a lookup table of current variable names to desired names
Using the names() function, do a lookup into the map with the mapping indexes and assign those mapped variables to the appropriate columns.
EDIT: As per Hadley's suggestion, I used a named vector instead of a list, makes life much easier. I always forget about named vectors :(
#no need to use rename
oldnames<-unique(c(names(sample1),names(sample2)))
newnames<-c("var1","var2","var3")
name_df<-data.frame(oldnames,newnames)
mydata<-list(sample1,sample2) # combined two datasets as a list
#one liner
finaldata <- lapply(mydata, function(i) {colnames(i)<-name_df[name_df[,1] %in% colnames(i),2]
return(i)})
> finaldata
[[1]]
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
[[2]]
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
With dplyr, we can use a named vector with old names as values and new names as names, then unquote only the values in name_vec that matches names in your dataset. rename supports unquoting characters, so there is no need to convert them to sym beforehand:
library(dplyr)
name_vec <- c(var1 = "A", var2 = "B", var3 = "C")
sample1 %>%
rename(!!name_vec[name_vec %in% names(.)])
sample2 %>%
rename(!!name_vec[name_vec %in% names(.)])
Also, with setNames:
name_vec <- c(A = "var1", B = "var2", C = "var3")
sample1 %>%
setNames(name_vec[names(.)])
sample2 %>%
setNames(name_vec[names(.)])
Output:
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
I’ve used #earino’s answer before
myself, but discovered that it can be unsafe. If column names of the data
frame are missing in the (names of the) named vector, those column names are silently replaced with NA and that is certainly not what you want.
d1 <- data.frame(A = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
rename_vec <- c("B" = "var2", "C" = "var3")
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
The same can happen, if you run names(d1) <- rename_vec[names(d1)]
twice by accident, because when you run it the second time, none of the
colnames(d1) are in names(rename_vec).
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA: int 1 2 3 4 5 6 7 8 9 10
#> $ NA: chr "a" "b" "c" "d" ...
We just need to select those columns that are in the data frame and in the rename vector.
d2 <- data.frame(B1 = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
sel <- is.element(colnames(d2), names(rename_vec))
names(d2)[sel] <- rename_vec[names(d2)][sel]
str(d2)
#> 'data.frame': 10 obs. of 2 variables:
#> $ B1 : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
UPDATE: I initially had a solution here that involved string replacement, which turned out to be unsafe as well, because it allowed for partial matching. This one is better, I think.

R: Aggregate character strings with c

I have a data frame with two columns: one is strings, the other one is integers.
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> df
x rnames
1 5 item.1
2 3 item.2
3 5 item.3
4 3 item.4
5 1 item.5
6 3 item.6
7 4 item.7
8 5 item.8
9 4 item.9
10 5 item.10
11 5 item.11
12 2 item.12
13 2 item.13
14 1 item.14
15 3 item.15
16 4 item.16
17 5 item.17
18 4 item.18
19 1 item.19
20 1 item.20
I'm trying to aggregate the strings into list or vectors of strings (characters) with the 'c' or the 'list' function, but getting weird results:
> aggregate(rnames ~ x, df, c)
x rnames
1 1 16, 6, 11, 13
2 2 4, 5
3 3 12, 15, 17, 7
4 4 18, 20, 8, 10
5 5 1, 14, 19, 2, 3, 9
When I use 'paste' instead of 'c', I can see that the aggregate is working correctly - but the result is not what I'm looking for.
> aggregate(rnames ~ x, df, paste)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
What I'm looking for is that every aggregated group would be presented as a vector or a lit (hence the use of c) as opposed to the single string I'm getting with 'paste'. Something along the lines of the following (which in reality doesn't work):
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.5, item.14, item.19, item.20
2 2 item.12, item.13
3 3 item.2, item.4, item.6, item.15
4 4 item.7, item.9, item.16, item.18
5 5 item.1, item.3, item.8, item.10, item.11, item.17
Any help would be appreciated.
You fell in the usual trap of data.frame: your character column is not a character column, it is a factor column! Hence the numbers instead of the characters in your result:
> rnames = sapply(1:20, FUN=function(x) paste("item", x, sep="."))
> x <- sample(c(1:5), 20, replace = TRUE)
> df <- data.frame(x, rnames)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 2 5 5 5 5 4 3 3 2 4 ...
$ rnames: Factor w/ 20 levels "item.1","item.10",..: 1 12 14 15 16 17 18 19 20 2 ...
To prevent the conversion to factors, use argument stringAsFactors=FALSE in your call to data.frame:
> df <- data.frame(x, rnames,stringsAsFactors=FALSE)
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 5 5 3 5 5 3 2 5 1 5 ...
$ rnames: chr "item.1" "item.2" "item.3" "item.4" ...
> aggregate(rnames ~ x, df, c)
x rnames
1 1 item.9, item.13, item.17
2 2 item.7
3 3 item.3, item.6, item.19
4 4 item.12, item.15, item.16
5 5 item.1, item.2, item.4, item.5, item.8, item.10, item.11, item.14, item.18, item.20
Another solution to avoid the conversion to factor is function I:
> df <- data.frame(x, I(rnames))
> str(df)
'data.frame': 20 obs. of 2 variables:
$ x : int 3 5 4 5 4 5 3 3 1 1 ...
$ rnames:Class 'AsIs' chr [1:20] "item.1" "item.2" "item.3" "item.4" ...
Excerpt from ?I:
In function data.frame. Protecting an object by enclosing it in I() in
a call to data.frame inhibits the conversion of character vectors to
factors and the dropping of names, and ensures that matrices are
inserted as single columns. I can also be used to protect objects
which are to be added to a data frame, or converted to a data frame
via as.data.frame.
It achieves this by prepending the class "AsIs" to the object's
classes. Class "AsIs" has a few of its own methods, including for [,
as.data.frame, print and format.
'm not sure just exactly what it is that you are looking for... so perhaps some reference output would be good to give us an idea of what we are aiming at?
But, since your last bit of code seems to be close to what you are after, maybe a solution like the following would work:
> library(plyr)
> ddply(df, .(x), summarize, rnames = paste(rnames, collapse = "|"))
x rnames
1 1 item.9|item.11|item.20
2 2 item.1|item.2|item.15|item.16
3 3 item.7|item.8
4 4 item.4|item.5|item.6|item.12|item.13
5 5 item.3|item.10|item.14|item.17|item.18|item.19
You can vary how the individual elements are stuck together by changing the collapse argument to paste().
Alternatively, if you want to just have each of the groups as a vetor then you could use this:
> df$rnames = as.character(df$rnames)
> L = dlply(df, .(x), function(df) {df$rnames})
> L
$`1`
[1] "item.9" "item.11" "item.20"
$`2`
[1] "item.1" "item.2" "item.15" "item.16"
$`3`
[1] "item.7" "item.8"
$`4`
[1] "item.4" "item.5" "item.6" "item.12" "item.13"
$`5`
[1] "item.3" "item.10" "item.14" "item.17" "item.18" "item.19"
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
x
1 1
2 2
3 3
4 4
5 5
This gives you a list of vectors, which is what you were after. And each group can be indexed out of the resulting list:
> L[[1]]
[1] "item.9" "item.11" "item.20"

Extract and sort specific values from an R data frame

I´m quite new to R, and this might there be an easy answer to, but still:
I have a data frame on the form
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
My dataset is way bigger than this, lots more rows and columns. But the basic idea is the same: I want to sort the n highest numeric values, independent of which column they appear in, and return a list with the values in decreasing order and their corresponding string in column "en".
Like this:
e 11
d 10
c 9
b 8
a 7
e 5
and so on.
How could I go about accomplishing this?
You can use the package reshape2 to melt your data and sort the value column, like this :
require(reshape2)
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
df2 <- melt(df, id = "en")
## 'data.frame': 10 obs. of 3 variables:
## $ en : chr "a" "b" "c" "d" ...
## $ variable: Factor w/ 2 levels "to","tre": 1 1 1 1 1 2 2 2 2 2
## $ value : int 1 2 3 4 5 7 8 9 10 11
df2[order(df2$value, decreasing = TRUE), c("en", "value")]
## en value
## 10 e 11
## 9 d 10
## 8 c 9
## 7 b 8
## 6 a 7
## 5 e 5
## 4 d 4
## 3 c 3
## 2 b 2
## 1 a 1
But I'm sure there are other ways to do that !!
Less elegant but without extra packages (it will work with any number of columns):
col1<-rep(df[,1],ncol(df)-1)
col2<-c()
for(i in 2:ncol(df)) {
col2<-c(col2,df[,i])
}
newdf<-data.frame(en=col1,value=col2)
newdf<-newdf[order(as.numeric(newdf[,2]),decreasing=TRUE),]

Resources