Issue with user defined function in R - r

I am trying to change the data type of my variables in data frame to 'factor' if they are 'character'. I have tried to replicate the problem using sample data as below
a <- c("AB","BC","AB","BC","AB","BC")
b <- c(12,23,34,45,54,65)
df <- data.frame(a,b)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
I wrote the below function to achieve that
abc <- function(x) {
for(i in names(x)){
if(is.character(x[[i]])) {
x[[i]] <- as.factor(x[[i]])
}
}
}
The function is executing properly if i pass the dataframe (df), but still it doesn't change the 'character' to 'factor'.
abc(df)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
NOTE: It works perfectly with for loop and if condition. When I tried to generalize it by writing a function around it, there's a problem.
Please help. What am I missing ?

Besides the comment from #Roland, you should make use of R's nice indexing possibilities and learn about the *apply family. With that you can rewrite your code to
change_to_factor <- function(df_in) {
chr_ind <- vapply(df_in, is.character, logical(1))
df_in[, chr_ind] <- lapply(df_in[, chr_ind, drop = FALSE], as.factor)
df_in
}
Explanation
vapply loops over all elements of a list, applies a function to each element and returns a value of the given type (here a boolean logical(1)). Since in R data frames are in fact lists where each (list) element is required to be of the same length, you can conveniently loop over all the columns of the data frame and apply the function is.character to each column. vapply then returns a boolean (logical) vector with TRUE/FALSE values depending on whether the column was a character column or not.
You can then use this boolean vector to subset your data frame to look only at columns which are character columns.
lapply is yet another memeber of the *apply family and loops through list elements and returns a list. We loop now over the character columns, apply as.factor to them and return a list of them which we conveniently store in the original positions in the data frame
By the way, if you look at str(df) you will see that column b is already a factor. This is because data.frame automatically converts character columns to characters. To avoid that you need to pass stringsAsFactors = FALSE to data.frame:
a <- c("AB", "BC", "AB", "BC", "AB", "BC")
b <- c(12, 23, 34, 45, 54, 65)
df <- data.frame(a, b)
str(df) # column b is factor
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
str(df2 <- data.frame(a, b, stringsAsFactors = FALSE))
# 'data.frame': 6 obs. of 2 variables:
# $ a: chr "AB" "BC" "AB" "BC" ...
# $ b: num 12 23 34 45 54 65
str(change_to_factor(df2))
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
It may also be worth to learn the tidyverse syntax with which you can simply do
library(tidyverse)
df2 %>%
mutate_if(is.character, as.factor) %>%
str()

Related

Why does R convert numbers and characters to factors when coercing to data frame?

Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.

How can I convert this list to a dataframe with same names in R?

I have the following data in R:
list0 <- list(ff = 45,gg = 23)
list1 <- list(a = 2, b=list0)
LIST <- list(mylist = list1)
I want to convert this list to a dataframe and get an output dataframe as follows, which has the following column header naming conventions:
a b.ff b.gg
1 2 45 23
any help is appreciated.
The LIST step was unnecessary:
> data.frame(list1)
a b.ff b.gg
1 2 45 23
vec <- unlist(LIST)
names(vec) <- sub("mylist.", "", names(vec))
dt <- data.frame(as.list(vec))
dt
a b.ff b.gg
1 2 45 23
You can also use do.call with data.frame to construct the data.frame and include unname to drop the name of the first list level.
mydf <-do.call(data.frame, unname(LIST))
mydf
a b.ff b.gg
1 2 45 23
Make sure that the object has the desired structure.
str(mydf)
'data.frame': 1 obs. of 3 variables:
$ a : num 2
$ b.ff: num 45
$ b.gg: num 23

An Elegant way to change columns type in dataframe in R

I have a data.frame which contains columns of different types, such as integer, character, numeric, and factor.
I need to convert the integer columns to numeric for use in the next step of analysis.
Example: test.data includes 4 columns (though there are thousands in my real data set): age, gender, work.years, and name; age and work.years are integer, gender is factor, and name is character. What I need to do is change age and work.years into a numeric type. And I wrote one piece of code to do this.
test.data[sapply(test.data, is.integer)] <-lapply(test.data[sapply(test.data, is.integer)], as.numeric)
It looks not good enough though it works. So I am wondering if there is some more elegant methods to fulfill this function. Any creative method will be appreciated.
I think elegant code is sometimes subjective. For me, this is elegant but it may be less efficient compared to the OP's code. However, as the question is about elegant code, this can be used.
test.data[] <- lapply(test.data, function(x) if(is.integer(x)) as.numeric(x) else x)
Also, another elegant option is dplyr
library(dplyr)
library(magrittr)
test.data %<>%
mutate_each(funs(if(is.integer(.)) as.numeric(.) else .))
Now very elegant in dplyr (with magrittr %<>% operator)
test.data %<>% mutate_if(is.integer,as.numeric)
It's tasks like this that I think are best accomplished with explicit loops. You don't buy anything here by replacing a straightforward for-loop with the hidden loop of a function like lapply(). Example:
## generate data
set.seed(1L);
N <- 3L; test.data <- data.frame(age=sample(20:90,N,T),gender=factor(sample(c('M','F'),N,T)),work.years=sample(1:5,N,T),name=sample(letters,N,T),stringsAsFactors=F);
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : int 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: int 5 4 4
## $ name : chr "b" "f" "e"
## solution
for (cn in names(test.data)[sapply(test.data,is.integer)])
test.data[[cn]] <- as.double(test.data[[cn]]);
## result
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : num 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: num 5 4 4
## $ name : chr "b" "f" "e"

R assign variable types to large data.frame from vector

I have a wide data.frame that is all character vectors (df1). I have a separate vector(vec1) that contains the column classes I'd like to assign to each of the columns in df1.
If I was using read.csv(), I'd use the colClasses argument and set it equal to vec1, but there doesn't appear to be a similar option for an existing data.frame.
Any suggestions for a fast way to do this besides a loop?
I don't know if it will be of help but I have run into the same need many times and I have created a function in case it helps:
reclass <- function(df, vec){
df[] <- Map(function(x, f){
#switch below shows the accepted values in the vector
#you can modify it and/or add more
f <- switch(f,
as.is = 'force',
factor = 'as.factor',
num = 'as.numeric',
char = 'as.character')
#takes the name of the function and fetches the function
f <- get(f)
#apply the function
f(x)
},
df,
vec)
df
}
It uses Map to pass in a vector of classes to the data.frame. Each element corresponds to the class of the column. The length of both the dataframe and the vector need to be the same.
I am using switch as well to make the corresponding classes shorter to type. Use as.is to keep the class the same, the rest are self explanatory I think.
Small example:
df1 <- data.frame(1:10, letters[1:10], runif(50))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : int 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : num 0.0969 0.1957 0.8283 0.1768 0.9821 ...
And after the function:
df1 <- reclass(df1, c('num','as.is','char'))
> str(df1)
'data.frame': 50 obs. of 3 variables:
$ X1.10 : num [1:50] 1 2 3 4 5 6 7 8 9 10 ...
$ letters.1.10.: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ runif.50. : chr [1:50] "0.0968757788650692" "0.19566105119884" "0.828283685725182" "0.176784737734124" ...
I guess Map internally is a loop but it is written in C so it should be fast enough.
May be you could try this function that makes the same work.
reclass <- function (df, vec_types) {
for (i in 1:ncol(df)) {
type <- vec_types[i]
class(df[ , i]) <- type
}
return(df)
}
and this is an example of vec_types (vector of types):
vec_types <- c('character', rep('integer', 3), rep('character', 2))
you can test the function (reclass) whith this table (df):
table <- data.frame(matrix(sample(1:10,30, replace = T), nrow = 5, ncol = 6))
str(table) # original column types
# apply the function
table <- reclass(table, vec_types)
str(table) # new column types

re-convert data types in R

I have a subset of data within a large dataset that does not conform to the original data types assigned when the data was read into R. How can I re-convert the data types for the subset of data, just as R would do if only that subset was read?
Example: imagine that there is one stack of data consisting of variables 1-4 (v1 to v4) and a different set of data starting with column names v5 to v8.
V1 V2 V3 V4
1 32 a 11 a
2 12 b 32 b
3 3 c 42 c
4 v5 v6 v7 v8
5 a 43 a 35
6 b 33 b 64
7 c 55 c 32
If I create a new df with v5-v8, how can I automatically "re-convert" the entire data to appropriate types? (Just as R would do if I were to re-read the data from a csv)
You could try type.convert
df1 <- df[1:3,]
str(df1)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: chr "32" "12" "3"
# $ V2: chr "a" "b" "c"
# $ V3: chr "11" "32" "42"
# $ V4: chr "a" "b" "c"
df1[] <- lapply(df1, type.convert)
str(df1)
#'data.frame': 3 obs. of 4 variables:
#$ V1: int 32 12 3
#$ V2: Factor w/ 3 levels "a","b","c": 1 2 3
#$ V3: int 11 32 42
#$ V4: Factor w/ 3 levels "a","b","c": 1 2 3
To subset the dataset, you could use grep (as #Richard Scriven mentioned in the comments)
indx <- grep('^v', df[,1])
df2 <- df[(indx+1):nrow(df),]
df2[] <- lapply(df2, type.convert)
Suppose, your dataset have many instances where this occurs, split the dataset based on a grouping index (indx1) created by grepl after removing the header rows (indx) and do the type.convert within the "list".
indx1 <- cumsum(grepl('^v', df[,1]))+1
lst <- lapply(split(df[-indx,],indx1[-indx]), function(x) {
x[] <- lapply(x, type.convert)
x})
Then, if you need to cbind the columns (assuming that the nrow is same for all the list elements)
dat <- do.call(cbind, lst)

Resources