I have a subset of data within a large dataset that does not conform to the original data types assigned when the data was read into R. How can I re-convert the data types for the subset of data, just as R would do if only that subset was read?
Example: imagine that there is one stack of data consisting of variables 1-4 (v1 to v4) and a different set of data starting with column names v5 to v8.
V1 V2 V3 V4
1 32 a 11 a
2 12 b 32 b
3 3 c 42 c
4 v5 v6 v7 v8
5 a 43 a 35
6 b 33 b 64
7 c 55 c 32
If I create a new df with v5-v8, how can I automatically "re-convert" the entire data to appropriate types? (Just as R would do if I were to re-read the data from a csv)
You could try type.convert
df1 <- df[1:3,]
str(df1)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: chr "32" "12" "3"
# $ V2: chr "a" "b" "c"
# $ V3: chr "11" "32" "42"
# $ V4: chr "a" "b" "c"
df1[] <- lapply(df1, type.convert)
str(df1)
#'data.frame': 3 obs. of 4 variables:
#$ V1: int 32 12 3
#$ V2: Factor w/ 3 levels "a","b","c": 1 2 3
#$ V3: int 11 32 42
#$ V4: Factor w/ 3 levels "a","b","c": 1 2 3
To subset the dataset, you could use grep (as #Richard Scriven mentioned in the comments)
indx <- grep('^v', df[,1])
df2 <- df[(indx+1):nrow(df),]
df2[] <- lapply(df2, type.convert)
Suppose, your dataset have many instances where this occurs, split the dataset based on a grouping index (indx1) created by grepl after removing the header rows (indx) and do the type.convert within the "list".
indx1 <- cumsum(grepl('^v', df[,1]))+1
lst <- lapply(split(df[-indx,],indx1[-indx]), function(x) {
x[] <- lapply(x, type.convert)
x})
Then, if you need to cbind the columns (assuming that the nrow is same for all the list elements)
dat <- do.call(cbind, lst)
Related
Recently I have come across a problem where my data has been converted to factors.
This is a large nuisance, as it's not (always) easily picked up on.
I am aware that I can convert them back with solutions such as as.character(paste(x)) or as.character(paste(x)), but that seems really unnecessary.
Example code:
nums <- c(1,2,3,4,5)
chars <- c("A","B","C,","D","E")
str(nums)
#> num [1:5] 1 2 3 4 5
str(chars)
#> chr [1:5] "A" "B" "C," "D" "E"
df <- as.data.frame(cbind(a = nums, b = chars))
str(df)
#> 'data.frame': 5 obs. of 2 variables:
#> $ a: Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5
#> $ b: Factor w/ 5 levels "A","B","C,","D",..: 1 2 3 4 5
Don't cbind as it converts data to matrix and matrix can hold data of only one type, so it converts numbers to characters.
Use data.frame because as.data.frame(a = nums, b = chars) returns an error.
Use stringsAsFactors = FALSE because in data.frame default value of
stringsAsFactors is TRUE which converts characters to factors. The numbers also change to factors because in 1) they have been changed to characters.
df <- data.frame(a = nums, b = chars, stringsAsFactors = FALSE)
str(df)
#'data.frame': 5 obs. of 2 variables:
# $ a: num 1 2 3 4 5
# $ b: chr "A" "B" "C," "D" ...
EDIT: As of the newest version of R, the default value of stringAsFactors has changed to FALSE.
This should no longer happen if you have updated R: data frames don't automatically turn chr to fct. In a way, data frames are now more similar to tibbles.
I am trying to change the data type of my variables in data frame to 'factor' if they are 'character'. I have tried to replicate the problem using sample data as below
a <- c("AB","BC","AB","BC","AB","BC")
b <- c(12,23,34,45,54,65)
df <- data.frame(a,b)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
I wrote the below function to achieve that
abc <- function(x) {
for(i in names(x)){
if(is.character(x[[i]])) {
x[[i]] <- as.factor(x[[i]])
}
}
}
The function is executing properly if i pass the dataframe (df), but still it doesn't change the 'character' to 'factor'.
abc(df)
str(df)
'data.frame': 6 obs. of 2 variables:
$ a: chr "AB" "BC" "AB" "BC" ...
$ b: num 12 23 34 45 54 65
NOTE: It works perfectly with for loop and if condition. When I tried to generalize it by writing a function around it, there's a problem.
Please help. What am I missing ?
Besides the comment from #Roland, you should make use of R's nice indexing possibilities and learn about the *apply family. With that you can rewrite your code to
change_to_factor <- function(df_in) {
chr_ind <- vapply(df_in, is.character, logical(1))
df_in[, chr_ind] <- lapply(df_in[, chr_ind, drop = FALSE], as.factor)
df_in
}
Explanation
vapply loops over all elements of a list, applies a function to each element and returns a value of the given type (here a boolean logical(1)). Since in R data frames are in fact lists where each (list) element is required to be of the same length, you can conveniently loop over all the columns of the data frame and apply the function is.character to each column. vapply then returns a boolean (logical) vector with TRUE/FALSE values depending on whether the column was a character column or not.
You can then use this boolean vector to subset your data frame to look only at columns which are character columns.
lapply is yet another memeber of the *apply family and loops through list elements and returns a list. We loop now over the character columns, apply as.factor to them and return a list of them which we conveniently store in the original positions in the data frame
By the way, if you look at str(df) you will see that column b is already a factor. This is because data.frame automatically converts character columns to characters. To avoid that you need to pass stringsAsFactors = FALSE to data.frame:
a <- c("AB", "BC", "AB", "BC", "AB", "BC")
b <- c(12, 23, 34, 45, 54, 65)
df <- data.frame(a, b)
str(df) # column b is factor
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
str(df2 <- data.frame(a, b, stringsAsFactors = FALSE))
# 'data.frame': 6 obs. of 2 variables:
# $ a: chr "AB" "BC" "AB" "BC" ...
# $ b: num 12 23 34 45 54 65
str(change_to_factor(df2))
# 'data.frame': 6 obs. of 2 variables:
# $ a: Factor w/ 2 levels "AB","BC": 1 2 1 2 1 2
# $ b: num 12 23 34 45 54 65
It may also be worth to learn the tidyverse syntax with which you can simply do
library(tidyverse)
df2 %>%
mutate_if(is.character, as.factor) %>%
str()
E.g.
chr <- c("a", "b", "c")
intgr <- c(1, 2, 3)
str(chr)
str(base::merge(chr,intgr, stringsAsFactors = FALSE))
gives:
> str(base::merge(chr,intgr, stringsAsFactors = FALSE))
'data.frame': 9 obs. of 2 variables:
$ x: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3
$ y: num 1 1 1 2 2 2 3 3 3
I originally thought it has something to do with how merge coerces arguments into data frames. However, I thought that adding the argument stringsAsFactors = FALSE would override the default coercion behaviour of char -> factor, yet this is not working.
EDIT: Doing the following gives me expected behaviour:
options(stringsAsFactors = FALSE)
str(base::merge(chr,intgr))
that is:
> str(base::merge(chr,intgr))
'data.frame': 9 obs. of 2 variables:
$ x: chr "a" "b" "c" "a" ...
$ y: num 1 1 1 2 2 2 3 3 3
but this is not ideal as it changes the global stringsAsFactors setting.
You can accomplish this particular "merge" using expand.grid(), since you're really just taking the cartesian product. This allows you to pass the stringsAsFactors argument:
sapply(expand.grid(x=chr,y=intgr,stringsAsFactors=F),class);
## x y
## "character" "numeric"
Here's a way of working around this limitation of merge():
sapply(merge(data.frame(x=chr,stringsAsFactors=F),intgr),class);
## x y
## "character" "numeric"
I would argue that it never makes sense to pass an atomic vector to merge(), since it is only really designed for merging data.frames.
We can use CJ from data.table as welll
library(data.table)
str(CJ(chr, intgr))
Classes ‘data.table’ and 'data.frame': 9 obs. of 2 variables:
#$ V1: chr "a" "a" "a" "b" ...
#$ V2: num 1 2 3 1 2 3 1 2 3
I have a data.frame which contains columns of different types, such as integer, character, numeric, and factor.
I need to convert the integer columns to numeric for use in the next step of analysis.
Example: test.data includes 4 columns (though there are thousands in my real data set): age, gender, work.years, and name; age and work.years are integer, gender is factor, and name is character. What I need to do is change age and work.years into a numeric type. And I wrote one piece of code to do this.
test.data[sapply(test.data, is.integer)] <-lapply(test.data[sapply(test.data, is.integer)], as.numeric)
It looks not good enough though it works. So I am wondering if there is some more elegant methods to fulfill this function. Any creative method will be appreciated.
I think elegant code is sometimes subjective. For me, this is elegant but it may be less efficient compared to the OP's code. However, as the question is about elegant code, this can be used.
test.data[] <- lapply(test.data, function(x) if(is.integer(x)) as.numeric(x) else x)
Also, another elegant option is dplyr
library(dplyr)
library(magrittr)
test.data %<>%
mutate_each(funs(if(is.integer(.)) as.numeric(.) else .))
Now very elegant in dplyr (with magrittr %<>% operator)
test.data %<>% mutate_if(is.integer,as.numeric)
It's tasks like this that I think are best accomplished with explicit loops. You don't buy anything here by replacing a straightforward for-loop with the hidden loop of a function like lapply(). Example:
## generate data
set.seed(1L);
N <- 3L; test.data <- data.frame(age=sample(20:90,N,T),gender=factor(sample(c('M','F'),N,T)),work.years=sample(1:5,N,T),name=sample(letters,N,T),stringsAsFactors=F);
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : int 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: int 5 4 4
## $ name : chr "b" "f" "e"
## solution
for (cn in names(test.data)[sapply(test.data,is.integer)])
test.data[[cn]] <- as.double(test.data[[cn]]);
## result
test.data;
## age gender work.years name
## 1 38 F 5 b
## 2 46 M 4 f
## 3 60 F 4 e
str(test.data);
## 'data.frame': 3 obs. of 4 variables:
## $ age : num 38 46 60
## $ gender : Factor w/ 2 levels "F","M": 1 2 1
## $ work.years: num 5 4 4
## $ name : chr "b" "f" "e"
I have a list of four data frames. Each data frame has the same first column person.id (unique key to each data frame) I want to pad zeros.
ISSUE:
The code runs but outputs to the Console and doesn't change the actual data frames in the list.
EXAMPLE DATA:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
lapply(dataList, function(i){
i$person.id <- str_pad(i$person.id, 6, pad = "0")
})
# Console output pads the zeros (not expected):
[[1]]
[1] "003200" "003201" "003202" "003203" "003204" "003205" "003206" "003207" "003208"
[10] "003209" "003210" "003211" "003212" "003213" "003214"
# Data Frames in list return with no change:
> dataList[[1]]$person.id
[1] 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214
How do I apply the change to every column names person.id in every data frame in my list?
What I want is padded zeros in every data frame in my list:
> dataList[[1]]$person.id
[1] 003200 003201 003202 003203 003204 003205 003206 003207 003208
[10] 003209 003210 003211 003212 003213 003214
The function you lapply needs to return the full data frame. The function you used just returns the result of the assignment, which is only the values for the column, not the entire data frame. You also need to save the result. Here we use transform as the function as it modifies a data frame, and use the person.id argument to modify the person.id column (see ?transform):
df.pad <- lapply(dataList, transform, person.id=str_pad(person.id, 6, pad = "0"))
Then, df.pad[[1]]: produces:
[[1]]
person.id letter
1 003200 a
2 003201 b
3 003202 c
4 003203 d
5 003204 e
6 003205 f
7 003206 g
8 003207 h
9 003208 i
10 003209 j
11 003210 k
12 003211 l
13 003212 m
14 003213 n
15 003214 o
You need to return the data frame because R is not an assign-by-reference language. Your assignments to i in lapply just modify the local copy of i, not the data frames in dataList in the global environment. If you want dataList to be modified you can substitute dataList for df.pad in the above expression, which will result in dataList being overwritten with a new version of it containing the modified data frames.
You made the assignment to a column but a) did not return the dataframes, nor b) did you assign the results to a new name. (Welcome to functional programming. Running a function on an object does not change the original object.) All you got were the names:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
library(stringr)
newList <- lapply(dataList, function(i){
i$person.id <- str_pad(i$person.id, 6, pad = "0"); return(i)
})
> str(newList)
List of 4
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "003200" "003201" "003202" "003203" ...
..$ letter : Factor w/ 15 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "004100" "004101" "004102" "004103" ...
..$ letter : Factor w/ 15 levels "h","i","j","k",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "004300" "004301" "004302" "004303" ...
..$ letter : Factor w/ 15 levels "j","k","l","m",..: 1 2 3 4 5 6 7 8 9 10 ...
$ :'data.frame': 15 obs. of 2 variables:
..$ person.id: chr [1:15] "005500" "005501" "005502" "005503" ...
..$ letter : Factor w/ 15 levels "e","f","g","h",..: 1 2 3 4 5 6 7 8 9 10 ...
The pad function in the package qdapTools can do this:
df1 <- data.frame(person.id = 3200:3214, letter = letters[1:15])
df2 <- data.frame(person.id = 4100:4114, letter = letters[8:22])
df3 <- data.frame(person.id = 4300:4314, letter = letters[10:24])
df4 <- data.frame(person.id = 5500:5514, letter = letters[5:19])
dataList <- list(df1, df2, df3, df4)
library(qdapTools)
lapply(dataList, function(x) {x[["person.id"]] <- pad(x[["person.id"]], 6);x})