Related
The title is not quite cool - I apologise that I was not able to summarise the question better. I am conceptually a bit lost and wondered if there is a better approach for the following:
What I have:
I have two columns, ID and eye. Eyes can be coded as "r", "l" or "b" (right/ left/ both eyes). It does not have to contain all values, and it can include NA.
What I want:
I want to count number of distinct eyes by ID. If "b" is occurring, "r|l" for the same ID should not be counted (because "right | left eye" is part of "both eyes").
Ideally base R only:
My approach uses base R only, and I would much prefer a base R solution, because this is intended for a package. (Actually, the core of this function is already part of a package, but I wonder if this can be improved).
Other solutions very welcome:
The final function is also to be applied on data frames with 10^6 rows and thousands of IDs, so should be fast computation by group. My solution seems already fairly fast, (I have not done a formal test though). I would therefore also think any dplyr::group_by solution would not be an option (at least in my approaches).
# sample data
set.seed(42)
id <- letters[sample(11, replace = TRUE)]
foo1 <- data.frame(id, eye = sample(c("r", "l", "b"), 11, replace = TRUE))
foo2 <- data.frame(id, eye = "r")
foo3 <- foo2
foo3$eye[1:5] <- NA
foo4 <- data.frame(id, eye = "b")
count_eyes <- function(x, pat_col, eye) {
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl("r|l", colnames(eye_tab)))){
eye_tab[, c("r","l")][eye_tab[, "b"] == 1] <- 0
}
}
# I chose the programmatic approach because the column names might not be present
# I add all 1 for each column. Because r is set to 0 previously, I have to
# add the count for b again to get the real number of right eyes.
n_b <- unname(colSums(eye_tab[, colnames(eye_tab) == "b", drop = FALSE]))
n_right <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "r", drop = FALSE])), n_b)
n_left <- sum(unname(colSums(eye_tab[, colnames(eye_tab) == "l", drop = FALSE])), n_b)
c(r = n_right, l = n_left)
}
expected result
lapply(mget(c("foo1", "foo2", "foo3", "foo4")), count_eyes, pat_col = "id", eye = "eye")
#> $foo1
#> r l
#> 7 6
#>
#> $foo2
#> r l
#> 8 0
#>
#> $foo3
#> r l
#> 6 0
#>
#> $foo4
#> r l
#> 8 8
The code could be shortened if we convert the column to factor with levels specified
count_eyes <- function(x, pat_col, eye) {
nm1 <- c('r', 'l')
x$eye <- factor(x$eye, levels = c("b", nm1)) # // convert to factor
# reduce to unique combinations of patient and eye, then count occurrence of
# "eye" by patient. Results in matrix of 0/1
eye_tab <- table(unique(x[, c(pat_col, eye)]))
# cases where "b" does not exist must also work (foo2 and foo3)
if(any(grepl("b", colnames(eye_tab)))){
# whenever "b" is present, "r" and "l" will be set to 0,
# so it will not be counted in the next step
# "r" and "l" might not occur
if(any(grepl(paste(nm1, collapse="|"), colnames(eye_tab)))){
eye_tab[, nm1][eye_tab[, "b"] == 1] <- 0
}
}
out <- colSums(eye_tab)
out[nm1] + out['b']
}
-testing
lapply(mget(paste0('foo', 1:4)), count_eyes, pat_col = "id", eye = "eye")
#$foo1
#r l
#7 6
#$foo2
#r l
#8 0
#$foo3
#r l
#6 0
#$foo4
#r l
#8 8
Here's another approach with split and rowSums:
count_eyes <- function(x , pat_col, eye){
rowSums(sapply(split(subset(x,select = eye),
subset(x,select = pat_col)),
function(y){c(r = any(y %in% c("b", "r")),
l = any(y %in% c("b", "l")))
}))}
lapply(mget(ls(pattern="foo")),count_eyes, "id", "eye")
$foo1
r l
5 4
$foo2
r l
6 0
$foo3
r l
4 0
$foo4
r l
6 6
i have a data frame(called hp) what contains more columns with NA-s.The classes of these columns are factor. First i want to change it to character, fill NA-s with "none" and change it back to factor. I have 14 columns and because of it i'd like to make it with loops. But it doesnt work.
Thx for your help.
The columns:
miss_names<-c("Alley","MasVnrType","FireplaceQu","PoolQC","Fence","MiscFeature","GarageFinish", "GarageQual","GarageCond","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1",
"BsmtFinType2","Electrical")
The loop:
for (i in miss_names){
hp[i]<-as.character(hp[i])
hp[i][is.na(hp[i])]<-"NONE"
hp[i]<-as.factor(hp[i])
print(hp[i])
}
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?
Use addNA() to add NA as a factor level and then replace that level with whatever you want. You don't have to turn the factors into a character vector first. You can loop over all the factors in the data frame and replace them one by one.
# Sample data
dd <- data.frame(
x = sample(c(NA, letters[1:3]), 20, replace = TRUE),
y = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE)
)
# Loop over the columns
for (i in seq_along(dd)) {
xx <- addNA(dd[, i])
levels(xx) <- c(levels(dd[, i]), "none")
dd[, i] <- xx
}
This gives us
> str(dd)
'data.frame': 20 obs. of 2 variables:
$ x: Factor w/ 4 levels "a","b","c","none": 1 4 1 4 4 1 4 3 3 3 ...
$ y: Factor w/ 4 levels "A","B","C","none": 1 1 2 2 1 3 3 3 4 1 ...
An alternative solution using the purrr library using the same data as # Johan Larsson:
library(purrr)
set.seed(15)
dd <- data.frame(
x = sample(c(NA, letters[1:3]), 20, replace = TRUE),
y = sample(c(NA, LETTERS[1:3]), 20, replace = TRUE))
# Create a function to convert NA to none
convert.to.none <- function(x){
y <- addNA(x)
levels(y) <- c(levels(x), "none")
x <- y
return(x) }
# use the map function to cycle through dd's columns
map_df(dd, convert.2.none)
Allows for scaling of your work.
In python, scikit has a great function called LabelEncoder that maps categorical levels (strings) to integer representation.
Is there anything in R to do this? For example if there is a variable called color with values {'Blue','Red','Green'} the encoder would translate:
Blue => 1
Green => 2
Red => 3
and create an object with this mapping to then use for transforming new data in a similar fashion.
Add:
It doesn't seem like just factors will work because there is no persisting of the mapping. If the new data has an unseen level from the training data, the entire structure changes. Ideally I would like the new levels labeled missing or 'other' somehow.
sample_dat <- data.frame(a_str=c('Red','Blue','Blue','Red','Green'))
sample_dat$a_int<-as.integer(as.factor(sample_dat$a_str))
sample_dat$a_int
#[1] 3 1 1 3 2
sample_dat2 <- data.frame(a_str=c('Red','Blue','Blue','Red','Green','Azure'))
sample_dat2$a_int<-as.integer(as.factor(sample_dat2$a_str))
sample_dat2$a_int
# [1] 4 2 2 4 3 1
Create your vector of data:
colors <- c("red", "red", "blue", "green")
Create a factor:
factors <- factor(colors)
Convert the factor to numbers:
as.numeric(factors)
Output: (note that this is in alphabetical order)
# [1] 3 3 1 2
You can also set a custom numbering system: (note that the output now follows the "rainbow color order" that I defined)
rainbow <- c("red","orange","yellow","green","blue","purple")
ordered <- factor(colors, levels = rainbow)
as.numeric(ordered)
# [1] 1 1 5 4
See ?factor.
Try CatEncoders package. It replicates the Python sklearn.preprocessing functionality.
# variable to encode values
colors = c("red", "red", "blue", "green")
lab_enc = LabelEncoder.fit(colors)
# new values are transformed to NA
values = transform(lab_enc, c('red', 'red', 'yellow'))
values
# [1] 3 3 NA
# doing the inverse: given the encoded numbers return the labels
inverse.transform(lab_enc, values)
# [1] "red" "red" NA
I would add the functionality of reporting the non-matching labels with a warning.
PS: It also has the OneHotEncoder function.
If I correctly understand what do you want:
# function which returns function which will encode vectors with values of 'vec'
label_encoder = function(vec){
levels = sort(unique(vec))
function(x){
match(x, levels)
}
}
colors = c("red", "red", "blue", "green")
color_encoder = label_encoder(colors) # create encoder
encoded_colors = color_encoder(colors) # encode colors
encoded_colors
new_colors = c("blue", "green", "green") # new vector
encoded_new_colors = color_encoder(new_colors)
encoded_new_colors
other_colors = c("blue", "green", "green", "yellow")
color_encoder(other_colors) # NA's are introduced
# save and restore to disk
saveRDS(color_encoder, "color_encoder.RDS")
c_encoder = readRDS("color_encoder.RDS")
c_encoder(colors) # same result
# dealing with multiple columns
# create data.frame
set.seed(123) # make result reproducible
color_dataframe = as.data.frame(
matrix(
sample(c("red", "blue", "green", "yellow"), 12, replace = TRUE),
ncol = 3)
)
color_dataframe
# encode each column
for (column in colnames(color_dataframe)){
color_dataframe[[column]] = color_encoder(color_dataframe[[column]])
}
color_dataframe
I wrote the following which I think works, the efficiency of which and/or how it will scale is not yet tested
str2Int.fit_transform<-function(df, plug_missing=TRUE){
list_of_levels=list() #empty list
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels<-unique(df[,i]) #distinct levels
list_of_levels[[colnames(df)[i]]] <- levels #set list with name of the columns to the levels
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}#end if character/factor IF
}#end loop
return (list(list_of_levels,df)) #return the list of levels and the new DF
}#end of function
str2Int.transform<-function(df,list_of_levels,plug_missing=TRUE)
{
#loop through the columns
for (i in 1: ncol(df))
{
#only
if (is.character(df[,i]) || is.factor(df[,i]) ){
#deal with missing
if(plug_missing){
#if factor
if (is.factor(df[,i])){
df[,i] = factor(df[,i], levels=c(levels(df[,i]), 'MISSING'))
df[,i][is.na(df[,i])] = 'MISSING'
}else{ #if character
df[,i][is.na(df[,i])] = 'MISSING'
}
}#end missing IF
levels=list_of_levels[[colnames(df)[i]]]
if (! is.null(levels)){
df[,i] <- as.numeric(factor(df[,i], levels = levels))
}
}# character or factor
}#end of loop
return(df)
}#end of function
######################################################
# Test the functions
######################################################
###Test fit transform
# as strings
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=FALSE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
#as factors
sample_dat <- data.frame(a_fact=c('Red','Blue','Blue',NA,'Green'), a_int=c(1,2,3,4,5), a_str=c('a','b','c','a','v'),stringsAsFactors=TRUE)
result<-str2Int.fit_transform(sample_dat)
result[[1]] #list of levels
result[[2]] #transformed df
###Test transform
str2Int.transform(sample_dat,result[[1]])
It's hard to believe why no one has mentioned caret's dummyVars function.
This is a widely searched question, and people don't want to write their own methods or copy and paste other users methods, they want a package, and caret is the closest thing to sklearn in R.
EDIT: I now realize that what the user actually want's is to turn strings into a counting number, which is just as.numeric(as.factor(x)) but I'm going to leave this here because using hot-one encoding is the more accurate method of encoding categorical data.
# input P to the function below is a dataframe containing only categorical variables
numlevel <- function(P) {
n <- dim(P)[2]
for(i in 1: n) {
m <- length(unique(P[[i]]))
levels(P[[i]]) <- c(1:m)
}
return(P)
}
Q <- numlevel(P)
df<- mtcars
head(df)
df$cyl <- factor(df$cyl)
df$carb <- factor(df$carb)
vec <- sapply(df, is.factor)
catlevels <- sapply(df[vec], levels)
#store the levels for each category
#level appearing first is coded as 1, second as 2 so on
df <- sapply(df, as.numeric)
class(df) #matrix
df <- data.frame(df)
#converting back to dataframe
head(df)
# Data
Country <- c("France", "Spain", "Germany", "Spain", "Germany", "France")
Age <- c(34, 27, 30, 32, 42, 30)
Purchased <- c("No", "Yes", "No", "No", "Yes", "Yes")
df <- data.frame(Country, Age, Purchased)
df
# Output
Country Age Purchased
1 France 34 No
2 Spain 27 Yes
3 Germany 30 No
4 Spain 32 No
5 Germany 42 Yes
6 France 30 Yes
Using CatEncoders package : Encoders for Categorical Variables
library(CatEncoders)
# Saving names of categorical variables
factors <- names(which(sapply(df, is.factor)))
# Label Encoder
for (i in factors){
encode <- LabelEncoder.fit(df[, i])
df[, i] <- transform(encode, df[, i])
}
df
# Output
Country Age Purchased
1 1 34 1
2 3 27 2
3 2 30 1
4 3 32 1
5 2 42 2
6 1 30 2
Using R base : factor function
# Label Encoder
levels <- c("France", "Spain", "Germany", "No", "Yes")
labels <- c(1, 2, 3, 1, 2)
for (i in factors){
df[, i] <- factor(df[, i], levels = levels, labels = labels, ordered = TRUE)
}
df
# Output
Country Age Purchased
1 1 34 1
2 2 27 2
3 3 30 1
4 2 32 1
5 3 42 2
6 1 30 2
Here is an easy and neat solution:
From the superml package:
https://www.rdocumentation.org/packages/superml/versions/0.5.3
There is a LabelEncoder class:
https://www.rdocumentation.org/packages/superml/versions/0.5.3/topics/LabelEncoder
install.packages("superml")
library(superml)
lbl <- LabelEncoder$new()
lbl$fit(sample_dat$column)
sample_dat$column <- lbl$fit_transform(sample_dat$column)
decode_names <- lbl$inverse_transform(sample_dat$column)
I'm having trouble assigning a dataframe to a subset of another. In the example below, the line
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
only modifies one column instead of two. I would expect it either to modify no columns or both, not only one. I wrote the function to fill in the PrefName and CountyID columns in dataframe ds where they are NA by joining ds to another dataframe cs.
As you can see if you run it, the test is failing because PrefName is not getting filled in. After doing a bit of debugging, I realized that join() is doing exactly what it is expected to do, but the actual assignment of the result of that join somehow drops the PrefName back to a NA.
# fully copy-paste-run-able (but broken) code
suppressMessages({
library("plyr")
library("methods")
library("testthat")
})
# Fill in the missing PrefName/CountyIDs in delstat
# - Find the missing values in Delstat
# - Grab the CityState Primary Record values
# - Match on zipcode to fill in the holes in the delstat data
# - Remove any codes that could not be fixed
# - #param ds: delstat dataframe with 6 columns (see test case)
# - #param cs: citystate dataframe with 6 columns (see test case)
getMissingCounties <- function(ds, cs) {
if (length(is.na(ds$CountyID))) {
cavities <- which(is.na(ds$CountyID))
fillings <- cs[cs$PrimRec==TRUE, c(1,3,4)]
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
ds <- ds[!is.na(ds$CountyID),]
}
return(ds)
}
test_getMissingCounties <- function() {
ds <- data.frame(
CityStateKey = c(1, 2, 3, 4 ),
ZipCode = c(11, 22, 33, 44 ),
Business = c(1, 1, 1, 1 ),
Residential = c(1, 1, 1, 1 ),
PrefName = c("One", NA , NA, NA),
CountyID = c(111, NA, NA, NA))
cs <- data.frame(
ZipCode = c(11, 22, 22, 33, 55 ),
Name = c("eh", "eh?", "eh?", "eh!?", "ah." ),
PrefName = c("One", "To", "Two", "Three", "Five"),
CountyID = c(111, 222, 222, 333, 555 ),
PrimRec = c(TRUE, FALSE, TRUE, TRUE, TRUE ),
CityStateKey = c(1, 2, 2, 3, 5 ))
expected <- data.frame(
CityStateKey = c(1, 2, 3 ),
ZipCode = c(11, 22, 33 ),
Business = c(1, 1, 1 ),
Residential = c(1, 1, 1 ),
PrefName = c("One", "Two", "Three"),
CountyID = c(111, 222, 333 ))
expect_equal(getMissingCounties(ds, cs), expected)
}
# run the test
test_getMissingCounties()
The results are:
CityStateKey ZipCode Business Residential PrefName CountyID
1 11 1 1 One 111
2 22 1 1 <NA> 222
3 33 1 1 <NA> 333
Any ideas why PrefName is getting set to NA by the assignment or how to do the assignment so I don't lose data?
The short answer is that you can avoid this problem by making sure that there are no factors in your data frames. You do this by using stringsAsFactors=FALSE in the call(s) to data.frame(...). Note that many of the data import functions, including read.table(...) and read.csv(...) also convert character to factor by default. You can defeat this behavior the same way.
This problem is actually quite subtle, and is also a good example of how R's "silent coercion" between data types creates all sorts of problems.
The data.frame(...) function converts any character vectors to factors by default. So in your code ds$PerfName is a factor with one level, and cs$PerfName is a factor with 5 levels. So in your assignment statement:
ds[cavities,] <- join(ds[cavities,1:4], fillings, by="ZipCode", "left")
the 5th column on the LHS is a factor with 1 level, and the 5th column on the RHS is a factor with 5 levels.
Under some circumstances, when you assign a factor with more levels to a factor with fewer levels, the missing levels are set to NA. Consider this:
x <- c("A","B",NA,NA,NA) # character vector
y <- LETTERS[1:5] # character vector
class(x); class(y)
# [1] "character"
# [1] "character"
df <- data.frame(x,y) # x and y coerced to factor
sapply(df,class) # df$x and df$y are factors
# x y
# "factor" "factor"
# assign rows 3:5 of col 2 to col 1
df[3:5,1] <- df[3:5,2] # fails with a warning
# Warning message:
# In `[<-.factor`(`*tmp*`, iseq, value = 3:5) :
# invalid factor level, NA generated
df # missing levels set to NA
# x y
# 1 A A
# 2 B B
# 3 <NA> C
# 4 <NA> D
# 5 <NA> E
The example above is equivalent to your assignment statement. However, notice what happens if you assign all of column 2 to column 1.
# assign all of col 2 to col 1
df <- data.frame(x,y)
df[,1] <- df[,2] # succeeds!!
df
# x y
# 1 A A
# 2 B B
# 3 C C
# 4 D D
# 5 E E
This works.
Finally, a note on debugging: if you are debugging a function, sometimes it is useful to run through the statements line by line at the command line (e.g., in the global environment). If you did that, you would have gotten the warning above, whereas inside a function call the warnings are suppressed.
The constraints of the test can be satisfied by reimplementing getMissingCountries with:
merge(ds[1:4], subset(subset(cs, PrimRec)[c(1, 3, 4)]), by="ZipCode")
Caveat: the ZipCode column is always emitted first, which differs from your expected result.
But to answer the subassignment question: it breaks, because the level sets of PrefName are incompatible between ds and cs. Either avoid using a factor or relevel them. You might have missed R's warning about this, because testthat was somehow suppressing warnings.
What is the quickest/best way to change a large number of columns to numeric from factor?
I used the following code but it appears to have re-ordered my data.
> head(stats[,1:2])
rk team
1 1 Washington Capitals*
2 2 San Jose Sharks*
3 3 Chicago Blackhawks*
4 4 Phoenix Coyotes*
5 5 New Jersey Devils*
6 6 Vancouver Canucks*
for(i in c(1,3:ncol(stats))) {
stats[,i] <- as.numeric(stats[,i])
}
> head(stats[,1:2])
rk team
1 2 Washington Capitals*
2 13 San Jose Sharks*
3 24 Chicago Blackhawks*
4 26 Phoenix Coyotes*
5 27 New Jersey Devils*
6 28 Vancouver Canucks*
What is the best way, short of naming every column as in:
df$colname <- as.numeric(ds$colname)
You have to be careful while changing factors to numeric. Here is a line of code that would change a set of columns from factor to numeric. I am assuming here that the columns to be changed to numeric are 1, 3, 4 and 5 respectively. You could change it accordingly
cols = c(1, 3, 4, 5);
df[,cols] = apply(df[,cols], 2, function(x) as.numeric(as.character(x)));
Further to Ramnath's answer, the behaviour you are experiencing is that due to as.numeric(x) returning the internal, numeric representation of the factor x at the R level. If you want to preserve the numbers that are the levels of the factor (rather than their internal representation), you need to convert to character via as.character() first as per Ramnath's example.
Your for loop is just as reasonable as an apply call and might be slightly more readable as to what the intention of the code is. Just change this line:
stats[,i] <- as.numeric(stats[,i])
to read
stats[,i] <- as.numeric(as.character(stats[,i]))
This is FAQ 7.10 in the R FAQ.
HTH
This can be done in one line, there's no need for a loop, be it a for-loop or an apply. Use unlist() instead :
# testdata
Df <- data.frame(
x = as.factor(sample(1:5,30,r=TRUE)),
y = as.factor(sample(1:5,30,r=TRUE)),
z = as.factor(sample(1:5,30,r=TRUE)),
w = as.factor(sample(1:5,30,r=TRUE))
)
##
Df[,c("y","w")] <- as.numeric(as.character(unlist(Df[,c("y","w")])))
str(Df)
Edit : for your code, this becomes :
id <- c(1,3:ncol(stats)))
stats[,id] <- as.numeric(as.character(unlist(stats[,id])))
Obviously, if you have a one-column data frame and you don't want the automatic dimension reduction of R to convert it to a vector, you'll have to add the drop=FALSE argument.
I know this question is long resolved, but I recently had a similar issue and think I've found a little more elegant and functional solution, although it requires the magrittr package.
library(magrittr)
cols = c(1, 3, 4, 5)
df[,cols] %<>% lapply(function(x) as.numeric(as.character(x)))
The %<>% operator pipes and reassigns, which is very useful for keeping data cleaning and transformation simple. Now the list apply function is much easier to read, by only specifying the function you wish to apply.
Here are some dplyr options:
# by column type:
df %>%
mutate_if(is.factor, ~as.numeric(as.character(.)))
# by specific columns:
df %>%
mutate_at(vars(x, y, z), ~as.numeric(as.character(.)))
# all columns:
df %>%
mutate_all(~as.numeric(as.character(.)))
I think that ucfagls found why your loop is not working.
In case you still don't want use a loop here is solution with lapply:
factorToNumeric <- function(f) as.numeric(levels(f))[as.integer(f)]
cols <- c(1, 3:ncol(stats))
stats[cols] <- lapply(stats[cols], factorToNumeric)
Edit. I found simpler solution. It seems that as.matrix convert to character. So
stats[cols] <- as.numeric(as.matrix(stats[cols]))
should do what you want.
lapply is pretty much designed for this
unfactorize<-c("colA","colB")
df[,unfactorize]<-lapply(unfactorize, function(x) as.numeric(as.character(df[,x])))
I found this function on a couple other duplicate threads and have found it an elegant and general way to solve this problem. This thread shows up first on most searches on this topic, so I am sharing it here to save folks some time. I take no credit for this just so see the original posts here and here for details.
df <- data.frame(x = 1:10,
y = rep(1:2, 5),
k = rnorm(10, 5,2),
z = rep(c(2010, 2012, 2011, 2010, 1999), 2),
j = c(rep(c("a", "b", "c"), 3), "d"))
convert.magic <- function(obj, type){
FUN1 <- switch(type,
character = as.character,
numeric = as.numeric,
factor = as.factor)
out <- lapply(obj, FUN1)
as.data.frame(out)
}
str(df)
str(convert.magic(df, "character"))
str(convert.magic(df, "factor"))
df[, c("x", "y")] <- convert.magic(df[, c("x", "y")], "factor")
I would like to point out that if you have NAs in any column, simply using subscripts will not work. If there are NAs in the factor, you must use the apply script provided by Ramnath.
E.g.
Df <- data.frame(
x = c(NA,as.factor(sample(1:5,30,r=T))),
y = c(NA,as.factor(sample(1:5,30,r=T))),
z = c(NA,as.factor(sample(1:5,30,r=T))),
w = c(NA,as.factor(sample(1:5,30,r=T)))
)
Df[,c(1:4)] <- as.numeric(as.character(Df[,c(1:4)]))
Returns the following:
Warning message:
NAs introduced by coercion
> head(Df)
x y z w
1 NA NA NA NA
2 NA NA NA NA
3 NA NA NA NA
4 NA NA NA NA
5 NA NA NA NA
6 NA NA NA NA
But:
Df[,c(1:4)]= apply(Df[,c(1:4)], 2, function(x) as.numeric(as.character(x)))
Returns:
> head(Df)
x y z w
1 NA NA NA NA
2 2 3 4 1
3 1 5 3 4
4 2 3 4 1
5 5 3 5 5
6 4 2 4 4
you can use unfactor() function from "varhandle" package form CRAN:
library("varhandle")
my_iris <- data.frame(Sepal.Length = factor(iris$Sepal.Length),
sample_id = factor(1:nrow(iris)))
my_iris <- unfactor(my_iris)
I like this code because it's pretty handy:
data[] <- lapply(data, function(x) type.convert(as.character(x), as.is = TRUE)) #change all vars to their best fitting data type
It is not exactly what was asked for (convert to numeric), but in many cases even more appropriate.
Based on #SDahm's answer, this was an "optimal" solution for my tibble:
data %<>% lapply(type.convert) %>% as.data.table()
This requires dplyr and magrittr.
I tried a bunch of these on a similar problem and kept getting NAs. Base R has some really irritating coercion behaviors, which are generally fixed in Tidyverse packages. I used to avoid them because I didn't want to create dependencies, but they make life so much easier that now I don't even bother trying to figure out the Base R solution most of the time.
Here's the Tidyverse solution, which is extremely simple and elegant:
library(purrr)
mydf <- data.frame(
x1 = factor(c(3, 5, 4, 2, 1)),
x2 = factor(c("A", "C", "B", "D", "E")),
x3 = c(10, 8, 6, 4, 2))
map_df(mydf, as.numeric)
df$colname <- as.numeric(df$colname)
I tried this way for changing one column type and I think it is better than many other versions, if you are not going to change all column types
df$colname <- as.character(df$colname)
for the vice versa.
I had problems converting all columns to numeric with an apply() call:
apply(data, 2, as.numeric)
The problem turns out to be because some of the strings had a comma in them -- e.g. "1,024.63" instead of "1024.63" -- and R does not like this way of formatting numbers. So I removed them and then ran as.numeric():
data = as.data.frame(apply(data, 2, function(x) {
y = str_replace_all(x, ",", "") #remove commas
return(as.numeric(y)) #then convert
}))
Note that this requires the stringr package to be loaded.
That's what's worked for me. The apply() function tries to coerce df to matrix and it returns NA's.
numeric.df <- as.data.frame(sapply(df, 2, as.numeric))