Converting from a character to a numeric data frame - r

I have a character data frame in R which has NaNs in it. I need to remove any row with a NaN and then convert it to a numeric data frame.
If I just do as.numeric on the data frame, I run into the following
Error: (list) object cannot be coerced to type 'double'
1:
0:

As #thijs van den bergh points you to,
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
dat <- as.data.frame(sapply(dat, as.numeric)) #<- sapply is here
dat[complete.cases(dat), ]
# x y
#2 2 3
Is one way to do this.
Your error comes from trying to make a data.frame numeric. The sapply option I show is instead making each column vector numeric.

Note that data.frames are not numeric or character, but rather are a list which can be all numeric columns, all character columns, or a mix of these or other types (e.g.: Date/logical).
dat <- data.frame(x=c("NaN","2"),y=c("NaN","3"),stringsAsFactors=FALSE)
is.list(dat)
# [1] TRUE
The example data just has two character columns:
> str(dat)
'data.frame': 2 obs. of 2 variables:
$ x: chr "NaN" "2"
$ y: chr "NaN" "3
...which you could add a numeric column to like so:
> dat$num.example <- c(6.2,3.8)
> dat
x y num.example
1 NaN NaN 6.2
2 2 3 3.8
> str(dat)
'data.frame': 2 obs. of 3 variables:
$ x : chr "NaN" "2"
$ y : chr "NaN" "3"
$ num.example: num 6.2 3.8
So, when you try to do as.numeric R gets confused because it is wondering how to convert this list object which may have multiple types in it. user1317221_G's answer uses the ?sapply function, which can be used to apply a function to the individual items of an object. You could alternatively use ?lapply which is a very similar function (read more on the *apply functions here - R Grouping functions: sapply vs. lapply vs. apply. vs. tapply vs. by vs. aggregate )
I.e. - in this case, to each column of your data.frame, you can apply the as.numeric function, like so:
data.frame(lapply(dat,as.numeric))
The lapply call is wrapped in a data.frame to make sure the output is a data.frame and not a list. That is, running:
lapply(dat,as.numeric)
will give you:
> lapply(dat,as.numeric)
$x
[1] NaN 2
$y
[1] NaN 3
$num.example
[1] 6.2 3.8
While:
data.frame(lapply(dat,as.numeric))
will give you:
> data.frame(lapply(dat,as.numeric))
x y num.example
1 NaN NaN 6.2
2 2 3 3.8

Related

#R #Non-numeric argument for binary operator #xts object * integer

I have an xts object. Thus a time series of "outstanding share" of a company that are ordered by date.
I want to multiply the time series of "outstanding shares" by the factor 7 in order to account for a stock split.
> outstanding_shares_xts <- shares_xts1[,1]
> adjusted <- outstanding_shares_xts*7
Error: Non-numeric argument for binary operator.
The ts "oustanding_shares_xts" is a column of integers.
Does anyone has an idea??
My guess is that they may look like integers but are in fact not.
Sleuthing:
I initially thought it could be [-vs-[[ column subsetting, since tibble(a=1:2)[,1] does not produce an integer vector (it produces a single-column tibble), but tibble(a=1:2)[,1] * 7 still works.
Then I thought it could be due to factors, but it's a different error:
data.frame(a=factor(1:2))[,1]*7
# Warning in Ops.factor(data.frame(a = factor(1:2))[, 1], 7) :
# '*' not meaningful for factors
# [1] NA NA
One possible is that you have character values that look like integers.
dat <- data.frame(a=as.character(1:2))
dat
# a
# 1 1
# 2 2
dat[,1]*7
# Error in dat[, 1] * 7 : non-numeric argument to binary operator
Try converting that column to integer, something like
str(dat)
# 'data.frame': 2 obs. of 1 variable:
# $ a: chr "1" "2"
dat$a <- as.integer(dat$a)
str(dat)
# 'data.frame': 2 obs. of 1 variable:
# $ a: int 1 2
dat[,1]*7
# [1] 7 14

R: Why am I not getting type or class "factor" after converting columns to factor?

I have the following setup.
df <- data.frame(aa = rnorm(1000), bb = rnorm(1000))
apply(df, 2, typeof)
# aa bb
#"double" "double"
apply(df, 2, class)
# aa bb
#"numeric" "numeric"
Then I try to convert one of the columns to "factor". But as you can see below, I am not getting any "factor" type or classes. Am I doing anything wrong ?
df[, 1] <- as.factor(df[, 1])
apply(df, 2, typeof)
# aa bb
#"character" "character"
apply(df, 2, class)
# aa bb
#"character" "character"
Sorry I felt my original answer badly written. Why did I put that "matrix of factors" in the very beginning? Here is a better try.
From ?apply:
If ‘X’ is not an array but an object of a class with a non-null
‘dim’ value (such as a data frame), ‘apply’ attempts to coerce it
to an array via ‘as.matrix’ if it is two-dimensional (e.g., a data
frame) or via ‘as.array’.
So a data frame is converted to a matrix by as.matrix, before FUN is applied row-wise or column-wise.
From ?as.matrix:
‘as.matrix’ is a generic function. The method for data frames
will return a character matrix if there is only atomic columns and
any non-(numeric/logical/complex) column, applying ‘as.vector’ to
factors and ‘format’ to other non-character columns. Otherwise,
the usual coercion hierarchy (logical < integer < double <
complex) will be used, e.g., all-logical data frames will be
coerced to a logical matrix, mixed logical-integer will give a
integer matrix, etc.
The default method for ‘as.matrix’ calls ‘as.vector(x)’, and hence
e.g. coerces factors to character vectors.
I am not a native English speaker and I can't read the following (which looks rather important!). Can someone clarify it?
The method for data frames will return a character matrix if there is only atomic columns and any non-(numeric/logical/complex) column, applying ‘as.vector’ to factors and ‘format’ to other non-character columns.
From ?as.vector:
Note that factors are _not_ vectors; ‘is.vector’ returns ‘FALSE’
and ‘as.vector’ converts a factor to a character vector for ‘mode
= "any"’.
Simply put, as long as you have a factor column in a data frame, as.matrix gives you a character matrix.
I believed this apply with data frame problem has been raised many times and the above just adds another duplicate answer. Really sorry. I failed to read OP's question carefully. What hit me in the first instance is that R can not build a true matrix of factors.
f <- factor(letters[1:4])
matrix(f, 2, 2)
# [,1] [,2]
#[1,] "a" "c"
#[2,] "b" "d"
## a sneaky way to get a matrix of factors by setting `dim` attribute
dim(f) <- c(2, 2)
# [,1] [,2]
#[1,] a c
#[2,] b d
#Levels: a b c d
is.matrix(f)
#[1] TRUE
class(f)
#[1] "factor" ## not a true matrix with "matrix" class
While this is interesting, it should be less-relevant to OP's question.
Sorry again for making a mess here. So bad!!
So if I do sapply would it help? Because I have many columns that need to be converted to factor.
Use lapply actually. sapply would simplify the result to an array, which is a matrix in 2D case. Here is an example:
dat <- head(trees)
sapply(dat, as.factor)
# Girth Height Volume
#[1,] "8.3" "70" "10.3"
#[2,] "8.6" "65" "10.3"
#[3,] "8.8" "63" "10.2"
#[4,] "10.5" "72" "16.4"
#[5,] "10.7" "81" "18.8"
#[6,] "10.8" "83" "19.7"
new_dat <- data.frame(lapply(dat, as.factor))
str(new_dat)
#'data.frame': 6 obs. of 3 variables:
# $ Girth : Factor w/ 6 levels "8.3","8.6","8.8",..: 1 2 3 4 5 6
# $ Height: Factor w/ 6 levels "63","65","70",..: 3 2 1 4 5 6
# $ Volume: Factor w/ 5 levels "10.2","10.3",..: 2 2 1 3 4 5
sapply(new_dat, class)
# Girth Height Volume
#"factor" "factor" "factor"
apply(new_dat, 2, class)
# Girth Height Volume
#"character" "character" "character"
Regarding typeof, factors are actually stored as integers.
sapply(new_dat, typeof)
# Girth Height Volume
#"integer" "integer" "integer"
When you dput a factor you can see this. For example:
dput(new_dat[[1]])
#structure(1:6, .Label = c("8.3", "8.6", "8.8", "10.5", "10.7",
#"10.8"), class = "factor")
The real values are 1:6. Character levels are just an attribute.

Converting an R list with NULL sub-elements to a data frame

Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.

Getting different results for 'class()' method

Here's the smallest piece of code which displays how i am getting different results for class() when called directly for columns vs when called using apply.
data.frame looks like this.
> df
A B C
1 rlm 4.047317e-03 0.0040111713
2 rlm -6.474359e-02 -0.0657461598
3 rlm 1.464302e-01 0.1451224214
4 rlm 3.508878e-01 0.3477540761
5 lm 2.701757e-01 0.2769367280
6 lm 2.580785e-03 0.0025815525
7 rlm 1.638077e-05 0.0000160895
> str(df)
'data.frame': 7 obs. of 3 variables:
$ A: chr "rlm" "rlm" "rlm" "rlm" ...
$ B: num 0.00405 -0.06474 0.14643 0.35089 0.27018 ...
$ C: num 0.00401 -0.06575 0.14512 0.34775 0.27694 ...
> class(df$A)
[1] "character"
> class(df$B)
[1] "numeric"
> apply(df, 2, class)
A B C
"character" "character" "character"
So, when called directly class of B is 'numeric', but when called using apply, it's saying 'character'.
Am i missing anything here ?
Apply coerces data.frames to matrices before applying the function. Since in a matrix each element must have the same class you end up with a character matrix (since you can convert numeric to character without information loss but not the other way). The reason for this is probably that you can apply functions by-row as well, which would be messy with data.frames since your function would need to operate on a list.
For what you want check out the lapply and sapply functions, since data.frames are basically lists with each element of the list being one of the columns.
> x <- data.frame(a = "Entry", b = 5)
> sapply(x, class)
a b
"factor" "numeric"
I get the same result. I think it might be the same behavior you see in this example:
number_m <- matrix(1:6)
mode(number_m) # "numeric"
number_m[2,1] <- "b"
mode(number_m) # "character"
number_m
converting one element of a matrix or vector to a character changes the data type of all the elements.
I get the correct result using a loop:
df <- read.table(header=TRUE, text="
A B C
1 rlm 4.047317e-03 0.0040111713
2 rlm -6.474359e-02 -0.0657461598
3 rlm 1.464302e-01 0.1451224214
4 rlm 3.508878e-01 0.3477540761
5 lm 2.701757e-01 0.2769367280
6 lm 2.580785e-03 0.0025815525
7 rlm 1.638077e-05 0.0000160895")
sapply(1:3, function(i) class(df[,i]))

R: numeric vector becoming non-numeric after cbind of dates

I have a numeric vector (future_prices) in my case. I use a date vector from another vector (here: pred_commodity_prices$futuredays) to create numbers for the months. After that I use cbind to bind the months to the numeric vector. However, was happened is that the numeric vector become non-numeric. Do you know how what the reason for this is? When I use as.numeric(future_prices) I get strange values. What could be an alternative? Thanks
head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a
1 68.33907 62.37888
2 68.08553 62.32658
is.numeric(future_prices)
[1] TRUE
> month = format(as.POSIXlt.date(pred_commodity_prices$futuredays), "%m")
> future_prices <- cbind (future_prices, month)
> head(future_prices)
pred_peak_month_3a pred_peak_quarter_3a month
1 "68.3390747063745" "62.3788824938719" "01"
is.numeric(future_prices)
[1] FALSE
The reason is that cbind returns a matrix, and a matrix can only hold one data type. You could use a data.frame instead:
n <- 1:10
b <- LETTERS[1:10]
m <- cbind(n,b)
str(m)
chr [1:10, 1:2] "1" "2" "3" "4" "5" "6" "7" "8" "9" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:2] "n" "b"
d <- data.frame(n,b)
str(d)
'data.frame': 10 obs. of 2 variables:
$ n: int 1 2 3 4 5 6 7 8 9 10
$ b: Factor w/ 10 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10
See ?format. The format function returns:
An object of similar structure to ‘x’ containing character
representations of the elements of the first argument ‘x’ in a
common format, and in the current locale's encoding.
from ?cbind, cbind returns
... a matrix combining the ‘...’ arguments
column-wise or row-wise. (Exception: if there are no inputs or
all the inputs are ‘NULL’, the value is ‘NULL’.)
and all elements of a matrix must be of the same class, so everything is coerced to character.
F.Y.I.
When one column is "factor", simply/directly using as.numeric will change the value in that column. The proper way is:
data.frame[,2] <- as.numeric(as.character(data.frame[,2]))
Find more details: Converting values to numeric, stack overflow

Resources