Changing df of integers to num doesn't work - r

I'm a newbie in R and don't have much experience with solving errors, so one more time I need help. I have a data frame named n_occur that has 2 columns - number and freq. Both values are integers. I wanted to get the histogram, but I got the error: argument x must be numeric, so I wanted to change both columns into num.
Firstly I tried with the simplest way:
n_occur[,] = as.numeric(as.character(n_occur[,]))
but as a result all values changed into NA. So after searching on stack, I decided to use this formula:
indx <- sapply(n_occur, is.factor)
n_occur[indx] <- lapply(n_occur[indx], function(x) as.numeric(as.character(x)))
and nothing changed, I still have integers and hist still doesn't work. Any ideas how to do that?

If anyone needs it in future, I solved the problem with mutate from dplyr:
n_occur <- mutate(n_occur, freq=as.numeric(freq))
for both columns separately. It worked!

I don't think you really have to do that, just supply the function with the actual columns instead of the entire dataframe. For example:
n_occur = data.frame(
number = sample(as.integer(1:10), 10, TRUE),
freq = sample(as.integer(0:10), 10, TRUE)
)
str(n_occur)
'data.frame': 10 obs. of 2 variables:
$ number: int 9 8 8 5 6 7 8 10 3 4
$ freq : int 0 9 2 0 4 10 7 2 7 9
hist(n_occur$number) # works
hist(n_occur$freq) # works
plot(n_occur$number, n_occur$freq, type = 'h') # works
hist(n_occur) # fails since it is the whole dataframe
Also if you still want to do that, this converts a factor to numeric:
as.numeric(factor(1:10))

Related

Strange behaviour of as.numeric() with factor variable - gives completely different numbers to those supplied [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 3 years ago.
I have a dataset where I am trying to convert a factor into a numeric variable, it appeared to work fine the first time I ran it but now I have changed the vector contents the as.numeric() function is returning different (possibly previous) values rather the values now in the vector, despite the fact that these do not appear to be stored anywhere. It works fine if I convert to a character first, however. The code I am using is:
rm(reprex) # ensure does not exist from previously
reprex <- data.frame(rbind(c("BT",8),c("BL", 1), c("TS",1), c("SA", 7), c("S", 5), c("LS",5), c("M",3), c("CV",3), c("CF",3), c("PE",3)))
names(reprex) <-c("Post Area", "Count")
reprex$Countnum <- as.numeric(reprex$Count) # should be same as Count
reprex$Countnum_char <- as.numeric(as.character(reprex$Count)) # is same as Count
head(reprex)
gives:
Post Area Count Countnum Countnum_char
1 BT 8 5 8
2 BL 1 1 1
3 TS 1 1 1
4 SA 7 4 7
5 S 5 3 5
6 LS 5 3 5
Why is this? It seems to work if I convert it to a character before converting to numeric so I can avoid it, but I am confused about why this happens at all and where the strangely-mapped (I suspect from a previous version of the dataframe) factor levels are being stored such that they persist after I remove the object.
This question deals with how R understands your process. Count = 1 is the smallest number and so this become Countnum = 1. Count = 3 is the second highest number so the factor level is 2, which also means that the Countnum = 2, and so on and so forth. In effect, what your first as.numeric does is takes the factor level and converts the factor level to a number. The Countnum_char takes the character value (e.g. Count = 8 is factor level = 5 or Count = 5 is factor level = 3) as its value and converts the value to a number, not the factor level.
Take a look here to shed some light on the why this is happening: https://www.dummies.com/programming/r/how-to-convert-a-factor-in-r/
The Dummies website has a lot of good free resources on R.
> numbers <- factor(c(9, 8, 10, 8, 9))
If you run str() on the above code snippet you get this output:
> str(numbers)
Factor w/ 3 levels "8","9","10": 2 1 3 1 2
R stores the values as c(2, 1, 3, 1, 2) with associated levels of c(“8”, “9”, “10”)
When converting numbers to character vectors you receive the expected output:
> as.character(numbers)
[1] "9" "8" "10" "8" "9"
However when you use as.numeric() you will get the output of the internal level representation of the vector, and not the original values.
Doing what you did
> as.numeric(as.character(numbers))
[1] 9 8 10 8 9
Is exactly how you fix this! This is normal behavior for R when doing what you are doing; you've not made any mistakes here that I can see.

How to pass dynamic column name to h2o arrange function

Given a h2o dataframe df with a numeric column col, the sort of df by col works if the column is defined specifically:
h2o.arrange(df, "col")
But the sort doesn't work when I passed a dynamic variable name:
var <- "A"
h2o.arrange(df, var)
I do not want to hard-coded the column name. Is there any way to solve it? Thanks.
added an example per Darren's request
library(h2o)
h2o.init()
df <- as.h2o(cars)
var <- "dist"
h2o.arrange(df, var) # got error
h2o.arrange(df, "dist") # works
It turns out to be quite tricky, but you can get the dynamic column name to be evaluated by using call(). So, to follow on from your example:
var <- "dist"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 7 4
3 4 10
4 9 10
Then:
var <- "speed"
eval(call("h2o.arrange",df,var))
Gives:
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
(I'd love to say that was the first thing I thought of, but it was more like experiment number 54! I was about halfway down http://adv-r.had.co.nz/Expressions.html There might be other, better ways, to achieve the same thing.)
By the way, another approach to achieve the same result is:
var = 1
h2o:::.newExpr("sort", df, var)
and
var = 0
h2o:::.newExpr("sort", df, var)
respectively. I.e. The 3rd argument is the zero-based index of the column. You can get that with match(var, names(df)) - 1. By this point you've implemented 75% of h2o.arrange().
(Remember that any time you end up using h2o::: you are taking the risk that it will not work in some future version of H2O.)

Why does the subsetting of a data.frame() in R behave differently when it has one contrary to multiple columns? [duplicate]

Say I have a data.frame:
df <- data.frame(A=c(10,20,30),B=c(11,22,33), C=c(111,222,333))
A B C
1 10 11 111
2 20 22 222
3 30 33 333
If I select two (or more) columns I get a data.frame:
x <- df[,1:2]
A B
1 10 11
2 20 22
3 30 33
This is what I want. However, if I select only one column I get a numeric vector:
x <- df[,1]
[1] 1 2 3
I have tried to use as.data.frame(), which does not change the results for two or more columns. it does return a data.frame in the case of one column, but does not retain the column name:
x <- as.data.frame(df[,1])
df[, 1]
1 1
2 2
3 3
I don't understand why it behaves like this. In my mind it should not make a difference if I extract one or two or ten columns. IT should either always return a vector (or matrix) or always return a data.frame (with the correct names). what am I missing? thanks!
Note: This is not a duplicate of the question about matrices, as matrix and data.frame are fundamentally different data types in R, and can work differently with dplyr. There are several answers that work with data.frame but not matrix.
Use drop=FALSE
> x <- df[,1, drop=FALSE]
> x
A
1 10
2 20
3 30
From the documentation (see ?"[") you can find:
If drop=TRUE the result is coerced to the lowest possible dimension.
Omit the ,:
x <- df[1]
A
1 10
2 20
3 30
From the help page of ?"[":
Indexing by [ is similar to atomic vectors and selects a list of the specified element(s).
A data frame is a list. The columns are its elements.
You can also use subset:
subset(df, select = 1) # by index
subset(df, select = A) # by name
As mentioned in the comments you can also use dplyr::select, but you do not need to quote the variable name:
library(dplyr)
# by name
df %>%
select(A)
# by index
df %>%
select(1)

R: A column in a dataframe from numeric to factor with paste0 (and vise- versa)

Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6

How do I handle multiple kinds of missingness in R?

Many surveys have codes for different kinds of missingness. For instance, a codebook might indicate:
0-99 Data
-1 Question not asked
-5 Do not know
-7 Refused to respond
-9 Module not asked
Stata has a beautiful facility for handling these multiple kinds of missingness, in that it allows you to assign a generic . to missing data, but more specific kinds of missingness (.a, .b, .c, ..., .z) are allowed as well. All the commands which look at missingness report answers for all the missing entries however specified, but you can sort out the various kinds of missingness later on as well. This is particularly helpful when you believe that refusal to respond has different implications for the imputation strategy than does question not asked.
I have never run across such a facility in R, but I would really like to have this capability. Are there any ways of marking several different types of NA? I could imagine creating more data (either a vector of length nrow(my.data.frame) containing the types of missingness, or a more compact index of which rows had what types of missingness), but that seems pretty unwieldy.
I know what you look for, and that is not implemented in R. I have no knowledge of a package where that is implemented, but it's not too difficult to code it yourself.
A workable way is to add a dataframe to the attributes, containing the codes. To prevent doubling the whole dataframe and save space, I'd add the indices in that dataframe instead of reconstructing a complete dataframe.
eg :
NACode <- function(x,code){
Df <- sapply(x,function(i){
i[i %in% code] <- NA
i
})
id <- which(is.na(Df))
rowid <- id %% nrow(x)
colid <- id %/% nrow(x) + 1
NAdf <- data.frame(
id,rowid,colid,
value = as.matrix(x)[id]
)
Df <- as.data.frame(Df)
attr(Df,"NAcode") <- NAdf
Df
}
This allows to do :
> Df <- data.frame(A = 1:10,B=c(1:5,-1,-2,-3,9,10) )
> code <- list("Missing"=-1,"Not Answered"=-2,"Don't know"=-3)
> DfwithNA <- NACode(Df,code)
> str(DfwithNA)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA NA NA 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
The function can also be adjusted to add an extra attribute that gives you the label for the different values, see also this question. You could backtransform by :
ChangeNAToCode <- function(x,code){
NAval <- attr(x,"NAcode")
for(i in which(NAval$value %in% code))
x[NAval$rowid[i],NAval$colid[i]] <- NAval$value[i]
x
}
> Dfback <- ChangeNAToCode(DfwithNA,c(-2,-3))
> str(Dfback)
'data.frame': 10 obs. of 2 variables:
$ A: num 1 2 3 4 5 6 7 8 9 10
$ B: num 1 2 3 4 5 NA -2 -3 9 10
- attr(*, "NAcode")='data.frame': 3 obs. of 4 variables:
..$ id : int 16 17 18
..$ rowid: int 6 7 8
..$ colid: num 2 2 2
..$ value: num -1 -2 -3
This allows to change only the codes you want, if that ever is necessary. The function can be adapted to return all codes when no argument is given. Similar functions can be constructed to extract data based on the code, I guess you can figure that one out yourself.
But in one line : using attributes and indices might be a nice way of doing it.
The most obvious way seems to use two vectors:
Vector 1: a data vector, where all missing values are represented using NA. For example, c(2, 50, NA, NA)
Vector 2: a vector of factors, indicating the type of data. For example, factor(c(1, 1, -1, -7)) where factor 1 indicates the a correctly answered question.
Having this structure would give you a create deal of flexibility, since all the standard na.rm arguments still work with your data vector, but you can use more complex concepts with the factor vector.
Update following questions from #gsk3
Data storage will dramatically increase: The data storage will double. However, if doubling the size causes real problem it may be worth thinking about other strategies.
Programs don't automatically deal with it. That's a strange comment. Some functions by default handle NAs in a sensible way. However, you want to treat the NAs differently so that implies that you will have to do something bespoke. If you want to just analyse data where the NA's are "Question not asked", then just use a data frame subset.
now you have to manipulate two vectors together every time you want to conceptually manipulate a variable I suppose I envisaged a data frame of the two vectors. I would subset the data frame based on the second vector.
There's no standard implementation, so my solution might differ from someone else's. True. However, if an off the shelf package doesn't meet your needs, then (almost) by definition you want to do something different.
I should state that I have never analysed survey data (although I have analysed large biological data sets). My answers above appear quite defensive, but that's not my intention. I think your question is a good one, and I'm interested in other responses.
This is more than just a "technical" issue. You should have a thorough statistical background in missing value analysis and imputation. One solution requires playing with R and ggobi. You can assign extremely negative values to several types of NA (put NAs into margin), and do some diagnostics "manually". You should bare in mind that there are three types of NA:
MCAR - missing completely at random, where P(missing|observed,unobserved) = P(missing)
MAR - missing at random, where P(missing|observed,unobserved) = P(missing|observed)
MNAR - missing not at random (or non-ignorable), where P(missing|observed,unobserved) cannot be quantified in any way.
IMHO this question is more suitable for CrossValidated.
But here's a link from SO that you may find useful:
Handling missing/incomplete data in R--is there function to mask but not remove NAs?
You can dispense with NA entirely and just use the coded values. You can then also roll them up to a global missing value. I often prefer to code without NA since NA can cause problems in coding and I like to be able to control exactly what is going into the analysis. If have also used the string "NA" to represent NA which often makes things easier.
-Ralph Winters
I usually use them as values, as Ralph already suggested, since the type of missing value seems to be data, but on one or two occasions where I mainly wanted it for documentation I have used an attribute on the value, e.g.
> a <- NA
> attr(a, 'na.type') <- -1
> print(a)
[1] NA
attr(,"na.type")
[1] -1
That way my analysis is clean but I still keep the documentation. But as I said: usually I keep the values.
Allan.
I´d like to add to the "statistical background component" here. Statistical analysis with missing data is a very good read on this.

Resources