Warning message: In mean.default(., lead_time) : argument is not numeric or logical: returning NA - r

I run into this problem.
The thing is, the column I am using is actually a numeric column and I get correct results using:
mean(hotel_bookings_city$lead_time).
but when I try this:
hotel_bookings_city %>% mean(lead_time)
I receive the warning that the column 'lead_time' is not numeric/logical even though it is.
What am I not doing right?

As #Ritchi Sacramento stated already, in your second approach, the hole data.frame is passed as first argument to mean because of the use of the dpyr operator %>%. This you can notice also in your warning where the point refers to the actual (eventually by former lines of %>% modified) version of your original data.frame.
You can solve the problem with the code #Ritchi Sacramento suggested already, by (1) defining the column to modify before applying mean or (2) with the function summarise.
Another approach which might be more similar to your original approach is to use the pull-function:
hotel_bookings_city %>%
pull(lead_time) %>%
mean()

Related

Converting chr to numeric and still not able to take mean

I am working with a dataframe from NYC opendata. On the information page it claims that a column, ACRES, is numeric, but when I download it is chr. I've tried the following:
parks$ACRES <- as.numeric(as.character(parks$ACRES))
which turned the column info type into dbl, but I was unable to take the mean, so I tried:
parks$ACRES <- as.integer(as.numeric(parks$ACRES))
I've also tried sapply() and I get an error message with NAs introduced by coercion. I tried convert() to but R didn't recognize it though it is supposed to be part of dplyr.
Either way I get NA as a result for the mean.
I've tried taking the mean a few different ways:
mean(parks[["ACRES"]])
mean(parks$ACRES)
Which also didn't work? Is it the dataframe? I'm wondering since it is from the government there are limits?
I'd appreciate any help.
You have NAs in your data. Either they were there before you converted or some of the data can't be converted to numeric directly (do you have comma separators for the 1000s in your input? Those need to be removed before converting to numeric).
Identifying why you have NAs and fixing if necessary is the first step you'll need to do. If the NAs are valid then what you want to do is to add the na.rm = TRUE parameter to the mean function which ignores NAs while calculating the mean.
Check to see how ACRES is being loaded in (i.e., what data type is it?). If it's being loaded in as a factor, you will have trouble changing a factor to a numerical value. The way to solve this is to use the 'stringsAsFactors = FALSE' argument in your read.csv or whatever function you're using to read in the data.

Why is one_of() called that?

Why is dplyr::one_of() called that? All the other select_helpers names make sense to me, so I'm wondering if there's an aspect of one_of() that I don't understand.
My understanding of one_of() is that it just lets you select variables using a character vector of their names instead of putting their names into the select() call, but then you get all of the variables whose names are in the vector, not just one of them. Is that wrong, and if it's correct, where does the name one_of() come from?
one_of allows for guessing or subset-matching
Let's say I know in general my column names will come from c("mpg","cyl","garbage") but I don't know which columns will be present because of interactivity/reactivity
mtcars %>% select(one_of(c("mpg","cyl","garbage")))
evaluates but provides a message
Warning message:
Unknown variables: `garbage`
In contrast
mtcars %>% select(mpg, cyl, garbage)
does not evaluate and gives the error
Error in overscope_eval_next(overscope, expr) :
object 'garbage' not found
The way I think about it is that select() eventually evaluates to a logical vector. So if you use starts_with it goes through the variables in the dataframe and asks whether the variable name starts with the right set of characters. one_of does the same thing but asks whether the variable name is one of the names listed in the character vector. But as they say, naming things is hard!
The reason for its name seems to be that it allows you to look for, at least, one of the variables that are contained in the vector.
For example:
select(flights, dep, arr_delay, sched_dep_time) won't work because the variable "dep" does not exits. It will produce no result.
select(flights, one_of(c("dep", "arr_delay", "sched_dep_time"))) will work, even due the variable "dep" does not exist. In this case, "arr_delay" and "sched_dep_time" will be shown.
The helper should be read as: at least one_of() the variables will be shown :)

Converting Data Type from data.table package in R

this might be a dumb/obvious question but unfortunately I haven't had much luck finding information about it online so I thought I'd ask it here. Basically, I'm working with the data.table package in R and I have imported a data set into R where, in a particular column, the values can be both numeric values and character values (and even blank/empty values), and I want to be able to obtain a value from that column and use it for calculations.
The thing about the data.table package though is that when you import a file using the fread() function it automatically sets all values in that file as a character data type, so this can cause a few issues since this means that all numbers are automatically character types as well. I have worked around this slightly by using the as.numeric() function so that if a value obtained from that column is a number then it can be easily converted to numeric type and used in calculations. However, since the column also contains other characters (specifically, it can also have \N or N as values) and since it can also contain blank/empty values, then this means the as.numeric() function will show up with an error. For example, I initially wrote an IF loop to detect whether a column cell had a character value or a numeric value as follows:
if( as.numeric(..{Reference to column cell from file here}...) == NA ) {
x <- 0
}
(where x is just some variable), but it did not work and instead gave the output:
Error in if ((as.numeric(.... :
missing value where TRUE/FALSE needed
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
(I should note that is.numeric() also did not work since all values in a data.table data set are automatically character values so this function always gives FALSE regardless of it's actual data type).
So clearly I need a better function or method to work around this. Is there a function capable of reading a 'character' value from a column and being able to detect whether that value is truly a numeric type or character type (or even neither, in the case of an empty cell)? Thanks in advance

R: Error in .Primitive, non-numeric argument to binary operator

I did some reading on similar SO questions, but couldn't figure out how to resolve my error.
I have written the following string of code:
points[paste0(score.avail,"_pts")] <-
Map('*', points[score.avail], mget(paste0(score.avail,'_m')) )
Essentially, I have a list of columns in the 'points' data frame, defined by 'score.avail'. I am multiplying each of the columns by a respective constant, defined as the paste0(score.avail, '_m') expression. It appends new fields based on the multiplication, given by paste0(score.avail, "_pts") expression.
I have used this function before in a similar setup with no issues. However, I am now getting the following error:
Error in .Primitive("*")(dots[[1L]][[1L]], dots[[2L]][[1L]]) :
non-numeric argument to binary operator
I'm pretty sure R is telling me that one of the fields I'm trying to multiply is not numeric. However, I have checked all my fields, and they are numeric. I have even tried running a line as.numeric(score.avail) but that doesn't help. I also ran the following to remove NA's in the fields (before the Map function above).
for(col in score.avail){
points[is.na(get(col)) & (data.source == "average" |
data.source == "averageWeighted"), (col) := 0]}
The thing that stumps me is that this expression has worked with no issues before.
Update
I did some more digging by separating out each component of my original function. I'm getting odd output when running points[score.avail]. Previously when I ran this, it would return just the columns for all of my rows. Now, however, I'm getting none of the rows in my original data frame -- rather, it is imputing the column names in the 'score.avail' list as rows and filling in NA's everywhere (this is clearly the source of my problem).
I think this is because I'm using the object I'm pointing to is a data.table with keyvars set. Previously with this function, I had been pointing to a data frame.
Off to try a few more things.
Another Update
I was able to solve my problem by copying the 'points' object using as.data.frame(). However, I will leave the question open to see if anyone knows how to reset the data table key vars so that the function I specified above will work.
I was able to solve my problem by copying the 'points' object using as.data.frame(). Apparently classifying the object as a data.table was causing my headaches.

ggplot iterate several columns

lapply(7:12, function(x) ggplot(mydf)+geom_histogram(aes(mydf[,x])))
will give an error Error in [.data.frame(mydf, , x) : undefined columns selected.
I have used several SO questions (e.g. this) as guidance, but can't figure out my error.
The code below works with the mtcars dataset. Just replace mtcars with mydf.
library(ggplot2)
lapply(1:3,function(i) {
ggplot(data.frame(x=mtcars[,i]))+
geom_histogram(aes(x=x))+
ggtitle(names(mtcars)[i])
})
Notice how the reference to i (the column index) was moved from the mapping argument (the call to aes(...)), to the data argument.
Your problem is actually quite subtle. ggplot evaluates the arguments to aes(...) first in the context of your data - e.g. it looks for column names in mydf. If that fails it jumps to the global environment. It does not look in the function's environment. See this post for another example of this behavior and some discussion.
The bottom line is that it is a really bad idea to use external variables in a call to aes(...). However, the data=... argument does not suffer from this. If you must refer to a column number, etc., do it in the call to ggplot(data=...).

Resources