why is na.rm not working for this case only? - r

I have been working with weather data that does contain some NA values. Usually to sum up values for one day I use colSums like the following.colSums(df,na.rm = T). This ofcourse never created any issue.
However using the same now for a different analysis is returning the following error.
colSums(I_2011,na.rm=T)
Error in colSums(I_2011, na.rm = T) : 'x' must be numeric
I don't understand why. the only difference is I_2011 is imported from a csv
I_2011<-read.csv("2011_IMD.csv",check.names = FALSE)
does the latter require something different?
lost on what to do next. I don't need to remove the columns containing NA. only to disregard them while doing colsums .
tried `
I_2011 %>%
mutate(avg= rowSums(., na.rm=TRUE)) %>%
bind_cols(I_2011[setdiff(names(I_2011), names(.))] , .)
returns the same error.

Related

Peculiar behavior of NA values. Can anyone provide insight into understanding the mechanics?

So, I have a dataframe with 90, 797 row and 29 columns, respectively, according to nrow(df) and ncol(df).
When using summary(df), I note that my dataframe contains NA values in 8 variables. The number of NA values across all columns are less than 15,000.
Why, then, when I run the following code, I return far fewer rows than 90,979 - 15,000.
nrow(df2[complete.cases(df2),]) # this returns 14,186 rows
I double-checked using the na.omit(df) function as well, and it also returned 14,186 rows. How is this possible?
BUT, the mystery deepens. When I use a variable with 5,597 NA values in a summary function with Dplyr, I am not getting an NA value returned, in spite of leaving them as is (not using na.rm = T).
Example, where variable Channel_1_qty contains over 5,000 NA values:
df3 <- df2 %>%
select(Season, Article, Channel_1_qty) %>%
filter(Season %in% c("FW2022", "SS2023") & (Channel_1_qty > 0)) %>%
group_by(Season) %>%
summarise(
article_count = n_distinct(Article),
tot_qty = sum(Channel_1_qty)
)
This gives me the output that I want. However, if I filter the initial dataframe using complete cases or na.omit() first, I return far fewer rows than I do leaving them in. What is happening under the hood. It was my understanding that Dplyr could not return a summary statistic if NA values were included, or if it did, they would be removed from the dataframe.
Can anyone provide insight into what's happening? Sorry I cannot post the original data file or have a reproducible example. It's more theoretical / understanding what's happening "under the hood" in R.
Thanks!

How to calculate the mean in R excluding NA values

I am a newbie in R, so this is probably a simple question. I am working with a large data frame to find the average review score(1-5) of items if a service is being used. A lot of items have "NoReview" in the review column. Is there a way I can exclude the items that say "NoReview"? I tried using na.rm = TRUE but I am pretty sure it is only for data that says NA.
Attached link is the code I tried and the error I received.
You need to transform you review column in numeric
You can achieve that transforming "NoReview" values to NA and then the column to numeric.
Try this:
odat %>%
mutate(review = case_when(
review == "NoReview" ~ NA,
TRUE ~ review_column)) %>%
group_by(if_cainiao) %>%
summarise(avgReview = mean(as.numeric(review), na.rm = T))

Repition in for loop stops unexpectedly [duplicate]

Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.

How to interpret column length error from ddplyr::mutate?

I'm trying to apply a function (more complex than the one used below, but I was trying to simplify) across two vectors. However, I receive the following error:
mutate_impl(.data, dots) :
Column `diff` must be length 2 (the group size) or one, not 777
I think I may be getting this error because the difference between rows results in one row less than the original dataframe, per some posts that I read. However, when I followed that advice and tried to add a vector to add 0/NA on the final line I received another error. Did I at least identify the source of the error correctly? Ideas? Thank you.
Original code:
diff_df <- DF %>%
group_by(DF$var1, DF$var2) %>%
mutate(diff = map2(DF$duration, lead(DF$duration), `-`)) %>%
as.data.frame()
We don't need map2 to get the difference between the 'duration' and the lead of 'duration'. It is vectorized. map2 will loop through each element of 'duration' with the corresponding element of lead(duration) which is unnecessary
DF %>%
group_by(var1, var2) %>%
mutate(diff = duration - lead(duration))
NOTE: When we extract the column with DF$duration after the group_by. it is breaking the grouping condition and get the full dataset column. Also, in the pipe, there is no need for dataset$columnname. It should be columnname (However,in certain situations, when we want to get the full column for some comparison - it can be used)

Getting "NA" when I run a standard deviation

Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.

Resources