How to calculate the mean in R excluding NA values - r

I am a newbie in R, so this is probably a simple question. I am working with a large data frame to find the average review score(1-5) of items if a service is being used. A lot of items have "NoReview" in the review column. Is there a way I can exclude the items that say "NoReview"? I tried using na.rm = TRUE but I am pretty sure it is only for data that says NA.
Attached link is the code I tried and the error I received.

You need to transform you review column in numeric
You can achieve that transforming "NoReview" values to NA and then the column to numeric.
Try this:
odat %>%
mutate(review = case_when(
review == "NoReview" ~ NA,
TRUE ~ review_column)) %>%
group_by(if_cainiao) %>%
summarise(avgReview = mean(as.numeric(review), na.rm = T))

Related

why is na.rm not working for this case only?

I have been working with weather data that does contain some NA values. Usually to sum up values for one day I use colSums like the following.colSums(df,na.rm = T). This ofcourse never created any issue.
However using the same now for a different analysis is returning the following error.
colSums(I_2011,na.rm=T)
Error in colSums(I_2011, na.rm = T) : 'x' must be numeric
I don't understand why. the only difference is I_2011 is imported from a csv
I_2011<-read.csv("2011_IMD.csv",check.names = FALSE)
does the latter require something different?
lost on what to do next. I don't need to remove the columns containing NA. only to disregard them while doing colsums .
tried `
I_2011 %>%
mutate(avg= rowSums(., na.rm=TRUE)) %>%
bind_cols(I_2011[setdiff(names(I_2011), names(.))] , .)
returns the same error.

Peculiar behavior of NA values. Can anyone provide insight into understanding the mechanics?

So, I have a dataframe with 90, 797 row and 29 columns, respectively, according to nrow(df) and ncol(df).
When using summary(df), I note that my dataframe contains NA values in 8 variables. The number of NA values across all columns are less than 15,000.
Why, then, when I run the following code, I return far fewer rows than 90,979 - 15,000.
nrow(df2[complete.cases(df2),]) # this returns 14,186 rows
I double-checked using the na.omit(df) function as well, and it also returned 14,186 rows. How is this possible?
BUT, the mystery deepens. When I use a variable with 5,597 NA values in a summary function with Dplyr, I am not getting an NA value returned, in spite of leaving them as is (not using na.rm = T).
Example, where variable Channel_1_qty contains over 5,000 NA values:
df3 <- df2 %>%
select(Season, Article, Channel_1_qty) %>%
filter(Season %in% c("FW2022", "SS2023") & (Channel_1_qty > 0)) %>%
group_by(Season) %>%
summarise(
article_count = n_distinct(Article),
tot_qty = sum(Channel_1_qty)
)
This gives me the output that I want. However, if I filter the initial dataframe using complete cases or na.omit() first, I return far fewer rows than I do leaving them in. What is happening under the hood. It was my understanding that Dplyr could not return a summary statistic if NA values were included, or if it did, they would be removed from the dataframe.
Can anyone provide insight into what's happening? Sorry I cannot post the original data file or have a reproducible example. It's more theoretical / understanding what's happening "under the hood" in R.
Thanks!

R- How do I use a lookup table containing threshold values that vary for different variables (columns) to replace values below those thresholds?

I am trying to streamline the process of auditing chemistry laboratory data. When we encounter data where an analyte is not detected I need to change the recorded result to a value equal to 1/2 of the level of detection (LOD) for the analytical method. I have LOD's contained within another dataframe to be used as a lookup table.
I have multiple columns representing data from different analytical tests, each with it's own unique LOD. Here's an example of the type of data I am working with:
library(tidyverse)
dat <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,-2.3,7.6,0.1,45.6,12.2,-0.1,22.2,0.6),
"TN" = c(100.3,56.2,-10.5,0.4,-0.3,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,-10.5,0.2,14.6,489.3,0.3,14.4,54.6,88.8))
dat
detect_level <- tibble("Parameter" = c('TP', 'TN', 'DOC'),
'LOD' = c(0.6, 11, 0.3)) %>%
mutate(halfLOD=LOD/2)
detect_level
I have poured over multiple other questions with a similar theme:
Change values in multiple columns of a dataframe using a lookup table
R - Match values from multiple columns in a data.frame to a lookup table.
Replace values in multiple columns using different thresholds
and gotten to a point where I have pivoted the data and split it out into a list of dataframes that are specific analytes:
dat %>%
pivot_longer(cols = c('TP','TN','DOC')) %>%
arrange(name) %>%
split(.$name)
I have tried to apply a function using map(), however I cannot figure out how to integrate the values from the lookup table (detect_level) into my code. If someone could help me continue this pipe, or finish the process to achieve a final product dat2 that should look like this I would appreciate it:
dat2 <- tibble("Lab_ID" = as.character(seq(1,10,1)),
"Tributary" = c('sawmill','paint', 'herring', 'water',
'paint', 'sawmill', 'bolt', 'water',
'herring', 'sawmill'),
"date" = rep(as.POSIXct("2021-10-01 12:00:00"), 10),
"TP" = c(1.5,15.7,0.3,7.6,0.3,45.6,12.2,0.3,22.2,0.6),
"TN" = c(100.3,56.2,5.5,5.5,5.5,11.0,45.8,256.0,12.2,144.0),
"DOC" = c(56.0,120.3,0.15,0.15,14.6,489.3,0.3,14.4,54.6,88.8))
dat2
Another possibility would be from the closest similar question I have found is:
Lookup multiple column from a single table
Here's a snippet of code that I have adapted from this question, however, if you run it you will see that where values exist that are not found in detect_level an NA is returned. Additionally, it does not appear to have worked for $TN or $DOC, even in cases when the $LOD value from detect_level was present.
dat %>%
mutate(across(all_of(unique(detect_level$Parameter)),
~ {i1 <- detect_level$Parameter == cur_column()
detect_level$LOD[i1][match(., detect_level$LOD)]}))
I am not comfortable at all with the purrr language here and have only adapted this code from the question linked, so I would appreciate if this is the direction an answerer chooses, that they might comment code to explain briefly what is happening "under the hood".
Thank you in advance!
Perhaps this helps
library(dplyr)
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ pmax(., detect_level$LOD[match(cur_column(), detect_level$Parameter)])))
For the updated case
dat %>%
mutate(across(all_of(detect_level$Parameter),
~ replace(., . < detect_level$LOD[match(cur_column(),
detect_level$Parameter)],detect_level$halfLOD[match(cur_column(),
detect_level$Parameter)])))

Repition in for loop stops unexpectedly [duplicate]

Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.

Getting "NA" when I run a standard deviation

Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.

Resources