Dropping NA values of factors within a function - r

Toy data:
Say I have this df
df <- structure(list(x = structure(c(NA, 7L, NA, NA, 4L, 6L, 6L, 2L,
3L, 5L, 8L, 4L, 7L, 3L, 5L, 1L, 5L, 5L, 5L, NA), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), y = structure(c(NA,
2L, 3L, 2L, 2L, 2L, 2L, 1L, 3L, NA, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L, 2L, 2L), .Label = c("1", "2", "3"), class = "factor"), z = structure(c(NA,
4L, 4L, 4L, 5L, 4L, 5L, 5L, 2L, NA, 4L, 1L, 1L, 3L, 2L, 5L, 2L,
2L, 4L, NA), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
a = c(-32L, -51L, -22L, 44L, 55L, -24L, -50L, 67L, 1L, -47L,
66L, -98L, -91L, -42L, -89L, -31L, -8L, -33L, 38L, 61L),
b = c(46L, -19L, -37L, 47L, -28L, -48L, 14L, -10L, -13L,
-31L, 32L, 21L, -21L, 25L, -8L, 42L, -26L, -24L, 36L, -39L
)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))
df
# A tibble: 20 × 5
x y z a b
<fct> <fct> <fct> <int> <int>
1 NA NA NA -32 46
2 7 2 4 -51 -19
3 NA 3 4 -22 -37
4 NA 2 4 44 47
5 4 2 5 55 -28
6 6 2 4 -24 -48
7 6 2 5 -50 14
8 2 1 5 67 -10
9 3 3 2 1 -13
10 5 NA NA -47 -31
11 8 2 4 66 32
12 4 3 1 -98 21
13 7 1 1 -91 -21
14 3 1 3 -42 25
15 5 3 2 -89 -8
16 1 2 5 -31 42
17 5 2 2 -8 -26
18 5 3 2 -33 -24
19 5 2 4 38 36
20 NA 2 NA 61 -39
I want to normalize variables x, y, and z on a 0-1 scale, and then produce some summary stats on them. I can produce the summary stats just fine using the code below
Code that works:
library(tidyverse)
vars <- c('x', 'y', 'z')
names(vars) <- vars
summary_stats <- function(data){
tibble(
n = sum(!is.na(data)),
mean = round(mean(as.numeric(data), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
table
# A tibble: 3 × 5
covariate n mean sd se
<chr> <int> <dbl> <dbl> <dbl>
1 x 16 4.75 1.88 0.47
2 y 18 2.11 0.676 0.159
3 z 17 3.35 1.41 0.342
Code that doesn't work:
But i'm struggling to figure out how to normalize the variables. My latest attempt is to try this
summary_stats <- function(data){
data_norm <- drop_na(data) %>% dplyr::summarize(
(as.numeric(data) - min(as.numeric(data))) /
(max(as.numeric(data)) - min(as.numeric(data)))
)
tibble(
n = sum(!is.na(data_norm)),
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
Errors:
But this returns the error
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "factor"
If I convert it to a numeric on the fly, so I have data_norm <- drop_na(as.numeric(data)) etc., I then get a very similar error saying
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "c('double', 'numeric')"
However, if I do this outside of the function it works fine
df %>% drop_na(x) %>% summarise(std_mean = (as.numeric(x) - min(as.numeric(x))) / (max(as.numeric(x)) - min(as.numeric(x))))
# A tibble: 16 × 1
std_mean
<dbl>
1 0.857
2 0.429
3 0.714
4 0.714
5 0.143
6 0.286
7 0.571
....
I need to remove the NA values or when I try and normalize the returned variable will have all NAs if there is at least 1 NA in that column. And if I apply drop_na() outside the function (to the master tibble i feed in to the map_dfr function), it will drop any row that has at least 1 NA value in any variable from the df, rather than just the NA values from that column.
Can anyone help here?
Update:
If I remove the drop_na() call from the function i get the following error
Error in UseMethod("summarise") :
no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"
This makes zero sense to me (i'm probably not understanding it) as summarise definitely works with numeric variables...

Looks like what's happening is that you're trying to write a function to take an entire data frame as an argument, but when you go to map it, you're actually only passing a single vector (e.g. df$x) as the argument to the function. This works fine for the first version of your function, but in the second version drop_na fails to work because it takes an entire data frame for an argument. Same goes for summarize, which is why you were getting a similar error. It also works outside of your function because you're able to specify a single vector.
So, what I did was swap out drop_na for na_omit, and also reorganized your code a bit.
First, let's just define a separate std_mean function so we don't have to deal with summarize:
std_mean <- function(x){
x <- na.omit(x)
(as.numeric(x) - min(as.numeric(x)))/(max(as.numeric(x)) - min(as.numeric(x)))
}
Now we can go back and fix your original function:
summary_stats <- function(vec){
data_norm <- std_mean(vec)
n = length(data_norm)
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3)
data.frame(
n = n,
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = sd,
se = round(sd/sqrt(n), digits = 3)
)
}
We have to define n and sd beforehand because they were being used as arguments in other columns of the data frame. While it would be cool for data.frame to calculate the first column to then allow you to feed into later columns, that isn't the case.
And now we're ready to map:
map(df[vars],summary_stats)
$x
n mean sd se
1 16 0.536 0.269 0.067
$y
n mean sd se
1 18 0.556 0.338 0.08
$z
n mean sd se
1 17 0.588 0.353 0.086

Related

Find the "top N" in a group and find the average of the "top N" in R

Rank Laps Average Time
1 1 1 30
2 2 1 34
3 3 1 35
4 1 2 32
5 2 2 33
6 3 2 56
7 4 1 43
8 5 1 23
9 6 1 31
10 4 2 23
11 5 2 88
12 6 2 54
I would like to know how I can group ranks 1-3 and ranks 4-6 and get an average of the "average time" for each lap. Also, I would like this to extend if I have groups 7-9, 10-13, etc.
One option is to use cut to put the different ranks into groups, and add Laps as a grouping variable. Then, you can summarize the data to get the mean.
library(tidyverse)
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3)), Laps) %>%
summarize(avg = mean(Average_Time))
Output
gr Laps avg
<fct> <int> <dbl>
1 (0,3] 1 33
2 (0,3] 2 40.3
3 (3,6] 1 32.3
4 (3,6] 2 55
Or another option if you want the range of ranks displayed for the group:
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3))) %>%
mutate(Rank_gr = paste0(min(Rank), "-", max(Rank))) %>%
group_by(Rank_gr, Laps) %>%
summarize(avg = mean(Average_Time))
Output
Rank_gr Laps avg
<chr> <int> <dbl>
1 1-3 1 33
2 1-3 2 40.3
3 4-6 1 32.3
4 4-6 2 55
Since you will have uneven groups, then you might want to use case_when to make the groups:
df %>%
group_by(gr=case_when(Rank %in% 1:3 ~ "1-3",
Rank %in% 4:6 ~ "4-6",
Rank %in% 7:9 ~ "7-9",
Rank %in% 10:13 ~ "10-13"),
Laps) %>%
summarize(avg = mean(Average_Time))
Data
df <- structure(list(Rank = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 4L,
5L, 6L), Laps = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L), Average_Time = c(30L, 34L, 35L, 32L, 33L, 56L, 43L, 23L,
31L, 23L, 88L, 54L)), class = "data.frame", row.names = c(NA,
-12L))

How to fix NA error for mean in R ? and I'm trying to dotchart the given data how will I get it?

I'm trying to fix the NA problem and trying to dot plot the data frame from ".CSV" file.
I'm trying to get the mean median and 10% trimmed mean of the given data frame, somehow I'm getting an error. I have already tried previous suggestions and still not helping me out. I have data and I can't plot the dot chart from it.
code for mean median and 10% trimmed mean
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
as.numeric(unlist(data_val))
print(ncol(data_val))
print(nrow(data_val))
# I have used several logics but it's not helping to solve the problem
mean(data_val,data_val$cholesterol_level[data_val$Treatment_type ==
'Control_group'])
mean(data_val$cholesterol_level[data_val$Treatment_type ==
'Treatment_group'])
code for dot chart & dot plot
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
data_val
plot(data_val$Treatment_type ~ data_val$cholestrol_level, xlab =
"Health Unit Range", ylab = " ",
main = "Regiment_Health", type="p") #p for point chart
#dotchart(data_val, data_val$Treatment_type ~
data_val$cholestrol_leve, labels = row.names(data_val),
#cex = 0.6,xlab = "units")
Following is the error message
[2] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 7 3
-4 14 2 5 22 -7 9 5 -6 5 9 4 4 12 37 [38] 5 3 3 [2] 2 [2] 20 argument is not numeric or logical: returning NA[2] NA argument is
not numeric or logical: returning NA[2] NA
and instead of point plot, I'm getting bar chart and dot chart syntax
is not working though I have given the proper syntax.
.csv data
Treatment_type cholestrol_level
Control_group 7
Control_group 3
Control_group -4
Control_group 14
Control_group 2
Control_group 5
Control_group 22
Control_group -7
Control_group 9
Control_group 5
Treatment_group -6
Treatment_group 5
Treatment_group 9
Treatment_group 4
Treatment_group 4
Treatment_group 12
Treatment_group 37
Treatment_group 5
Treatment_group 3
Treatment_group 3
Data in dput format.
data_val <-
structure(list(Treatment_type = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Control_group", "Treatment_group"),
class = "factor"), cholestrol_level = c(7L, 3L, -4L, 14L,
2L, 5L, 22L, -7L, 9L, 5L, -6L, 5L, 9L, 4L, 4L, 12L, 37L,
5L, 3L, 3L)), class = "data.frame", row.names = c(NA, -20L))

Valid observations based on conditions [duplicate]

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

Calculation with multiple column conditional on data being there

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

How do I use approx() inside mutate_at() with a conditional statement in dplyr?

I want to interpolate missing values using dplyr, piping, and approx().
Data:
test <- structure(list(site = structure(c(3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L), .Label = c("lake", "stream", "wetland"), class = "factor"),
depth = c(0L, -3L, -4L, -8L, 0L, -1L, -3L, -5L, 0L, -2L,
-4L, -6L), var1 = c(1L, NA, 3L, 4L, 1L, 2L, NA, 4L, 1L, NA,
NA, 4L), var2 = c(1L, NA, 3L, 4L, NA, NA, NA, NA, NA, 2L,
NA, NA)), .Names = c("site", "depth", "var1", "var2"), class = "data.frame", row.names = c(NA,
-12L))
This code works:
library(tidyverse)
# interpolate missing var1 values for each site using approx()
test_int <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
But the code no longer works if it encounters a grouping (site & var) that doesn't have at least 2 non-NA values, e.g.,
# here I'm trying to interpolate missing values for var1 & var2
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
R appropriately throws this error:
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.
How do I include a conditional statement or filter so that it only tries to interpolate cases where the site has at least 2 non-NA values and skips the rest or returns NA for those?
This will do what you are looking for...
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i"=if(sum(!is.na(.))>1)
approx(depth, ., depth, rule=1, method="linear")[["y"]]
else
NA))
test_int2
# A tibble: 12 x 6
# Groups: site [3]
site depth var1 var2 var1_i var2_i
<fctr> <int> <int> <int> <dbl> <dbl>
1 wetland 0 1 1 1.0 1.0
2 wetland -3 NA NA 2.5 2.5
3 wetland -4 3 3 3.0 3.0
4 wetland -8 4 4 4.0 4.0
5 lake 0 1 NA 1.0 NA
6 lake -1 2 NA 2.0 NA
7 lake -3 NA NA 3.0 NA
8 lake -5 4 NA 4.0 NA
9 stream 0 1 NA 1.0 NA
10 stream -2 NA 2 2.0 NA
11 stream -4 NA NA 3.0 NA
12 stream -6 4 NA 4.0 NA

Resources