Calculation with multiple column conditional on data being there - r

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.

Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))

Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

Related

Find the "top N" in a group and find the average of the "top N" in R

Rank Laps Average Time
1 1 1 30
2 2 1 34
3 3 1 35
4 1 2 32
5 2 2 33
6 3 2 56
7 4 1 43
8 5 1 23
9 6 1 31
10 4 2 23
11 5 2 88
12 6 2 54
I would like to know how I can group ranks 1-3 and ranks 4-6 and get an average of the "average time" for each lap. Also, I would like this to extend if I have groups 7-9, 10-13, etc.
One option is to use cut to put the different ranks into groups, and add Laps as a grouping variable. Then, you can summarize the data to get the mean.
library(tidyverse)
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3)), Laps) %>%
summarize(avg = mean(Average_Time))
Output
gr Laps avg
<fct> <int> <dbl>
1 (0,3] 1 33
2 (0,3] 2 40.3
3 (3,6] 1 32.3
4 (3,6] 2 55
Or another option if you want the range of ranks displayed for the group:
df %>%
group_by(gr = cut(Rank, breaks = seq(0, 6, by = 3))) %>%
mutate(Rank_gr = paste0(min(Rank), "-", max(Rank))) %>%
group_by(Rank_gr, Laps) %>%
summarize(avg = mean(Average_Time))
Output
Rank_gr Laps avg
<chr> <int> <dbl>
1 1-3 1 33
2 1-3 2 40.3
3 4-6 1 32.3
4 4-6 2 55
Since you will have uneven groups, then you might want to use case_when to make the groups:
df %>%
group_by(gr=case_when(Rank %in% 1:3 ~ "1-3",
Rank %in% 4:6 ~ "4-6",
Rank %in% 7:9 ~ "7-9",
Rank %in% 10:13 ~ "10-13"),
Laps) %>%
summarize(avg = mean(Average_Time))
Data
df <- structure(list(Rank = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 5L, 6L, 4L,
5L, 6L), Laps = c(1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L,
2L), Average_Time = c(30L, 34L, 35L, 32L, 33L, 56L, 43L, 23L,
31L, 23L, 88L, 54L)), class = "data.frame", row.names = c(NA,
-12L))

Dropping NA values of factors within a function

Toy data:
Say I have this df
df <- structure(list(x = structure(c(NA, 7L, NA, NA, 4L, 6L, 6L, 2L,
3L, 5L, 8L, 4L, 7L, 3L, 5L, 1L, 5L, 5L, 5L, NA), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), y = structure(c(NA,
2L, 3L, 2L, 2L, 2L, 2L, 1L, 3L, NA, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L, 2L, 2L), .Label = c("1", "2", "3"), class = "factor"), z = structure(c(NA,
4L, 4L, 4L, 5L, 4L, 5L, 5L, 2L, NA, 4L, 1L, 1L, 3L, 2L, 5L, 2L,
2L, 4L, NA), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
a = c(-32L, -51L, -22L, 44L, 55L, -24L, -50L, 67L, 1L, -47L,
66L, -98L, -91L, -42L, -89L, -31L, -8L, -33L, 38L, 61L),
b = c(46L, -19L, -37L, 47L, -28L, -48L, 14L, -10L, -13L,
-31L, 32L, 21L, -21L, 25L, -8L, 42L, -26L, -24L, 36L, -39L
)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))
df
# A tibble: 20 × 5
x y z a b
<fct> <fct> <fct> <int> <int>
1 NA NA NA -32 46
2 7 2 4 -51 -19
3 NA 3 4 -22 -37
4 NA 2 4 44 47
5 4 2 5 55 -28
6 6 2 4 -24 -48
7 6 2 5 -50 14
8 2 1 5 67 -10
9 3 3 2 1 -13
10 5 NA NA -47 -31
11 8 2 4 66 32
12 4 3 1 -98 21
13 7 1 1 -91 -21
14 3 1 3 -42 25
15 5 3 2 -89 -8
16 1 2 5 -31 42
17 5 2 2 -8 -26
18 5 3 2 -33 -24
19 5 2 4 38 36
20 NA 2 NA 61 -39
I want to normalize variables x, y, and z on a 0-1 scale, and then produce some summary stats on them. I can produce the summary stats just fine using the code below
Code that works:
library(tidyverse)
vars <- c('x', 'y', 'z')
names(vars) <- vars
summary_stats <- function(data){
tibble(
n = sum(!is.na(data)),
mean = round(mean(as.numeric(data), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
table
# A tibble: 3 × 5
covariate n mean sd se
<chr> <int> <dbl> <dbl> <dbl>
1 x 16 4.75 1.88 0.47
2 y 18 2.11 0.676 0.159
3 z 17 3.35 1.41 0.342
Code that doesn't work:
But i'm struggling to figure out how to normalize the variables. My latest attempt is to try this
summary_stats <- function(data){
data_norm <- drop_na(data) %>% dplyr::summarize(
(as.numeric(data) - min(as.numeric(data))) /
(max(as.numeric(data)) - min(as.numeric(data)))
)
tibble(
n = sum(!is.na(data_norm)),
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
Errors:
But this returns the error
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "factor"
If I convert it to a numeric on the fly, so I have data_norm <- drop_na(as.numeric(data)) etc., I then get a very similar error saying
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "c('double', 'numeric')"
However, if I do this outside of the function it works fine
df %>% drop_na(x) %>% summarise(std_mean = (as.numeric(x) - min(as.numeric(x))) / (max(as.numeric(x)) - min(as.numeric(x))))
# A tibble: 16 × 1
std_mean
<dbl>
1 0.857
2 0.429
3 0.714
4 0.714
5 0.143
6 0.286
7 0.571
....
I need to remove the NA values or when I try and normalize the returned variable will have all NAs if there is at least 1 NA in that column. And if I apply drop_na() outside the function (to the master tibble i feed in to the map_dfr function), it will drop any row that has at least 1 NA value in any variable from the df, rather than just the NA values from that column.
Can anyone help here?
Update:
If I remove the drop_na() call from the function i get the following error
Error in UseMethod("summarise") :
no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"
This makes zero sense to me (i'm probably not understanding it) as summarise definitely works with numeric variables...
Looks like what's happening is that you're trying to write a function to take an entire data frame as an argument, but when you go to map it, you're actually only passing a single vector (e.g. df$x) as the argument to the function. This works fine for the first version of your function, but in the second version drop_na fails to work because it takes an entire data frame for an argument. Same goes for summarize, which is why you were getting a similar error. It also works outside of your function because you're able to specify a single vector.
So, what I did was swap out drop_na for na_omit, and also reorganized your code a bit.
First, let's just define a separate std_mean function so we don't have to deal with summarize:
std_mean <- function(x){
x <- na.omit(x)
(as.numeric(x) - min(as.numeric(x)))/(max(as.numeric(x)) - min(as.numeric(x)))
}
Now we can go back and fix your original function:
summary_stats <- function(vec){
data_norm <- std_mean(vec)
n = length(data_norm)
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3)
data.frame(
n = n,
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = sd,
se = round(sd/sqrt(n), digits = 3)
)
}
We have to define n and sd beforehand because they were being used as arguments in other columns of the data frame. While it would be cool for data.frame to calculate the first column to then allow you to feed into later columns, that isn't the case.
And now we're ready to map:
map(df[vars],summary_stats)
$x
n mean sd se
1 16 0.536 0.269 0.067
$y
n mean sd se
1 18 0.556 0.338 0.08
$z
n mean sd se
1 17 0.588 0.353 0.086

Choose rows in which the absolute value of subtraction is less a specified value

Let's say I have this dataframe:
ID X1 X2
1 1 2
2 2 1
3 3 1
4 4 1
5 5 5
6 6 20
7 7 20
8 9 20
9 10 20
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))
And I want to select rows in which the absolute value of the subtraction of rows are more or equal to 2 (based on columns X1 and X2).
For example, row 4 value is 4-1, which is 3 and should be selected.
Row 9 value is 10-20, which is -10. Absolute value is 10 and should be selected.
In this case it would be rows 3, 4, 6, 7, 8 and 9
I tried:
dataset2 = dataset[,abs(dataset- c(dataset[,2])) > 2]
But I get an error.
The operation:
abs(dataset- c(dataset[,2])) > 2
Does give me rows that the sum are more than 2, but the result only works for my second column and does not select properly
We can get the difference between the 'X1', 'X2' columns, create a logical expression in subset to subset the rows
subset(dataset, abs(X1 - X2) >= 2)
# ID X1 X2
#3 3 3 1
#4 4 4 1
#6 6 6 20
#7 7 7 20
#8 8 9 20
#9 9 10 20
Or using index
subset(dataset, abs(dataset[[2]] - dataset[[3]]) >= 2)
data
dataset <- structure(list(ID = 1:9, X1 = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 9L,
10L), X2 = c(2L, 1L, 1L, 1L, 5L, 20L, 20L, 20L, 20L)),
class = "data.frame", row.names = c(NA,
-9L))

Valid observations based on conditions [duplicate]

I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70

Removing rows based on criteron from other column in R

I am trying to filter some data that I have in R. It is formatted like this:
id config_id alpha begin end day
1 1 1 5 138 139 6
2 1 2 5 137 138 6
3 1 3 5 47 48 2
4 1 3 3 46 47 2
5 1 4 3 45 46 2
6 1 4 3 43 44 2
...
id config_id alpha begin end day
1 2 1 5 138 139 6
2 2 2 5 137 138 6
3 2 2 5 136 137 6
4 2 3 3 45 46 2
5 2 3 3 44 45 2
6 2 4 3 43 44 2
My goal is to remove any configuration which results in having beginnings and endings on the same day. For example, in the top example config_id 3 is not acceptable because both instances of config_id occur on day 2. Same story for config_id 4. In the bottom example config_id 2 and config_id 3 are unacceptable for the same reason.
Basically, if I have a repeated config_id AND any day (from the day) column shows up more than once for that config_id, then I want to remove that config_id from the list.
Right now I'm using something of a fairly complex lapply algorithm but there must be an easier way.
Thanks!
You can do this several ways, assuming your data is stored in a data frame called my_data.
base R
same_day <- aggregate(my_data$day, my_data["config_id"], function(x) any(table(x) > 1))
names(same_day)[2] <- "same_day"
my_data <- merge(my_data, same_day, by = "config_id")
my_data <- same_day[!same_day$repeated_id, ]
dplyr
library(dplyr)
my_data %<>% group_by(config_id) %>%
mutate(same_day = any(table(day) > 1)) %>%
filter(!same_day)
data.table
library(data.table)
my_data <- data.table(my_data, key = "config_id")
same_day <- my_data[, .(same_day = any(table(day) > 1)), by = "config_id"]
my_data[!my_data[same_day]$same_day, ]
We can also use n_distinct from dplyr. Here, I am grouping by 'id' and 'config_id', then remove the rows using filter. If the number of elements within the group is greater than 1 (n()>1) and (&) the number of distinct elements in 'day' is equal to 1 (n_distinct==1), we remove it.
library(dplyr)
df1 %>%
group_by(id, config_id) %>%
filter(!(n()>1 & n_distinct(day)==1))
#Source: local data frame [4 x 6]
#Groups: id, config_id [4]
# id config_id alpha begin end day
# (int) (int) (int) (int) (int) (int)
#1 1 1 5 138 139 6
#2 1 2 5 137 138 6
#3 2 1 5 138 139 6
#4 2 4 3 43 44 2
This should also work if we have different 'day' for the same 'config_id'.
df1$day[4] <- 3
A similar option using data.table is uniqueN. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id' and 'config_id', we subset the dataset (.SD) using the logical condition.
library(data.table)#v1.9.6+
setDT(df1)[, if(!(.N>1 & uniqueN(day) == 1L)) .SD, by = .(id, config_id)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L), config_id = c(1L, 2L, 3L, 3L, 4L, 4L, 1L, 2L, 2L, 3L,
3L, 4L), alpha = c(5L, 5L, 5L, 3L, 3L, 3L, 5L, 5L, 5L, 3L, 3L,
3L), begin = c(138L, 137L, 47L, 46L, 45L, 43L, 138L, 137L, 136L,
45L, 44L, 43L), end = c(139L, 138L, 48L, 47L, 46L, 44L, 139L,
138L, 137L, 46L, 45L, 44L), day = c(6L, 6L, 2L, 2L, 2L, 2L, 6L,
6L, 6L, 2L, 2L, 2L)), .Names = c("id", "config_id", "alpha",
"begin", "end", "day"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Resources