R calculate median and last row in groups for certain rows - r

I'm working with grouping and median, I'd like to have a grouping of a data.frame with the median of certain rows (not all) and the last value.
My data are something like this:
test <- data.frame(
id = c('A','A','A','A','A','B','B','B','B','B','C','C','C','C'),
value = c(1,2,3,4,5,3,4,5,1,8,3,4,2,9))
> test
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 3
7 B 4
8 B 5
9 B 1
10 B 8
11 C 3
12 C 4
13 C 2
14 C 9
For each id, I need the median of the three (number may vary, in this case three) central rows, then the last value.
I've tried first of all with only one id.
test_a <- test[which(test$id == 'A'),]
> test_a
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
The desired output is this for this one,
Having this:
median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value) # median of three central values
tail(test_a,1)$value # last value
I used this:
library(tidyverse)
test_a %>% group_by(id) %>%
summarise(m = median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value),
last = tail(test_a,1)$value) %>%
data.frame()
id m last
1 A 3 5
But when I tried to generalize to all id:
test %>% group_by(id) %>%
summarise(m = median(test[(nrow(test)-3):(nrow(test)-1),]$value),
last = tail(test,1)$value) %>%
data.frame()
id m last
1 A 3 9
2 B 3 9
3 C 3 9
I think that the formulas take the full dataset to calculate last value and median, but I cannot imagine how to make it works. Thanks in advance.

This works:
test %>%
group_by(id) %>%
summarise(m = median(value[(length(value)-3):(length(value)-1)]),
last = value[length(value)])
# A tibble: 3 x 3
id m last
<fctr> <dbl> <dbl>
1 A 3 5
2 B 4 8
3 C 4 9
You just refer to variable value instead of the whole dataset within summarise.
Edit: Here's a generalized version.
test %>%
group_by(id) %>%
summarise(m = ifelse(length(value) == 1, value,
ifelse(length(value) == 2, median(value),
median(value[(ceiling(length(value)/2)-1):(ceiling(length(value)/2)+1)])),
last = value[length(value)])
If a group has only one row, the value itself will be stored in m. If it has only two rows, the median of these two rows will be stored in m. If it has three or more rows, the middle three rows will be chosen dynamically and the median of those will be stored in m.

Related

R data imputation from group_by table [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 7 months ago.
group = c(1,1,4,4,4,5,5,6,1,4,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c')
sleep = c(14,NA,22,15,NA,96,100,NA,50,2,1)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% summarise(mean_sleep = mean(sleep, na.rm = T))
I would like to replace the NA values the sleep column based on the mean sleep value grouped by group and animal.
Is there any way that I can perform some sort of lookup like Excel that matches group and animal from the test dataframe to the group_animal dataframe and replaces the NA value in the sleep column from the test df with the sleep value in the group_animal df?
We could use mutate instead of summarise as summarise returns a single row per group
library(dplyr)
library(tidyr)
test <- test %>%
group_by(group, animal) %>%
mutate(sleep = replace_na(sleep, mean(sleep, na.rm = TRUE))) %>%
ungroup
-output
test
# A tibble: 11 × 3
group animal sleep
<dbl> <chr> <dbl>
1 1 a 14
2 1 b 50
3 4 c 22
4 4 c 15
5 4 d 2
6 5 a 96
7 5 b 100
8 6 c 1
9 1 b 50
10 4 d 2
11 6 c 1

How to filter rows according to the bigger value in another column?

I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!
We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5
You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10
Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5

Get max col value and make it a new variable

df = data.frame(group=c(1,1,1,2,2,2,3,3,3),
score=c(11,NA,7,NA,NA,4,6,9,15),
MAKE=c(11,11,11,4,4,4,15,15,15))
Say you have data as above with group and score and the objective is to make new variable MAKE which is just the maximum value of score for each group repeated.
And this is my attempt yet it does not work.
df %>%
group_by(group) %>%
summarise(Value = max(is.na(score)))
For that you need
df %>% group_by(group) %>% mutate(MAKE = max(score, na.rm = TRUE))
# A tibble: 9 x 3
# Groups: group [3]
# group score MAKE
# <dbl> <dbl> <dbl>
# 1 1 11 11
# 2 1 NA 11
# 3 1 7 11
# 4 2 NA 4
# 5 2 NA 4
# 6 2 4 4
# 7 3 6 15
# 8 3 9 15
# 9 3 15 15
The issue with max(is.na(score)) is that is.na(score) is a logical vector and when max is applied, it gets coerced to a binary vector with 1 for TRUE and 0 for FALSE. A somewhat less natural solution but closer to what you tried then would be
df %>% group_by(group) %>% mutate(MAKE = max(score[!is.na(score)]))
which finds the maximal value among all those values of score that are not NA.

Subtracting the last value in a group from previous values in dplyr

I have the following data
data = tribble(~t,~key,~value,
1,"a",10,
2,"a",20,
3,"a",30,
1,"b",100,
2,"b",200,
3,"b",300,
1,"c",1000,
2,"c",2000,
3,"c",3000)
and would like to get the following result
result = tribble(~t,~key,~value,
1,"a",-20,
2,"a",-10,
3,"a",0,
1,"b",-200,
2,"b",-100,
3,"b",0,
1,"c",-2000,
2,"c",-3000,
3,"c",0)
The idea is that I would like to subtract the 3rd value from all of the other values in that group. I tried to group_by the key, but struggled on the row wise subtraction within the group
We can use the last function from the dplyr. The arrange function is to make sure your dataset are in the right order.
library(dplyr)
data2 <- data %>%
arrange(key, t) %>%
group_by(key) %>%
mutate(value = value - last(value)) %>%
ungroup()
data2
# # A tibble: 9 x 3
# t key value
# <dbl> <chr> <dbl>
# 1 1 a -20
# 2 2 a -10
# 3 3 a 0
# 4 1 b -200
# 5 2 b -100
# 6 3 b 0
# 7 1 c -2000
# 8 2 c -1000
# 9 3 c 0

For each group find observations with max value of several columns

Assume I have a data frame like so:
set.seed(4)
df<-data.frame(
group = rep(1:10, each=3),
id = rep(sample(1:3), 10),
x = sample(c(rep(0, 15), runif(15))),
y = sample(c(rep(0, 15), runif(15))),
z = sample(c(rep(0, 15), runif(15)))
)
As seen above, some elements of x, y, z vectors take value of zero, the rest being drawn from the uniform distribution between 0 and 1.
For each group, determined by the first column, I want to find three IDs from the second column, pointing to the highest value of x, y, z variables in the group. Assume there are no draws except for the cases in which a variable takes a value of 0 in all observations of a given group - in that case I don't want to return any number as an id of a row with maximum value.
The output would look like so:
group x y z
1 2 2 1
2 2 3 1
... .........
My first thought is to select rows with maximum values separately for each variable and then use merge to put it in one table. However, I'm wondering if it can be done without merge, for example with standard dplyr functions.
Here is my proposed solution using plyr:
ddply(df,.variables = c("group"),
.fun = function(t){apply(X = t[,c(-1,-2)],MARGIN = 2,
function(z){ifelse(sum(abs(z))==0,yes = NA,no = t$id[which.max(z)])})})
# group x y z
#1 1 2 2 1
#2 2 2 3 1
#3 3 1 3 2
#4 4 3 3 1
#5 5 2 3 NA
#6 6 3 1 3
#7 7 1 1 2
#8 8 NA 2 3
#9 9 2 1 3
#10 10 2 NA 2
A solution uses dplyr and tidyr. Notice that if all numbers are the same, we cannot decide which id should be selected. So filter(n_distinct(Value) > 1) is added to remove those records. In the final output df2, NA indicates such condition where all numbers are the same. We can decide whether to impute those NA later if we want. This solution should work for any numbers of id or columns (x, y, z, ...).
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -group, -id) %>%
arrange(group, Column, desc(Value)) %>%
group_by(group, Column) %>%
# If all values from a group-Column are all the same, remove that group-Column
filter(n_distinct(Value) > 1) %>%
slice(1) %>%
select(-Value) %>%
spread(Column, id)
If you want to stick with just dplyr, you can use the multiple-column summarize/mutate functions. This should work regardless of the form of id; my initial attempt was slightly cleaner but assumed that an id of zero was invalid.
df %>%
group_by(group) %>%
mutate_at(vars(-id),
# If the row is the max within the group, set the value
# to the id and use NA otherwise
funs(ifelse(max(.) != 0 & . == max(.),
id,
NA))) %>%
select(-id) %>%
summarize_all(funs(
# There are zero or one non-NA values per group, so handle both cases
if(any(!is.na(.)))
na.omit(.) else NA))
## # A tibble: 10 x 4
## group x y z
## <int> <int> <int> <int>
## 1 1 2 2 1
## 2 2 2 3 1
## 3 3 1 3 2
## 4 4 3 3 1
## 5 5 2 3 NA
## 6 6 3 1 3
## 7 7 1 1 2
## 8 8 NA 2 3
## 9 9 2 1 3
## 10 10 2 NA 2

Resources