R data imputation from group_by table [duplicate]

R data imputation from group_by table [duplicate] - r

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 7 months ago.
group = c(1,1,4,4,4,5,5,6,1,4,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c')
sleep = c(14,NA,22,15,NA,96,100,NA,50,2,1)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% summarise(mean_sleep = mean(sleep, na.rm = T))
I would like to replace the NA values the sleep column based on the mean sleep value grouped by group and animal.
Is there any way that I can perform some sort of lookup like Excel that matches group and animal from the test dataframe to the group_animal dataframe and replaces the NA value in the sleep column from the test df with the sleep value in the group_animal df?

We could use mutate instead of summarise as summarise returns a single row per group
library(dplyr)
library(tidyr)
test <- test %>%
group_by(group, animal) %>%
mutate(sleep = replace_na(sleep, mean(sleep, na.rm = TRUE))) %>%
ungroup
-output
test
# A tibble: 11 × 3
group animal sleep
<dbl> <chr> <dbl>
1 1 a 14
2 1 b 50
3 4 c 22
4 4 c 15
5 4 d 2
6 5 a 96
7 5 b 100
8 6 c 1
9 1 b 50
10 4 d 2
11 6 c 1

Related

where function in sum for row wise calculation

i trying the below code where i am trying to get the row wise sum of a, b and c teams which are all numeric except for the team_league which is character, excluding this character variable i would like to derive the sum of numeric variables into a new variable league_points
to select the numeric variables i am using where(is.numeric) but it is not working, any thoughts
vital1 <- data.frame(a_team=c(1:3), b_team=c(2:4),team_league=c('dd','ee','ff'),c_team=c(5,9,1)) %>%
rowwise() %>%
mutate(league_points=sum(where(is.numeric))
)

We can use where within c_across
library(dplyr)
data.frame(a_team=c(1:3), b_team=c(2:4),
team_league=c('dd','ee','ff'),c_team=c(5,9,1)) %>%
rowwise() %>%
mutate(league_points = sum(c_across(where(is.numeric)), na.rm = TRUE)) %>%
ungroup
-output
# A tibble: 3 × 5
a_team b_team team_league c_team league_points
<int> <int> <chr> <dbl> <dbl>
1 1 2 dd 5 8
2 2 3 ee 9 14
3 3 4 ff 1 8
rowwise would be slow. Here, a vectorized function is already available i.e. rowSums
data.frame(a_team=c(1:3), b_team=c(2:4),
team_league=c('dd','ee','ff'),c_team=c(5,9,1)) %>%
mutate(league_points = rowSums(across(where(is.numeric)), na.rm = TRUE))
-output
a_team b_team team_league c_team league_points
1 1 2 dd 5 8
2 2 3 ee 9 14
3 3 4 ff 1 8

How to filter rows according to the bigger value in another column?

I have a data frame like below
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
Which looks like the data table in this picture
My goal is to filter the rows based on which value of d2 in every 3 rows is biggest. So it would look like this:
Thank you!

We may use rollmax from zoo to filter the rows
library(dplyr)
library(zoo)
df1 %>%
filter(d2 == na.locf0(rollmax(d2, k = 3, fill = NA)))
d1 d2
1 b 5
2 e 13
3 g 32
4 l 5

You can create a grouping variable that puts observations into groups of 3. I have first created a sequence from 1 to the total number of rows, incremented by 3. And then repeated each number of this sequence 3 times and subset the result to get a vector the same length of the data, incase the number of observations is not perfectly divisible by 3. Then simply filter rows based by the largest number of each group in d2 column.
library(dplyr)
df1 %>%
mutate(group = rep(seq(1, n(), by = 3), each = 3)[1:n()]) %>%
group_by(group) %>%
filter(d2 == max(d2))
# A tibble: 4 x 3
# Groups: group [4]
# d1 d2 group
# <chr> <dbl> <dbl>
# 1 b 5 1
# 2 e 13 4
# 3 g 32 7
# 4 l 5 10

Yet another solution:
library(tidyverse)
d1<-c('a','b','c','d','e','f','g','h','i','j','k','l')
d2<-c(1,5,1,2,13,2,32,2,1,2,4,5)
df1<-data.frame(d1,d2)
df1 %>%
mutate(id = rep(1:(n()/3), each=3)) %>%
group_by(id) %>%
slice_max(d2) %>%
ungroup %>% select(-id)
#> # A tibble: 4 × 2
#> d1 d2
#> <chr> <dbl>
#> 1 b 5
#> 2 e 13
#> 3 g 32
#> 4 l 5

I want summarise a data frame [duplicate]

This question already has answers here:
count number of rows in a data frame in R based on group [duplicate]
(8 answers)
Closed 1 year ago.
I want summarize the following data frame to a summary table.
plot <- c(rep(1,2), rep(2,4), rep(3,3))
bird <- c('a','b', 'a','b', 'c', 'd', 'a', 'b', 'c')
area <- c(rep(10,2), rep(5,4), rep(15,3))
birdlist <- data.frame(plot,bird,area)
birdlist
plot bird area
1 1 a 10
2 1 b 10
3 2 a 5
4 2 b 5
5 2 c 5
6 2 d 5
7 3 a 15
8 3 b 15
9 3 c 15
I tried the following
birdlist %>%
group_by(plot, area) %>%
mutate(count(bird))
I am trying to get a data frame as result that looks like the following
plot bird area
1 2 10
2 4 5
3 3 15
Please help/advice on how to count bird with reference to plot and respective area of the plot. Thanks.

You were very close, you want summarize instead of mutate though and you can use n() to count the number of rows within the group you're specifying.
library(tidyverse)
birdlist %>%
group_by(plot, area) %>%
summarize(bird = n(),
.groups = "drop")
#> # A tibble: 3 x 3
#> plot area bird
#> <dbl> <dbl> <int>
#> 1 1 10 2
#> 2 2 5 4
#> 3 3 15 3
If you're set on count, you would use it without group_by.
birdlist %>%
count(plot, area, name = "bird")

We could group_by plot and summarise using unique():
birdlist %>%
group_by(plot) %>%
summarise(bird = n(), area = unique(area))
plot bird area
<dbl> <int> <dbl>
1 1 2 10
2 2 4 5
3 3 3 15

Get max col value and make it a new variable

df = data.frame(group=c(1,1,1,2,2,2,3,3,3),
score=c(11,NA,7,NA,NA,4,6,9,15),
MAKE=c(11,11,11,4,4,4,15,15,15))
Say you have data as above with group and score and the objective is to make new variable MAKE which is just the maximum value of score for each group repeated.
And this is my attempt yet it does not work.
df %>%
group_by(group) %>%
summarise(Value = max(is.na(score)))

For that you need
df %>% group_by(group) %>% mutate(MAKE = max(score, na.rm = TRUE))
# A tibble: 9 x 3
# Groups: group [3]
# group score MAKE
# <dbl> <dbl> <dbl>
# 1 1 11 11
# 2 1 NA 11
# 3 1 7 11
# 4 2 NA 4
# 5 2 NA 4
# 6 2 4 4
# 7 3 6 15
# 8 3 9 15
# 9 3 15 15
The issue with max(is.na(score)) is that is.na(score) is a logical vector and when max is applied, it gets coerced to a binary vector with 1 for TRUE and 0 for FALSE. A somewhat less natural solution but closer to what you tried then would be
df %>% group_by(group) %>% mutate(MAKE = max(score[!is.na(score)]))
which finds the maximal value among all those values of score that are not NA.

R calculate median and last row in groups for certain rows

I'm working with grouping and median, I'd like to have a grouping of a data.frame with the median of certain rows (not all) and the last value.
My data are something like this:
test <- data.frame(
id = c('A','A','A','A','A','B','B','B','B','B','C','C','C','C'),
value = c(1,2,3,4,5,3,4,5,1,8,3,4,2,9))
> test
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 B 3
7 B 4
8 B 5
9 B 1
10 B 8
11 C 3
12 C 4
13 C 2
14 C 9
For each id, I need the median of the three (number may vary, in this case three) central rows, then the last value.
I've tried first of all with only one id.
test_a <- test[which(test$id == 'A'),]
> test_a
id value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
The desired output is this for this one,
Having this:
median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value) # median of three central values
tail(test_a,1)$value # last value
I used this:
library(tidyverse)
test_a %>% group_by(id) %>%
summarise(m = median(test_a[(nrow(test_a)-3):(nrow(test_a)-1),]$value),
last = tail(test_a,1)$value) %>%
data.frame()
id m last
1 A 3 5
But when I tried to generalize to all id:
test %>% group_by(id) %>%
summarise(m = median(test[(nrow(test)-3):(nrow(test)-1),]$value),
last = tail(test,1)$value) %>%
data.frame()
id m last
1 A 3 9
2 B 3 9
3 C 3 9
I think that the formulas take the full dataset to calculate last value and median, but I cannot imagine how to make it works. Thanks in advance.

This works:
test %>%
group_by(id) %>%
summarise(m = median(value[(length(value)-3):(length(value)-1)]),
last = value[length(value)])
# A tibble: 3 x 3
id m last
<fctr> <dbl> <dbl>
1 A 3 5
2 B 4 8
3 C 4 9
You just refer to variable value instead of the whole dataset within summarise.
Edit: Here's a generalized version.
test %>%
group_by(id) %>%
summarise(m = ifelse(length(value) == 1, value,
ifelse(length(value) == 2, median(value),
median(value[(ceiling(length(value)/2)-1):(ceiling(length(value)/2)+1)])),
last = value[length(value)])
If a group has only one row, the value itself will be stored in m. If it has only two rows, the median of these two rows will be stored in m. If it has three or more rows, the middle three rows will be chosen dynamically and the median of those will be stored in m.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R data imputation from group_by table [duplicate] - r

Related

where function in sum for row wise calculation

How to filter rows according to the bigger value in another column?

I want summarise a data frame [duplicate]

Get max col value and make it a new variable

R calculate median and last row in groups for certain rows

Categories

Resources