grouping to aggregate values, but tripping up on NA's - r

I have long data, and I am trying to make a new variable (consistent) that is the value for a given column (VALUE), for each person (ID), at TIME = 2. I used the code below to do this, but I am getting tripped up on NA's. If the VALUE for TIME = 2 is NA, then I want it to grab the VALUE at TIME = 1 instead. That part I'm not sure how to do. So, in the example below, I want the new variable (consistent) should be 10 instead of NA.
ID = c("A", "A", "B", "B", "C", "C", "D", "D")
TIME = c(1, 2, 1, 2, 1, 2, 1, 2)
VALUE = c(8, 9, 10, NA, 12, 13, 14, 9)
df = data.frame(ID, TIME, VALUE)
df <- df %>%
group_by(ID) %>%
mutate(consistent = VALUE[TIME == 2]) %>% ungroup
df

If we want to use the same code, then coalesce with the 'VALUE' where 'TIME' is 1 (assuming there is a single observation of 'TIME' for each 'ID')
library(dplyr)
df %>%
group_by(ID) %>%
mutate(consistent = coalesce(VALUE[TIME == 2], VALUE[TIME == 1])) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 9
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 13
6 C 2 13 13
7 D 1 14 9
8 D 2 9 9
Or another option is to arrange before doing the group_by and get the first element of 'VALUE' (assuming no replicating for 'TIME')
df %>%
arrange(ID, is.na(VALUE), desc(TIME)) %>%
group_by(ID) %>%
mutate(consistent = first(VALUE)) %>%
ungroup
-output
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 2 9 9
2 A 1 8 9
3 B 1 10 10
4 B 2 NA 10
5 C 2 13 13
6 C 1 12 13
7 D 2 9 9
8 D 1 14 9

Another possible solution, using tidyr::fill:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(consistent = VALUE) %>% fill(consistent) %>% ungroup
#> # A tibble: 8 × 4
#> ID TIME VALUE consistent
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 1 8 8
#> 2 A 2 9 9
#> 3 B 1 10 10
#> 4 B 2 NA 10
#> 5 C 1 12 12
#> 6 C 2 13 13
#> 7 D 1 14 14
#> 8 D 2 9 9

You can also use ifelse with your condition. TIME is guaranteed to be 1 in this scenario if there are only 2 group member each with TIME 1 and 2.
df %>%
group_by(ID) %>%
arrange(TIME, .by_group=T) %>%
mutate(consistent=ifelse(is.na(VALUE)&TIME==2, lag(VALUE), VALUE)) %>%
ungroup()
# A tibble: 8 × 4
ID TIME VALUE consistent
<chr> <dbl> <dbl> <dbl>
1 A 1 8 8
2 A 2 9 9
3 B 1 10 10
4 B 2 NA 10
5 C 1 12 12
6 C 2 13 13
7 D 1 14 14
8 D 2 9 9

Related

I'm not writing my pivot_wider function correctly [duplicate]

This question already has answers here:
Transpose a data frame
(6 answers)
Closed 22 days ago.
I am trying to run pivot_wider so that the output switches the rows and column fields. This does not accomplish that - what am I missing here?
Current:
model jan feb mar
a 1 2 3
b 2 4 6
c 3 6 9
d 4 8 12
e 5 10 14
Attempting:
month a b c d e f
jan 1 2 3 4 5 6
feb 2 4 6 8 10 12
mar 3 6 9 12 14
df <- data.frame(model = c('a', 'b', 'c', 'd', 'e'),
jan = c(1, 2, 3, 4, 5),
feb = c(2, 4, 6, 8, 10),
mar = c(3, 6, 9, 12, 14)
)
df %>%
pivot_wider(names_from = model, values_from = c(2:4), values_fill = 0)
My real set has around 15 columns and I'm just trying to flip the values and keep them tied to model and month fields. I am trying to get the values to play nicer with PowerBI so I can sort/filter/group in visuals by date values. Thank you
Here is an alternative approach without pivoting:
library(janitor)
library(dplyr)
df %>%
t() %>%
row_to_names(row_number = 1) %>%
as.data.frame()
a b c d e
jan 1 2 3 4 5
feb 2 4 6 8 10
mar 3 6 9 12 14
You could try:
library(tidyr)
df %>%
pivot_longer(cols = -model, names_to = 'month') %>%
pivot_wider(names_from = model)
Output:
# A tibble: 3 × 6
month a b c d e
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 jan 1 2 3 4 5
2 feb 2 4 6 8 10
3 mar 3 6 9 12 14
Please check this
df %>% pivot_longer(c(-1), names_to = 'name', values_to = 'value') %>%
pivot_wider(names_from = model, values_from = value) %>%
rename(month=name)
Created on 2023-01-29 with reprex v2.0.2
# A tibble: 3 × 6
month a b c d e
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 jan 1 2 3 4 5
2 feb 2 4 6 8 10
3 mar 3 6 9 12 14
Thank you -
I ended up going with
df %>%
pivot_longer(cols = jan:mar) %>%
pivot_wider(names_from = model, values_from = value)
name a b c d e
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 jan 1 2 3 4 5
2 feb 2 4 6 8 10
3 mar 3 6 9 12 14

Get mean column values every n rows grouped by value in other column

I have a dataframe df that looks like this
time object
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 B
8 8 B
9 9 B
10 10 B
11 11 B
12 12 C
13 13 C
14 14 C
15 15 C
16 16 C
17 17 C
18 18 C
I would like to get the mean of the timecolumn every 3 rows based on the object column
df_mean
time object
1 2 A
2 5 A
3 8 B
4 13 C
5 16 C
I though about using dplyr
df%>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(grp) %>%
summarise(across(c("time"), mean, na.rm = TRUE)) %>%
select(-grp)
but I do not know how to integrate the control for the object.
Another option would be to use aggregate
aggregate(.~object, data=df, mean)
but in this case I do not know how to get the mean every 3 rows.
Your dplyr attempt is on the right track. With a few modifications it will work.
library(dplyr)
df <- tibble(time = 1:18, object = rep(c('A', 'B', 'C'), each = 6))
df %>%
group_by(object, grp = (row_number()-1) %/% 3) %>%
summarise(across(time, mean, na.rm = T), .groups = 'drop') %>%
select(-grp)
#> # A tibble: 6 × 2
#> object time
#> <chr> <dbl>
#> 1 A 2
#> 2 A 5
#> 3 B 8
#> 4 B 11
#> 5 C 14
#> 6 C 17
Here is an alternative dplyr way:
library(dplyr)
n = 3
df %>%
group_by(object, Col2 = rep(row_number(), each=n, length.out = n())) %>%
summarise(time = mean(time, na.rm = TRUE)) %>%
select(-Col2)
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 B 11
5 C 14
6 C 17
You can do a slight modification, creating your grp variable within object group first, and then filtering where the size of the joint grouping is >=3, and then summarize:
df%>%
group_by(object) %>%
mutate(grp = 1+ (row_number()-1) %/% 3) %>%
group_by(object,grp) %>%
filter(n()>=3) %>%
summarize(time=mean(time), .groups="drop") %>%
select(-grp)
Output:
object time
<chr> <dbl>
1 A 2
2 A 5
3 B 8
4 C 13
5 C 16
data.table option:
library(data.table)
setDT(df)
df[, n3:=gl(.N, 3, length=.N), by=object]
df[, .(time=mean(time)), by=.(object, n3)][, !"n3"]
Output:
object time
1: A 2
2: A 5
3: B 8
4: B 11
5: C 14
6: C 17
in base R:
aggregate(time~., cbind(df, gr=gl(nrow(df),3, nrow(df))), mean)
object gr time
1 A 1 2
2 A 2 5
3 B 3 8
4 B 4 11
5 C 5 14
6 C 6 17

How to flag the last row of a data frame group?

Suppose we start with the below dataframe df:
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
ID Period Value
1 1 1 10
2 1 2 12
3 1 3 11
4 5 1 4
5 5 2 6
Now using dplyr I add a "Calculate" column that multiplies Period and Value of each row, giving me the following:
> df %>% mutate(Calculate = Period * Value)
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 33
4 5 1 4 4
5 5 2 6 12
I'd like to modify the above "Calculate" to give me a value of 0, when reaching the last row for a given ID, so that the data frame output looks like:
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 0
4 5 1 4 4
5 5 2 6 0
I was going to use the lead() function to peer at the next row to see if the ID changes but wasn't sure that happens when reaching the end of the data frame.
How could this be accomplished using dplyr?
You can group_by ID and replace the last row for each ID with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
Yet another possibility:
library(tidyverse)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = if_else(row_number() == n(), 0, Calculate)) %>%
ungroup
#> # A tibble: 5 × 4
#> ID Period Value Calculate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
library(tidyverse)
df %>%
mutate(Calculate = Period * Value * duplicated(ID, fromLast = TRUE))
#> ID Period Value Calculate
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
Created on 2022-01-09 by the reprex package (v2.0.1)
This should work. You can also replace rownum with Period (most likely)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df = df %>% mutate(Calculate = Period * Value)
df$rownum = rownames(df)
df = df %>%
group_by(ID) %>%
mutate(Calculate = ifelse(rownum == max(rownum), 0, Calculate)) %>%
ungroup()
A tibble: 5 × 5
ID Period Value Calculate rownum
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 10 10 1
2 1 2 12 24 2
3 1 3 11 0 3
4 5 1 4 4 4
5 5 2 6 0 5

How to call columns implicitly in certain R functions

There are some functions in R where you have to call columns explicitly by name, such as pmin. My question is how to get around this, preferably using tidyverse.
Here's some sample data.
library(tidyverse)
df <- tibble(a = c(1:5),
b = c(6:10),
d = c(11:15),
e = c(16:20))
# A tibble: 5 x 4
a b d e
<int> <int> <int> <int>
1 1 6 11 16
2 2 7 12 17
3 3 8 13 18
4 4 9 14 19
5 5 10 15 20
Now I'd like to find the minimum of all the columns except for "e". I can do this:
df %>%
mutate(min = pmin(a, b, d))
# A tibble: 5 x 5
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
But what if I have many columns and would like to call every column except "e" without having to type out each column's name? I've made several attempts but none successful. I used the column index in my examples but I'd prefer excluding "e" by name. See below.
df %>%
mutate(min = pmin(-e))
df %>%
mutate(min = pmin(names(. %>% select(.))[-4]))
df %>%
mutate(min = pmin(names(.)[-4]))
df %>%
mutate(min = pmin(noquote(paste0(names(.)[-4], collapse = ","))))
df %>%
mutate(min = pmin(!!ensyms(names(.)[-4])))
None of these worked and I'm a bit at a loss.
One option using dplyr and purrr could be:
df %>%
mutate(min = exec(pmin, !!!select(., -e)))
a b d e min
<int> <int> <int> <int> <int>
1 1 6 11 16 1
2 2 7 12 17 2
3 3 8 13 18 3
4 4 9 14 19 4
5 5 10 15 20 5
For those not reading the comments, a nice option proposed by #IceCreamToucan and involving only dplyr could be:
df %>%
mutate(min = do.call(pmin, select(., -e)))
We can also use reduce with pmin
library(dplyr)
library(purrr)
df %>%
mutate(min = select(., -e) %>%
reduce(pmin))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
Or with syms and !!!. Note that en- prefix is used while using from inside a function
df %>%
mutate(min = pmin(!!! syms(names(.)[-4])))
# A tibble: 5 x 5
# a b d e min
# <int> <int> <int> <int> <int>
#1 1 6 11 16 1
#2 2 7 12 17 2
#3 3 8 13 18 3
#4 4 9 14 19 4
#5 5 10 15 20 5
I would do this by reshaping long, calculating the min by group, and then reshaping back to wide:
df %>%
rowid_to_column() %>%
pivot_longer(cols = -c(e, rowid)) %>%
group_by(rowid) %>%
mutate(min = min(value)) %>%
ungroup() %>%
pivot_wider() %>%
select(-rowid, -min, min)

How to use group_by with summarise and summarise_all?

x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12
The above is part of the input.
Let's suppose that it also has a bunch of other columns
I want to:
group_by x
summarise y by sum
And for all other columns, I want to summarise_all by just taking the first value
Here's an approach that breaks it into two problems and combines them:
library(dplyr)
left_join(
# Here we want to treat column y specially
df %>%
group_by(x) %>%
summarize(sum_y = sum(y)),
# Here we exclude y and use a different summation for all the remaining columns
df %>%
group_by(x) %>%
select(-y) %>%
summarise_all(first)
)
# A tibble: 5 x 3
x sum_y z
<int> <int> <int>
1 1 20 1
2 2 16 3
3 3 17 2
4 4 18 2
5 5 7 3
Sample data:
df <- read.table(
header = T,
stringsAsFactors = F,
text="x y z
1 1 1
3 2 2
2 3 3
3 4 4
2 5 1
4 6 2
5 7 3
2 8 4
1 9 1
1 10 2
3 11 3
4 12 4")
library(dplyr)
df1 %>%
group_by(x) %>%
summarise_each(list(avg = mean), -y) %>%
bind_cols(.,{df1 %>%
group_by(x) %>%
summarise_at(vars(y), funs(sum)) %>%
select(-x)
})
#> # A tibble: 5 x 4
#> x r_avg r.1_avg y
#> <int> <dbl> <dbl> <int>
#> 1 1 6.67 6.67 20
#> 2 2 5.33 5.33 16
#> 3 3 5.67 5.67 17
#> 4 4 9 9 18
#> 5 5 7 7 7
Created on 2019-06-20 by the reprex package (v0.3.0)
Data:
df1 <- read.table(text="
r x y
1 1 1
2 3 2
3 2 3
4 3 4
5 2 5
6 4 6
7 5 7
8 2 8
9 1 9
10 1 10
11 3 11
12 4 12", header=T)
df1 <- df1[,c(2,3,1,1)]
library(tidyverse)
df <- tribble(~x, ~y, # making a sample data frame
1, 1,
3, 2,
2, 3,
3, 4,
2, 5,
4, 6,
5, 7,
2, 8,
1, 9,
1, 10,
3, 11,
4, 12)
df <- df %>%
add_column(z = sample(1:nrow(df))) #add another column for the example
df
# If there is only one additional column and you need the first value
df %>%
group_by(x) %>%
summarise(sum_y = sum(y), z_1st = z[1])
# otherwise use summarise_at to address all the other columns
f <- function(x){x[1]} # function to extract the first value
df %>%
group_by(x) %>%
summarise_at(.vars = vars(-c('y')), .funs = f) # exclude column y from the calculations

Resources