How to group by with keeping other columns? - r

Lets say I have the following dataframe:
df <- data.frame(id = c(1,1,1,2,2,2,3,3,3,3),
col1 = c("a","a", "b", "c", "d", "e", "f", "g", "h", "g"),
start_day = c(NA,1,15, NA, 4, 22, 5, 11, 14, 18),
end_day = c(NA,2, 15, NA, 6, 22, 6, 12, 16, 21))
Output:
id col1 start_day end_day
1 1 a NA NA
2 1 a 1 2
3 1 b 15 15
4 2 c NA NA
5 2 d 4 6
6 2 e 22 22
7 3 f 5 6
8 3 g 11 12
9 3 h 14 16
10 3 g 18 21
I want to create a data frame such that for each unique id I get the minimum of start_day column and the maximum of the end_day column. Also I want to keep the other columns. One solution could be using group_by:
df %>% group_by(id) %>% summarise(start_day = min(start_day, na.rm = T),
end_day = max(end_day, na.rm = T))
Output:
id start_day end_day
1 1 1 15
2 2 4 22
3 3 5 21
But I loose other columns (in this example col1). How can I save the other columns. A desired outcome would look like as follow:
id start_day end_day col1_start col1_end
1 1 1 15 a b
2 2 4 22 d e
3 3 5 21 f g
Is there anyway that I can get the data frame I need?

Create the index first and then update the 'start_day' as the original column got updated with summarised output
library(dplyr)
df %>%
group_by(id) %>%
summarise(col1_start = col1[which.min(start_day)],
col1_end = col1[which.max(end_day)],
start_day = min(start_day, na.rm = TRUE),
end_day = max(end_day, na.rm = TRUE))
-output
# A tibble: 3 × 5
id col1_start col1_end start_day end_day
<dbl> <chr> <chr> <dbl> <dbl>
1 1 a b 1 15
2 2 d e 4 22
3 3 f g 5 21

Related

Arrange a tibble based on 2 columns in R?

A similar question was asked here... however, I cant get it to work in my case and Im not sure why.
I am trying to arrange a tibble based on 2 columns. For example, in my data, I am trying to arrange by the value and count columns. To begin, I show a working example:
library(dplyr)
dat <- tibble(
value = c("B", "D", "D", "E", "A", "A", "B", "C", "B", "E"),
ids = c(1:10),
count = c(3, 2, 1, 2, 2, 1, 2, 1, 1, 1)
)
dat %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
looking at the output:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 B 1 3 1
2 B 7 2 1
3 B 9 1 1
4 D 2 2 2
5 D 3 1 2
6 E 4 2 4
7 E 10 1 4
8 A 5 2 5
9 A 6 1 5
10 C 8 1 8
We can see that the code worked... the tibble is arranged by the value column, and the order is based on how many times each element appears in the tibble (ie, the count).
However, when I try the following example, the same code doesn't work:
dat_1 <- tibble(
value = c("x2....", "x5...." , "x5....", "x3...." , "x3....", "x4....", "x3....", "x3....", "x4....", "x2...." ),
ids = c(1:10),
count = c(2, 2, 1, 4, 3, 2, 2, 1, 1, 1)
)
dat_1 %>%
group_by(value) %>%
mutate(valrank = min(ids)) %>%
ungroup() %>%
arrange(valrank, value, desc(count))
Looking at this output, we get:
# A tibble: 10 × 4
value ids count valrank
<chr> <int> <dbl> <int>
1 x2.... 1 2 1
2 x2.... 10 1 1
3 x5.... 2 2 2
4 x5.... 3 1 2
5 x3.... 4 4 4
6 x3.... 5 3 4
7 x3.... 7 2 4
8 x3.... 8 1 4
9 x4.... 6 2 6
10 x4.... 9 1 6
So we can see, this has failed to reorder the tibble based on the count. In the 2nd example, x3 appears the most (i.e., has the highest count), so should appear at the top of the tibble.
I'm not sure what Im doing wrong here!?
UPDATE:
I think I may have solved this problem with:
dat_1 %>%
group_by(value) %>%
mutate(valrank = max(count)) %>%
ungroup() %>%
arrange(-valrank, value, -count)

R dplyr calculating group and column percentages

I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5

How to replace NA with set of values

I have the following data frame:
library(dplyr)
library(tibble)
df <- tibble(
source = c("a", "b", "c", "d", "e"),
score = c(10, 5, NA, 3, NA ) )
df
It looks like this:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10 . # current max value
2 b 5
3 c NA
4 d 3
5 e NA
What I want to do is to replace NA in score column with values ranging for existing max + n onwards. Where n range from 1 to total number of rows of the df
Resulting in this (hand-coded) :
source score
a 10
b 5
c 11 # obtained from 10 + 1
d 3
e 12 # obtained from 10 + 2
How can I achieve that?
Another option :
transform(df, score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
# source score
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
If you want to do this in dplyr
library(dplyr)
df %>% mutate(score = pmin(max(score, na.rm = TRUE) +
cumsum(is.na(score)), score, na.rm = TRUE))
A base R solution
df$score[is.na(df$score)] <- seq(which(is.na(df$score))) + max(df$score,na.rm = TRUE)
such that
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Here is a dplyr approach,
df %>%
mutate(score = replace(score,
is.na(score),
(max(score, na.rm = TRUE) + (cumsum(is.na(score))))[is.na(score)])
)
which gives,
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
With dplyr:
library(dplyr)
df %>%
mutate_at("score", ~ ifelse(is.na(.), max(., na.rm = TRUE) + cumsum(is.na(.)), .))
Result:
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
A dplyr solution.
df %>%
mutate(na_count = cumsum(is.na(score)),
score = ifelse(is.na(score), max(score, na.rm = TRUE) + na_count, score)) %>%
select(-na_count)
## A tibble: 5 x 2
# source score
# <chr> <dbl>
#1 a 10
#2 b 5
#3 c 11
#4 d 3
#5 e 12
Another one, quite similar to ThomasIsCoding's solution:
> df$score[is.na(df$score)]<-max(df$score, na.rm=T)+(1:sum(is.na(df$score)))
> df
# A tibble: 5 x 2
source score
<chr> <dbl>
1 a 10
2 b 5
3 c 11
4 d 3
5 e 12
Not quite elegant as compared to the base R solutions, but still possible:
library(data.table)
setDT(df)
max.score = df[, max(score, na.rm = TRUE)]
df[is.na(score), score :=(1:.N) + max.score]
Or in one line but a bit slower:
df[is.na(score), score := (1:.N) + df[, max(score, na.rm = TRUE)]]
df
source score
1: a 10
2: b 5
3: c 11
4: d 3
5: e 12

How to subtract values by group (subtract blank stored as one group) using dplyr?

I have some tidy data, and one of the group is a blank:
df <- data.frame(Group = c(rep(LETTERS[1:3], 3), "Blank", "Blank", "Blank"),
ID = rep(1:3, 4),
Value = c(10, 11, 12, 21, 22, 23, 31, 32, 33, 1, 2, 3))
df
Group ID Value
1 A 1 10
2 B 2 11
3 C 3 12
4 A 1 21
5 B 2 22
6 C 3 23
7 A 1 31
8 B 2 32
9 C 3 33
10 Blank 1 1
11 Blank 2 2
12 Blank 3 3
I wanted to subtract Blank from each group (A, B, C), so the normalized data will look like that:
df_normalized<- data.frame(Group = rep(LETTERS[1:3], 3),
ID = rep(1:3, 3),
Value = c(9, 9, 9, 20, 20, 20, 30, 30, 30))
df_normalized
Group ID Value
1 A 1 9
2 B 2 9
3 C 3 9
4 A 1 20
5 B 2 20
6 C 3 20
7 A 1 30
8 B 2 30
9 C 3 30
How to do it nicely using dplyr?
EDIT:
How to do that for multiple groups? e.g:
df <- data.frame(Cluster = c(rep("C1", 12), rep("C2", 12)),
Group = rep(c(rep(LETTERS[1:3], 3), "Blank", "Blank", "Blank"), 2),
ID = rep(1:3, 8),
Value = sample(24))
Assuming you'll have only one "Blank" value per ID as shown in the example, you can do
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Value = Value - Value[Group == "Blank"]) %>%
filter(Group != "Blank")
# Group ID Value
# <fct> <int> <dbl>
#1 A 1 9
#2 B 2 9
#3 C 3 9
#4 A 1 20
#5 B 2 20
#6 C 3 20
#7 A 1 30
#8 B 2 30
#9 C 3 30
If you have more than one "Blank" you can use match which would ensure that only the first value is selected.
df %>%
group_by(ID) %>%
mutate(Value = Value - Value[match("Blank", Group)]) %>%
filter(Group != "Blank")

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources