So, my data take the general shape of:
library(tidyverse)
id <- c(1, 1, 2, 2, 3, 3)
group <- c("A", "B", "A", "A", "B", "B")
value <- c(34, 12, 56, 78, 90, 91)
df <- tibble(id, group, value)
df
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 1 B 12
3 2 A 56
4 2 A 78
5 3 B 90
6 3 B 91
What I want to do can be described as "for each id, take the maximum value of group A. But, if A is not there, take the maximum value of group B." So my desired output would look something like:
id group value
<dbl> <chr> <dbl>
1 1 A 34
4 2 A 78
6 3 B 91
I tried to do this using the code...
desired <- df %>%
group_by(id) %>%
filter(if (exists(group == "A")) max(value) else if (exists(group == "B")) (max(value)))
...but I received an error. Help?
One option could be:
df %>%
group_by(id) %>%
arrange(group, desc(value), .by_group = TRUE) %>%
slice(which.max(group == "A"))
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
Here is a base R option
subset(
df[order(id, group, -value), ],
ave(rep(TRUE, nrow(df)), id, FUN = function(x) seq_along(x) == 1)
)
which gives
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
The basic idea is:
We reorder the rows of df via df[order(id, group, -value), ]
Then we take the first value in the reordered df by id
Using data.table
library(data.table)
setDT(df)[order(id, group, -value), .SD[1], id]
# id group value
#1: 1 A 34
#2: 2 A 78
#3: 3 B 91
Related
I have a data frame with a grouping variable ID, a factor F and a value V that looks something like this:
df <- data.frame(ID = c(rep(1, 3), rep(2, 3)),
F = factor(c("A","B","X","C","D","X")),
V = c(30, 32, 25, 31, 37, 24)
)
> df
ID F V
1 1 A 30
2 1 B 32
3 1 X 25
4 2 C 31
5 2 D 37
6 2 X 24
Now, I would like to add a new column New, which has the same value within each group (by ID) based on the value for V in the row where F==X using the tidyverse environment. Ideally, those rows would be removed afterwards so that the new data frame looks like this:
> df
ID F V New
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
I know that I have to use the group_by() function and probably also mutate(), but I couldn't quite manage to get my desired result.
df %>%
group_by(ID) %>%
mutate(New = V[F =='X']) %>%
filter(F != 'X')
# A tibble: 4 × 4
# Groups: ID [2]
ID F V New
<dbl> <fct> <dbl> <dbl>
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
library(dplyr)
df %>%
group_by(ID) %>% # grouping variables by ID
mutate(New = ifelse(F == "X",
V,
NA)) %>% # adding New column
summarise(New = max(New, na.rm = T)) %>% # Filtering rows with filled New column
right_join(df %>% filter(F != "X"), by = "ID") %>% # SQL-like join
select(ID, F, V, New) # reordering the columns to the desired order
And you get this output:
# A tibble: 4 × 4
ID F V New
<dbl> <fct> <dbl> <dbl>
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
Or even simplier:
df %>% filter(F == "X") %>% # filtering the rows with "X" in F column
right_join(df %>% filter(F != "X"), by = "ID") %>% joining to the same dataset without "X" rows
select(ID, F= F.y, V = V.y, New = V.x) #reordering and renaming of columns
I am new to R and have a simple 'how to' question, specifically, what is the best way to calculate Group and overall percentages on data frame columns? My data looks like this:
# A tibble: 13 x 3
group resp id
<chr> <dbl> <chr>
1 A 1 ssa
2 A 1 das
3 A NA fdsf
4 B NA gfd
5 B 1 dfg
6 B 1 dg
7 C 1 gdf
8 C NA gdf
9 C NA hfg
10 D 1 hfg
11 D 1 trw
12 D 1 jyt
13 D NA ghj
the test data is this:
structure(list(group = c("A", "A", "A", "B", "B", "B", "C", "C",
"C", "D", "D", "D", "D"), resp = c(1, 1, NA, NA, 1, 1, 1, NA,
NA, 1, 1, 1, NA), id = c("ssa", "das", "fdsf", "gfd", "dfg",
"dg", "gdf", "gdf", "hfg", "hfg", "trw", "jyt", "ghj")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame")
I managed to do the group percentages by doing the following (which seems overcomplicated):
a <- test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE))
b <- test %>%
group_by(group) %>%
summarise(all = n_distinct(id, na.rm = TRUE))
result <- a %>%
left_join(b) %>%
mutate(a,resp_rate = round(no_resp/all*100))
this gives me:
# A tibble: 4 x 4
group no_resp all resp_rate
<chr> <dbl> <int> <dbl>
1 A 2 3 67
2 B 2 3 67
3 C 1 2 50
4 D 3 4 75
which is fine, but I wondered how I could make this simpler? Also, how would I do an overall percentage? E.g. an overall distinct count of resp/distinct count of id, without grouping.
Many thanks
You can add multiple statements in summarise so you don't have to create temporary objects a and b. To calculate overall percentage you can divide the number by the sum of the column.
library(dplyr)
test %>%
group_by(group) %>%
summarise(no_resp = sum(resp, na.rm = TRUE),
all = n_distinct(id),
resp_rate = round(no_resp/all*100)) %>%
mutate(no_resp_perc = no_resp/sum(no_resp) * 100)
# group no_resp all resp_rate no_resp_perc
# <chr> <int> <int> <dbl> <dbl>
#1 A 2 3 67 25
#2 B 2 3 67 25
#3 C 1 2 50 12.5
#4 D 3 4 75 37.5
Using base R we may apply tapply and table functions.
res <- transform(with(test, data.frame(no_resp=tapply(resp, group, sum, na.rm=TRUE),
all=colSums(table(id, group) > 0))),
resp_rate=round(no_resp/all*100),
overall_perc=prop.table(no_resp)*100
)
res
# no_resp all resp_rate overall_perc
# A 2 3 67 25.0
# B 2 3 67 25.0
# C 1 2 50 12.5
# D 3 4 75 37.5
I would like to duplicate a certain row based on information in a data frame. Prefer a tidyverse solution. I'd like to accomplish this without explicitly calling the original data frame in a function.
Here's a toy example.
data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
All the solutions I've found so far require the user to input the desired row index. I'd like to find a way of doing it programmatically. In this case, I would like to duplicate the row where var1 is "A" with the highest value of var2 for "A" and append to the original data frame. The expected output is
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54
A variation using dplyr. Find the max by group, filter for var1 and append.
library(dplyr)
df %>%
group_by(var1) %>%
filter(var2 == max(var2),
var1 == "A") %>%
bind_rows(df, .)
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54
You could select the row that you want to duplicate and add it to original dataframe :
library(dplyr)
var1_variable <- 'A'
df %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
#For dplyr < 1.0.0
#slice(which.max(var2)) %>%
bind_rows(df, .)
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#6 A 3 54
In base R, that can be done as :
df1 <- subset(df, var1 == var1_variable)
rbind(df, df1[which.max(df1$var2), ])
From this post we can save the previous work in a temporary variable and then bind rows so that we don't break the chain and don't bind the original dataframe df.
df %>%
#Previous list of commands
{
{. -> temp} %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
bind_rows(temp)
}
In base you can use rbind and subset to append the row(s) where var1 == "A" with the highest value of var2 to the original data frame.
rbind(x, subset(x[x$var1 == "A",], var2 == max(var2)))
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#31 A 3 54
Data:
x <- data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
An option with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(replace(rep(1, n()), match(max(val[var1 == 'A']), val), 2)) %>%
as_tibble
# A tibble: 6 x 3
# var1 var2 val
# <chr> <dbl> <dbl>
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 A 3 54
#5 B 4 65
#6 B 5 76
I am trying to remove rows that have offsetting values.
library(dplyr)
a <- c(1, 1, 1, 1, 2, 2, 2, 2,2,2)
b <- c("a", "b", "b", "b", "c", "c","c", "d", "d", "d")
d <- c(10, 10, -10, 10, 20, -20, 20, 30, -30, 30)
o <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
df <- tibble(ID = a, SEQ = b, VALUE = d, OTHER = o)
Generates this ordered table that is grouped by ID and SEQ.
> df
# A tibble: 10 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 B
3 1 b -10 C
4 1 b 10 D
5 2 c 20 E
6 2 c -20 F
7 2 c 20 G
8 2 d 30 H
9 2 d -30 I
10 2 d 30 J
I want to drop the row pairs (2,3), (5,6), (8,9) because VALUE negates the VALUE in the matching previous row.
I want the resulting table to be
> df2
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 10 D
3 2 c 20 G
4 2 d 30 J
I know that I can't use group_by %>% summarize, because I need to keep the value that is in OTHER. I've looked at the dplyr::lag() function but I don't see how that can help. I believe that I could loop through the table with some type of for each loop and generate a logical vector that can be used to drop the rows, but I was hoping for a more elegant solution.
What about:
vec <- cbind(
c(head(df$VALUE,-1) + df$VALUE[-1], 9999) ,
df$VALUE + c(9999, head(df$VALUE,-1))
)
vec <- apply(vec,1,prod)
vec <- vec!=0
df[vec,]
# A tibble: 4 x 4
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
The idea is to take your VALUE field and subtract it with a slightly subset version of it. When the result is 0, than you remove the line.
Here's another solution with dplyr. Not sure about the edge case you mentioned in the comments, but feel free to test it with my solution:
library(dplyr)
df %>%
group_by(ID, SEQ) %>%
mutate(diff = VALUE + lag(VALUE),
diff2 = VALUE + lead(VALUE)) %>%
mutate_at(vars(diff:diff2), funs(coalesce(., 1))) %>%
filter((diff != 0 & diff2 != 0)) %>%
select(-diff, -diff2)
Result:
# A tibble: 4 x 4
# Groups: ID, SEQ [4]
ID SEQ VALUE OTHER
<dbl> <chr> <dbl> <chr>
1 1 a 10 A
2 1 b 50 D
3 2 c 60 G
4 2 d 70 J
Note:
This solution first creates two diff columns, one adding the lag, another adding the lead of VALUE to each VALUE. Only the offset columns will either have a zero in diff or in diff2, so I filtered out those rows, resulting in the desired output.
I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2