Summarizing columns using a vector with dplyr - r

I want to calculate the mean of certain columns (names stored in a vector), while grouping against a column. Here is a reproducible example:
Cities <- c("London","New_York")
df <- data.frame(Grade = c(rep("Bad",2),rep("Average",4),rep("Good",4)),
London = seq(1,10,1),
New_York = seq(11,20,1),
Shanghai = seq(21,30,1))
> df
Grade London New_York Shanghai
1 Bad 1 11 21
2 Bad 2 12 22
3 Average 3 13 23
4 Average 4 14 24
5 Average 5 15 25
6 Average 6 16 26
7 Good 7 17 27
8 Good 8 18 28
9 Good 9 19 29
10 Good 10 20 30
The output I want:
> df %>% group_by(Grade) %>% summarise(London = mean(London), New_York = mean(New_York))
# A tibble: 3 x 3
Grade London New_York
<fct> <dbl> <dbl>
1 Average 4.5 14.5
2 Bad 1.5 11.5
3 Good 8.5 18.5
I would like to select the elements within vector cities (without calling their names) inside summarise, all while retaining their original name within the vector

You can do:
df %>%
group_by(Grade) %>%
summarise_at(vars(one_of(Cities)), mean)
Grade London New_York
<fct> <dbl> <dbl>
1 Average 4.5 14.5
2 Bad 1.5 11.5
3 Good 8.5 18.5
From documentation:
one_of(): Matches variable names in a character vector.

vars can take a vector of column names as such. select-helpers(matches, starts_with, ends_with are used when we have some kind of pattern to match). Now, with the current implementation vars is more generalized, it can select columns, deselect (with -)
library(dplyr)
df %>%
group_by(Grade) %>%
summarise_at(vars(Cities), mean)
# A tibble: 3 x 3
# Grade London New_York
# <fct> <dbl> <dbl>
#1 Average 4.5 14.5
#2 Bad 1.5 11.5
#3 Good 8.5 18.5

Related

Filter a dataset having duplicated rows in R

I need to filter a dataset based on two conditions.
Here is how my dataset looks like:
df <- data.frame(
id = c(1,2,2,3,3,4,5,5),
district = c(10,10,11,12,12,13,14,15),
value = c(10.2, 10.8, 10.8, 7.5, 9.3, 6, 7.0, 7.0))
> df
id district value
1 1 10 10.2
2 2 10 10.8
3 2 11 10.8
4 3 12 7.5
5 3 12 9.3
6 4 13 6.0
7 5 14 7.0
8 5 15 7.0
I have duplicated rows based on ids. In order to keep the desired row,
First ids having the multiple districts but the same value, I need to keep the first row:
Second ids having multiple values, but from the same district, I need the max of value row.
SO the desired filtered dataset is:
> df
id district value
1 1 10 10.2
2 2 10 10.8
3 3 12 9.3
4 4 13 6.0
5 5 14 7.0
I was able to locate the duplicated ids only up until now.
df[duplicated(df$id),]
Does anyone have any ideas?
Thanks
With dplyr:
df %>%
group_by(id) %>%
arrange(desc(value)) %>%
slice(1)
# # A tibble: 5 x 3
# # Groups: id [5]
# id district value
# <dbl> <dbl> <dbl>
# 1 1 10 10.2
# 2 2 10 10.8
# 3 3 12 9.3
# 4 4 13 6
# 5 5 14 7
There's no real need to distinguish between the max value if there are multiple values and keeping the first value if there are duplicates - if we order the data descending by value and keep the first row in each id group, it accomplishes both of those tasks with one logic.
library(dplyr)
df %>%
arrange(id, -value) %>%
distinct(id, district, .keep_all = TRUE) %>%
distinct(id, value, .keep_all = TRUE)
id district value
1 1 10 10.2
2 2 10 10.8
3 3 12 9.3
4 4 13 6.0
5 5 14 7.0
First we sort descending by value, then we use the distinct function to look for unique combinations.
In base R, we can use duplicated after ordering the rows
df1 <- df[order(df$id, -df$value),]
df1[!duplicated(df1$id),]
# id district value
#1 1 10 10.2
#2 2 10 10.8
#5 3 12 9.3
#6 4 13 6.0
#7 5 14 7.0

How to create a percentage column based on the values present in every third row?

I have a data frame containing the values of weight. I have a create a new column, percentage change of weight wherein the denominator takes the value of every third row.
df <- data.frame(weight = c(30,30,109,30,309,10,20,20,14))
# expected output
change_of_weight = c(30/109, 30/109, 109/109, 30/10,309/10,10/10,20/14,20/14,14/14)
Subset weight column where it's position %% 3 is zero and repeat each value three times.
df <- transform(df, change_of_weight=weight / rep(weight[1:nrow(df) %% 3 == 0], each=3))
df
weight change_of_weight
1 30 0.2752294
2 30 0.2752294
3 109 1.0000000
4 30 3.0000000
5 309 30.9000000
6 10 1.0000000
7 20 1.4285714
8 20 1.4285714
9 14 1.0000000
You can create a group of every 3 rows and divide weight column by the last value in the group.
df$change <- with(df, ave(df$weight, ceiling(seq_len(nrow(df))/3),
FUN = function(x) x/x[length(x)]))
Or using dplyr :
library(dplyr)
df %>%
group_by(grp = ceiling(row_number()/3)) %>%
mutate(change = weight/last(weight))
# weight grp change
# <dbl> <dbl> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
We can also use gl to create a grouping column
library(dplyr)
df %>%
group_by(grp = as.integer(gl(n(), 3, n()))) %>%
mutate(change = weight/last(weight))
# A tibble: 9 x 3
# Groups: grp [3]
# weight grp change
# <dbl> <int> <dbl>
#1 30 1 0.275
#2 30 1 0.275
#3 109 1 1
#4 30 2 3
#5 309 2 30.9
#6 10 2 1
#7 20 3 1.43
#8 20 3 1.43
#9 14 3 1
Or using data.table
library(data.table)
setDT(df)[, change := weight/last(weight), .(as.integer(gl(nrow(df), 3, nrow(df))))]

Loop in R to find previous match

I need some help to write a loop function in R. I have some problem to select previous match when same id occurs and then write OLD_RANK column and NEW_RANK column.
OLD_RANK must be the NEW_RANK of the previous match found.
`NEW_RANK`<- OLD_RANK+0.05(S1-S2)
Here my data for this example
JUNK<- matrix(c(1,1,10,20,3,2,30,40,1,3,60,4,3,
4,5,40,1,5,10,30,7,6,20,20),ncol=4,byrow=TRUE)
colnames(JUNK) <- c("ID1","DAY","S1","S2")
JUNK<- as.data.frame(JUNK)
What I thought could be a good start:
#subset to find previous match. Find matches before days and if more matches are
#found, choose the row with higher values in `days`
loop for each row
s1 <- subset(s1, DAYS < days)
s1 <- subset(s1, DAYS = max(days))
#if no match fuond JUNK$OLD_RANK<-35 and JUNK$NEW_RANK <-JUNK$OLD_RANK+0.05(S1-S2)
#if previous match is found JUNK$NEW_RANK <-JUNK$OLD_RANK+0.05(S1-S2)
expected result:
ID1 DAYS S1 S2 OLD_RANK NEW_RANK
1 1 10 20 35 34.5
3 2 30 40 35 34.5
1 3 60 4 34.5 37.3
3 4 5 40 34.5 32.75
1 5 10 30 37.3 36.3
7 6 20 20 35 35
Any help is appreciate.
Here's one approach:
library(dplyr)
JUNK2 <- JUNK %>%
group_by(ID1) %>%
mutate(change = 0.05*(S1-S2),
NEW_RANK = 35 + cumsum(change),
OLD_RANK = lag(NEW_RANK) %>% if_else(is.na(.), 35, .)) %>%
ungroup() # EDIT: Added to end with ungrouped table
Result:
JUNK2
# A tibble: 6 x 7
ID1 DAY S1 S2 change NEW_RANK OLD_RANK
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 10 20 -0.5 34.5 35
2 3 2 30 40 -0.5 34.5 35
3 1 3 60 4 2.8 37.3 34.5
4 3 4 5 40 -1.75 32.8 34.5
5 1 5 10 30 -1 36.3 37.3
6 7 6 20 20 0 35 35

Trying to keep values of a column based on the unique values of two other columns

I want to keep only the 2 largest values in a column of a df according to the unique pair of values in two other columns. e.g., I have this df:
df <- data.frame('ID' = c(1,1,1,2,2,3,4,4,4,5),
'YEAR' = c(2002,2002,2003,2002,2003,2005,2010,2011,2012,2008),
'WAGES' = c(100,98,60,120,80,300,50,40,30,500));
And I want to drop the 3rd and 9th rows, equivalently, keep the first two largest values in WAGES column. The df has roughly 300,000 rows.
You can use dplyr's top_n:
library(dplyr)
df %>%
group_by(ID) %>%
top_n(n = 2, wt = WAGES)
## A tibble: 8 x 3
## Groups: ID [5]
# ID YEAR WAGES
# <dbl> <dbl> <dbl>
#1 1 2001 100
#2 1 2002 98
#3 2 2002 120
#4 2 2003 80
#5 3 2005 300
#6 4 2010 50
#7 4 2011 40
#8 5 2008 500
If I understood your question correctly, using base R:
for (i in 1:2) {
max_row <- which.max(df$WAGES)
df <- df[-c(max_row), ]
}
df
# ID YEAR WAGES
# 1 1 2001 100
# 2 1 2002 98
# 3 1 2003 60
# 4 2 2002 120
# 5 2 2003 80
# 7 4 2010 50
# 8 4 2011 40
# 9 4 2012 30
Note - and , in df <- df[-c(max_row), ].

How to mimick ROW_NUMBER() OVER(...) in R

To manipulate/summarize data over time, I usually use SQL ROW_NUMBER() OVER(PARTITION by ...). I'm new to R, so I'm trying to recreate tables I otherwise would create in SQL. The package sqldf does not allow OVER clauses. Example table:
ID Day Person Cost
1 1 A 50
2 1 B 25
3 2 A 30
4 3 B 75
5 4 A 35
6 4 B 100
7 6 B 65
8 7 A 20
I want my final table to include the average of the previous 2 instances for each day after their 2nd instance (day 4 for both):
ID Day Person Cost Prev2
5 4 A 35 40
6 4 B 100 50
7 6 B 65 90
8 7 A 20 35
I've been trying to play around with aggregate, but I'm not really sure how to partition or qualify the function. Ideally, I'd prefer not to use the fact that id is sequential with the date to form my answer (i.e. original table could be rearranged with random date order and code would still work). Let me know if you need more details, thanks for your help!
You could lag zoo::rollapplyr with a width of 2. In dplyr,
library(dplyr)
df %>% arrange(Day) %>% # sort
group_by(Person) %>% # set grouping
mutate(Prev2 = lag(zoo::rollapplyr(Cost, width = 2, FUN = mean, fill = NA)))
#> Source: local data frame [8 x 5]
#> Groups: Person [2]
#>
#> ID Day Person Cost Prev2
#> <int> <int> <fctr> <int> <dbl>
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or all in dplyr,
df %>% arrange(Day) %>% group_by(Person) %>% mutate(Prev2 = (lag(Cost) + lag(Cost, 2)) / 2)
which returns the same thing. In base,
df <- df[order(df$Day), ]
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
c(NA, zoo::rollapplyr(x, width = 2, FUN = mean, fill = NA)[-length(x)])
})
df
#> ID Day Person Cost Prev2
#> 1 1 1 A 50 NA
#> 2 2 1 B 25 NA
#> 3 3 2 A 30 NA
#> 4 4 3 B 75 NA
#> 5 5 4 A 35 40.0
#> 6 6 4 B 100 50.0
#> 7 7 6 B 65 87.5
#> 8 8 7 A 20 32.5
or without zoo,
df$Prev2 <- ave(df$Cost, df$Person, FUN = function(x){
(c(NA, x[-length(x)]) + c(NA, NA, x[-(length(x) - 1):-length(x)])) / 2
})
which does the same thing. If you want to remove the NA rows, tack on tidyr::drop_na(Prev2) or na.omit.

Resources