How to select last N observation from each group in dplyr dataframe? - r

Given a dataframe:
df <- structure(list(a = c(1, 1, 1, 2, 2, 2, 3, 3, 4, 4), b = c(34,
343, 54, 11, 55, 62, 59, -9, 0, -0.5)), row.names = c(NA, -10L
), class = c("tbl_df", "tbl", "data.frame"))
I want to take last N observations / rows from each group:
df %>%
dplyr::group_by(a) %>%
dplyr::last(2)
Gives me wrong results.
I want it to be:
a b
1 343
1 54
2 55
2 62
3 59
3 -9
4 0
4 -0.5
Please advise what is wrong here?
The error I get is:
Error in order(order_by)[[n]] : subscript out of bounds

As it is a specific question based on dplyr
1) after the group_by, use slice on the row_number()
library(tidyverse)
df %>%
group_by(a) %>%
slice(tail(row_number(), 2))
# A tibble: 8 x 2
# Groups: a [4]
# a b
# <dbl> <dbl>
#1 1 343
#2 1 54
#3 2 55
#4 2 62
#5 3 59
#6 3 -9
#7 4 0
#8 4 -0.5
2) Or use filter from dplyr
df %>%
group_by(a) %>%
filter(row_number() >= (n() - 1))
3) or with do and tail
df %>%
group_by(a) %>%
do(tail(., 2))
4) In addition to the tidyverse, methods, we can also use compact data.table
library(data.table)
setDT(df)[df[, .I[tail(seq_len(.N), 2)], a]$V1]
5) Or by from base R
by(df, df$a, FUN = tail, 2)
6) or with aggregate from base R
df[aggregate(c ~ a, transform(df, c = seq_len(nrow(df))), FUN = tail, 2)$c,]
7) or with split from base R
do.call(rbind, lapply(split(df, df$a), tail, 2))

dplyr 1.0.0 introduced slice_tail that makes this simple:
library(dplyr)
df %>%
group_by(a) %>%
slice_tail(n = 2)
Similarly, there is slice_head to get the first n rows.

A base R option using tapply is to subset the last two rows for every group.
df[unlist(tapply(1:nrow(df), df$a, tail, 2)), ]
# a b
# <dbl> <dbl>
#1 1 343
#2 1 54
#3 2 55
#4 2 62
#5 3 59
#6 3 -9
#7 4 0
#8 4 -0.5
Or another option using ave
df[as.logical(with(df, ave(1:nrow(df), a, FUN = function(x) x %in% tail(x, 2)))), ]

Also a tidyverse possibility:
df %>%
group_by(a) %>%
top_n(2, row_number())
a b
<dbl> <dbl>
1 1. 343.
2 1. 54.0
3 2. 55.0
4 2. 62.0
5 3. 59.0
6 3. -9.00
7 4. 0.
8 4. -0.500
It is taking the top two rows given the row numbers per groups.

Try tail().In R head function allows you to preview the first n rows, while tail allows you to preview last n rows.

Related

R: Count number of times B follows A using dplyr

I have a data.frame of monthly averages of radon measured over a few months. I have labeled each value either "below" or "above" a threshold and would like to count the number of times the average value does: "below to above", "above to below", "above to above" or "below to below".
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
A bit of digging into Matlab answer on here suggests that we could use the Matrix package:
require(Matrix)
sparseMatrix(i=c(2,2,2,1), j=c(2,2,2))
Produces this result which I can't yet interpret.
[1,] | |
[2,] | .
Any thoughts about a tidyverse method?
Sure, just use group by and count the values
library(dplyr)
df <- data.frame(value = c(130, 200, 240, 230, 130),
level = c("below", "above","above","above", "below"))
df %>%
group_by(grp = paste(level, lead(level))) %>%
summarise(n = n()) %>%
# drop the observation that does not have a "next" value
filter(!grepl(pattern = "NA", x = grp))
#> # A tibble: 3 × 2
#> grp n
#> <chr> <int>
#> 1 above above 2
#> 2 above below 1
#> 3 below above 1
You could use table from base R:
table(df$level[-1], df$level[-nrow(df)])
above below
above 2 1
below 1 0
EDIT in response to #HCAI's comment: applying table to multiple columns:
First, generate some data:
set.seed(1)
U = matrix(runif(4*20),nrow = 20)
dfU=data.frame(round(U))
library(plyr) # for mapvalues
df2 = data.frame(apply(dfU,
FUN = function(x) mapvalues(x, from=0:1, to=c('below','above')),
MARGIN=2))
so that df2 contains random 'above' and 'below':
X1 X2 X3 X4
1 below above above above
2 below below above below
3 above above above below
4 above below above below
5 below below above above
6 above below above below
7 above below below below
8 above below below above
9 above above above below
10 below below above above
11 below below below below
12 below above above above
13 above below below below
14 below below below below
15 above above below below
16 below above below above
17 above above below above
18 above below above below
19 below above above above
20 above below below above
Now apply table to each column and vectorize the output:
apply(df2,
FUN=function(x) as.vector(table(x[-1],
x[-nrow(df2)])),
MARGIN=2)
which gives us
X1 X2 X3 X4
[1,] 5 2 7 2
[2,] 5 6 4 6
[3,] 6 5 3 6
[4,] 3 6 5 5
All that's left is a bit of care in labeling the rows of the output. Maybe someone can come up with a clever way to merge/join the data frames resulting from apply(df2, FUN=function(x) melt(table(x[-1],x[-nrow(df2)])),2), which would maintain the row names. (I spent some time looking into it but couldn't work out how to do it easily.)
not run, so there may be a typo, but you get the idea. I'll leave it to you to deal with na and the first obs. Single pass through the vector.
library(dplyr)
summarize(increase = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
decrease = sum(case_when(value > lag(value) ~ 1, T ~ 0)),
constant = sum(case_when(value = lag(value) ~ 1, T ~ 0))
)
A slightly different version:
library(dplyr)
library(stringr)
df %>%
group_by(level = str_c(level, lead(level), sep = " ")) %>%
count(level) %>%
na.omit()
level n
<chr> <int>
1 above above 2
2 above below 1
3 below above 1
Another possible solution, based on tidyverse:
library(tidyverse)
df<-data.frame(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
df %>%
mutate(changes = str_c(lag(level), level, sep = "_")) %>%
count(changes) %>% drop_na(changes)
#> changes n
#> 1 above_above 2
#> 2 above_below 1
#> 3 below_above 1
Yet another solution, based on data.table:
library(data.table)
dt<-data.table(value=c(130,200, 240, 230, 130),level=c("below", "above","above","above", "below"))
dt[, changes := paste(shift(level), level, sep = "_")
][2:.N][,.(n = .N), keyby = .(changes)]
#> changes n
#> 1: above_above 2
#> 2: above_below 1
#> 3: below_above 1

Is there a way in R to combine the functions slice_max (dplyr) and fct_other(forcats)?

I´m trying to combine the functions slice_max from dplyr and fct_other from forcats to get a top n slice of a dataframe, based in a numeric variable, but I don´t want to lose the non top n factors. I want those other factors to be designated as "Others" to summarise or count after that if I need it.
For example, with a dataframe similar to this:
df <- data.frame(acron = c("AA", "BB", "CC", "DD", "EE", "FF", "GG"), value = c(6, 4, 1, 10, 3, 1, 1))
If I want the top 3 subjetcs by their "value", I can use the next code:
df %>%
slice_max(value, n = 3)
Getting the next result:
acron value
DD 10
AA 6
BB 4
But I would like to designate to dropped "acron"s the factor "Others" similar to the results obtained using the function fct_other from forcats. I´ve tried this code but it deosn´t work:
df %>%
mutate(acron = fct_other(acron, keep = slice_max(value, n = 3), other_level = "Others"))
Any suggestion to get something like this?:
acron value
DD 10
AA 6
BB 4
Others 3
Others 1
Others 1
Others 1
Or even like this:
acron value
DD 10
AA 6
BB 4
Others 6
One option could be using fct_lump_n():
df %>%
mutate(acron = fct_lump_n(acron, n = 3, w = value))
acron value
1 AA 6
2 BB 4
3 Other 1
4 DD 10
5 Other 3
6 Other 1
7 Other 1
If we want to use the approach with slice_max, it needs to extract the vector 'acron'. Using pull, it can be extracted
library(dplyr)
library(forcats)
df %>%
mutate(acron = fct_other(acron, keep = {.} %>%
slice_max(value, n = 3) %>%
pull(acron), other_level = "Others"))
# acron value
#1 AA 6
#2 BB 4
#3 Others 1
#4 DD 10
#5 Others 3
#6 Others 1
#7 Others 1
Or other option is order and head
df %>%
mutate(acron = fct_other(acron, keep = head(acron[order(-value)], 3),
other_level = "Others")) %>%
arrange(desc(value))
# acron value
#1 DD 10
#2 AA 6
#3 BB 4
#4 Others 3
#5 Others 1
#6 Others 1
#7 Others 1
Or do the arrange first and then use
df %>%
arrange(desc(value)) %>%
mutate(acron = fct_other(acron, keep = head(acron, 3), other_level = "Others"))
# acron value
#1 DD 10
#2 AA 6
#3 BB 4
#4 Others 3
#5 Others 1
#6 Others 1
#7 Others 1
To get the summarised output, do a group by sum
df %>%
arrange(desc(value)) %>%
group_by(acron = fct_other(acron, keep = head(acron, 3),
other_level = "Others")) %>%
summarise(value = sum(value))
# A tibble: 4 x 2
# acron value
# <fct> <dbl>
#1 AA 6
#2 BB 4
#3 DD 10
#4 Others 6

How can I do operations among rows in a tibble?

I have a tibble which has stored variables taken at different points in the sea and at different depths, but I need to condense all the depths of the same point into a single row following a specific formula (the summation of the sum of values X and X+1 times the subtraction of the depth of X+1 minus depth of X and so on...), which I have wrote on excel as a way of better explaining what I'm trying to do
And here is an small sample of the (edited) data I'm working with
long lat station depth no3
<dbl> <dbl> <dbl> <dbl> <dbl>
1 -71.1 -32 1 0 9
2 -71.1 -32 1 5 14
3 -71.1 -32 1 10 10
4 -71.1 -32 1 20 11
5 -71.6 -32 2 0 13
6 -71.6 -32 2 5 8
7 -71.6 -32 2 10 2
8 -71.6 -32 2 20 6
9 -71.6 -32 2 50 4
10 -71.6 -32 2 75 9
# ... with 942 more rows
From what I read here in similar questions, I could use aggregate or merge but those only do the summation, and I don't know how to get it to do the entire equation. I'll appreciate any suggestion, I'm new to R and if I haven't been very clear (or the solution is actually quite simple) I'm sorry
You can use the lead function to create the sums based on the following row's data (use lag for the previous row), and then sum this new column in a summarize:
df <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3)) %>%
mutate(a = (lead(NO3) + NO3)*(lead(depth) - depth))
df
depth NO3 a
1 0 3 40
2 5 5 55
3 10 6 80
4 20 2 150
5 50 3 NA
df %>%
summarize(b = sum(a, na.rm = TRUE))
b
1 325
Note that the na.rm in the sum is key here since the lead functions create NA values in the final row. These can be filled using the default argument.
EDIT:
If you'd like to apply this to more than just one column, you can use the "scoped" variants of mutate and summarize, by adding _at or _if to the end of these functions.
df2 <- data.frame(
depth = c(0, 5, 10, 20, 50),
NO3 = c(3, 5, 6, 2, 3),
NO4 = c(1, 2, 3, 4, 5),
NO5 = c(5, 4, 3, 2, 1))
_at functions require either a names vector or an index vector to determine which columns to operate on. All three of these will return the same thing, where .x refers to the column being modified:
df2 %>%
mutate_at(c("NO3", "NO4", "NO5"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(c("NO3", "NO4", "NO5"), sum, na.rm = T)
df2 %>%
mutate_at(vars(NO3:NO5), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(vars(NO3:NO5), sum, na.rm = T)
df2 %>%
mutate_at(2:4, ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_at(2:4, sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220
_if functions need a "predicate function" that determines whether a column will be operated on. Either of these, which check the name of the column, would work:
df2 %>%
mutate_if(str_detect(colnames(.), "NO"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(str_detect(colnames(.), "NO"), sum, na.rm = T)
df2 %>%
mutate_if(!str_detect(colnames(.), "depth"), ~(lead(.x) + .x)*(lead(depth)-depth)) %>%
summarize_if(!str_detect(colnames(.), "depth"), sum, na.rm = T)
NO3 NO4 NO5
1 325 380 220

Interpolation of values from list

I have a dataframe containing the results of a competition. In this example competitors b and c have tied for second place. The actual dataframe is very large and could contain multiple ties.
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
I also have point values for the respective places, where first place gets 4 points, 2nd gets 3, 3rd gets 1 and 4th gets 0.
points <- c(4, 3, 1, 0)
names(points) <- 1:4
I can match points to place to get each competitor's score
df %>%
mutate(score = points[place])
name place score
1 a 1 4
2 b 2 3
3 c 2 3
4 d 4 0
What I would like to do though is award points to b and c that are the mean of the point values for 2nd and 3rd, such that each receives 2 points like this:
name place score
1 a 1 4
2 b 2 2
3 c 2 2
4 d 4 0
How can I accomplish this programmatically?
A solution using nested data frames and purrr.
library(dplyr)
library(tidyr)
library(purrr)
df <- data.frame(name = letters[1:4],
place = c(1, 2, 2, 4))
points <- c(4, 3, 1, 0)
names(points) <- 1:4
# a function to help expand the dataframe based on the number of ties
expand_all <- function(x,n){
x:(x+n-1)
}
df %>%
group_by(place) %>%
tally() %>%
mutate(new_place = purrr::map2(place,n, expand_all)) %>%
unnest(new_place) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
Robert Wilson's answer gave me an idea. Rather than mapping over nested dataframes the rank function from base can get to the same result
df %>%
mutate(new_place = rank(place, ties.method = "first")) %>%
mutate(score = points[new_place]) %>%
group_by(place) %>%
summarize(score = mean(score)) %>%
inner_join(df)
place score name
<dbl> <dbl> <chr>
1 1 4 a
2 2 2 b
3 2 2 c
4 4 0 d
This can be accomplished in few lines with an ifelse() statement inside of a mutate():
df %>%
group_by(place) %>%
mutate(n_ties = n()) %>%
ungroup %>%
mutate(score = (points[place] + ifelse(n_ties > 1, 1, 0))/ n_ties)
# A tibble: 4 x 4
name place n_ties score
<chr> <dbl> <int> <dbl>
1 a 1 1 4
2 b 2 2 2
3 c 2 2 2
4 d 4 1 0

Summarize with conditions based on ranges in dplyr

There is an illustration of my example.
Sample data:
df <- data.frame(ID = c(1, 1, 2, 2, 3, 5), A = c("foo", "bar", "foo", "foo", "bar", "bar"),
B = c(1, 5, 7, 23, 54, 202))
df
ID A B
1 1 foo 1
2 1 bar 5
3 2 foo 7
4 2 foo 23
5 3 bar 54
6 5 bar 202
What I want to do is to summarize, by ID, and count of the same IDs. Furthermore, I want frequencies of IDs in subgroups based values of B in different numeric ranges (number of observations with B>=0 & B<5, B>=5 & B<10, B>=10 & B<15, B>=15 & B<20 etc for all IDs).
I want this result:
ID count count_0_5 count_5_10 etc
1 1 2 1 1 etc
2 2 2 NA 1 etc
3 3 1 NA NA etc
4 5 1 NA NA etc
I tried this code using package dplyr:
df %>%
group_by(ID) %>%
summarize(count=n(), count_0_5 = n(B>=0 & B<5))
However, it returns this error:
`Error in n(B>=0 & B<5) :
unused argument (B>=0 & B<5)`
Perhaps replacing n(B>=0 & B<5) with sum(B>=0 & B<5)?
This will sum the number of cases where the two specified conditions are accomplished.
However, you'll get 0's instead of NA's. This can be settled by:
ifelse(sum(B>=0 & B<5)>0, sum(B>=0 & B<5), NA)
I'm pretty sure that there may be a better solution (more clearer and efficient), but this should work!
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
mutate(B_cut = cut(B, c(0,5,10,15,20,1000), labels = c('count_0_5','count_5_10','count_10_15','count_15_20','count_20_1000')), count=n()) %>%
group_by(ID,B_cut) %>% mutate(n=n()) %>% slice(1) %>% select(-A,-B) %>%
spread(B_cut, n)
#2nd option
left_join(df %>% group_by(ID) %>% summarise(n=n()),
df %>% mutate(B_cut = cut(B, c(0,5,10,15,20,1000), labels = c('count_0_5','count_5_10','count_10_15','count_15_20','count_20_1000'))) %>%
count(ID,B_cut) %>% spread(B_cut,n),
by='ID')
# A tibble: 4 x 5
# Groups: ID [4]
ID count count_0_5 count_5_10 count_20_1000
<dbl> <int> <int> <int> <int>
1 1 2 2 NA NA
2 2 2 NA 1 1
3 3 1 NA NA 1
4 5 1 NA NA 1

Resources