R - Identify a sequence of row elements by groups in a dataframe - r

Consider the following sample dataframe:
> df
id name time
1 1 b 10
2 1 b 12
3 1 a 0
4 2 a 5
5 2 b 11
6 2 a 9
7 2 b 7
8 1 a 15
9 2 b 1
10 1 a 3
df = structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L),
name = c("b", "b", "a", "a", "b", "a", "b", "a", "b", "a"
), time = c(10L, 12L, 0L, 5L, 11L, 9L, 7L, 15L, 1L, 3L)), .Names = c("id",
"name", "time"), row.names = c(NA, -10L), class = "data.frame")
I need to identify and record all sequences seq <- c("a","b"), where "a" precedes "b" based on "time" column, for each id. No other names between "a" and "b" are permitted. Real sequence length is at least 5.
The expected result for the sample data is
a b
1 3 10
2 5 7
3 9 11
There is a similar question Finding rows in R dataframe where a column value follows a sequence. However, it is not clear to me how to deal with "id" column in my case. Is it a way to solve the problem using "dplyr"?

library(dplyr); library(tidyr)
# sort data frame by id and time
df %>% arrange(id, time) %>% group_by(id) %>%
# get logical vector indicating rows of a followed by b and mark each pair as unique
# by cumsum
mutate(ab = name == "a" & lead(name) == "b", g = cumsum(ab)) %>%
# subset rows where conditions are met
filter(ab | lag(ab)) %>%
# reshape your data frame to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [3 x 4]
#Groups: id [2]
# id g a b
#* <int> <int> <int> <int>
#1 1 1 3 10
#2 2 1 5 7
#3 2 2 9 11
If length of the sequence is larger than two, then you will need to check multiple lags, and one option of this is to use shift function(which accepts a vector as lag/lead steps) from data.table combined with Reduce, say if we need to check pattern abb:
library(dplyr); library(tidyr); library(data.table)
pattern = c("a", "b", "b")
len_pattern = length(pattern)
df %>% arrange(id, time) %>% group_by(id) %>%
# same logic as before but use Reduce function to check multiple lags condition
mutate(ab = Reduce("&", Map("==", shift(name, n = 0:(len_pattern - 1), type = "lead"), pattern)),
g = cumsum(ab)) %>%
# use reduce or to subset sequence rows having the same length as the pattern
filter(Reduce("|", shift(ab, n = 0:(len_pattern - 1), type = "lag"))) %>%
# make unique names
group_by(g, add = TRUE) %>% mutate(name = paste(name, 1:n(), sep = "_")) %>%
# pivoting the table to wide format
select(-ab) %>% spread(name, time)
#Source: local data frame [1 x 5]
#Groups: id, g [1]
# id g a_1 b_2 b_3
#* <int> <int> <int> <int> <int>
#1 1 1 3 10 12

It's somewhat convoluted, but how about a rolling join?
library(data.table)
setorder(setDT(df), id, time)
df[ name == "b" ][
df[, if(name == "a") .(time = last(time)), by=.(id, name, r = rleid(id,name))],
on = .(id, time),
roll = -Inf,
nomatch = 0,
.(a = i.time, b = x.time)
]
a b
1: 3 10
2: 5 7
3: 9 11

You can use an ifelse in filter with lag and lead, and then tidyr::spread to reshape to wide:
library(tidyverse)
df %>% arrange(id, time) %>% group_by(id) %>%
filter(ifelse(name == 'b', # if name is b...
lag(name) == 'a', # is the previous name a?
lead(name) == 'b')) %>% # else if name is not b, is next name b?
ungroup() %>% mutate(i = rep(seq(n() / 2), each = 2)) %>% # create indices to spread by
spread(name, time) %>% select(a, b) # spread to wide and clean up
## # A tibble: 3 × 2
## a b
## * <int> <int>
## 1 3 10
## 2 5 7
## 3 9 11
Based on the comment below, here's a version that uses gregexpr to find the first index of a matched pattern, which while more complicated, scales more easily to longer patterns like "aabb":
df %>% group_by(pattern = 'aabb', id) %>% # add pattern as column, group
arrange(time) %>%
# collapse each group to a string for name and a list column for time
summarise(name = paste(name, collapse = ''), time = list(time)) %>%
# group and add list-column of start indices for each match
rowwise() %>% mutate(i = gregexpr(pattern, name)) %>%
unnest(i, .drop = FALSE) %>% # expand, keeping other list columns
filter(i != -1) %>% # chop out rows with no match from gregexpr
rowwise() %>% # regroup
# subset with sequence from index through pattern length
mutate(time = list(time[i + 0:(nchar(pattern) - 1)]),
pattern = strsplit(pattern, '')) %>% # expand pattern to list column
rownames_to_column('match') %>% # add rownames as match index column
unnest(pattern, time) %>% # expand matches in parallel
# paste sequence onto each letter (important for spreading if repeated letters)
group_by(match) %>% mutate(pattern = paste0(pattern, seq(n()))) %>%
spread(pattern, time) # spread to wide form
## Source: local data frame [1 x 8]
## Groups: match [1]
##
## match id name i a1 a2 b3 b4
## * <chr> <int> <chr> <int> <int> <int> <int> <int>
## 1 1 1 aabba 1 0 3 10 12
Note that if the pattern doesn't happen to be in alphabetical order, the resulting columns will not be ordered by their indices. Since indices are preserved, though, you can sort with something like select(1:4, parse_number(names(.)[-1:-4]) + 4).

Related

Sum column based on variable name in other column that contains x following similar letters

I have a table that is somewhat like this:
var
RC
distance50
2
distance20
4
precMax
5
precMin
1
total_prec
8
travelTime
5
travelTime
2
I want to sum all similar type variables, resulting in something like this:
var
sum
dist
6
prec
14
trav
7
Using 4 letters is enough to separate the different types. I have tried and tried but not figured it out. Could anyone please assist? I generally try to work with dplyr, so that would be preferred. The datasets are small (n<100) so speed is not required.
Base R solution:
aggregate(
RC ~ var,
data = transform(
with(df, df[!(grepl("total", var)),]),
var = gsub("^(\\w+)([A-Z0-9]\\w+$)", "\\1", var)
),
FUN = sum
)
Data:
df <- structure(list(var = c("distance50", "distance20", "precMax",
"precMin", "total_prec", "travelTime", "travelTime"), RC = c(2L,
4L, 5L, 1L, 8L, 5L, 2L)), class = "data.frame", row.names = c(NA,
-7L))
library(dplyr)
library(tidyr)
df %>%
separate(var, c("var", "b"), sep = "[_A-Z0-9]", extra = "merge") %>%
group_by(var = ifelse(b %in% var, b, var)) %>%
summarize(RC = sum(RC), .groups = "drop")
separate var into two columns by splitting on underscores (_), capital letters A-Z or numbers 0-9.
In the group_by statement, if the second column can be found in the first then fill the first column.
Lastly, sum RC by group.
Output
var RC
<chr> <int>
1 distance 6
2 prec 14
3 travel 7
tibble(
var=c('dista', 'distb', 'travelTime'),
rc=2:4) %>%
print() %>%
# A tibble: 3 x 2
# var rc
# <chr> <int>
#1 dista 2
#2 distb 3
#3 travelTime 4
group_by(var=str_sub(var, end=4)) %>%
print() %>%
# A tibble: 3 x 2
# Groups: var [2]
# var rc
# <chr> <int>
#1 dist 2
#2 dist 3
#3 trav 4
summarise(sum=sum(rc))
# A tibble: 2 x 2
# var sum
# <chr> <int>
#1 dist 5
#2 trav 4

Repeating the value in a df column by a specified amount, and concatenating integer count to repeated values

I would like to use R to create an expanded_df from a template_df, where each row is repeated by a number of times specified in a separate column in the template_df, and an integer count is concatenated to the ID column in the expanded_df, specifying the number this row has been repeated in the expanded_df.
I would like this count to start at 600 for each ID class.
E.g., template_df:
Initial_ID Count
a 2
b 3
c 1
d 4
expanded_df:
Expanded_ID
a-600
a-601
b-600
b-601
b-602
c-600
d-600
d-601
d-602
d-603
Anyone have any ideas? Thanks!
We may use uncount to expand the rows and then get the rowid (of the 'Initial_ID' to paste after adding 599
library(dplyr)
library(tidyr)
library(data.table)
library(stringr)
template_df %>%
uncount(Count) %>%
transmute(Expanded_ID = str_c(Initial_ID, 599 + rowid(Initial_ID), sep = '-'))
-output
Expanded_ID
1 a-600
2 a-601
3 b-600
4 b-601
5 b-602
6 c-600
7 d-600
8 d-601
9 d-602
10 d-603
Or using base R with rep and paste
data.frame(Expanded_ID = with(template_df, paste0(rep(Initial_ID, Count), "-",
599 + sequence(Count))))
-output
Expanded_ID
1 a-600
2 a-601
3 b-600
4 b-601
5 b-602
6 c-600
7 d-600
8 d-601
9 d-602
10 d-603
data
template_df <- structure(list(Initial_ID = c("a", "b", "c", "d"), Count = c(2L,
3L, 1L, 4L)), class = "data.frame", row.names = c(NA, -4L))
An alternative dplyr solution:
library(dplyr)
template_df %>%
group_by(Initial_ID) %>%
slice(rep(1:n(), each = Count)) %>%
mutate(row = 600 + row_number()-1) %>%
ungroup() %>%
transmute(Expanded_ID = paste(Initial_ID,row, sep = "-"))
Expanded_ID
<chr>
1 a-600
2 a-601
3 b-600
4 b-601
5 b-602
6 c-600
7 d-600
8 d-601
9 d-602
10 d-603

Extract all row.names in a data.frame that match a value in another data.frame

I have a data.frame with 150 column names. For each column, I want to extract the maximum and minimum values (the rows repeat) and the row names of each maximum value. I have extracted the min and max values in another data.frame but don't know how to match them.
I have found functions that are very close for this, like for minimum values:
head(cars)
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
sapply(cars,which.min)
speed dist
1 1
Here, it only gives the first index for minimum speed.
And I've tried with loops like:
for (i in (colnames(cars))){
print(min(cars[[i]]))
}
[1] 4
[1] 2
But that just gives me the minimum values, and not if they are repeated and the rowname of each repeated value.
I want something like:
min.value column rowname freq.times
4 speed 1,2 2
2 dist 1 1
Thanks and sorry if I have orthography mistakes. No native speaker
One option is to use tidyverse. I was a little unclear if you want min and max in the same dataframe, so I included both. First, I create an index column with row numbers. Then, I pivot to long format to determine which values are minimum and maximum (using case_when). Then, I drop the rows that are not min or max (i.e., NA in category). Then, I use summarise to turn the row names into a single character string and get the frequency of a given minimum or maximum value.
library(tidyverse)
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "value") %>%
group_by(column) %>%
mutate(category = case_when((value == min(value)) == TRUE ~ "min",
(value == max(value)) == TRUE ~ "max")) %>%
drop_na(category) %>%
group_by(column, value, category) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2:3, 1, 4, 5)
Output
# A tibble: 4 × 5
# Groups: column, value [4]
value category column rowname freq.times
<dbl> <chr> <chr> <chr> <int>
1 2 min dist 1 1
2 120 max dist 49 1
3 4 min speed 1, 2 2
4 25 max speed 50 1
However, if you want to produce the dataframes separately. Then, you could adjust something like this. Here, I don't use category and instead use filter to drop all rows that are not the minimum for a group/column. Then, we can summarise as we did above. You can do the samething for max as well.
cars %>%
mutate(rowname = row_number()) %>%
pivot_longer(-rowname, names_to = "column", values_to = "min.value") %>%
group_by(column) %>%
filter(min.value == min(min.value)) %>%
group_by(column, min.value) %>%
summarise(rowname = toString(rowname), freq.times = n()) %>%
select(2, 1, 3, 4)
Output
# A tibble: 2 × 4
# Groups: column [2]
min.value column rowname freq.times
<dbl> <chr> <chr> <int>
1 2 dist 1 1
2 4 speed 1, 2 2
Here is another tidyverse approach:
which.min(.) gives the first index, whereas which(. == min(.)) will give all indices that are true for the condition!
Analogues to get the frequence we could use: length(which(.==min(.)))
summarise across all columns min.value, rowname and freq.time
The part after is pivoting to bring the column name in position.
library(tidyverse)
cars %>%
summarise(across(dplyr::everything(), list(min.value = min,
rowname = ~list(which(. == min(.))),
freq.times = ~length(which(.==min(.)))))) %>%
pivot_longer(
cols = contains("_"),
names_to = "key",
values_to = "val",
values_transform = list(val = as.character)
) %>%
separate(key, c("column", "name"), sep="_") %>%
pivot_wider(
names_from = name,
values_from = val
) %>%
mutate(rowname = str_replace(rowname, '\\:', '\\,'))
column min.value rowname freq.times
<chr> <chr> <chr> <chr>
1 speed 4 1,2 2
2 dist 2 1 1
min.value <- sapply(cars, min)
columns <- names(min.value)
row.values <- sapply(columns, \(x) which(cars[[x]] == min.value[which(names(min.value) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(min.value) <- names(row.values) <- names(freq.times) <- NULL
data.frame(min.value = min.value,
columns = columns,
row.values = row.values,
freq.times = freq.times)
min.value columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
Here it is wrapped in function, so that you can use it across whatever data frame and function you need:
create_table <- function(df, FUN) {
values <- sapply(df, FUN)
columns <- names(values)
row.values <- sapply(columns, \(x) which(df[[x]] == values[which(names(values) == x)]))
freq.times <- sapply(row.values, length)
row.values <- sapply(row.values, \(x) paste(x, collapse = ","))
names(values) <- names(row.values) <- names(freq.times) <- NULL
data.frame(values = values,
columns = columns,
row.values = row.values,
freq.times = freq.times)
}
create_table(cars, min)
values columns row.values freq.times
1 4 speed 1,2 2
2 2 dist 1 1
create_table(cars, max)
values columns row.values freq.times
1 25 speed 50 1
2 120 dist 49 1
You can use which to obtain the positions. sapply should work. Since you need multiple summary statistics for each column, you just have to wrap up them in a list. Something like this
as.data.frame(sapply(cars, \(x) {
extrema <- range(x)
min.row <- which(x == extrema[[1L]])
max.row <- which(x == extrema[[2L]])
list(
min.value = extrema[[1L]], max.value = extrema[[2L]],
min.row = min.row, max.row = max.row,
freq.min = length(min.row), freq.max = length(max.row)
)
}))
Output
speed dist
min.value 4 2
max.value 25 120
min.row 1, 2 1
max.row 50 49
freq.min 2 1
freq.max 1 1

remove character for all column names in a data frame

I have a large dataframe. I'm trying to remove v character from variable names of a data frame
df <- tibble(q_ve5 = 1:2,
q_f_1v = 3:4,
q_vf_2 = 3:4,
q_e6 = 5:6,
q_ev8 = 5:6)
I tried this. It seems my regular expression pattern is not correct
df %>%
rename_all(~ str_remove(., "\\v\\d+$"))
My desired col names:
q_e5 q_f_1 q_f_2 q_e6 q_e8
If we need to remove only 'v' the one of more digits (\\d+) at the end ($) is not needed as the expected output also removes 'v' from first column 'q_ve5'
library(dplyr)
library(stringr)
df %>%
rename_with(~ str_remove(., "v"), everything())
-output
# A tibble: 2 × 5
q_e5 q_f_1 q_f_2 q_e6 q_e8
<int> <int> <int> <int> <int>
1 1 3 3 5 5
2 2 4 4 6 6
Or without any packages
names(df) <- sub("v", "", names(df))

How to check multiple values using if condition [duplicate]

This question already has answers here:
Idiom for ifelse-style recoding for multiple categories
(13 answers)
Closed 4 years ago.
I have like below mentioned dataframe:
Records:
ID Remarks Value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14
By utilizing the above mentioned dataframe, I want to add new column Status in the existing dataframe.
Where if the Remarks is ABC, AAB or ABB than status would be TRUE and for XYZ and ZZX it should be FALSE.
I am using below mentioned method for that but it didn't work.
Records$Status<-ifelse(Records$Remarks %in% ("ABC","AAB","ABB"),"TRUE",
ifelse(Records$Remarks %in%
("XYZ","ZZX"),"FALSE"))
And, bases on the Status i want to derive following output:
ID TRUE FALSE Sum
1 2 1 37
2 1 1 26
Records$Status<-ifelse(Records$Remarks %in% c("ABC","AAB","ABB"),TRUE,
ifelse(Records$Remarks %in%
c("XYZ","ZZX"),FALSE, NA))
You need to enclose your lists of strings with c(), and add an "else" condition for the second ifelse (but see Roman's answer below for a better way of doing this with case_when). (Also note that here I changed the "TRUE" and "FALSE" (as character class) into TRUE and FALSE (the logical class).
For the summary (using dplyr):
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Status), falses=sum(!Status), sum=sum(Value))
# A tibble: 2 x 4
ID trues falses sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Of course, if you don't really need the intermediate Status column but just want the summary table, you can skip the first step altogether:
Records %>% group_by(ID) %>%
dplyr::summarise(trues=sum(Remarks %in% c("ABC","AAB","ABB")),
falses=sum(Remarks %in% c("XYZ","ZZX")),
sum=sum(Value))
Since it makes sense to use dplyr for your second question (see #iod's answer) it is also a good opportunity to use the package's very straightforward case_when() function for the first part.
Records %>%
mutate(Status = case_when(Remarks %in% c("ABC", "AAB", "ABB") ~ TRUE,
Remarks %in% c("XYZ", "ZZX") ~ FALSE,
TRUE ~ NA))
ID Remarks Value Status
1 1 ABC 10 TRUE
2 1 AAB 12 TRUE
3 1 ZZX 15 FALSE
4 2 XYZ 12 FALSE
5 2 ABB 14 TRUE
This approach will scale to a large number of remarks.
Load the data and prepare a matching data frame
The second data frame makes a matching between remarks and their TRUE or FALSE value.
library(readr)
library(dplyr)
library(tidyr)
dtf <- read_table("id remarks value
1 ABC 10
1 AAB 12
1 ZZX 15
2 XYZ 12
2 ABB 14")
truefalse <- data_frame(remarks = c("ABC", "AAB", "ABB", "ZZX", "XYZ"),
tf = c(TRUE, TRUE, TRUE, FALSE, FALSE))
Group by id and summarise
This is the format as asked in the question
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id) %>%
summarise(true = sum(tf),
false = sum(!tf),
value = sum(value))
# A tibble: 2 x 4
id true false value
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Alternative proposal: group by id, tf and summarise
This option retains more details on the spread of value along the grouping variables id and tf.
dtf %>%
left_join(truefalse, by = "remarks") %>%
group_by(id, tf) %>%
summarise(n = n(),
value = sum(value))
# A tibble: 4 x 4
# Groups: id [?]
id tf n value
<int> <lgl> <int> <int>
1 1 FALSE 1 15
2 1 TRUE 2 22
3 2 FALSE 1 12
4 2 TRUE 1 14
In most cases, life is easier and lines are shorter without ifelse:
# short version
df$Status <- df$Remarks %in% c("ABC","AAB","ABB")
This version is OK for most purposes but it has shortcomings. Status will be FALSE if Remarks is NA or, say "garbage" but one might want it to be NA in these cases and FALSE only if Remarks %in% c("XYZ", "ZZX"). So one can add and multiply the conditions and finally convert it to logical:
df$Status <- as.logical(with(df,
Remarks %in% c("ABC","AAB","ABB") +
! Remarks %in% c("XYZ","ZZX") ))
And the summary table with base R:
aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
Umm... perhaps some formatting would be useful:
t1 <- aggregate(df[,-(1:2)], df["ID"], function(x) if(is.numeric(x)) sum(x) else table(x))
t1 <- t1[, c(1,3,2)]
colnames(t1) <- c("ID", "", "Sum")
t1
# ID FALSE TRUE Sum
# 1 1 1 2 37
# 2 2 1 1 26
This one returns correct result, only if there are two mentioned groups ("ABC", "AAB", "ABB" vs "XYZ","ZZX", ...). For me #iod's solution, is more R-like, but I've tried to avoid ifelse, and do it another way:
Code:
library(tidyverse)
dt %>%
group_by(ID, Status = grepl("^A[AB][CB]$", Remarks)) %>%
summarise(N = n(), Sum = sum(Value)) %>%
spread(Status, N) %>%
summarize_all(sum, na.rm = T) %>% # data still groupped by ID
select("ID", "TRUE", "FALSE", "Sum")
# A tibble: 2 x 4
ID `TRUE` `FALSE` Sum
<int> <int> <int> <int>
1 1 2 1 37
2 2 1 1 26
Data:
dt <- structure(
list(ID = c(1L, 1L, 1L, 2L, 2L),
Remarks = c("ABC", "AAB", "ZZX", "XYZ", "ABB"),
Value = c(10L, 12L, 15L, 12L, 14L)),
.Names = c("ID", "Remarks", "Value"), class = "data.frame", row.names = c(NA, -5L)
)

Resources