I need to detect a sequence by group in a data.frame and compute new variable.
Consider I have this following data.frame:
df1 <- data.frame(ID = c(1,1,1,1,1,1,1,2,2,2,3,3,3,3),
seqs = c(1,2,3,4,5,6,7,1,2,3,1,2,3,4),
count = c(2,1,3,1,1,2,3,1,2,1,3,1,4,1),
product = c("A", "B", "C", "C", "A,B", "A,B,C", "D", "A", "B", "A", "A", "A,B,C", "D", "D"),
stock = c("A", "A,B", "A,B,C", "A,B,C", "A,B,C", "A,B,C", "A,B,C,D", "A", "A,B", "A,B", "A", "A,B,C", "A,B,C,D", "A,B,C,D"))
df1
> df1
ID seqs count product stock
1 1 1 2 A A
2 1 2 1 B A,B
3 1 3 3 C A,B,C
4 1 4 1 C A,B,C
5 1 5 1 A,B A,B,C
6 1 6 2 A,B,C A,B,C
7 1 7 3 D A,B,C,D
8 2 1 1 A A
9 2 2 2 B A,B
10 2 3 1 A A,B
11 3 1 3 A A
12 3 2 1 A,B,C A,B,C
13 3 3 4 D A,B,C,D
14 3 4 1 D A,B,C,D
I am interested to compute a measure for ID that follow this sequence:
- Count == 1
- Count > 1
- Count == 1
In the example this is true for:
- rows 2, 3, 4 for `ID==1`
- rows 8, 9, 10 for `ID==2`
- rows 12, 13, 14 for `ID==3`
For these ID and rows, I need to compute a measure called new that takes the value of the product of the last row of the sequence if it is in the second row of the sequence and NOT in the stock of the first sequence.
The desired outcome is shown below:
> output
ID seq1 seq2 seq3 new
1 1 2 3 4 C
2 2 1 2 3
3 3 2 3 4 D
Note:
In the sequence detected for ID no new products are added to the stock.
In the original data there are a lot of IDs who do not have any sequences.
Some ID have multiple qualifying sequences. All should be recorded.
Count is always 1 or greater.
The original data holds millions of ID with up to 1500 sequences.
How would you write an efficient piece of code to get this output?
Here's a data.table option:
library(data.table)
char_cols <- c("product", "stock")
setDT(df1)[,
(char_cols) := lapply(.SD, as.character),
.SDcols = char_cols] # in case they're factors
df1[, c1 := (count == 1) &
(shift(count) > 1) &
(shift(count, 2L) == 1),
by = ID] #condition1
df1[, pat := paste0("(", gsub(",", "|", product), ")")] # pattern
df1[, c2 := mapply(grepl, pat, shift(product)) &
!mapply(grepl, pat, shift(stock, 2L)),
by = ID] # condition2
df1[(c1), new := ifelse(c2, product, "")] # create new column
df1[, paste0("seq", 1:3) := shift(seqs, 2:0)] # create seq columns
df1[(c1), .(ID, seq1, seq2, seq3, new)] # result
Here's another approach using tidyverse; however, I think lag and lead has made this solution a bit time-consuming. I included the comments within the code to make it more legible.
But I spent enough time on it, to post it anyway.
library(tidyverse)
df1 %>% group_by(ID) %>%
# this finds the row with count > 1 which ...
#... the counts of the row before and the one of after it equals to 1
mutate(test = (count > 1 & c(F, lag(count==1)[-1]) & c(lead(count==1)[-n()],F))) %>%
# this makes a column which has value of True for each chunk...
#that meets desired condition to later filter based on it
mutate(test2 = test | c(F,lag(test)[-1]) | c(lead(test)[-n()], F)) %>%
filter(test2) %>% ungroup() %>%
# group each three occurrences in case of having multiple ones within each ID
group_by(G=trunc(3:(n()+2)/3)) %>% group_by(ID,G) %>%
# creating new column with string extracting techniques ...
#... (assuming those columns are characters)
mutate(new=
str_remove_all(
as.character(regmatches(stock[2], gregexpr(product[3], stock[2]))),
stock[1])) %>%
# selecting desired columns and adding times for long to wide conversion
select(ID,G,seqs,new) %>% mutate(times = 1:n()) %>% ungroup() %>%
# long to wide conversion using tidyr (part of tidyverse)
gather(key, value, -ID, -G, -new, -times) %>%
unite(col, key, times) %>% spread(col, value) %>%
# making the desired order of columns
select(-G,-new,new) %>% as.data.frame()
# ID seqs_1 seqs_2 seqs_3 new
# 1 1 2 3 4 C
# 2 2 1 2 3
# 3 3 2 3 4 D
Related
I had an earlier post regarding how to delete ID if any of the rows within ID contain certain strings (e.g., A or D) from the following data frame in a longitudinal format. These are R code examples that I received from the earlier post (r2evans, akrun, ThomasIsCoding) in order:
d %>% group_by(id) %>% filter(!any(dx %in% c("A", "D"))) %>%
ungroup()
filter(d, !id %in% id[dx %in% c("A", "D")])
subset(d, !ave(dx %in% c("A", "D"), id, FUN = any))
While these all worked well, I realized that I had to remove more than 600 strings (e.g., A, D, E2, F112, G203, etc), so I created a csv file for the list of these strings without a column name. 1. Is it the right approach to make a list? 2. How should I modify the above R codes if I intend to use the file of the strings list? Although I reviewed the other post or Google search results, I could not figure out what to do with my case. I would appreciate any suggestions!
Data frame:
id time dx
1 1 C
1 2 B
2 1 A
2 2 B
3 1 D
4 1 G203
4 2 E1
The results I want:
id time dx
1 1 C
1 2 B
UPDATE: Tarjae's below answer resolved the issue. The following are R codes for the solution.
my_list <- read.csv("my_list.csv")
columnname
A
D
E2
F112
G203
d %>% group_by(id) %>% filter(!any(dx%in%my_list$columnname)) %>% ungroup()
filter(d, !id %in% id[dx %in% my_list$columnname])
subset(d, !ave(dx %in% my_list$columnname, id, FUN = any))
This is a good strategy:
Put your values in a vector or list here my_list then
filter the dx column by negating by ! and using %in% operator:
library(dplyr)
my_list <- c("A", "D")
df %>%
filter(!dx %in% my_list)
id time dx
1 1 1 C
2 1 2 B
3 2 3 B
4 4 1 G203
5 4 1 E1
Expanding the list of values: my_list <- c("A", "D", "G203", "E1")
gives with the same code:
library(dplyr)
df %>%
filter(!dx %in% my_list)
id time dx
1 1 1 C
2 1 2 B
3 2 3 B
I have the following dataset:
a b
1 a
1 a
1 a
1 none
2 none
2 none
2 b
3 a
3 c
3 c
3 d
4 a
I want to get the most frequent value in b for any a and the second most frequent value of b for any a. in case two values in b have the same frequency I m indifferent about any of the two being considered the "first" or the "second".
in this case the expected output would be:
d2:
a first second
1 a none
2 none b
3 c a(or d, doesn't matter)
4 a NA
as you can see a=4 has just one value in b, thus I expect a NA in the output column "second" as there is no second most frequent value.
data:
a <- c(1,1,1,1,2,2,2,3,3,3,3,4)
b<- c("a","a", "a", "none", "none", "none", "b", "a", "c" , "c", "d","a")
d <- data.frame(a,b)
what I tried at the moment is the following
d1 <- d %>% group_by(a) %>% summarize ( first =names(which.max(table(b))) , second= names(which.max(table(b)[-which.max(table(b))] )))
but it doesn't work properly, any idea on how to do this?
You can count number of rows for a and b combination and for each value of a select 1st and 2nd value in summarise.
library(dplyr)
d %>%
count(a, b, sort = TRUE) %>%
group_by(a) %>%
summarise(first = b[1],second = b[2])
# A tibble: 4 x 3
# a first second
# <dbl> <chr> <chr>
#1 1 a none
#2 2 none b
#3 3 c a
#4 4 a NA
Here is one option with data.table
library(data.table)
setDT(d)[, .N, .(a, b)][order(N), .(first = first(b), second = b[2]), a]
I'm trying to group my data by a number of variables before providing a summary table showing the sum of the values within each group.
I have created the below data as an example.
Value <- c(21000,10000,50000,60000,2000, 4000, 5500, 10000, 35000, 40000)
Group <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "C")
Type <- c(1, 2, 1, 2, 1, 1, 1, 2, 2, 1)
Matrix <- cbind(Value, Group, Type)
I want to group the above data first by the 'Group' variable, and then by the 'Type' variable to then sum the values and get an output similar to the attached example I worked on Excel. I would usually use the aggregate function if I just wanted to group by one variable, but am not sure whether I can translate this for multiple variables?
Further to this I then need to provide an identical table but with the values being calculated with a "count" function rather than a "sum".
Many thanks in advance!
You can supply multiple groupings to aggregate:
df <- data.frame(Value, Group, Type)
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), sum)
Type Group x
1 1 A 26500
2 2 A 20000
3 1 B 50000
4 2 B 95000
5 1 C 46000
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), length)
Type Group x
1 1 A 2
2 2 A 2
3 1 B 1
4 2 B 2
5 1 C 3
There are other packages which may be easier to use such as data.table:
>library(data.table)
>dt <- as.data.table(df)
>dt[, .(Count = length(Value), Sum = sum(Value)),
by = .(Type, Group)]
Type Group Count Sum
1: 1 A 2 26500
2: 2 A 2 20000
3: 1 B 1 50000
4: 2 B 2 95000
5: 1 C 3 46000
dplyr is another option and #waskuf has good example of that.
Using dplyr (note that "Matrix" needs to be a data.frame):
library(dplyr)
Matrix <- data.frame(Value, Group, Type)
Matrix %>% group_by(Group, Type) %>% summarise(Sum = sum(Value),
Count = n()) %>% ungroup()
I didn't find a solution for this common grouping problem in R:
This is my original dataset
ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C
This should be my grouped resulting dataset
State min(ID) max(ID)
A 1 2
B 3 5
A 6 8
C 9 10
So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.
Any ideas?
You could try:
library(dplyr)
df %>%
mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
group_by(rleid) %>%
summarise(State = first(State), min = min(ID), max = max(ID)) %>%
select(-rleid)
Or as per mentioned by #alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:
df %>%
group_by(State, rleid = data.table::rleid(State)) %>%
summarise_all(funs(min, max)) %>%
select(-rleid)
Which gives:
## A tibble: 4 × 3
# State min max
# <fctr> <int> <int>
#1 A 1 2
#2 B 3 5
#3 A 6 8
#4 C 9 10
Here is a method that uses the rle function in base R for the data set you provided.
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
max.ID=cumsum(temp$lengths))
which returns
newDF
State min.ID max.ID
1 A 1 2
2 B 3 5
3 A 6 8
4 C 9 10
Note that rle requires a character vector rather than a factor, so I use the as.is argument below.
As #cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:
# get the run length encoding
temp <- rle(df$State)
# construct the data.frame
newDF <- data.frame(State=temp$values,
min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
max.ID=df$ID[cumsum(temp$lengths)])
data
df <- read.table(header=TRUE, as.is=TRUE, text="ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
An idea with data.table:
require(data.table)
dt <- fread("ID State
1 A
2 A
3 B
4 B
5 B
6 A
7 A
8 A
9 C
10 C")
dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]
which gives:
rle State min max
1: 1 A 1 2
2: 2 B 3 5
3: 3 A 6 8
4: 4 C 9 10
The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.
you can remove the rle column with
dt2[,rle:=NULL]
Chained:
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]
You can shorten the above code even more by using rleid inside by directly:
dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Here is another attempt using rle and aggregate from base R:
rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values
# State ID.minID ID.maxID
# 1 A 1 2
# 2 B 3 5
# 3 A 6 8
# 4 C 9 10
data
df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B",
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame",
row.names = c(NA,
-10L))
I have, for example, a vector with 1000 obs and 3 levels (A, B, C). I want to count how many times level A occurs for every 5 rows and produce another vector of the count values, ie with 200obs. Is anyone able to help? I've found how to count based on another variable but not number of rows. Thank you!
df <- data.frame(test=factor(sample(c("A","B", "C" ),1000,replace=TRUE)))
head(df, 10)
test
1 A
2 A
3 B
4 C
5 B
6 A
7 C
8 B
9 C
10 C
Here are a couple of options you might find useful:
a) count all entries per 5 rows and return a list:
head(lapply(split(df$test, rep(1:200, each = 5)), table), 2)
# $`1` # <- result for rows 1:5
#
# A B C
# 1 0 4
#
# $`2` # <- result for rows 6:10
#
# A B C
# 3 0 2
b) count all entries per 5 rows and return a matrix:
head(t(sapply(split(df$test, rep(1:200, each = 5)), table)), 2)
# A B C
# 1 1 0 4
# 2 3 0 2
c) count number of As per 5 rows and return a list:
head(lapply(split(df$test == "A", rep(1:200, each = 5)), sum), 2)
# $`1`
# [1] 1
#
# $`2`
# [1] 3
d) count number of As per 5 rows and return a vector:
head(sapply(split(df$test == "A", rep(1:200, each = 5)), sum), 2)
#1 2
#1 3
Each of the results will be 200 entries long / have 200 rows.
Here is a solution with dplyr and tidyr
library(dplyr)
library(tidyr)
df %>%
mutate(Set = (seq_along(test) - 1) %/% 5) %>%
group_by(Set, test) %>%
summarise(N = n()) %>%
spread(key = test, value = N, fill = 0)
We can use data.table
library(data.table)
setDT(df)[, .N , .(grp= gl(nrow(df), 5, nrow(df)), test)]
If you prefer dplyr, you could use
c1 <- df %>%
mutate(group = rep(paste0("G", seq(1, 200)), each = 5)) %>%
# count each level
count(group, test)
Note that this method doesn't include levels with no values for a certain group (i.e. no 0 values)