Labelling rows according to how many times the group appeared in previous rows - r

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".

You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp

Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

Related

Strsplit on a column of a data frame [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I have a data.frame where one of the variables is a vector (or a list), like this:
MyColumn <- c("A, B,C", "D,E", "F","G")
MyDF <- data.frame(group_id=1:4, val=11:14, cat=MyColumn)
# group_id val cat
# 1 1 11 A, B,C
# 2 2 12 D,E
# 3 3 13 F
# 4 4 14 G
I'd like to have a new data frame with as many rows as the vector
FlatColumn <- unlist(strsplit(MyColumn,split=","))
which looks like this:
MyNewDF <- data.frame(group_id=c(rep(1,3),rep(2,2),3,4), val=c(rep(11,3),rep(12,2),13,14), cat=FlatColumn)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
In essence, for every factor which is an element of the list of MyColumn (the letters A to G), I want to assign the corresponding values of the list. Every factor appears only once in MyColumn.
Is there a neat way for this kind of reshaping/unlisting/merging? I've come up with a very cumbersome for-loop over the rows of MyDF and the length of the corresponding element of strsplit(MyColumn,split=","). I'm very sure that there has to be a more elegant way.
You can use separate_rows from tidyr:
tidyr::separate_rows(MyDF, cat)
# group_id val cat
# 1 1 11 A
# 2 1 11 B
# 3 1 11 C
# 4 2 12 D
# 5 2 12 E
# 6 3 13 F
# 7 4 14 G
How about
lst <- strsplit(MyColumn, split = ",")
k <- lengths(lst) ## expansion size
FlatColumn <- unlist(lst, use.names = FALSE)
MyNewDF <- data.frame(group_id = rep.int(MyDF$group_id, k),
val = rep.int(MyDF$val, k),
cat = FlatColumn)
# group_id val cat
#1 1 11 A
#2 1 11 B
#3 1 11 C
#4 2 12 D
#5 2 12 E
#6 3 13 F
#7 4 14 G
We can use cSplit from splitstackshape
library(splitstackshape)
cSplit(MyDF, "cat", ",", "long")
# group_id val cat
#1: 1 11 A
#2: 1 11 B
#3: 1 11 C
#4: 2 12 D
#5: 2 12 E
#6: 3 13 F
#7: 4 14 G
We can also use do with base R with strsplit to split the 'cat' column into a list, replicate the sequence of rows of 'MyDF' with the lengths of 'lst', and create the 'cat' column by unlisting the 'lst'.
lst <- strsplit(as.character(MyDF$cat), ",")
transform(MyDF[rep(1:nrow(MyDF), lengths(lst)),-3], cat = unlist(lst))

How to remove individuals with fewer than 5 observations from a data frame [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed last month.
To clarify the question I'll briefly describe the data.
Each row in the data.frame is an observation, and the columns represent variables pertinent to that observation including: what individual was observed, when it was observed, where it was observed, etc. I want to exclude/filter individuals for which there are fewer than 5 observations.
In other words, if there are fewer than 5 rows where individual = x, then I want to remove all rows that contain individual x and reassign the result to a new data.frame. I'm aware of some brute force techniques using something like names == unique(df$individualname) and then subsetting out those names individually and applying nrow to determine whether or not to exclude them...but there has to be a better way. Any help is appreciated, I'm still pretty new to R.
An example using group_by and filter from dplyr package:
library(dplyr)
df <- data.frame(id=c(rep("a", 2), rep("b", 5), rep("c", 8)),
foo=runif(15))
> df
id foo
1 a 0.8717067
2 a 0.9086262
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
df %>% group_by(id) %>% filter(n()>= 5) %>% ungroup()
Source: local data frame [13 x 2]
id foo
(fctr) (dbl)
1 b 0.9962453
2 b 0.8980123
3 b 0.1535324
4 b 0.2802848
5 b 0.9366375
6 c 0.8109557
7 c 0.6945285
8 c 0.1012925
9 c 0.6822955
10 c 0.3757085
11 c 0.7348635
12 c 0.3026395
13 c 0.9707223
or with base R:
> df[df$id %in% names(which(table(df$id)>=5)), ]
id foo
3 b 0.9962453
4 b 0.8980123
5 b 0.1535324
6 b 0.2802848
7 b 0.9366375
8 c 0.8109557
9 c 0.6945285
10 c 0.1012925
11 c 0.6822955
12 c 0.3757085
13 c 0.7348635
14 c 0.3026395
15 c 0.9707223
Still in base R, using with is a more elegant way to do the very same thing:
df[with(df, id %in% names(which(table(id)>=5))), ]
or:
subset(df, with(df, id %in% names(which(table(id)>=5))))
Another way to do the same thing using the data.table package.
library(data.table)
set.seed(1)
dt <- data.table(id=sample(1:4,20,replace=TRUE),var=sample(1:100,20))
dt1<-dt[,count:=.N,by=id][(count>=5)]
dt2<-dt[,count:=.N,by=id][(count<5)]
dt1
id var count
1: 2 94 5
2: 2 22 5
3: 3 64 5
4: 4 13 6
5: 4 37 6
6: 4 2 6
7: 3 36 5
8: 3 81 5
9: 3 90 5
10: 2 17 5
11: 4 72 6
12: 2 57 5
13: 3 67 5
14: 4 9 6
15: 2 60 5
16: 4 34 6
dt2
id var count
1: 1 26 4
2: 1 31 4
3: 1 44 4
4: 1 54 4
It can be also with data.table using a logical condition with if after grouping by 'id'
library(data.table)
setDT(df)[, if(.N >=5) .SD, id]
# id foo
# 1: b 0.9962453
# 2: b 0.8980123
# 3: b 0.1535324
# 4: b 0.2802848
# 5: b 0.9366375
# 6: c 0.8109557
# 7: c 0.6945285
# 8: c 0.1012925
# 9: c 0.6822955
#10: c 0.3757085
#11: c 0.7348635
#12: c 0.3026395
#13: c 0.9707223
data
df <- structure(list(id = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "c"), foo = c(0.8717067, 0.9086262,
0.9962453, 0.8980123, 0.1535324, 0.2802848, 0.9366375, 0.8109557,
0.6945285, 0.1012925, 0.6822955, 0.3757085, 0.7348635, 0.3026395,
0.9707223)), .Names = c("id", "foo"), class = "data.frame",
row.names = c(NA, -15L))
you can also use table. take for instance the data.frame mtcars
table(mtcars$cyl)
you will see that cyl has 3 values 4 6 8. there are 7 cars with 6 cylinders and if you want to exclude observations with less than 10 then you can exclude the cars with 6 cylinders like that
mtcars[!mtcars$cyl%in%names(table(mtcars$cyl)[table(mtcars$cyl)<10]),]
this will exclude observations using %in% names and table alone

Compare values in the same column and add the result in a second column in R

I am comparing values in the same column for different rows and depending on the value I want to give a result.
Below the table. I have the IDs and I want to get the values of the column "Result". The column "What I have" is what I get from my function.
Conditions for the results:
If the ID (n) is different from the ID of the row before (n-1), then Result (n) is A.
If the ID (n) is the same as in the ID of the row before and different from the next row (n+1), then Result (n) is C.
Other cases is B.
l <- ifelse ((df$ID[1:(nrow(df)-1)] != df$ID[2:(nrow(df)+0)]),print("A"),
ifelse(((df$ID[1:(nrow(df)-1)] == df$ID[2:(nrow(df)+0)]) & (df$ID[2:(nrow(df)+0)] != df$ID[3:(nrow(df)+1)])), print ("C"),print ("B")))
df
ID Result What I have
1 12 A A
2 13 A A
3 14 A A
4 15 A B
5 15 B B
6 15 B B
7 15 B B
8 15 B C
9 15 C A
10 16 A NA
11 17 A NA
Thanks a lot in advance
Using lead from dplyr and ifelse you could do it this way:
library(dplyr)
df$Result <- ifelse(df$ID != lag(df$ID) | is.na(lag(df$ID)), 'A',
ifelse(df$ID == lag(df$ID) & df$ID != lead(df$ID), 'C', 'B' ))
Output:
> df
ID Result
1 12 A
2 13 A
3 14 A
4 15 A
5 15 B
6 15 B
7 15 B
8 15 B
9 15 C
10 16 A
11 17 A
A few words for clarification: lag shifts the column by 1 row whereas lead does exactly the opposite thing i.e. takes the column back by 1 row. Check lag(df$ID) and lead(df$ID) for visualizing it.
We could also use rleid from data.table. We group by the run-length type id of 'ID' variable. Use a logical condition to create the 'Result' variable, i.e. if the number of elements is greater than 1 (.N >1), we concatenate 'A' with replicate of 'B' and 'C' or else to return 'A'.
library(data.table)#v1.9.6+
setDT(df)[, Result:=if(.N>1) c('A', rep('B', .N-2), 'C') else 'A' ,
by = rleid(ID)]
df
# ID Result
# 1: 12 A
# 2: 13 A
# 3: 14 A
# 4: 15 A
# 5: 15 B
# 6: 15 B
# 7: 15 B
# 8: 15 B
# 9: 15 C
#10: 16 A
#11: 17 A
data
df <- data.frame(ID= c(12:14, rep(15, 6), 16:17))

Selecting rows by offsetting

I have this data frame, lets call it my_df.
It looks like this:
my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10))
names(my_df) <- c("num","let")
head(my_df)
num let
1 0.01202600 a
2 1.09025768 a
3 -0.08656178 a
4 -0.04847073 a
5 -0.63750258 a
6 0.58846135 a
What I want to do is select all of the rows when my_df$let == "b" as well as the five rows before the first row when my_df$let == "b", and the five rows after the last row when my_df == "b". So basically my_df[6:25,].
The data I'm actually working with is hundreds of thousands of lines long and I don't know what rows is what, and besides that each set of data doesn't match up row wise and I can't take the time to go through each set of data individually. I've been using a subset to select the data I want, but I don't know how to select the additional rows outside of the subset (1000 rows before and after).
Here's my subset for what I'm doing:
#The following lines seperate pXX_NoNegative into individual field sections
p04_HighWeeds <- subset(p04_NoNegative, subset = p04_NoNegative$GS_Field == "High Weeds")
I want to select all of the rows that the above code selects, but I also want 100 rows before that, and 1000 rows after that.
If you need any additional information that may help you please ask.
Here's another idea using dplyr:
library(dplyr)
my_df %>% filter(lead(let == "b", 5) | lag(let == "b", 5))
Or as per #akrun suggestion using the devel version of data.table:
setDT(my_df)[shift(let == "b", 5) | shift(let == "b", type = "lead", 5)]
Which gives:
# num let
#1 0.36723709 a
#2 0.24743170 a
#3 -0.33339924 a
#4 -0.57024317 a
#5 0.03390278 a
#6 -0.43495096 b
#7 -0.85107347 b
#8 0.53048931 b
#9 -0.26739611 b
#10 -0.96029355 b
#11 -0.71737408 b
#12 0.34324685 b
#13 0.12319646 b
#14 0.75207703 b
#15 0.18134006 b
#16 -0.02230777 c
#17 0.42646106 c
#18 -0.11055478 c
#19 0.06013187 c
#20 0.50782158 c
Normally splitting a data frame into a list of data frames based on some categorization is straightforward -- you would use split(my_df, my_df$let) in your case. However with the added complication that you want some number of rows before or after I would operate over the set of unique categorizations, selecting the rows you want in each case:
before <- 5
after <- 5
ret <- setNames(lapply(unique(my_df$let), function(x) {
positions <- which(my_df$let == x)
start.pos <- max(1, min(positions)-before)
end.pos <- min(nrow(my_df), max(positions)+after)
my_df[start.pos:end.pos,]
}), unique(my_df$let))
You can grab the observations for any category you want out of the returned list:
ret$b # Also works: ret[["b"]]
# num let
# 6 -0.197901427 a
# 7 0.194607192 a
# 8 -0.107318203 a
# 9 -0.365313233 a
# 10 -0.188926562 a
# 11 0.636272295 b
# 12 -0.058791973 b
# 13 -0.231029510 b
# 14 0.519441716 b
# 15 0.239510912 b
# 16 0.107025658 b
# 17 -0.446644081 b
# 18 0.145052077 b
# 19 -0.426090749 b
# 20 -0.356062993 b
# 21 -0.155012203 c
# 22 -0.007968255 c
# 23 -0.504253089 c
# 24 0.081624303 c
# 25 -0.657008233 c
I recently answered a nearly identical question: Select n rows after specific number. Adapting the single-segment solution to your data:
set.seed(1); my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10));
names(my_df) <- c("num","let");
brange <- range(which(my_df$let=='b'));
my_df$offb <- c((1-brange[1]):-1,rep(0,diff(brange)+1),1:(nrow(my_df)-brange[2]));
my_df;
## num let offb
## 1 -0.313226905 a -10
## 2 0.091821662 a -9
## 3 -0.417814306 a -8
## 4 0.797640401 a -7
## 5 0.164753886 a -6
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5
## 26 -0.028064370 c 6
## 27 -0.077897753 c 7
## 28 -0.735376192 c 8
## 29 -0.239075028 c 9
## 30 0.208970780 c 10
subset(my_df,offb>=-5&offb<=5);
## num let offb
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources