R: splitting dataframe into distinct subgroups containing sequence of groups - r

This question is similar to one already answered: R: Splitting dataframe into subgroups consisting of every consecutive 2 groups
However, rather than splitting into subgroups that have a type in common, I need to split into subgroups that contain two consecutive types and are distinct. The groups in my actual data have differing numbers of rows as well.
df <- data.frame(ID=c('1','1','1','1','1','1','1'), Type=c('a','a','b','c','c','d','d'), value=c(10,2,5,3,7,3,9))
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
So subgroup 1 would be Type a and b:
ID Type value
1 1 a 10
2 1 a 2
3 1 b 5
And subgroup 2 would be Type c and d:
ID Type value
4 1 c 3
5 1 c 7
6 1 d 3
7 1 d 9
I have tried manipulating the code from this previous example, but I can't figure out how to make this happen without having overlapping Types in each group. Any help would be greatly appreciated - thanks!
EDIT: thanks for pointing out I didn't actually include the correct link.

We can do a little manipulation of a dense_rank of the Type variable to make an appropriate grouping variable:
library(dplyr)
df %>%
group_by(g = (dense_rank(match(Type, Type)) - 1) %/% 2) %>%
group_split()
# [[1]]
# # A tibble: 3 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 a 10 0
# 2 1 a 2 0
# 3 1 b 5 0
#
# [[2]]
# # A tibble: 4 × 4
# ID Type value g
# <chr> <chr> <dbl> <dbl>
# 1 1 c 3 1
# 2 1 c 7 1
# 3 1 d 3 1
# 4 1 d 9 1
Explanation: match(Type, Type) converts Type into integers ordered by number of appearance - but not dense. dense_rank() makes that dense (no gaps). We then subtract 1 to make it start at 0 and %/% 2 to see how many 2s go into it, effectively grouping by pairs.

Here is a rle way, written as a function. Pass the data.frame and the split column name as a character string.
df <- data.frame(ID=c('1','1','1','1','1','1','1'),
Type=c('a','a','b','c','c','d','d'),
value=c(10,2,5,3,7,3,9))
split_two <- function(x, col) {
r <- rle(x[[col]])
r$values[c(FALSE, TRUE)] <- r$values[c(TRUE, FALSE)]
split(x, inverse.rle(r))
}
split_two(df, "Type")
#> $a
#> ID Type value
#> 1 1 a 10
#> 2 1 a 2
#> 3 1 b 5
#>
#> $c
#> ID Type value
#> 4 1 c 3
#> 5 1 c 7
#> 6 1 d 3
#> 7 1 d 9
Created on 2023-02-09 with reprex v2.0.2

Related

Is there an R function to sequentially assign a code to each value in a dataframe, in the order it appears within the dataset?

I have a table with a long list of aliased values like this:
> head(transmission9, 50)
# A tibble: 50 x 2
In_Node End_Node
<chr> <chr>
1 c4ca4238 2838023a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a
I would like to have R go through both columns and assign a number sequentially to each value, in the order that an aliased value appears in the dataset. I would like R to read across rows first, then down columns. For example, for the dataset above:
In_Node End_Node
<chr> <chr>
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
Is this possible? Ideally, I'd also love to be able to generate a "key" which would match each sequential code to each aliased value, like so:
Code Value
1 c4ca4238
2 2838023a
3 d82c8d16
4 a684ecee
5 fc490ca4
Thank you in advance for the help!
You could do:
df1 <- df
df1[]<-as.numeric(factor(unlist(df), unique(c(t(df)))))
df1
In_Node End_Node
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
You can match against the unique values. For a single vector, the code is straightforward:
match(vec, unique(vec))
The requirement to go across columns before rows makes this slightly tricky: you need to transpose the values first. After that, match them.
Finally, use [<- to assign the result back to a data.frame of the same shape as your original data (here x):
y = x
y[] = match(unlist(x), unique(c(t(x))))
y
V2 V3
1 1 2
2 1 3
3 1 4
4 1 5
5 6 1
6 7 8
c(t(x)) is a bit of a hack:
t first converts the tibble to a matrix and then transposes it. If your tibble contains multiple data types, these will be coerced to a common type.
c(…) discards attributes. In particular, it drops the dimensions of the transposed matrix, i.e. it converts the matrix into a vector, with the values now in the correct order.
A dplyr version
Let's first re-create a sample data
library(tidyverse)
transmission9 <- read.table(header = T, text = " In_Node End_Node
1 c4ca4238 283802d3a
2 c4ca4238 d82c8d16
3 c4ca4238 a684ecee
4 c4ca4238 fc490ca4
5 28dd2c79 c4ca4238
6 f899139d 3def184a")
Do this simply
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data()))))))
#> In_Node End_Node
#> 1 1 2
#> 2 1 3
#> 3 1 4
#> 4 1 5
#> 5 6 1
#> 6 7 8
use .names argument if you want to create new columns
transmission9 %>%
mutate(across(everything(), ~ match(., unique(c(t(cur_data())))),
.names = '{.col}_code'))
In_Node End_Node In_Node_code End_Node_code
1 c4ca4238 2838023a 1 2
2 c4ca4238 d82c8d16 1 3
3 c4ca4238 a684ecee 1 4
4 c4ca4238 fc490ca4 1 5
5 28dd2c79 c4ca4238 6 1
6 f899139d 3def184a 7 8

R how to get a result like expand.grid, but control the order of the expansion?

The expand.grid gives the results ordered by the last entered set, but I need it based on the first set.
Given the following code:
expand.grid(a=(1:2),b=c("a","b","c"))
a b
1 1 a
2 2 a
3 1 b
4 2 b
5 1 c
6 2 c
Notice how column a changes most often with b less often.
The algorithm it seems is lock the 2nd or Nth variable b and then alternate the 1st or (N-1) variable until the grid gets to every combination possible in the grid.
I need to expand.grid or a similar function that first sets the 1st variable and then adjusts the 2nd variable and so on until it gets to all N.
The desired result for the example is:
a b
1 1 a
2 1 b
3 1 c
4 2 a
5 2 b
6 2 c
One way I that works for the example is simply to order by column a, but that does not work as I would need to be able to order by N columns in order and I have not found a way to do so.
It seems so trivial, but I cannot find a way to get expand.grid to behave like I need.
Any solution must work on any arbitrary number of entries to expand.grid and of any arbitrary size. Thank you.
try to do so
library(tidyverse)
df <- expand.grid(a=(1:2),b=c("a","b","c"))
df %>%
arrange_all()
We can use crossing from tidyr
library(tidyr)
crossing(a = 1:2, b = c('a', 'b', 'c'))
# A tibble: 6 x 2
# a b
# <int> <chr>
#1 1 a
#2 1 b
#3 1 c
#4 2 a
#5 2 b
#6 2 c
Here is a base-R solution, that works with any amount of variables without knowing the content beforehand.
Gather all the variables in a list, with the desired order in which you want to expand. Apply a reverse function rev first on the list in expand.grid and a second time on the output to get the desired expanding result.
Your example:
l <- list(a=(1:2),b=c("a","b","c"))
rev(expand.grid(rev(l)))
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
An example with 3 variables:
var1 <- c("SR", "PL")
var2 <- c(1,2,3)
var3 <- c("A",'B')
l <- list(var1,var2,var3)
rev(expand.grid(rev(l)))
#> Var3 Var2 Var1
#> 1 SR 1 A
#> 2 SR 1 B
#> 3 SR 2 A
#> 4 SR 2 B
#> 5 SR 3 A
#> 6 SR 3 B
#> 7 PL 1 A
#> 8 PL 1 B
#> 9 PL 2 A
#> 10 PL 2 B
#> 11 PL 3 A
#> 12 PL 3 B
Try this:
expand.grid(b=c("a","b","c"), a=(1:2))[, c("a", "b")]
#> a b
#> 1 1 a
#> 2 1 b
#> 3 1 c
#> 4 2 a
#> 5 2 b
#> 6 2 c
Created on 2020-03-19 by the reprex package (v0.3.0)

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

Subset data frame that include a variable

I have a list of events and sequences. I would like to print the sequences in a separate table if event = x is included somewhere in the sequence. See table below:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4
In this case I would like a new table that includes only the sequences where Event=x was included:
Event Sequence
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
Base R solution:
d[d$Sequence %in% d$Sequence[d$Event == "x"], ]
Event Sequence
1: a 1
2: a 1
3: x 1
4: a 3
5: a 3
6: x 3
data.table solution:
library(data.table)
setDT(d)[Sequence %in% Sequence[Event == "x"]]
As you can see syntax/logic is quite similar between these two solutions:
Find event's that are equal to x
Extract their Sequence
Subset table according to specified Sequence
We can use dplyr to group the data and filter the sequence with any "x" in it.
library(dplyr)
df2 <- df %>%
group_by(Sequence) %>%
filter(any(Event %in% "x")) %>%
ungroup()
df2
# A tibble: 6 x 2
Event Sequence
<chr> <int>
1 a 1
2 a 1
3 x 1
4 a 3
5 a 3
6 x 3
DATA
df <- read.table(text = " Event Sequence
1 a 1
2 a 1
3 x 1
4 a 2
5 a 2
6 a 3
7 a 3
8 x 3
9 a 4
10 a 4",
header = TRUE, stringsAsFactors = FALSE)

Computing Change from Baseline in R

I have a dataset in R, which contains observations by time. For each subject, I have up to 4 rows, and a variable of ID along with a variable of Time and a variable called X, which is numerical (but can also be categorical for the sake of the question). I wish to compute the change from baseline for each row, by ID. Until now, I did this in SAS, and this was my SAS code:
data want;
retain baseline;
set have;
if (first.ID) then baseline = .;
if (first.ID) then baseline = X;
else baseline = baseline;
by ID;
Change = X-baseline;
run;
My question is: How do I do this in R ?
Thank you in advance.
Dataset Example (in SAS, I don't know how to do it in R).
data have;
input ID, Time, X;
datalines;
1 1 5
1 2 6
1 3 8
1 4 9
2 1 2
2 2 2
2 3 7
2 4 0
3 1 1
3 2 4
3 3 5
;
run;
Generate some example data:
dta <- data.frame(id = rep(1:3, each=4), time = rep(1:4, 3), x = rnorm(12))
# > dta
# id time x
# 1 1 1 -0.232313499
# 2 1 2 1.116983376
# 3 1 3 -0.682125947
# 4 1 4 -0.398029820
# 5 2 1 0.440525082
# 6 2 2 0.952058966
# 7 2 3 0.690180586
# 8 2 4 -0.995872696
# 9 3 1 0.009735667
# 10 3 2 0.556254340
# 11 3 3 -0.064571775
# 12 3 4 -1.003582676
I use the package dplyr for this. This package is not installed by default, so, you'll have to install it first if it isn't already.
The steps are: group the data by id (following operations are done per group), sort the data to make sure it is ordered on time (that the first record is the baseline), then calculate a new column which is the difference between x and the first value of x. The result is stored in a new data.frame, but can of course also be assigned back to dta.
library(dplyr)
dta_new <- dta %>% group_by(id) %>% arrange(id, time) %>%
mutate(change = x - first(x))
# > dta_new
# Source: local data frame [12 x 4]
# Groups: id [3]
#
# id time x change
# <int> <int> <dbl> <dbl>
# 1 1 1 -0.232313499 0.00000000
# 2 1 2 1.116983376 1.34929688
# 3 1 3 -0.682125947 -0.44981245
# 4 1 4 -0.398029820 -0.16571632
# 5 2 1 0.440525082 0.00000000
# 6 2 2 0.952058966 0.51153388
# 7 2 3 0.690180586 0.24965550
# 8 2 4 -0.995872696 -1.43639778
# 9 3 1 0.009735667 0.00000000
# 10 3 2 0.556254340 0.54651867
# 11 3 3 -0.064571775 -0.07430744
# 12 3 4 -1.003582676 -1.01331834

Resources