R - Subset dataframe based on a repeated sequence - r

I am trying to subset a data frame based on specific sequence occurring in column v3.
A sample of a dataframe:
v1 <- c(1:20)
v2 <- c(1,1,0,0,1,0,1,1,1,0,1,1,0,0,0,1,1,0,0,0)
v3 <- c(4,4,2,3,2,3,2,4,4,2,3,2,3,3,3,4,4,2,3,3)
my_df <- data.frame(v1,v2,v3) # creating a dataframe
sample output for my_df
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
4 4 0 3
5 5 1 2
6 6 0 3
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
11 11 1 3
12 12 1 2
13 13 0 3
14 14 0 3
15 15 0 3
16 16 1 4
17 17 1 4
18 18 0 2
19 19 0 3
20 20 0 3
The output I am trying to achieve should look like this
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
So I want to subset my df according to sequence of 4 4 2 in column v3. What I tried so far is:
my_df[which(c(diff(v3))==-2),]
but this only extracts the middle four of the sequence 4 4 2 like
v1 v2 v3
2 2 1 4
9 9 1 4
17 17 1 4
Another option I tried:
m = match(v3, c(4,4,2))
> m
[1] 1 1 3 NA 3 NA 3 1 1 3 NA 3 NA NA NA 1 1 3 NA NA
> my_df[!is.na(m),]
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
5 5 1 2
7 7 1 2
8 8 1 4
9 9 1 4
10 10 0 2
12 12 1 2
16 16 1 4
17 17 1 4
18 18 0 2
This output gives me all 4 and 2 but not the sequence 4 4 2 that I want. Any help would be appreciated.
I already achieved this in matlab with for and if loop but I am just wondering how I can solve this in R in a loopless way.

We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(my_df)). Using shift from data.table, we get the next element with type = "lead". As shift takes a vector of n, we specify n = 0:2, so that we get three columns with the n = 0 corresponds to the original 'v3' column and others 1st and 2nd next values. Then, paste the elements rowwise (do.call(paste0, ...), check whether it is equal to 442, get the index of TRUE values (which), use rep to replicate the index and add with 0:2 so that we get the index of the three rows for each index. This can be used to subset the original dataset rows.
library(data.table)
setDT(my_df)[my_df[, rep(which(do.call(paste0, shift(v3, 0:2,
type= "lead")) == 442), each = 3) + 0:2]]
# v1 v2 v3
#1: 1 1 4
#2: 2 1 4
#3: 3 0 2
#4: 8 1 4
#5: 9 1 4
#6: 10 0 2
#7: 16 1 4
#8: 17 1 4
#9: 18 0 2
data
my_df <- structure(list(v1 = 1:20, v2 = c(1L, 1L, 0L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L), v3 = c(4L,
4L, 2L, 3L, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 2L, 3L, 3L, 3L, 4L, 4L,
2L, 3L, 3L)), .Names = c("v1", "v2", "v3"), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20"))

As long as v3 does not have any missing values and the values of v3 are single characters, you can also use gregexpr to accomplish this as follows
# get the row indices where the pattern 442 starts c(1 , 8, 16)
rowstarts <- unlist(gregexpr("442", paste(my_df$v3, collapse="")))
# extract rows from the data fram
dfNew <- my_df[sort(c(outer(rowstarts, (0:2), "+"))), ]
which returns
dfNew
v1 v2 v3
1 1 1 4
2 2 1 4
3 3 0 2
8 8 1 4
9 9 1 4
10 10 0 2
16 16 1 4
17 17 1 4
18 18 0 2
paste with the collapse argument turns the vector v3 into a single character string. grexpr then finds the starting position in this string for any "442" subexpression.
The final step subsets the data.frame using the outer function suggested by #alexis-laz's in the comments above.

Related

How to merge datasets with repeated measures

I have Three datasets that I want to MERGE/JOIN.
This This examples only include the first participants I have a total of 25
df1
ID Grup pretest
1 1 A 2
2 1 A 1
3 1 A 3
4 2 B NA
5 2 B 1
6 2 B 3
7 3 A 2
8 3 A 1
9 3 A NA
10 4 B 2
11 4 B 1
12 4 B 3
df2 (this is missing one ID (5)
ID Grup posttest
1 1 A NA
2 1 A 5
3 1 A 4
4 2 B 2
5 2 B 4
6 2 B 3
7 3 A 5
8 3 A 6
9 3 A 3
10 6 B 4
11 6 B 2
12 6 B NA
Updated
df3( this have 5 Measurements for per ID)
ID Grup traning
1 1 A 2
2 1 A 6
3 1 A 3
4 1 A NA
5 1 A 1
6 2 B 3
7 2 B 4
8 2 B 1
9 2 B NA
10 2 B 2
11 3 A 1
12 3 A 3
I’ve been trying merge() and full_join() but both end up creating duplicates that I don’t want.
It won’t recognize the ID as an independent value, it’s creating 9 IDs for every ID value.
New <- merge(df1, df2, by= 'ID')
New <- full_join(df1, df2, By = "ID")
Setting all = TRUE doesn’t help.
I need the dataset to look like this
ID Grup pretest posttest traning
1 1 A 2 NA. 3
2 1 A 1 5. 4
3 1 A 3 4. 4
4 1 A NA Na. 4
5 1. A NA Na. 3
6 2 B 3 3. Na
7 2 B. 2 5. 3
8 2 B Na 6. 2
9 2 B NA Na. 5
10 2 B Na Na. 4
11 3 A. 1 2. 3
12 3 A. 3 3. 4
Since you are relying on the order of the frames, you can simply use cbind()
cbind(df1,df2[,3,F])
Output:
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
You can add a helper column iid to separate the entries.
df1 <- cbind(iid = 1:nrow(df1), df1)
df2 <- cbind(iid = 1:nrow(df2), df2)
With dplyr
library(dplyr)
left_join(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA
With base R merge
merge(df1, df2, c("iid", "ID", "Grup"))[,-1]
ID Grup pretest posttest
1 1 A 2 NA
2 4 B 2 4
3 4 B 1 2
4 4 B 3 NA
5 1 A 1 5
6 1 A 3 4
7 2 B NA 2
8 2 B 1 4
9 2 B 3 3
10 3 A 2 5
11 3 A 1 6
12 3 A NA 3
Data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), pretest = c(2L, 1L, 3L, NA, 1L, 3L, 2L, 1L, NA,
2L, 1L, 3L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12"))
df2 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L), Grup = c("A", "A", "A", "B", "B", "B", "A", "A", "A",
"B", "B", "B"), posttest = c(NA, 5L, 4L, 2L, 4L, 3L, 5L, 6L,
3L, 4L, 2L, NA)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
Another option is joining by rownames, eg. row numbers:
library(tibble)
library(dplyr)
left_join(rownames_to_column(df1), df2 %>% rownames_to_column() , by="rowname") %>%
select(ID = ID.x, Grup = Grup.x, pretest, posttest)
ID Grup pretest posttest
1 1 A 2 NA
2 1 A 1 5
3 1 A 3 4
4 2 B NA 2
5 2 B 1 4
6 2 B 3 3
7 3 A 2 5
8 3 A 1 6
9 3 A NA 3
10 4 B 2 4
11 4 B 1 2
12 4 B 3 NA

Pasting values from a vector to a new column in a for loop with nested data

I have a dataframe that currently looks like this:
subjectID
Trial
1
3
1
3
1
3
1
4
1
4
1
5
1
5
1
5
2
1
2
1
2
3
2
3
2
3
2
5
2
5
2
6
3
1
Etc., where trial number is nested under subject ID. I need to make a new column in which column "NewTrial" is simply what order the trials now appear in. For example:
subjectID
Trial
NewTrial
1
3
1
1
3
1
1
3
1
1
4
2
1
4
2
1
5
3
1
5
3
1
5
3
2
1
1
2
1
1
2
3
2
2
3
2
2
3
2
2
5
3
2
5
3
2
6
4
3
1
1
So far, I have a for-loop written that looks like this:
for (myperson in unique(data$subjectID)){
#This line creates a vector of the number of unique trials per subject: for subject 1, c(1, 2, 3)
triallength=1:length(unique(data$Trial[data$subID==myperson]))
I'm having trouble now finding a way to paste the numbers from the created triallength vector as a column in the dataframe. Does anyone know of a way to accomplish this? I am lacking some experience with for-loops and hoping to gain more. If anyone has a tidyverse/dplyr solution, however, I am open to that as well as an alternative to a for-loop. Thanks in advance, and let me know if any clarification is needed!
Converting to factor with unique values as levels, then as.numeric in an ave should be nice.
transform(dat, NewTrial=ave(Trial, subjectID, FUN=\(x) as.numeric(factor(x, levels=unique(x)))))
# subjectID Trial NewTrial
# 1 1 3 1
# 2 1 3 1
# 3 1 3 1
# 4 1 4 2
# 5 1 4 2
# 6 1 5 3
# 7 1 5 3
# 8 1 5 3
# 9 2 1 1
# 10 2 1 1
# 11 2 3 2
# 12 2 3 2
# 13 2 3 2
# 14 2 5 3
# 15 2 5 3
# 16 2 6 4
# 17 3 1 1
Data:
dat <- structure(list(subjectID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L), Trial = c(3L, 3L, 3L, 4L,
4L, 5L, 5L, 5L, 1L, 1L, 3L, 3L, 3L, 5L, 5L, 6L, 1L)), class = "data.frame", row.names = c(NA,
-17L))
We could use match on the unique values after grouping by 'subjectID'
library(dplyr)
df1 <- df1 %>%
group_by(subjectID) %>%
mutate(NewTrial = match(Trial, unique(Trial))) %>%
ungroup
We could use rleid:
library(dplyr)
library(data.table)
df %>%
group_by(subjectID) %>%
mutate(NewTrial = rleid(subjectID, Trial))
subjectID Trial NewTrial
<int> <int> <int>
1 1 3 1
2 1 3 1
3 1 3 1
4 1 4 2
5 1 4 2
6 1 5 3
7 1 5 3
8 1 5 3
9 2 1 1
10 2 1 1
11 2 3 2
12 2 3 2
13 2 3 2
14 2 5 3
15 2 5 3
16 2 6 4
17 3 1 1

Adding a new column to the dataframe in R that contains the most frequent value in each row

for example consider the following dataframe:
X[[i]] X[[i]] X[[i]]
1 1 1 1
2 1 1 2
3 1 2 2
4 0 3 0
5 3 3 3
6 0 3 0
7 4 3 4
8 4 4 4
the result will be:
X[[i]] X[[i]] X[[i]] output
1 1 1 0 1
2 1 1 2 1
3 1 2 2 2
4 0 3 0 0
5 3 3 3 3
6 0 3 0 0
7 4 3 4 4
8 4 4 4 4
The dataframe vary in their number of rows and columns in each execution. And the output column values are numeric
Thanks in advance.
We can loop over the rows with apply and use Mode
cbind(df1, output = apply(df1, 1, FUN = Mode))
# X[[i]] X[[i]] X[[i]] output
#1 1 1 1 1
#2 1 1 2 1
#3 1 2 2 2
#4 0 3 0 0
#5 3 3 3 3
#6 0 3 0 0
#7 4 3 4 4
#8 4 4 4 4
where
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
data
df1 <- structure(list(`X[[i]]` = c(1L, 1L, 1L, 0L, 3L, 0L, 4L, 4L),
`X[[i]]` = c(1L, 1L, 2L, 3L, 3L, 3L, 3L, 4L), `X[[i]]` = c(1L,
2L, 2L, 0L, 3L, 0L, 4L, 4L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))
What you're calculating is the mode of each row. The following will work for any number of rows and columns:
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df$output = apply(df, 1, mode)
This will produce:
V1 V2 V3 output
1 1 1 1 1
2 1 1 2 1
3 1 2 2 2
4 0 3 0 0
5 3 3 3 3
6 0 3 0 0
7 4 3 4 4
8 4 4 4 4

How to custom arrange such that no group dimension contains the same index twice?

I have the following tibble containing all the permutations of some indexes:
bb <- as_tibble(expand.grid(v1=0:2, v2=0:2)) %>%
arrange(v1, v2)
bb
# A tibble: 9 x 2
v1 v2
<int> <int>
1 0 0
2 0 1
3 0 2
4 1 0
5 1 1
6 1 2
7 2 0
8 2 1
9 2 2
How can it be arranged in such a way that it generates this output instead:
v1 v2
<int> <int>
1 0 0
2 1 1
3 2 2
4 0 1
5 1 2
6 2 0
7 0 2
8 1 0
9 2 1
Where the output is three groups/sets such that within each set there is no repetition of the index within each variable. Note that there can be only so many rows per group/set fulfilling this criteria ...
Sorry that I am not very familiar with tibble, so I provide a solution with data.frame in base R:
shifter <- function(x, n) ifelse(n == 0, return(x), return(c(tail(x, -n), head(x, n))))
res <- `rownames<-`(Reduce(rbind,lapply(seq(length(dfs<-split(df,rep(0:2,3)))),
function(k) {
dfs[[k]][,2] <- shifter(dfs[[k]][,1],k-1)
dfs[[k]]})),seq(nrow(df)))
which gives:
> res
v1 v2
1 0 0
2 1 1
3 2 2
4 0 1
5 1 2
6 2 0
7 0 2
8 1 0
9 2 1
DATA
df <- structure(list(v1 = c(0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), v2 = c(0L,
1L, 2L, 0L, 1L, 2L, 0L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-9L))
Update: a more efficient generator for all combinations with desired format is given as below:
genAllCombn <- function(n) {
v1 <- rep(0:(n-1),n)
v2 <- (v1 + rep(0:(n-1),1,each = n)) %% n
return(data.frame(v1,v2))
}
> genAllCombn(4)
v1 v2
1 0 0
2 1 1
3 2 2
4 3 3
5 0 1
6 1 2
7 2 3
8 3 0
9 0 2
10 1 3
11 2 0
12 3 1
13 0 3
14 1 0
15 2 1
16 3 2

How to find the last occurrence of a certain observation in grouped data in R?

I have data that is grouped using dplyr in R. I would like to find the last occurrence of observations ('B') equal to or greater than 1 (1, 2, 3 or 4) in each group ('A'), in terms of the 'day' they occurred. I would like the value of 'day' for each group to be given in a new column.
For example, given the following sample of data, grouped by A (this has been simplified, my data is actually grouped by 3 variables):
A B day
a 2 1
a 2 2
a 1 5
a 0 8
b 3 1
b 3 4
b 3 6
b 0 7
b 0 9
c 1 2
c 1 3
c 1 4
I would like to achieve the following:
A B day last
a 2 1 5
a 2 2 5
a 1 5 5
a 0 8 5
b 3 1 6
b 3 4 6
b 3 6 6
b 0 7 6
b 0 9 6
c 1 2 4
c 1 3 4
c 1 4 4
I hope this makes sense, thank you all very much for your help! I have thoroughly searched for my answer online but couldn't find anything. However, if I have accidentally duplicated a question then I apologise.
We can try
library(data.table)
setDT(df1)[, last := day[tail(which(B>=1),1)] , A]
df1
# A B day last
# 1: a 2 1 5
# 2: a 2 2 5
# 3: a 1 5 5
# 4: a 0 8 5
# 5: b 3 1 6
# 6: b 3 4 6
# 7: b 3 6 6
# 8: b 0 7 6
# 9: b 0 9 6
#10: c 1 2 4
#11: c 1 3 4
#12: c 1 4 4
Or using dplyr
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(last = day[max(which(B>=1))])
Or use the last function from dplyr (as #docendo discimus suggested)
df1 %>%
group_by(A) %>%
mutate(last= last(day[B>=1]))
For the second question,
setDT(df1)[, dayafter:= if(all(!!B)) NA_integer_ else
day[max(which(B!=0))+1L] , A]
# A B day dayafter
# 1: a 2 1 8
# 2: a 2 2 8
# 3: a 1 5 8
# 4: a 0 8 8
# 5: b 3 1 7
# 6: b 3 4 7
# 7: b 3 6 7
# 8: b 0 7 7
# 9: b 0 9 7
#10: c 1 2 NA
#11: c 1 3 NA
#12: c 1 4 NA
Here is a solution that does not require loading external packages:
df <- structure(list(A = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
B = c(2L, 2L, 1L, 0L, 3L, 3L, 3L, 0L, 0L, 1L, 1L, 1L), day = c(1L,
2L, 5L, 8L, 1L, 4L, 6L, 7L, 9L, 2L, 3L, 4L)), .Names = c("A",
"B", "day"), class = "data.frame", row.names = c(NA, -12L))
x <- split(df, df$A, drop = TRUE)
tp <- lapply(x, function(k) {
tmp <- k[k$B >0,]
k$last <- tmp$day[length(tmp$day)]
k
})
do.call(rbind, tp)
A B day last
#a.1 a 2 1 5
#a.2 a 2 2 5
#a.3 a 1 5 5
#a.4 a 0 8 5
#b.5 b 3 1 6
#b.6 b 3 4 6
#b.7 b 3 6 6
#b.8 b 0 7 6
#b.9 b 0 9 6
#c.10 c 1 2 4
#c.11 c 1 3 4
#c.12 c 1 4 4

Resources