Find an element following a sequence across rows in a data frame - r

I have a data set with the structure shown below.
# example data set
a <- "a"
b <- "b"
d <- "d"
id1 <- c(a,a,a,a,b,b,d,d,a,a,d)
id2 <- c(b,d,d,d,a,a,a,a,b,b,d)
id3 <- c(b,d,d,a,a,a,a,d,b,d,d)
dat <- rbind(id1,id2,id3)
dat <- data.frame(dat)
I need to find across each row the first sequence with repeated elements "a" and identify the element following the sequence immediately.
# desired results
dat$s3 <- c("b","b","d")
dat
I was able to break the problem in 3 steps and solve the first one but as my programming skills are quite limited, I would appreciate any advice on how to approach steps 2 and 3. If you have an idea that solves the problem in another way that would be extremely helpful as well.
Here is what I have so far:
# Step 1: find the first occurence of "a" in the fist sequence
dat$s1 <- apply(dat, 1, function(x) match(a,x))
# Step 2: find the last occurence in the first sequence
# Step 3: find the element following the last occurence in the first sequence
Thanks in advance!

I'd use filter:
fun <- function(x) {
x <- as.character(x)
isa <- (x == "a") #find "a" values
#find sequences with two TRUE values and the last value FALSE
ids <- stats::filter(isa, c(1,1,1), sides = 1) == 2L & !isa
na.omit(x[ids])[1] #subset
}
apply(dat, 1, fun)
#id1 id2 id3
#"b" "b" "d"

Try this (assuming that you have repeated a at each row):
library(stringr)
dat$s3 <-apply(dat, 1, function(x) str_match(paste(x, collapse=''),'aa([^a])')[,2])
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 s3
id1 a a a a b b d d a a d b
id2 b d d d a a a a b b d b
id3 b d d a a a a d b d d d

Well, here is one attempt which is a bit messy,
l1 <- lapply(apply(dat, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
dat$s3 <- diag(as.matrix(dat[ind]))
dat$s3
#[1] "b" "b" "d"
or wrap it in a function,
fun1 <- function(df){
l1 <- lapply(apply(df, 1, function(i) as.integer(which(i == a))),
function(j) j[cumsum(c(1, diff(j) != 1)) == 1])
ind <- unname(sapply(l1, function(i) tail(i, 1) + 1))
return(diag(as.matrix(df[ind])))
}
fun1(dat)
#[1] "b" "b" "d"

Related

Row name of the smallest element

I have the following dataframe:
d <- data.frame(a=c(1,2,3,4), b=c(20,19,18,17))
row.names(d) <- c("A", "B", "C", "D")
I want another data.frame, with the same columns and 2 rows, which contain the row names of the 2 smallest elements in that column.
In the example the expected result would be:
# Expected results
exp <- data.frame(a=c("A", "B"), b=c("C","D"))
We loop over the columns with lapply, order the values, use that index to subset the n corresponding row.names of 'd', and wrap with data.frame
n <- 2
data.frame(lapply(d, function(x) sort(head(row.names(d)[order(x)], n))))
-output
# a b
#1 A C
#2 B D
With R 4.1.0, we can also use the |> operator for chaining the functions (applied in the order for easier understanding) along with \(x) - for lambda function in base R
# // ordered the column values
# // get corresponding row names
lapply(d, \(x) row.names(d)[order(x)] |>
head(n) |> # // get the first n values
sort()) |> # // sort them
data.frame() # // convert the list to data.frame
# a b
#1 A C
#2 B D
Or using dplyr
library(dplyr)
d %>%
summarise(across(everything(),
~ sort(head(row.names(d)[order(.)], n))))
# a b
#1 A C
#2 B D
Using sapply in base R -
rn <- rownames(d)
sapply(d, function(x) rn[order(x) %in% 1:2])
# a b
#[1,] "A" "C"
#[2,] "B" "D"

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

Multiple observations per ID, create column for last observation [duplicate]

I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.
Here is the example,
## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE) # factor columns
## So, it looks like this
setDT(dat)[]
# X1 X2 X3 X4 X5
# 1: u NA NA NA NA
# 2: f q NA NA NA
# 3: f b w NA NA
# 4: k g h NA NA
# 5: u b r NA NA
# 6: f q w x t
# 7: u g h i e
# 8: u q r n t
## I just want to get the labels of the factors
## that are 'rightmost' in each row. I tried a number of things
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]
This is the goal though, to extract these labels, here using regular base functions.
## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
Here's another way:
dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]
X1 X2 X3 X4 X5 res
1: u NA NA NA NA u
2: f q NA NA NA q
3: f b w NA NA w
4: k g h NA NA h
5: u b r NA NA r
6: f q w x t t
7: u g h i e e
8: u q r n t t
Benchmarks Using the same data as #alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.
Functions:
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex2 = function(x){
x[, res := NA_character_]
wh = x[, .I]
for (v in (length(x)-1):1){
if (!length(wh)) break
set(x, j="res", i=wh, v = x[[v]][wh])
wh = wh[is.na(x$res[wh])]
}
x$res
}
frank = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
frank2 = function(x){
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
x$res
}
Example data and benchmark:
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)
library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)
Unit: milliseconds
expr min lq mean median uq max neval
frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898 30
frank2(DAT2) 88.68229 93.40476 118.27959 107.69190 121.60257 346.48264 30
alex(DAT3) 98.56861 109.36653 131.21195 131.20760 149.99347 183.43918 30
alex2(DAT4) 26.14104 26.45840 30.79294 26.67951 31.24136 50.66723 30
Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.
alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
if(!length(wh)) return(ans)
ans[wh] = as.character(x[[length(x)]])[wh]
Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"
And to compare with Frank's:
frank = function(x)
{
x[, res := NA_character_]
for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
return(x$res)
}
DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)),
function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)),
{ frank(DAT2); DAT2[, res := NULL] },
times = 30)
#Unit: milliseconds
# expr min lq median uq max neval
# alex(as.list(DAT1)) 102.9767 108.5134 117.6595 133.1849 166.9594 30
# { frank(DAT2) DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589 30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE
Here is a one liner base R approach:
sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
# 1 2 3 4 5 6 7 8
#"u" "q" "w" "h" "r" "t" "e" "t"
We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).
melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
In case, the data is already a data.table
dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
if(!any(is.na(value))) value[.N]
else value[which(is.na(value))[1]-1], by = rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"
Here is another option
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
, as.character(.SD[[.BY[[1]]]]), by=colInd]
Or as #Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact
melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]
I'm not sure how to improve upon #alexis's answer beyond what #Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.
Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:
as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]
The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that #akrun shared.

Subsetting a data frame using another data frame

I'm having trouble with something that shouldn't be that hard to come around. What I would like to do is subsetting a data.frame by using another data.frame, and more precisely, by using a certain parameter.
Here goes the example:
df1<- t(data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("0.5","3","0","0","5","0","15"), C=c("0","0","3","15","15","0","0"), D=c("0.5","0.5","0.5","0","0","0","0"), E=c("37.5","37.5","0.5","62.5","0.5","0.5","1")))
df2<- data.frame(A=c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"), B=c("vasc", "vasc","vasc","spha", "moss","moss","moss"), C=c("a", "a", "b", "a", "c","d","a"))
Now, let's say that I want in my df1 only the objects A (here they are species) that are "vasc" in df2 in my df1.
For that I've tried a few things such as:
df3 <- subset(df2, B=="vasc")
df4 <- df1[,c(df1, as.vector(df2))]
But in doing so, I have an error of type:
Error in df1[, c(df1, as.vector(df2))] : invalid subscript type 'list'
Therefore, I've tried to unlist my dataframes but nothing seems to work. I've been on this problem for a while now and I did explore the forum to see if anyone had an elegant solution to my problem but it looks like not.
Another way of doing this subsetting was to do the following bit of code, but it didn't work either even though I felt closer to the solution:
try11 <- list(df2, df1)%>% rbindlist(., fill=T) # with df1 not transposed
df11 <- try11[try11=="vasc",]
I hope the code is good enough and my explanation clear enough.
Thank you!
You might try:
library(data.table)
setDT(df1)
setDT(df2)
dtPruned <- df1[A %in% df2[B == "vasc", A]]
Make sure to remove the t() call in your df1 definition for this to work, however. Basically, what it's doing is selecting the A column in df2 where B = "vasc". It then selects the rows from df1 where A is in those A's from df2.
You can do it with dplyr
library(dplyr)
species <- as.character(df2[df2$B == "vasc",1])
df1 %>%
slice(A %in% species)
## A tibble: 3 x 5
# A B C D E
# <fct> <fct> <fct> <fct> <fct>
#1 ABI 0.5 0 0.5 37.5
#2 ABI 0.5 0 0.5 37.5
#3 ABI 0.5 0 0.5 37.5
PS
Your data contains only factor. Maybe you want yo use number as numeric class.
This should do it. First we create a character vector (x) of all A values where B == vasc in df2. Then we select columns from df1 where A == x:
# Create a character vector of all A values when B == vasc
x <- as.character(df2[df2$B == "vasc", 1])
# Select columns where row A == x
df1[, which(df1[1, ] %in% x)]
[,1] [,2] [,3]
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
If we avoid the t call, we can do:
df1[df1$A %in% df2[df2$B == "vasc", 1], ]
A B C D E
1 ABI 0.5 0 0.5 37.5
2 BET 3 0 0.5 37.5
3 ALN 0 3 0.5 0.5
We could transpose the data frame to retain the same format as above:
t(df1[df1$A %in% df2[df2$B == "vasc", 1], ])
1 2 3
A "ABI" "BET" "ALN"
B "0.5" "3" "0"
C "0" "0" "3"
D "0.5" "0.5" "0.5"
E "37.5" "37.5" "0.5"
Data:
df1 <- t(data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("0.5","3","0","0","5","0","15"),
C = c("0","0","3","15","15","0","0"),
D = c("0.5","0.5","0.5","0","0","0","0"),
E = c("37.5","37.5","0.5","62.5","0.5","0.5","1")
)
)
df2 <- data.frame(
A = c("ABI", "BET", "ALN", "SPH", "PTI", "DIC", "PTD"),
B = c("vasc", "vasc","vasc","spha", "moss","moss","moss"),
C = c("a", "a", "b", "a", "c","d","a")
)

More efficient than nested loop in R

I am trying to break my habit of for loops by using apply but I've gotten stumped on this one. I have a for loop that collapses every two rows into one row for an object, obj.tmp(366 by 34343), but it is slow.
Here's a much shortened example:
df <- data.frame(X1 = letters[1:10], X2 = letters[11:20], stringsAsFactors = FALSE)
Thus:
> df
X1 X2
a k
b l
c m
d n
e o
f p
g q
h r
i s
j t
for(i in 1:(nrow(df)/2)){
df2[i,] <- apply( df[(i*2-1):(i*2),], 2, paste, collapse = "")
}
Output:
> df2
X1 X2
ab kl
cd mn
ef op
gh qr
ij st
Suggestions on a better method?
Based on your sample data, here is one possibility:
# Sample data
df <- data.frame(X1 = letters[1:10], X2 = letters[11:20], stringsAsFactors = FALSE);
do.call(rbind, lapply(split(df, gl(nrow(df) / 2, 2, nrow(df))), function(x) sapply(x, paste0, collapse = "")))
# X1 X2
#1 "ab" "kl"
#2 "cd" "mn"
#3 "ef" "op"
#4 "gh" "qr"
#5 "ij" "st"
Explanation: Split df every two rows and store in list, paste entries by column, and rbind into final object.
If you want to avoid rbinding the list element, you can also do:
t(sapply(split(df, gl(nrow(df) / 2, 2, nrow(df))), function(x) sapply(x, paste0, collapse = "")));
# X1 X2
#1 "ab" "kl"
#2 "cd" "mn"
#3 "ef" "op"
#4 "gh" "qr"
#5 "ij" "st"
We can use the aggregate function:
df1=cbind(df,id=rep(1:(nrow(df)/2)# Create a new df with an id that shows the rows to be combined
aggregate(.~id,df1,each=2)),paste0,collapse="")[-1]#Combine the rows
X1 X2
1 ab kl
2 cd mn
3 ef op
4 gh qr
5 ij st
You can do all this in one line:
aggregate(.~id,cbind(df,id=rep(1:(nrow(df)/2),each=2)),paste0,collapse="")[-1]
You can also try:
matrix(do.call(paste0,data.frame(matrix(unlist(df),,2,T))),,2)
[,1] [,2]
[1,] "ab" "kl"
[2,] "cd" "mn"
[3,] "ef" "op"
[4,] "gh" "qr"
[5,] "ij" "st"
Some thing like this ? If isn't, Can you be more clear? And pass the code to replicate what you are doing. But I hope this solves your problem.
df <- data.frame(X1 = letters[1:10], stringsAsFactors = FALSE)
df2 <- data.frame(X1 = character(), stringsAsFactors = FALSE)
sapply(1:round(nrow(df)/2), FUN = function(x) {
df2[x,] <<- paste(df[(x*2-1):(x*2),], collapse = "")
})
df2

Resources