R: Match specific elements from a list of data frames and create new data frame - r

Let's have a list of data frames:
df1 <- data.frame(V1=c("a", "b", "c"),V2=c("d", "e","f"), V3=c("g","h","i"),V4=c("j","k","l"))
df2 <- data.frame(V1=c("m","n"), V2=c("o","p"), V3=c("q","r"))
l <-list(df1, df2)
> l
[[1]]
V1 V2 V3 V4
1 a d g j
2 b e h k
3 c f i l
[[2]]
V1 V2 V3
1 m o q
2 n p r
Moreover, we have a vector:
ele <- c("a","b","e","g","i","m","p","s","t")
I want to obtain a new data frame contructed by matching elements from vector ele and list l. Data frame should have colnames from matched elemenets from vector and element right to the matches elements from the list.
For instance:
df3 <-data.frame(a="d",b='e',e="h",g="j",i="l",m="o",p="r")
> df3
a b e g i m p
1 d e h j l o r
As you may notice there is not spefic matching pattern.

Probably there's better solutions somewhere, but this is a possibility:
library(tidyverse)
library(magrittr)
l %<>%
map(~ t(.x) %>%
as_tibble() %>%
flatten_chr())
ele %>%
map(~ map(l, equals, .x)) %>%
map_chr(~ {
lgl <- map_lgl(.x, any)
if (!any(lgl)) {
NA
} else {
lgl_idx <- min(which(lgl))
lgl <- l[[lgl_idx]]
lgl[min(which(.x[[lgl_idx]])) + 1]
}
}) %>%
set_names(ele) %>%
na.omit()
Needs some more exception handling (such as when the vector equals an element in the last column) but it works on the example you've given.
a b e g i m p
"d" "e" "h" "j" "l" "o" "r"

You can fine the element that matches an argument using which, and then add a vector to it (in this case c(0,1)).
ele_list = as.list(ele)
names(ele_list) = ele
unlist(lapply(ele_list, function(e) df1[which(df1 == e, arr.ind = TRUE) + c(0, 1)]))
a b e g i
"d" "e" "h" "j" "l"
I only did it for df1, you could run the third line for both, then combine the vectors and convert to dataframe.

Related

Row name of the smallest element

I have the following dataframe:
d <- data.frame(a=c(1,2,3,4), b=c(20,19,18,17))
row.names(d) <- c("A", "B", "C", "D")
I want another data.frame, with the same columns and 2 rows, which contain the row names of the 2 smallest elements in that column.
In the example the expected result would be:
# Expected results
exp <- data.frame(a=c("A", "B"), b=c("C","D"))
We loop over the columns with lapply, order the values, use that index to subset the n corresponding row.names of 'd', and wrap with data.frame
n <- 2
data.frame(lapply(d, function(x) sort(head(row.names(d)[order(x)], n))))
-output
# a b
#1 A C
#2 B D
With R 4.1.0, we can also use the |> operator for chaining the functions (applied in the order for easier understanding) along with \(x) - for lambda function in base R
# // ordered the column values
# // get corresponding row names
lapply(d, \(x) row.names(d)[order(x)] |>
head(n) |> # // get the first n values
sort()) |> # // sort them
data.frame() # // convert the list to data.frame
# a b
#1 A C
#2 B D
Or using dplyr
library(dplyr)
d %>%
summarise(across(everything(),
~ sort(head(row.names(d)[order(.)], n))))
# a b
#1 A C
#2 B D
Using sapply in base R -
rn <- rownames(d)
sapply(d, function(x) rn[order(x) %in% 1:2])
# a b
#[1,] "A" "C"
#[2,] "B" "D"

Matching across datasets and columns

I have a vector with words, e.g., like this:
w <- LETTERS[1:5]
and a dataframe with tokens of these words but also tokens of other words in different columns, e.g., like this:
set.seed(21)
df <- data.frame(
w1 = c(sample(LETTERS, 10)),
w2 = c(sample(LETTERS, 10)),
w3 = c(sample(LETTERS, 10)),
w4 = c(sample(LETTERS, 10))
)
df
w1 w2 w3 w4
1 U R A Y
2 G X P M
3 Q B S R
4 E O V T
5 V D G W
6 T A Q E
7 C K L U
8 D F O Z
9 R I M G
10 O T T I
# convert factor to character:
df[] <- lapply(df[], as.character)
I'd like to extract from dfall the tokens of those words that are contained in the vector w. I can do it like this but that doesn't look nice and is highly repetitive and error prone if the dataframe is larger:
extract <- c(df$w1[df$w1 %in% w],
df$w2[df$w2 %in% w],
df$w3[df$w3 %in% w],
df$w4[df$w4 %in% w])
I tried this, using paste0 to avoid addressing each column separately but that doesn't work:
extract <- df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w]
extract
data frame with 0 columns and 10 rows
What's wrong with this code? Or which other code would work?
To answer your question, "What's wrong with this code?": The code df[paste0("w", 1:4)][df[paste0("w", 1:4)] %in% w] is the equivalent of df[df %in% w] because df[paste0("w", 1:4)], which you use twice, simply returns the entirety of df. That means df %in% w will return FALSE FALSE FALSE FALSE because none of the variables in df are in w (w contains strings but not vectors of strings), and df[c(F, F, F, F)] returns an empty data frame.
If you're dealing with a single data type (strings), and the output can be a character vector, then use a matrix instead of a data frame, which is faster and is, in this case, a little easier to subset:
mat <- as.matrix(df)
mat[mat %in% w]
#[1] "B" "D" "E" "E" "A" "B" "E" "B"
This produces the same output as your attempt above with extract <- ….
If you want to keep some semblance of the original data frame structure then you can try the following, which outputs a list (necessary as the returned vectors for each variable might have different lengths):
lapply(df, function(x) x[x %in% w])
#### OUTPUT ####
$w1
[1] "B" "D" "E"
$w2
[1] "E" "A"
$w3
[1] "B"
$w4
[1] "E" "B"
Just call unlist or unclass on the returned list if you want a vector.

how to subset data in R using conditional operation booleans [duplicate]

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Using row-wise column indices in a vector to extract values from data frame [duplicate]

This question already has an answer here:
Get the vector of values from different columns of a matrix
(1 answer)
Closed 5 years ago.
Using vector of column positional indexes such as:
> i <- c(3,1,2)
How can I use the index to extract the 3rd value from the first row of a data frame, the 1st value from the second row, the 2nd value from the third row, etc.
For example, using the above index and:
> dframe <- data.frame(x=c("a","b","c"), y=c("d","e","f"), z=c("g","h","i"))
> dframe
x y z
1 a d g
2 b e h
3 c f i
I would like to return:
> [1] "g", "b", "f"
Just use matrix indexing, like this:
dframe[cbind(seq_along(i), i)]
# [1] "g" "b" "f"
The cbind(seq_along(i), i) part creates a two column matrix of the relevant row and column that you want to extract.
How about this:
Df <- data.frame(
x=c("a","b","c"),
y=c("d","e","f"),
z=c("g","h","i"))
##
i <- c(3,1,2)
##
index2D <- function(v = i, DF = Df){
sapply(1:length(v), function(X){
DF[X,v[X]]
})
}
##
> index2D()
[1] "g" "b" "f"

Subset dataframe by multiple logical conditions of rows to remove

I would like to subset (filter) a dataframe by specifying which rows not (!) to keep in the new dataframe. Here is a simplified sample dataframe:
data
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c
For example, if a row of column v1 has a "b", "d", or "e", I want to get rid of that row of observations, producing the following dataframe:
v1 v2 v3 v4
a v d c
a v d d
c k d c
c r p g
I have been successful at subsetting based on one condition at a time. For example, here I remove rows where v1 contains a "b":
sub.data <- data[data[ , 1] != "b", ]
However, I have many, many such conditions, so doing it one at a time is not desirable. I have not been successful with the following:
sub.data <- data[data[ , 1] != c("b", "d", "e")
or
sub.data <- subset(data, data[ , 1] != c("b", "d", "e"))
I've tried some other things as well, like !%in%, but that doesn't seem to exist.
Any ideas?
Try this
subset(data, !(v1 %in% c("b","d","e")))
The ! should be around the outside of the statement:
data[!(data$v1 %in% c("b", "d", "e")), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
You can also accomplish this by breaking things up into separate logical statements by including & to separate the statements.
subset(my.df, my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e")
This is not elegant and takes more code but might be more readable to newer R users. As pointed out in a comment above, subset is a "convenience" function that is best used when working interactively.
This answer is more meant to explain why, not how. The '==' operator in R is vectorized in a same way as the '+' operator. It matches the elements of whatever is on the left side to the elements of whatever is on the right side, per element. For example:
> 1:3 == 1:3
[1] TRUE TRUE TRUE
Here the first test is 1==1 which is TRUE, the second 2==2 and the third 3==3. Notice that this returns a FALSE in the first and second element because the order is wrong:
> 3:1 == 1:3
[1] FALSE TRUE FALSE
Now if one object is smaller then the other object then the smaller object is repeated as much as it takes to match the larger object. If the size of the larger object is not a multiplication of the size of the smaller object you get a warning that not all elements are repeated. For example:
> 1:2 == 1:3
[1] TRUE TRUE FALSE
Warning message:
In 1:2 == 1:3 :
longer object length is not a multiple of shorter object length
Here the first match is 1==1, then 2==2, and finally 1==3 (FALSE) because the left side is smaller. If one of the sides is only one element then that is repeated:
> 1:3 == 1
[1] TRUE FALSE FALSE
The correct operator to test if an element is in a vector is indeed '%in%' which is vectorized only to the left element (for each element in the left vector it is tested if it is part of any object in the right element).
Alternatively, you can use '&' to combine two logical statements. '&' takes two elements and checks elementwise if both are TRUE:
> 1:3 == 1 & 1:3 != 2
[1] TRUE FALSE FALSE
data <- data[-which(data[,1] %in% c("b","d","e")),]
my.df <- read.table(textConnection("
v1 v2 v3 v4
a v d c
a v d d
b n p g
b d d h
c k d c
c r p g
d v d x
d v d c
e v d b
e v d c"), header = TRUE)
my.df[which(my.df$v1 != "b" & my.df$v1 != "d" & my.df$v1 != "e" ), ]
v1 v2 v3 v4
1 a v d c
2 a v d d
5 c k d c
6 c r p g
sub.data<-data[ data[,1] != "b" & data[,1] != "d" & data[,1] != "e" , ]
Larger but simple to understand (I guess) and can be used with multiple columns, even with !is.na( data[,1]).
And also
library(dplyr)
data %>% filter(!v1 %in% c("b", "d", "e"))
or
data %>% filter(v1 != "b" & v1 != "d" & v1 != "e")
or
data %>% filter(v1 != "b", v1 != "d", v1 != "e")
Since the & operator is implied by the comma.

Resources