R - preserve order when using matching operators (%in%)

R - preserve order when using matching operators (%in%) - r

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!

Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3

Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))

Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Related

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.

Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4

Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better

Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.

Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

Function to create a new variable not working in R

I am creating a function to help me quickly recode variables into numerical values, as a form of practice. The idea behind creating the function is to quickly recode several values into numerical form, for any length. If a dataset is really long for instance, the function in theory should recode all of these values without having to manually type out each condition in which to recode it into a specific value.
For instance:
levels(d$letters)
[1] a b c d
The general form of the function is to:
d$letters.recode[d$letters == "a"] <- 1
d$letters.recode[d$letters == "b"] <- 2
d$letters.recode[d$letters == "c"] <- 3
And using this function:
rc.f <- function(a, b){
x <- levels(a)
y <- length(a)
b <- NA
for (i in 1:y){
z <- b[a==x[i]] <- i
}
}
In theory, the idea is that this function should create another variable, where a is recoded as 1, b is recoded as 2 and so on.
However when I run rc.f(d$letters, d$letters.recode), no new variables are created in the dataset, and the function does not return an error.
Any ideas?
Thanks.
Another example dataset d:
Say for a list of respondents they are assigned a category depending on their region:
Respondent Region
1 d
2 b
3 g
4 c
5 e
6 c
7 f
8 a
I am looking for a way to recode d$Region into a numerical value, to d$Region.R.
Using the same function as above, I am wondering whether I can use the function to create another variable in the dataframe, by inputting d$Region and d$Region.R into the function. So recoding a,b,c,[...],g into 1,2,3,[...],7.

If you want to a,b,f,d as 1,2,4,3 then use following
I have updated your code for function rc.f a little bit
Removed second argument b, since we are giving b <- NA ,so we do not need second argument
We do not need other variable to store the value of b , so i removed z
Since every argument is not factor so we need to coerce it into factor
we do not need y , we can directly put length(a) in for loop condition
and last but not the least the last line is the output of the function unless we use return, so there i putted b in last
The code is
rc.f <- function(a)
{
a<-as.factor(a)
x <- levels(a)
b <- NA
for (i in 1:length(a))
{
b[a==x[i]] <- i
}
b
}
let us take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f"
[14] "d" "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 5 6 10 4 9 6 7 4 3 1 8 8 8
If you want a,b,f,d as 1,2,6,4 then use following
rc.f <- function(a)
{
a<-as.factor(a)
b <- NA
for (i in 1:26)
{
b[a==letters[i]] <- i
}
b
}
lets take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f" "d"
[15] "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 7 8 25 6 22 8 10 6 4 1 19 19 19

Can't match character value to multiple columns

I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.

Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))

You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2

Find indices of vector elements in a list

I have this toy character vector:
a = c("a","b","c","d","e","d,e","f")
in which some elements are concatenated with a comma (e.g. "d,e")
and a list that contains the unique elements of that vector, where in case of comma concatenated elements I do not keep their individual components.
So this is the list:
l = list("a","b","c","d,e","f")
I am looking for an efficient way to obtain the indices of the elements of a in the l list. For elements of a that are represented by the comma concatenated elements in l it should return the indices of the these comma concatenated elements in l.
So the output of this function would be:
c(1,2,3,4,4,4,5)
As you can see it returns index 4 for a elements: "d", "e", and "d,e"

I would make your search vector into a set of regular expressions, by substituting the comma with a pipe. Add names to the search vector too, according to its position in the list.
L <- setNames(lapply(l, gsub, pattern = ",", replacement = "|"), seq_along(l))
Then you can do:
lapply(L, function(x) grep(x, a, value = TRUE))
# $`1`
# [1] "a"
#
# $`2`
# [1] "b"
#
# $`3`
# [1] "c"
#
# $`4`
# [1] "d" "e" "d,e"
#
# $`5`
# [1] "f"
The names are important, because you can now use stack to get what you are looking for.
stack(lapply(L, function(x) grep(x, a, value = TRUE)))
# values ind
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 4
# 6 d,e 4
# 7 f 5

You could use a strategy with factors. First, find the index for each element in your list with
l <- list("a","b","c","d,e","f")
idxtr <- Map(function(x) unique(c(x, strsplit(x, ",")[[1]])), unlist(l))
This build a list for each item in l along with all possible matches for each element. Then we take the vector a and create a factor with those levels, and then reassign based on the list we just build
a <- c("a","b","c","d","e","d,e","f")
a <- factor(a, levels=unlist(idxtr));
levels(a) <- idxtr
as.numeric(a)
# [1] 1 2 3 4 4 4 5
finally, to get the index, we use as.numeric on the factor

Compare two character vectors in R

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.

Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2

I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3

Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.

With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.

If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R - preserve order when using matching operators (%in%) - r

Use match. df[match(vec, rownames(df)), ] # [1] 2 1 3 Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected. Edit: I just realized that row name indexing will solve your issue a bit more simply and elegantly: df[vec, ] # [1] 2 1 3

Use match (and get rid of the NA values for elements in either vector for those that don't match in the other): Filter(function(x) !is.na(x), match(rownames(df), vec))

Related

Storing unique values of each column (of a df) in list

Function to create a new variable not working in R

Can't match character value to multiple columns

Find indices of vector elements in a list

Compare two character vectors in R

Categories

Resources