Can't match character value to multiple columns - r

I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.

Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))

You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2

Related

I use as.complex() to convert a string column to a numeric column in r

I have three columns which are characters A, B, and C respectively. I am using is.numeric to convert them to numeric and then assign them values e.g. 1,2 and 3, but when I am using is.numeric(). it returns back NAs. In different data frames these orders vary, e.g. ABC or ACB, but A=i+0i, B=2+3i and C is also a complex number. I want to first convert the string to a complex number and then assign values to them.
LV$phase1 <- as.numeric(LV$phase1)
class(phase1)
A=1
print(phase1)
This is the error:
"Warning message:
NAs introduced by coercion "
It does not usually make sense to convert character data to numeric, but if the letters refer to an ordered sequence of events/phases/periods, then it may be useful. R uses factors for this purpose. For example
set.seed(42)
phase <- sample(LETTERS[1:4], 10, replace=TRUE)
phase
# [1] "A" "A" "A" "A" "B" "D" "B" "B" "A" "D"
factor(phase)
# [1] A A A A B D B B A D
Levels: A B D
as.numeric(factor(phase))
# [1] 1 1 1 1 2 3 2 2 1 3
If this is what you are trying to do
LV$phase1 <- as.numeric(factor(LV$phase1))
will convert the letters to an ordered sequence and assign numbers to represent those categories.

compare dataframes and extract matching rows [duplicate]

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

Compare two character vectors in R

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

"replace" function examples

I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C

R - preserve order when using matching operators (%in%)

I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A

Resources