Get a single value from a data frame in R - r

Say I have a data frame df such as :
col1 col2
x1 y1
x2 y2
with arbitrary values in each "cell".
How do I get a single value for a given cell ?
For instance to get the value of the cell in the first row and second column, doing this :
df[1,2]
works with numeric values, but with strings it return the levels as well.
What is the proper way of getting a single value (for instance for use in a condition for a subset of another data frame) ?
EDIT
More details about what I need this for. Say I need to use values from df to subset another data frame df2 :
subset(df2, (id == SomeCommand(df[1,1])) & (name == SomeCommand(df[1,2])))
Is there any such "SomeCommand" that would reliably return a single value (w/o levels) of the appropriate type regardless of the type of the columns in df ?

R will get out of its way to try to figure out what you want. If you coerce to character, it should work. Here's a quick example.
> xy <- data.frame(a = c(0.1, 0.2, 0.3), b = factor(1:3), c = letters[1:3])
>
> xy$a == 0.1
[1] TRUE FALSE FALSE
> xy$a == "0.1"
[1] TRUE FALSE FALSE
> xy$b == "2"
[1] FALSE TRUE FALSE
> xy$b == 2
[1] FALSE TRUE FALSE
> xy$c == "a"
[1] TRUE FALSE FALSE

A common application is to obtain a particular value of one variable in a data-frame given the value of one or more other column variables in the same record. For this the "filter" command can be used. It may look clunky but it works well for a large data-frame.
library(dplyr)
df
rnames col1 col2 col3
1 row1 1 3 a
2 row2 2 6 b
3 row3 3 9 c
4 row4 4 12 d
5 row5 5 15 e
To find the value of col1 given col3 = 'c'
a <- filter(df, col3=='c') # can specify multiple known column values
a #produces a data-frame with the record(s)
rnames col1 col2 col3
1 row3 3 9 c # which contains Col1 = 3
class(a)
[1] "data.frame"
But can get value of Col1 in one line
b <- filter(df, col3=='c')$col1
b
[1] 3
class(b)
[1] "numeric"
For a result with multiple values
c <- filter(df, col1 > 3)$col3
c[1] "d" "e" # list if > 1 result
class(c)
[1] "character"

One way that works is, defining the colClasses of your dataframe while creating it:
for example:
my_table = read.table("myfile.txt", sep=" ", colClasses = c("character", "character", "numeric"))

Related

compare dataframes and extract matching rows [duplicate]

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

Storing unique values of each column (of a df) in list

It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4

data.frame rows to list named elements

I have a list of mixed types, vector and data frames of 2 columns.
> my.list
$a
[1] 1
$df1
key value
1 b 2
2 c 3
$df2
key value
1 d 4
2 e 5
I would like to end up with a list of vectors only, each data frame row would become a list element with column value as value and column key as element name.
So the result in this example would be :
$a
[1] 1
$b
[1] 2
$c
[1] 3
$d
[1] 4
$e
[1] 5
Actually here is how I achieve this :
my.list <- list(a = 1,
df1 = data.frame(key = c("b", "c"), value = 2:3),
df2 = data.frame(key = c("d", "e"), value = 4:5))
unlist(lapply(seq_along(my.list), function(i) {
if (is.data.frame(my.list[[i]])) {
with(my.list[[i]], as.list(setNames(value, key), all.names = TRUE))
} else {
setNames(my.list[i], names(my.list[i]))
}
}), recursive = FALSE)
But I don't realy like this solution. Do you have smarter ideas to achieve this please ? Thanks
You can do it in two steps base R:
x = do.call(rbind, Filter(is.data.frame, my.list))
c(Filter(Negate(is.data.frame), my.list), as.list(setNames(x$value, x$key)))
$a
[1] 1
$b
[1] 2
$c
[1] 3
$d
[1] 4
$e
[1] 5
One option (based on the example) would be to melt (from reshape2) to 'long' format, convert to data.table (setDT), replace the NA elements in 'key' with the corresponding values from 'L1', and split the 'value' based on 'key'.
library(data.table)
library(reshape2)
with(setDT(melt(my.list))[is.na(key), key := L1], split(value, key))

Can't match character value to multiple columns

I'm trying to match a character value, "C", to multiple columns in a dataframe. Here's part of the frame:
X1 X2
1 F F
2 C C
3 D D
4 A# A#
Here's what happens when I try to match the value "C":
> "C" %in% frame[, 1]
[1] TRUE
> "C" %in% frame[, 1:2]
[1] FALSE
Considering that "C" is in both columns, I can't figure out why it's returning false. Is there a function or operator that can test to see if a value is present in multiple columns? My goal is to create a function that can sum the number of times a character value like "C" is found in specified columns.
Try:
apply(frame, 2, function(u) "C" %in% u)
You can also use is.element:
apply(frame, 2, function(u) is.element("C",u))
You probably want to use grepl here, which returns a logical vector. Then you can count the number of occurrences with sum.
> frame
X1 X2
1 F F
2 C C
3 D D
4 A# A#
> grepl('C', frame$X1)
[1] FALSE TRUE FALSE FALSE
> sum(grepl('C', frame$X1))
[1] 1
and to count the total number of Cs in every column you can use lapply
(note: apply is better suited for matrices, not data frames which are
lists.)
> sum(unlist(lapply(frame, function(col) grepl('C', col))))
[1] 2

Compare two character vectors in R

I have two character vectors of IDs.
I would like to compare the two character vectors, in particular I am interested in the following figures:
How many IDs are both in A and B
How many IDs are in A but not in B
How many IDs are in B but not in A
I would also love to draw a Venn diagram.
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
I'm usually dealing with large-ish sets, so I use a table instead of a Venn diagram:
xtab_set <- function(A,B){
both <- union(A,B)
inA <- both %in% A
inB <- both %in% B
return(table(inA,inB))
}
set.seed(1)
A <- sample(letters[1:20],10,replace=TRUE)
B <- sample(letters[1:20],10,replace=TRUE)
xtab_set(A,B)
# inB
# inA FALSE TRUE
# FALSE 0 5
# TRUE 6 3
Yet an another way, with using %in% and boolean vectors of common elements instead of intersect and setdiff. I take it you actually want to compare two vectors, not two lists - a list is an R class that may contain any type of element, while vectors always contain elements of just one type, hence easier comparison of what is truly equal. Here the elements are transformed to character strings, as that was the most inflexible element type that was present.
first <- c(1:3, letters[1:6], "foo", "bar")
second <- c(2:4, letters[5:8], "bar", "asd")
both <- first[first %in% second] # in both, same as call: intersect(first, second)
onlyfirst <- first[!first %in% second] # only in 'first', same as: setdiff(first, second)
onlysecond <- second[!second %in% first] # only in 'second', same as: setdiff(second, first)
length(both)
length(onlyfirst)
length(onlysecond)
#> both
#[1] "2" "3" "e" "f" "bar"
#> onlyfirst
#[1] "1" "a" "b" "c" "d" "foo"
#> onlysecond
#[1] "4" "g" "h" "asd"
#> length(both)
#[1] 5
#> length(onlyfirst)
#[1] 6
#> length(onlysecond)
#[1] 4
# If you don't have the 'gplots' package, type: install.packages("gplots")
require("gplots")
venn(list(first.vector = first, second.vector = second))
Like it was mentioned, there are multiple choices for plotting Venn-diagrams in R. Here is the output using gplots.
With sqldf: Slower but very suitable for data frames with mixed types:
t1 <- as.data.frame(1:10)
t2 <- as.data.frame(5:15)
sqldf1 <- sqldf('SELECT * FROM t1 EXCEPT SELECT * FROM t2') # subset from t1 not in t2
sqldf2 <- sqldf('SELECT * FROM t2 EXCEPT SELECT * FROM t1') # subset from t2 not in t1
sqldf3 <- sqldf('SELECT * FROM t1 UNION SELECT * FROM t2') # UNION t1 and t2
sqldf1 X1_10
1
2
3
4
sqldf2 X5_15
11
12
13
14
15
sqldf3 X1_10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Using the same example data as one of the answers above.
A = c("Dog", "Cat", "Mouse")
B = c("Tiger","Lion","Cat")
match(A,B)
[1] NA 3 NA
The match function returns a vector with the location in B of all values in A. So, cat, the second element in A, is the third element in B. There are no other matches.
To get the matching values in A and B, you can do:
m <- match(A,B)
A[!is.na(m)]
"Cat"
B[m[!is.na(m)]]
"Cat"
To get the non-matching values in A and B:
A[is.na(m)]
"Dog" "Mouse"
B[which(is.na(m))]
"Tiger" "Cat"
Further, you can use length() to get the total number of matching and non-matching values.
If A is a data.table with field a of type list, with entries themselves as vectors of a primitive type, e.g. created as follows
A<-data.table(a=c(list(c("abc","def","123")),list(c("ghi","zyx"))),d=c(9,8))
and B is a list with vector of primitive entries, e.g. created as follows
B<-list(c("ghi","zyx"))
and you're attempting to find which (if any) element of A$a matches B
A[sapply(a,identical,unlist(B))]
if you just want the entry in a
A[sapply(a,identical,unlist(B)),a]
if you want the matching indicies of a
A[,which(sapply(a,identical,unlist(B)))]
if instead B is itself a data.table with the same structure as A, e.g.
B<-data.table(b=c(list(c("zyx","ghi")),list(c("abc","def",123))),z=c(5,7))
and you're looking for the intersection of the two lists by one column, where you require the same order of vector elements.
# give the entry in A for in which A$a matches B$b
A[,`:=`(res=unlist(sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)))
][res==TRUE
][,res:=NULL][]
# get T/F for each index of A
A[,sapply(list(a),function(x,y){
x %in% unlist(lapply(y,as.vector,mode="character"))
},list(B[,b]),simplify=FALSE)]
Note that you can't do something as easy as
setkey(A,a)
setkey(B,b)
A[B]
to join A&B because you cannot key on a field of type list in data.table 1.12.2
similarly, you cannot ask
A[a==B[,b]]
even if A and B are identical, as the == operator hasn't been implemented in R for type list

Resources