compare two characters based on subset - r

I have a simple dataframe with two columns:
df <- data.frame(x = c(1,1,2,2,3),
y = c(rep(1:2,2),1),
target = c('a','a','a','b','a'))
I would like to compare the strings in the target column (find out whether they are equal or not, i.e., TRUE or FALSE) within every level of x (same number for x).
First I would like to compare lines 1 and 2, then 3 and 4 ...
My problem is that I am missing some comparisons, for example, line 5 has only one case instead of two - so it should turn out to be FALSE.
Variable y indicates the first and second case within x.
I played around with ddply doing something like:
ddply(df, .(x), summarise,
ifelse(as.character(df[df$y == '1',]$target),
as.character(df[df$y == '2',]$target),0,1))
which is ugly ...
and does not work ...
Any insights how I could achieve this comparison?
Thanks

ddply(df, .(x), function(d) NROW(d) == 2 & d$target[1] == d$target[2])
This assumes you want the value to be TRUE only if there are exactly 2 rows with that 'x' value. If it is possible for there to be 3 or more, and you want it to be TRUE if all target values are identical, you could do:
ddply(df, .(x), function(d) NROW(d) > 1 & length(unique(d$target)) == 1)

Here is a base R solution, assuming I have followed what you wanted correctly. foo() is a function that compares the two target values in each subset, whilst we split() the data on df$x and l|sapply() foo() to each of the subsets.
foo <- function(x) {
with(x, {if(length(target) < 2) {
FALSE
} else {
isTRUE(all.equal(target[1], target[2]))
}})
}
lapply(split(df, df$x), foo)
sapply(split(df, df$x), foo)
Which produces this output
> lapply(split(df, df$x), foo)
$`1`
[1] TRUE
$`2`
[1] FALSE
$`3`
[1] FALSE
>
> sapply(split(df, df$x), foo)
1 2 3
TRUE FALSE FALSE

ave(as.character(df$target), df$x,
FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
[1] "TRUE" "TRUE" "FALSE" "FALSE" "FALSE"
Or ... if you only want the results by group ...., use aggregate:
> aggregate(as.character(df$target), list(df$x),
+ FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
Group.1 x
1 1 TRUE
2 2 FALSE
3 3 FALSE

Related

Determine which elements of a vector partially match a second vector, and which elements don't (in R)

I have a vector A, which contains a list of genera, which I want to use to subset a second vector, B. I have successfully used grepl to extract anything from B that has a partial match to the genera in A. Below is a reproducible example of what I have done.
But now I would like to get a list of which genera in A matched with something in B, and which which genera did not. I.e. the "matched" list would contain Cortinarius and Russula, and the "unmatched" list would contain Laccaria and Inocybe. Any ideas on how to do this? In reality my vectors are very long, and the genus names in B are not all in the same position amongst the other info.
# create some dummy vectors
A <- c("Cortinarius","Laccaria","Inocybe","Russula")
B <- c("fafsdf_Cortinarius_sdfsdf","sdfsdf_Russula_sdfsdf_fdf","Tomentella_sdfsdf","sdfas_Sebacina","sdfsf_Clavulina_sdfdsf")
# extract the elements of B that have a partial match to anything in A.
new.B <- B[grepl(paste(A,collapse="|"), B)]
# But now how do I tell which elements of A were present in B, and which ones were not?
We could use lapply or sapply to loop over the patterns and then get a named output
out <- setNames(lapply(A, function(x) grep(x, B, value = TRUE)), A)
THen, it is easier to check the ones returning empty elements
> out[lengths(out) > 0]
$Cortinarius
[1] "fafsdf_Cortinarius_sdfsdf"
$Russula
[1] "sdfsdf_Russula_sdfsdf_fdf"
> out[lengths(out) == 0]
$Laccaria
character(0)
$Inocybe
character(0)
and get the names of that
> names(out[lengths(out) > 0])
[1] "Cortinarius" "Russula"
> names(out[lengths(out) == 0])
[1] "Laccaria" "Inocybe"
You can use sapply with grepl to check for each value of A matching with ever value of B.
sapply(A, grepl, B)
# Cortinarius Laccaria Inocybe Russula
#[1,] TRUE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE TRUE
#[3,] FALSE FALSE FALSE FALSE
#[4,] FALSE FALSE FALSE FALSE
#[5,] FALSE FALSE FALSE FALSE
You can take column-wise sum of these values to get the count of matches.
result <- colSums(sapply(A, grepl, B))
result
#Cortinarius Laccaria Inocybe Russula
# 1 0 0 1
#values with at least one match
names(Filter(function(x) x > 0, result))
#[1] "Cortinarius" "Russula"
#values with no match
names(Filter(function(x) x == 0, result))
#[1] "Laccaria" "Inocybe"

How to delete from vector using ifelse condition in R

I have a vector a with values (1,2,3,4) and another vector b with values (1,1,0,1). Using the elements in b as a flag, I want to remove the vector elements from A at the same positions where 0 is found in element b.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
if(b[i] == 0)
{
a <- a[-i]
}
}
I get the desired output
a
[1] 1 2 4
But using ifelse, I do not get the output as required.
a <- c(1,2,3,4)
b <- c(1,1,0,1)
for(i in 1:length(b))
{
a <- ifelse(b[i] == 0,a[-i],a)
}
Output:
a
[1] 1
How to use ifelse in such situations?
I think ifelse isn't the correct function here since ifelse gives output of same length as input and we want to subset values here. You don't need a loop as well. You can directly do
a[b != 0]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)
Another option could be:
a[as.logical(b)]
[1] 1 2 4
If you want to use ifelse, you can use the following code
na.omit(ifelse(b==0,NA,a))
such that
> na.omit(ifelse(b==0,NA,a))
[1] 1 2 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"
We can also use double negation
a[!!b]
#[1] 1 2 4
data
a <- 1:4
b <- c(1, 1, 0, 1)

Compare 2 dataframes for equality in R

I have 2 dataframes with 2 same columns. I want to check if the datasets are identical. The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets
I tried using compare, identical, all, all_equal etc. None of them returns me True.
The dummy datasets are -
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
all(a==c)
[1] FALSE
compare(a,c)
FALSE [FALSE, FALSE]
identical(a,c)
[1] FALSE
all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"
The datasets are entirely same, except for the order of the records. If these functions only work when the datasets are mirror images of each other, then I must try something else. If that is the case, can someone help with how do I get True for these 2 datasets (unordered)
dplyr's setdiff works on data frames, I would suggest
library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE
Note that this will not account for number of duplicate rows. (i.e., if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE). Not sure how you want duplicate rows handled...
If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical.
(a) adding an ID column
library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE
(b) sorting
a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE
Maybe a function to sort the columns before comparison is what you need. But it will be slow on large dataframes.
unordered_equal <- function(X, Y, exact = FALSE){
X[] <- lapply(X, sort)
Y[] <- lapply(Y, sort)
if(exact) identical(X, Y) else all.equal(X, Y)
}
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE
a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE
Basically what you want may be to compare the ordered underlying matrices.
all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
You could wrap this into a function for more convenience:
om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))
all.equal(om(a), om(c))
# [1] TRUE
You can use the new package called waldo
library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
compare(a,c)
And you get:
`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`: 10 ...
`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`:

Simplify a code in R

I have written a code that runs through the columns of a dataframe and returns TRUE if it has a number 1 in any of them, and sends the value to a vector with the same size as a column of the dataframe. I would like to know if there is a way to simplify the code snippet below, since I will have to repeat it for several numbers.
n1 <- (tab[, 2]==1| tab[, 3]==1 | tab[, 4]==1 | tab[, 5]==1 |
tab[, 6]==1 | tab[, 7]==1 | tab[, 8]==1 | tab[, 9]==1 |
tab[, 10]==1 | tab[, 11]==1 | tab[, 12]==1 | tab[, 13]==1 |
tab[, 14]==1 | tab[, 15]==1 | tab[, 16]==1)
One possible solution is the following: you search for == 1 numbers in the dataframe and then Reduce the rows of that with the | operator:
tab <- data.frame(a = 1:10, b = 2:11)
apply(tab == 1, 1, function(x) {
Reduce("|", x)
})
For this example it will give you the ouput of:
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or one even simpler solution is:
apply(tab, 1, function(x) {
any(x == 1)
})
The other comments and answers can work, but I suggest they are encouraging bad behavior when dealing with a data.frame. First and foremost is that apply and rowSums expect a matrix as the data, and will happily coerce to such if given a data.frame. If any of the data.frame columns are character, then all columns will be converted to character. Some operations may still work as expected (e.g., == 1 since it will effectively be == "1" ... though some rounding errors may cause undesired effects), but anything mathematic will not work.
As an example,
n <- 20
set.seed(2)
tab <- data.frame(
a = as.character(sample(n, replace = FALSE)),
b1 = sample(5, size = n, replace = TRUE),
b2 = sample(5, size = n, replace = TRUE),
stringsAsFactors = FALSE
)
str(tab)
# 'data.frame': 20 obs. of 3 variables:
# $ a : chr "4" "14" "11" "3" ...
# $ b1: int 4 2 5 1 2 3 1 2 5 1 ...
# $ b2: int 5 2 1 1 5 4 5 2 3 5 ...
apply(tab, 1, function(y) any(y == 1))
# [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
apply(tab, 1, sum)
# Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
rowSums(tab == 1)
# [1] 0 0 1 2 0 0 1 0 0 1 2 2 0 0 0 0 0 1 0 1
rowSums(tab)
# Error in rowSums(tab) : 'x' must be numeric
There are some easy ways to deal with this. Given your example, it appears that columns 2:16 are numeric and the ones you are concerned about. If that's the case, then you can safely use either one of:
rowSums(tab[,2:16] == 1) # Frank's comment
apply(tab[,2:16], 1, function(y) any(y == 1)) # suggested by You-leee's answer
(the former being fairly specific, the latter can be extended to other functionality). If there's only one non-numeric, once can always do
rowSums(tab[,-1,drop=FALSE] == 1)
apply(tab[,-1,drop=FALSE], 1, function(y) any(y == 1))
A third technique is to determine at run time which columns to choose:
isnum <- sapply(tab, is.numeric)
Reduce(`|`, lapply(tab[isnum], function(y) any(y == 1)))
This was a little more complex, because the return from lapply is a list, but it still works fine. Realize that the use of isnum could be based on column names as well, using something like grepl. This method is fairly robust, too, in that it does not error if none of the columns match.

R: Appending values from row values specified by other row values to a list

I have a data frame with two columns - a group number and a name:
Group Name
1 A
4 B
2 C
3 D
4 E
I now want to make a list containing all the names that have groups in common.
I have tried with this for loop:
myfun <- function(x,g1,g2,g3,g4){
for (j in 1:nrow(x)){
if (x[1,j] == 1){
list(g1, list(c=x[2,j]))
} else if (x[1,j] == 2){
list(g2, list(c=x[2,j]))
} else if (x[1,j] == 3){
list(g3, list(c=x[2,j]))
} else if (x[1,j] == 4){
list(g4, list(c=x[2,j]))
}
}
}
where g1, g2, g3 and g4 are empty lists.
I get this error Error in if (x[1, i] == 1) { : argument is of length zero.
Do I have the right approach?
Edit:
How can I search and extract the level by a value in the list (lets say i want the group with the name B in it?
You can simplify your code (avoiding all the loops) by using an apply function (dat is the data)
res <- lapply(unique(dat$Group), function(g) unique(dat[dat$Group==g, "Name"]))
names(res) <- unique(dat$Group)
res[["4"]]
# [1] B E
# Levels: A B C D E
This creates a list where the indices of the list correspond to unique(dat$Group) and each element contains the unique "Name"s in that group.
Another solution, using plyr
library(plyr)
res <- dlply(dat, .(Group), function(x) unique(x$Name))
res[["4"]]
# [1] B E
# Levels: A B C D E
## If you want to extract all the groups with a "B" Name
inds <- unlist(lapply(res, function(x) "B" %in% x))
inds
# 1 2 3 4
# FALSE FALSE FALSE TRUE
## and to extract that Group
names(inds)[inds]
# [1] "4"

Resources