Simplify a code in R - r

I have written a code that runs through the columns of a dataframe and returns TRUE if it has a number 1 in any of them, and sends the value to a vector with the same size as a column of the dataframe. I would like to know if there is a way to simplify the code snippet below, since I will have to repeat it for several numbers.
n1 <- (tab[, 2]==1| tab[, 3]==1 | tab[, 4]==1 | tab[, 5]==1 |
tab[, 6]==1 | tab[, 7]==1 | tab[, 8]==1 | tab[, 9]==1 |
tab[, 10]==1 | tab[, 11]==1 | tab[, 12]==1 | tab[, 13]==1 |
tab[, 14]==1 | tab[, 15]==1 | tab[, 16]==1)

One possible solution is the following: you search for == 1 numbers in the dataframe and then Reduce the rows of that with the | operator:
tab <- data.frame(a = 1:10, b = 2:11)
apply(tab == 1, 1, function(x) {
Reduce("|", x)
})
For this example it will give you the ouput of:
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Or one even simpler solution is:
apply(tab, 1, function(x) {
any(x == 1)
})

The other comments and answers can work, but I suggest they are encouraging bad behavior when dealing with a data.frame. First and foremost is that apply and rowSums expect a matrix as the data, and will happily coerce to such if given a data.frame. If any of the data.frame columns are character, then all columns will be converted to character. Some operations may still work as expected (e.g., == 1 since it will effectively be == "1" ... though some rounding errors may cause undesired effects), but anything mathematic will not work.
As an example,
n <- 20
set.seed(2)
tab <- data.frame(
a = as.character(sample(n, replace = FALSE)),
b1 = sample(5, size = n, replace = TRUE),
b2 = sample(5, size = n, replace = TRUE),
stringsAsFactors = FALSE
)
str(tab)
# 'data.frame': 20 obs. of 3 variables:
# $ a : chr "4" "14" "11" "3" ...
# $ b1: int 4 2 5 1 2 3 1 2 5 1 ...
# $ b2: int 5 2 1 1 5 4 5 2 3 5 ...
apply(tab, 1, function(y) any(y == 1))
# [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
apply(tab, 1, sum)
# Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
rowSums(tab == 1)
# [1] 0 0 1 2 0 0 1 0 0 1 2 2 0 0 0 0 0 1 0 1
rowSums(tab)
# Error in rowSums(tab) : 'x' must be numeric
There are some easy ways to deal with this. Given your example, it appears that columns 2:16 are numeric and the ones you are concerned about. If that's the case, then you can safely use either one of:
rowSums(tab[,2:16] == 1) # Frank's comment
apply(tab[,2:16], 1, function(y) any(y == 1)) # suggested by You-leee's answer
(the former being fairly specific, the latter can be extended to other functionality). If there's only one non-numeric, once can always do
rowSums(tab[,-1,drop=FALSE] == 1)
apply(tab[,-1,drop=FALSE], 1, function(y) any(y == 1))
A third technique is to determine at run time which columns to choose:
isnum <- sapply(tab, is.numeric)
Reduce(`|`, lapply(tab[isnum], function(y) any(y == 1)))
This was a little more complex, because the return from lapply is a list, but it still works fine. Realize that the use of isnum could be based on column names as well, using something like grepl. This method is fairly robust, too, in that it does not error if none of the columns match.

Related

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.
Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE
Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

Compare 2 dataframes for equality in R

I have 2 dataframes with 2 same columns. I want to check if the datasets are identical. The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets
I tried using compare, identical, all, all_equal etc. None of them returns me True.
The dummy datasets are -
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
all(a==c)
[1] FALSE
compare(a,c)
FALSE [FALSE, FALSE]
identical(a,c)
[1] FALSE
all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"
The datasets are entirely same, except for the order of the records. If these functions only work when the datasets are mirror images of each other, then I must try something else. If that is the case, can someone help with how do I get True for these 2 datasets (unordered)
dplyr's setdiff works on data frames, I would suggest
library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE
Note that this will not account for number of duplicate rows. (i.e., if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE). Not sure how you want duplicate rows handled...
If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical.
(a) adding an ID column
library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE
(b) sorting
a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE
Maybe a function to sort the columns before comparison is what you need. But it will be slow on large dataframes.
unordered_equal <- function(X, Y, exact = FALSE){
X[] <- lapply(X, sort)
Y[] <- lapply(Y, sort)
if(exact) identical(X, Y) else all.equal(X, Y)
}
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE
a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE
Basically what you want may be to compare the ordered underlying matrices.
all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
You could wrap this into a function for more convenience:
om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))
all.equal(om(a), om(c))
# [1] TRUE
You can use the new package called waldo
library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)
compare(a,c)
And you get:
`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`: 10 ...
`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`:

How to work with the rows of a data frame without coercing it into a character vector?

I have this data frame:
df <- data.frame(
a = c(0, 1, 0, 1),
b = c("a", "b", "c", "d")
)
# a b
# 1 0 a
# 2 1 b
# 3 0 c
# 4 1 d
Let's say I want to test each row for a condition and return either "ok" or "not ok". This should work:
apply(df, 1, function(row){
if (is.numeric(row[1]) & row[2] != "b") {
"ok"
} else {
"not ok"
}
})
# I should return: "ok" "not ok" "ok" "ok"
Unfortunately apply coerces the dataframe to a single type, so everything is seen as a character, so this is the output I get:
# "not ok" "not ok" "not ok" "not ok"
Is there a way to go through the rows of a dataframe preserving the data types? Maybe using dplyr::do or purrr::map?
Update
I know the conditions in the example don't make a lot of sense, but I was trying to simplify a more complex condition. I want to avoid using nested ifelse statements because they are not very readable.
A solution with ifelse() has been suggested in the comments and this is of course fine in your case:
df$c <- ifelse(is.numeric(df$a) & df$b != "b", "ok", "not ok")
df
## a b c
## 1 0 a ok
## 2 1 b not ok
## 3 0 c ok
## 4 1 d ok
But your more general question is how to apply a function over the rows of a data frame without converting it to a matrix. A possible way to do this, is to use lapply (or one of the others) over row indices:
df$c <- vapply(1:nrow(df), function(i){
if (is.numeric(df[i, 1]) & df[i, 2] != "b") {
"ok"
} else {
"not ok"
}
}, character(1))
## df
## a b c
## 1 0 a ok
## 2 1 b not ok
## 3 0 c ok
## 4 1 d ok
Again, in your situation, ifelse() is just fine. But if you want to do something more complicated with the rows of your data frame, applying over row indices might be the way to go.
The first half of this answer is expanding and trying to explain #Joran's excellent comment/answer, which is mainly an exercise for me and my understanding, but hopefully it helps someone else too. (and I'm happy to have my understanding corrected).
The second half shows a couple of other non-base solutions that could be used in more complex situations.
Joran's answer
c('not ok','ok')[(is.numeric(df[[1]]) & (df[[2]] != 'b')) + 1]
From ?data.frame
A data frame is a list of variables
so, each column/variable in the data.frame is a list
From ?[ and this question on the difference between [ and [[ we note that
For lists, one generally uses [[ to select any single element, whereas [ returns a list of the selected elements.
Therefore, using [[ in this solution selects a single element of the the list
df[[1]] ## select the 1st column as a single element (which is a vector)
# [1] 0 1 0 1
df[[2]] ## select the 2nd column as a single element (which is a vector)
# [1] a b c d
## note that df[1] would return the first column as a data.frame (which is a list), not a vector
## we can see that by
# > str(df[1])
# 'data.frame': 4 obs. of 1 variable:
# $ a: num 0 1 0 1
# > str(df[[1]])
# num [1:4] 0 1 0 1
With these two vectors now selected we can perform the vectorised logical check on each element within them
is.numeric(df[[1]]) & (df[[2]] != 'b')
# TRUE FALSE TRUE TRUE
From ?logical we have
...with TRUE being mapped to 1L, FALSE to 0L...
so essentially TRUE == 1L and FALSE == 0L, which we can see by
sum(c(TRUE, TRUE, FALSE, TRUE))
# [1] 3
Now, taking a vector of our choices
c("not ok", "ok")
# [1] "not ok" "ok"
we can use [ again to select each element
c("not ok", "ok")[1]
# [1] "not ok"
c("not ok", "ok")[2]
# [1] "ok"
c("not ok", "ok")[3]
# [1] NA
## Because there isn't a 3rd element
c("not ok", "ok")[0]
# character(0) ## empty
## and we can use a vector to select each element
c("not ok", "ok")[c(1,2,1,3)]
# [1] "not ok" "ok" "not ok" NA
Which also means we can use our logical comparison from earlier to subset the choices. However, as FALSE is mapped to 0L, we need to add 1 to it so it will be able to select from the vector
c(TRUE, TRUE, FALSE, TRUE) + 1
# [1] 2 2 1 2
which gives
c("not ok", "ok")[c(2,2,1,2)]
# [1] "ok" "ok" "not ok" "ok"
Which now gives us the information we want to include in our original data.frame
df$c <- c("not ok", "ok")[c(2,2,1,2)]
# a b c
# 1 0 a ok
# 2 1 b ok
# 3 0 c not ok
# 4 1 d ok
Non-base solutions
## a dplyr version, still using ifelse construct
library(dplyr)
df %>%
mutate(c = ifelse(is.numeric(a) & b != "b", "ok", "not ok"))
## a couiple of data.table versions using by reference udpates (:=)
library(data.table)
## using an ifelse
setDT(df)[, c := ifelse(is.numeric(a) & b != "b", "ok", "not ok")]
## using filters in i
setDT(df)[is.numeric(a) & b != "b", c := "ok"][is.na(c), c := "not ok"]

Unexpected behavior in subsetting aggregate function in R

I have a data frame that contains with the following format:
manufacturers pricegroup leads
harley <2500 #
honda <5000 #
... ... ..
I am using the aggregate function to pull out data in the following way:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer==c("honda","harley")))
I noticed this is not returning the correct totals. The numbers for each manufacturer get smaller and smaller the more manufacturers I add to the subset group. However, if I use:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer=="honda" | manufacturer=="harley"))
It returns the correct numbers. For the life of me, I can't figure out why. I would just use the OR operator, except I will be passing a list of manufacturers in dynamically. Any thoughts as to why the first construct is not working? Better, any thoughts on how to make it work? Thanks!
The problem is that == is alternating between the values of "honda" and "harley" and comparing with the value in the relevant position of your "manufacturer" variable. On the other hand, %in% (as suggested by MrFlick) and | are checking across the entire "manufacturer" variable before deciding which values to mark.
== will recycle values to the length of what is being compared.
This might be easier to see with an example:
set.seed(1)
v1 <- sample(letters[1:5], 10, TRUE)
v2 <- c("a", "b") ## Will be recycled to rep(c("a", "b"), 5) when comparing with v1
data.frame(v1, v2,
`==` = v1 == v2,
`%in%` = v1 %in% v2,
`|` = v1 == "a" | v1 == "b",
check.names = FALSE)
# v1 v2 == %in% |
# 1 b a FALSE TRUE TRUE
# 2 b b TRUE TRUE TRUE
# 3 c a FALSE FALSE FALSE
# 4 e b FALSE FALSE FALSE
# 5 b a FALSE TRUE TRUE
# 6 e b FALSE FALSE FALSE
# 7 e a FALSE FALSE FALSE
# 8 d b FALSE FALSE FALSE
# 9 d a FALSE FALSE FALSE
# 10 a b FALSE TRUE TRUE
Notice that in the == column, the only TRUE value was where "v1" and the recycled values of "v2" were the same.

compare two characters based on subset

I have a simple dataframe with two columns:
df <- data.frame(x = c(1,1,2,2,3),
y = c(rep(1:2,2),1),
target = c('a','a','a','b','a'))
I would like to compare the strings in the target column (find out whether they are equal or not, i.e., TRUE or FALSE) within every level of x (same number for x).
First I would like to compare lines 1 and 2, then 3 and 4 ...
My problem is that I am missing some comparisons, for example, line 5 has only one case instead of two - so it should turn out to be FALSE.
Variable y indicates the first and second case within x.
I played around with ddply doing something like:
ddply(df, .(x), summarise,
ifelse(as.character(df[df$y == '1',]$target),
as.character(df[df$y == '2',]$target),0,1))
which is ugly ...
and does not work ...
Any insights how I could achieve this comparison?
Thanks
ddply(df, .(x), function(d) NROW(d) == 2 & d$target[1] == d$target[2])
This assumes you want the value to be TRUE only if there are exactly 2 rows with that 'x' value. If it is possible for there to be 3 or more, and you want it to be TRUE if all target values are identical, you could do:
ddply(df, .(x), function(d) NROW(d) > 1 & length(unique(d$target)) == 1)
Here is a base R solution, assuming I have followed what you wanted correctly. foo() is a function that compares the two target values in each subset, whilst we split() the data on df$x and l|sapply() foo() to each of the subsets.
foo <- function(x) {
with(x, {if(length(target) < 2) {
FALSE
} else {
isTRUE(all.equal(target[1], target[2]))
}})
}
lapply(split(df, df$x), foo)
sapply(split(df, df$x), foo)
Which produces this output
> lapply(split(df, df$x), foo)
$`1`
[1] TRUE
$`2`
[1] FALSE
$`3`
[1] FALSE
>
> sapply(split(df, df$x), foo)
1 2 3
TRUE FALSE FALSE
ave(as.character(df$target), df$x,
FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
[1] "TRUE" "TRUE" "FALSE" "FALSE" "FALSE"
Or ... if you only want the results by group ...., use aggregate:
> aggregate(as.character(df$target), list(df$x),
+ FUN=function(z) if ( length(z)=="2" & length(unique(z))==1){TRUE} else{ FALSE })
Group.1 x
1 1 TRUE
2 2 FALSE
3 3 FALSE

Resources