Unexpected behavior in subsetting aggregate function in R

Unexpected behavior in subsetting aggregate function in R - r

I have a data frame that contains with the following format:
manufacturers pricegroup leads
harley <2500 #
honda <5000 #
... ... ..
I am using the aggregate function to pull out data in the following way:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer==c("honda","harley")))
I noticed this is not returning the correct totals. The numbers for each manufacturer get smaller and smaller the more manufacturers I add to the subset group. However, if I use:
aggregate( leads ~ manufacturer + pricegroup, data=leaddata,
FUN=sum, subset=(manufacturer=="honda" | manufacturer=="harley"))
It returns the correct numbers. For the life of me, I can't figure out why. I would just use the OR operator, except I will be passing a list of manufacturers in dynamically. Any thoughts as to why the first construct is not working? Better, any thoughts on how to make it work? Thanks!

The problem is that == is alternating between the values of "honda" and "harley" and comparing with the value in the relevant position of your "manufacturer" variable. On the other hand, %in% (as suggested by MrFlick) and | are checking across the entire "manufacturer" variable before deciding which values to mark.
== will recycle values to the length of what is being compared.
This might be easier to see with an example:
set.seed(1)
v1 <- sample(letters[1:5], 10, TRUE)
v2 <- c("a", "b") ## Will be recycled to rep(c("a", "b"), 5) when comparing with v1
data.frame(v1, v2,
`==` = v1 == v2,
`%in%` = v1 %in% v2,
`|` = v1 == "a" | v1 == "b",
check.names = FALSE)
# v1 v2 == %in% |
# 1 b a FALSE TRUE TRUE
# 2 b b TRUE TRUE TRUE
# 3 c a FALSE FALSE FALSE
# 4 e b FALSE FALSE FALSE
# 5 b a FALSE TRUE TRUE
# 6 e b FALSE FALSE FALSE
# 7 e a FALSE FALSE FALSE
# 8 d b FALSE FALSE FALSE
# 9 d a FALSE FALSE FALSE
# 10 a b FALSE TRUE TRUE
Notice that in the == column, the only TRUE value was where "v1" and the recycled values of "v2" were the same.

Related

How do I use the across function across all columns of a dataframe rather than specify certain columns?

I have the following script so far that successfully creates a new column in my dataframe and populates it with a sum of how many times the value "TRUE" appears for each row of the dataframe:
data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))
You will notice after I bring in the across function, I specify that I want to drop a column from the function. However, I actually don't want to drop any columns from my function. I tried writing something like
across(data_1 %in%, TRUE) to indicate I want to go across the whole dataframe/all columns, but this is not the correct syntax.
Also, I tried to do this a much simpler way using just rowSums and no mutate as follows:
data_1$True_Count <- rowSums(df == TRUE) but all this did was create an empty column called True_Count and did not count the occurrences of TRUE logical values in each row. I also tried the same thing using a random string value that I know occurs exactly one time in my dataset: data_1$True_Count <- rowSums(df == "banana") but this did the same thing -- it created an empty column and did not count the instance of banana in my dataset.
Lastly there was one more behavior that I did not understand. If I run the first code, data_1 <- data_1 %>% mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE))) more than once, the counts in the True_Count column cease to be correct.

It is really helpful if you share data in a reproducible format with the expected output so that everyone is on the same page regarding understanding of the question.
Since you did not share an example, I created one myself to explain the answer here. I have added 4 random columns with TRUE/FALSE values since it seems this is what your dataset contains.
data_1 <- data.frame(Community.name = c(T, F, T, F, F),
Community.code = c(T, F, F, T, T),
col1 = T,
col2 = c(F, F, T, F, F))
data_1
# Community.name Community.code col1 col2
#1 TRUE TRUE TRUE FALSE
#2 FALSE FALSE TRUE FALSE
#3 TRUE FALSE TRUE TRUE
#4 FALSE TRUE TRUE FALSE
#5 FALSE TRUE TRUE FALSE
Note that TRUE (logical) is different from "TRUE" (character). So first verify if your dataset contains logical values or character values before trying out the answers below.
This is your current code where you are dropping Community.name and calculating number of TRUE values in the dataset.
library(dplyr)
data_2 <- data_1 %>%
mutate(True_Count = rowSums(across(-c(Community.name), `%in%`, TRUE)))
data_2
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 2
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 2
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Seems to work as expected. We ignore Community.Name and calculate number of TRUE values in the dataset.
Now your question,
I actually don't want to drop any columns from my function.
For that you can use everything() in across to include all the columns.
data_3 <- data_1 %>%
mutate(True_Count = rowSums(across(everything(), `%in%`, TRUE)))
data_3
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 3
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 3
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Also note that everything() is default in ?across.
Also, I tried to do this a much simpler way using just rowSums and no mutate
Yes, using rowSums with no mutate is much simpler way giving the same answer.
data_1$True_Count <- rowSums(data_1)
data_1
# Community.name Community.code col1 col2 True_Count
#1 TRUE TRUE TRUE FALSE 3
#2 FALSE FALSE TRUE FALSE 1
#3 TRUE FALSE TRUE TRUE 3
#4 FALSE TRUE TRUE FALSE 2
#5 FALSE TRUE TRUE FALSE 2
Lastly there was one more behavior that I did not understand. If I run the first code, more than once, the counts in the True_Count column cease to be correct.
That might be because initially you don't have True_Count column in the dataset. So for the first time when you run the code True_Count column is added in your dataset data_1, now when you run the code second time it also uses True_Count for calculation which is something you don't want.

Is there an R function for checking membership in a vector?

I often run into a situation where I have two vectors and I want to check if each element of vector 1 is in vector 2. I typically do it with an sapply() but wanted to know of there is a more concise way to do it or a single built in function for this. For example:
v1 = c(1,1,3,4,5,7)
v2 = c(1,5)
# desired output: [1] TRUE TRUE FALSE FALSE TRUE FALSE
# my solution
sapply(v1, function(x) x %in% v2)

We can just use %in% as it is vectorized
v1 %in% v2
#[1] TRUE TRUE FALSE FALSE TRUE FALSE

R: Test for overlap of name values in dataframe

I have a dataframe filled with names.
For a given row in the dataframe, I'd like to compare that row to every row above it in the df and determine if the number of matching names is less than or equal to 4 for every row.
Toy Example where row 3 is the row of interest
"Jim","Dwight","Michael","Andy","Stanley","Creed"
"Jim","Dwight","Angela","Pam","Ryan","Jan"
"Jim","Dwight","Angela","Pam","Creed","Ryan" <--- row of interest
So first we'd compare row 3 to row 1 and see that the name overlap is 3, which meets the <= 4 criteria.
Then we'd compare row 3 to row 2 and see that the name overlap is 5 which fails the <= 4 criteria, ultimately returning a failed condition for being <=4 for every row above it.
Right now I am doing this operation using a for loop but the speed is much too slow for the dataframe size I am working with.

Example data
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
), stringsAsFactors = FALSE)
df
# V1 V2 V3 V4 V5 V6
# 1 Jim Dwight Michael Andy Stanley Creed
# 2 Jim Dwight Angela Pam Ryan Jan
# 3 Jim Dwight Angela Pam Creed Ryan
Operation and output (sapply over columns with %in% and take rowSums)
out_lgl <- rowSums(sapply(df, '%in%', unlist(df[3,]))) <= 4
out_lgl
# [1] TRUE FALSE FALSE
which(out_lgl)
# [1] 1
Explanation:
For each column, each element is compared to the third row (the vector unlist(df[3,])). The output is a matrix of logical values with the same dimensions as df, TRUE if there is a match.
sapply(df, '%in%', unlist(df[3,]))
# V1 V2 V3 V4 V5 V6
# [1,] TRUE TRUE FALSE FALSE FALSE TRUE
# [2,] TRUE TRUE TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE
Then we can sum the TRUEs to see the number of matches for each row
rowSums(sapply(df, '%in%', unlist(df[3,])))
# [1] 3 5 6
Edit:
I have added the stringsAsFactors = FALSE option to the creation of df above. However, as far as I can tell the output of %in% is the same whether comparing factors with different levels or characters, so I don't believe this could change the results in any way. See example below
x <- c('b', 'c', 'z')
y <- c('a', 'b', 'g')
all.equal(x %in% y, factor(x) %in% factor(y))
# [1] TRUE

Similar solution as IceCreamToucan, but for any row.
For the data.frame:
df <- as.data.frame(rbind(
c("Jim","Dwight","Michael","Andy","Stanley","Creed"),
c("Jim","Dwight","Angela","Pam","Ryan","Jan"),
c("Jim","Dwight","Angela","Pam","Creed","Ryan")
)
For any row number i:
f <- function(i) {
if(i == 1) return(T)
r <- vapply(df[1:(i-1),], '%in%', unlist(df[i,]), FUN.VALUE = logical(i-1))
out_lgl <- rowSums(as.matrix(r)) <= 4
return(all(out_lgl))
}

Identifying duplicates rows only with respect to some columns

I would like to create a variable (e.g. reap) taking the value TRUE only if the elements of some columns are duplicates of those of another row BUT the values on other columns are different.
The sample data will probably clarify my question:
V1 V2 V3
1. a b c
2. a b d
3. e f g
4. e f g
For example, if we want a variable taking value TRUE when rows have same V1 and V2 but different V3, then this variable should look like the following:
V1 V2 V3 reap
1. a b c TRUE
2. a b d TRUE
3. e f g FALSE
4. e f g FALSE
Thanks a lot for your help.

One idea would be to identify all duplicates of each column, and create a logical vector using rowSums and setting the condition != ncol(df)
rowSums(sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))) != ncol(df)
#[1] TRUE TRUE FALSE FALSE
To only consider the third column
m1 <- sapply(df, function(i) duplicated(i)|duplicated(i, fromLast = TRUE))
rowSums(m1) == 2 & !m1[,3]
#[1] TRUE TRUE FALSE FALSE

R: Choosing specific number of combinations from all possible combinations

Let's say we have the following dataset
set.seed(144)
dat <- matrix(rnorm(100), ncol=5)
The following function creates all possible combinations of columns and removes the first
(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
# Var1 Var2 Var3 Var4 Var5
# 2 TRUE FALSE FALSE FALSE FALSE
# 3 FALSE TRUE FALSE FALSE FALSE
# 4 TRUE TRUE FALSE FALSE FALSE
# ...
# 31 FALSE TRUE TRUE TRUE TRUE
# 32 TRUE TRUE TRUE TRUE TRUE
My question is how can I calculate single, binary and triple combinations only ?
Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ]
However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:
Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) :
invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow
Any suggestion that would allow me to compute single, binary and triple combinations only ?

You could try either
cols[rowSums(cols) < 4L, ]
Or
cols[Reduce(`+`, cols) < 4L, ]

You can use this solution:
col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
#
# [[2]]
# [1] 2
#
# <...skipped...>
#
# [[24]]
# [1] 2 4 5
#
# [[25]]
# [1] 3 4 5
Here, col.i is a list every element of which contains column indices.
How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.
You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.
lapply(col.i, function(cols) dat[,cols])
which will return a list of data frames each containing a certain subset of columns of dat.
In case you want to get column indices as a boolean matrix, you can use:
col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
# [,1] [,2] [,3] [,4] [,5]
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE TRUE FALSE FALSE
# ...
[UPDATE]
More efficient realization:
library("gRbase")
coli <- function(x=5,m=3) {
col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))
z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
v.b <- rep(F,x*length(col.i))
v.b[unlist(z)] <- TRUE
matrix(v.b,ncol=x,byrow = TRUE)
}
coli(70,5) # takes about 30 sec on my desktop

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Unexpected behavior in subsetting aggregate function in R - r

Related

How do I use the across function across all columns of a dataframe rather than specify certain columns?

Is there an R function for checking membership in a vector?

R: Test for overlap of name values in dataframe

Identifying duplicates rows only with respect to some columns

R: Choosing specific number of combinations from all possible combinations

Categories

Resources