Select and count the number of duplicate items with two different outcome values? - r

Long-time follower, thanks so much for all your help over the years! I have a question that might have an easy answer, but I failed in googling it, and trying various subsetting and bracket notation also feel short. I'm betting someone here has encountered a similar problem.
I have a long-form data set with a set of duplicate ids. I also have a third variable that might be different for the duplicate. By example, if you recreate my data set:
x <- c("a", "a", "b", "c", "c", "d", "d", "d")
y <- c("z", "z", "z", "y", "y", "y", "x", "x")
z <- c(10, 20, 10, 10, 10, 10, 10, 20)
df <- cbind(x, y, z)
df <- as.data.frame(df)
names(df) <- c("id1", "id2", "var1")
df
I want to select the rows in which id2 has BOTH a 10 and 20 when they are connected to the same id1, For example, 'x' has two observations connected to id1 ('a') with two different var1 values (a '10' and a '20).
I want to select these cases, as well as count how many cases like this are in the overall data set. Thanks in advance!

One way is with ddply from the plyr package. Something like this:
> library(plyr)
> ddply(df, c('id2', 'id1'), function(x) if(length(unique(x$var1))==2) x)
id1 id2 var1
1 d x 10
2 d x 20
3 a z 10
4 a z 20

Related

How do I Identify by row id the values in a data frame column not in another data frame column?

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x
I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"
I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

R add all combinations of three values of a vector to a three-dimensional array

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))
I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE
I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

Subsetting data from a dataframe and taking specific values from the subsetted values

I want to check if values (in example below "letters") in 1 dataframe appear in another dataframe. And if that is the case, I want a value (in example below "ranking") which is specific for that value from the first dataframe to be added to the second dataframe... What I have now Is the following:
Df1 <- data.frame(c("A", "C", "E"), c(1:3))
colnames(Df1) <- c("letters", "ranking")
Df2 <- data.frame(c("A", "B", "C", "D", "E"))
colnames(Df2) <- c("letters")
Df2$rank <- ifelse(Df2$letters %in% Df1$letters, 1, 0)
However... Instead of getting a '1' when the letters overlap, I want to get the specific 'ranking' number from Df1.
Thanks!
What you're looking for is called a merge:
merge(Df2, Df1, by="letters", all.x=TRUE)
Also, fun fact, you can create a dataframe and name the columns at the same time (and you'll usually want to "turn off" strings as factors):
df1 <- data.frame(
letters = c("a", "b", "c"),
ranking = 1:3,
stringsAsFactors = FALSE)
dplyr package is best for this.
Df2 <- Df2 %>%
left_join(Df1,by = "letters")
this will show a NA for "D" if you want to keep it.
Otherwise you can do semi_join
DF2 <- Df2 %>%
semi_join(Df1, by = "letters")
And this will only keep the ones they have in common (intersection)

Preserving zero length groups with aggregate

I just noticed that aggregate disappears empty groups from the result, how can I solve this? e.g.
`xx <- c("a", "b", "d", "a", "d", "a")
xx <- factor(xx, levels = c("a", "b", "c", "d"))
y <- rnorm(60, 5, 1)
z <- matrix(y, 6, 10)
aggregate(z, by = list(groups = xx), sum)`
xx is a factor variable with 4 levels, but the result gives just 3 rows, and would like a row for the "c" level with zeros. I would like the same behavior of table(xx) tha gives frecuencies even for levels with no observations.
We can create another data.frame with just the levels of 'xx' and then merge with the aggregate. The output will have all the 'groups' while the row corresponding to the missing level for the other columns will be NA.
merge(data.frame(groups=levels(xx)),
aggregate(z, by = list(groups = xx), sum), all.x=TRUE)
Another option might be to convert to 'long' format with melt and then use dcast with fun.aggregate as 'sum' and drop=FALSE
library(data.table)
dcast(melt(data.table(groups=xx, z), id.var='groups'),
groups~variable, value.var='value', sum, drop=FALSE)

Order data frame by two columns in R

I'm trying to reorder the rows of a data frame by two factors. For the first factor i'm happy with the default ordering. For the second factor i'd like to impose my own custom order to the rows. Here's some dummy data:
dat <- data.frame(apple=rep(LETTERS[1:10], 3),
orange=c(rep("agg", 10), rep("org", 10), rep("fut", 10)),
pear=rnorm(30, 10),
grape=rnorm(30, 10))
I'd like to order "apple" in a specific way:
appleOrdered <- c("E", "D", "J", "A", "F", "G", "I", "B", "H", "C")
I've tried this:
dat <- dat[with(dat, order(orange, rep(appleOrdered, 3))), ]
But it seems to put "apple" into a random order. Any suggestions? Thanks.
Reordering the factor levels:
dat[with(dat, order(orange, as.integer(factor(apple, appleOrdered)))), ]
Try using a factor with the levels in the desired order and the arrange function from plyr:
dat$apple <- factor(dat$apple,levels=appleOrdered)
arrange(dat,orange,apple)

Resources