R: how to subset duplicate rows of data.frame - r

set.seed(3)
mydata <- data.frame(id = c(1:5),
score = c(rnorm(5, 0, 1)))
ids <- c(1, 2, 3, 3)
> subset(mydata, id %in% ids)
id score
1 1 -0.9619334
2 2 -0.2925257
3 3 0.2587882
I have a situation where I would like to subset all rows of mydata such that its id matches my ids. The catch is that my ids has the number 3 repeated twice. But it seems that subset only extracted the unique rows, I'm guessing due to the operator %in%. However, my desired output is
> subset(mydata, id %in% ids)
id score
1 1 -0.9619334
2 2 -0.2925257
3 3 0.2587882
4 3 0.2587882
I've also tried to use the == operator instead. However, that didn't seem to do the trick.

Rather than using %in%, try using it's sister function match()
mydata[match(ids, mydata$id), ]
This will return the duplicated IDs.

Related

R - Compare column values in data frames of differing lengths by unique ID

I'm sure I can figure out a straightforward solution to this problem, but I didn't see a comparable question so I thought I'd post a question.
I have a longitudinal dataset with thousands of respondents over several time intervals. Everything from the questions to the data types can differ between the waves and often requires constructing long series of bools to construct indicators or dummy variables, but each respondent has a unique ID with no additional respondents add to the surveys after the first wave, so easy enough.
The issue is that while the early wave consist of one (Stata) file each, the latter waves contain lots of addendum files, structured differently. So, for example, in constructing previous indicators for the sex of previous partners there were columns (for one wave) called partnerNum and sex and there were up to 16 rows for each unique ID (respondent). Easy enough to spread (or cast) that data to be able to create a single row for each unique ID and columns partnerNum_1 ... partnerNum_16 with the value from the sex column as the entry in partnerDF. Then it's easy to construct indicators like:
sexuality$newIndicator[mainDF$bioSex = "Male" & apply(partnerDF[1:16] == "Male", 1, any)] <- 1
For other addendum files in the last two waves the data is structured long like the partner data, with multiple rows for each unique ID, but rather than just one variable like sex there are hundreds that I need to use to test against to construct indicators, all coded with different types, so it's impractical to spread (or cast) the data wide (never mind writing those bools). There are actually several of these files for each wave and the way they are structured some respondents (unique ID) occupy just 1 row, some a few dozen. (I've left_join'ed the addendum files together for each wave.)
What I'd like to be able to do to is test something like:
newDF$indicator[any(waveIIIAdds$var1 == 1) & any(waveIIIAdds$var2 == 1)] <- 1
or
newDF$indicator[mainDF$var1 == 1 & any(waveIIIAdds$var2 == 1)] <- 1
where newDF is the same length as mainDF (one row per unique ID).
So, for example, if I had two dfs.
df1 <- data.frame(ID = c(1:4), A = rep("a"))
df2 <- data.frame(ID = rep(1:4, each=2), B = rep(1:2, 2), stringsAsFactors = FALSE)
df1$A[1] <- "b"
df1$A[3] <- "b"
df2$B[8] <- 3
> df1 > df2
ID A ID B
1 b 1 1
2 a 1 2
3 b 2 1
4 a 2 2
3 1
3 2
4 1
4 3
I'd like to test like (assuming df3 has one column, just the unique IDs from df1)
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] & df2$ID[df2$B == 2]] <- 1
So that df3 would have one unique ID per row and since there is an "a" in df1$A for all IDs but df1$A[1] and a 2 in at least one row of df2$B for all IDs except the last ID (df2$B[7:8]) the result would be:
> df3
ID new
1 0
2 1
3 1
4 0
and
df3$new <- 0
df3$new[df1$ID[df1$A == "a"] | df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 1
4 0
This does it...
df3 <- data.frame(ID=unique(df1$ID),
new=sapply(unique(df1$ID),function(x)
as.numeric(x %in% df1$ID[df1$A == "a"] & x %in% df2$ID[df2$B == 2])))
df3
ID new
1 1 1
2 2 1
3 3 1
4 4 0
I came up with a parsimonious solution thinking about it for a few minutes after returning to the problem (rather than the wee hours of the morning of the post).
I wanted something a graduate student who will likely construct thousands of indicators or dummy variables this way and may learn R first, or even only ever learn R, could use. The following provides a solution for the example and actual data using the same schema:
if the DF was already created with the IDs and the column values for the dummy indicator initiated to zero already as assumed in the example:
df3 <- data.frame(ID = df1$ID)
df3$new <- 0
My solution was:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] & df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 0
2 1
3 0
4 1
Using | (or) instead:
df3$new[df1$ID %in% df1$ID[df1$A == "a"] | df1$ID %in% df2$ID[df2$B == 2]] <- 1
> df3
ID new
1 1
2 1
3 0
4 1

Assigning unique variable from a data.frame

This is a similiar question to this but my output results are different.
Take the data:
example <- data.frame(var1 = c(2,3,3,2,4,5),
var2 = c(2,3,5,4,2,5),
var3 = c(3,3,4,3,4,5))
Now I want to create example$Identity which take a value from 1:x for each unique var1 value
I have used
example$Identity <- apply(example[,1], 2, function(x)(unique(x)))
But I am not familiar with correct formatting function()
The output of example$Identity should be 1,2,2,1,3,4
This:
example$Identity <- as.numeric(as.factor(example$var1))
will give you the desired result:
> example$Identity
[1] 1 2 2 1 3 4
By wrapping the as.factor in as.numeric it starts counting the factor levels with 1 and so on.
Or you can use match
example$Identity <- with(example, match(var1, unique(var1)))
If the values are sorted as in the vector, findInterval can be also used
findInterval(example$var1, unique(example$var1))
#[1] 1 2 2 1 3 4

Finding unique tuples in R but ignoring order

Since my data is much more complicated, I made a smaller sample dataset (I left the reshape in to show how I generated the data).
set.seed(7)
x = rep(seq(2010,2014,1), each=4)
y = rep(seq(1,4,1), 5)
z = matrix(replicate(5, sample(c("A", "B", "C", "D"))))
temp_df = cbind.data.frame(x,y,z)
colnames(temp_df) = c("Year", "Rank", "ID")
head(temp_df)
require(reshape2)
dcast(temp_df, Year ~ Rank)
which results in...
> dcast(temp_df, Year ~ Rank)
Using ID as value column: use value.var to override.
Year 1 2 3 4
1 2010 D B A C
2 2011 A C D B
3 2012 A B D C
4 2013 D A C B
5 2014 C A B D
Now I essentially want to use a function like unique, but ignoring order to find where the first 3 elements are unique.
Thus in this case:
I would have A,B,C in row 5
I would have A,B,D in rows 1&3
I would have A,C,D in rows 2&4
Also I need counts of these "unique" events
Also 2 more things. First, my values are strings, and I need to leave them as strings.
Second, if possible, I would have a column between year and 1 called Weighting, and then when counting these unique combinations I would include each's weighting. This isn't as important because all weightings will be small positive integer values, so I can potentially duplicate the rows earlier to account for weighting, and then tabulate unique pairs.
You could do something like this:
df <- dcast(temp_df, Year ~ Rank)
combos <- apply(df[, 2:4], 1, function(x) paste0(sort(x), collapse = ""))
combos
# 1 2 3 4 5
# "BCD" "ABC" "ACD" "BCD" "ABC"
For each row of the data frame, the values in columns 1, 2, and 3 (as labeled in the post) are sorted using sort, then concatenated using paste0. Since order doesn't matter, this ensures that identical cases are labeled consistently.
Note that the paste0 function is equivalent to paste(..., sep = ""). The collapse argument says to concatenate the values of a vector into a single string, with vector values separated by the value passed to collapse. In this case, we're setting collapse = "", which means there will be no separation between values, resulting in "ABC", "ACD", etc.
Then you can get the count of each combination using table:
table(combos)
# ABC ACD BCD
# 2 1 2
This is the same solution as #Alex_A but using tidyverse functions:
library(purrr)
library(dplyr)
df <- dcast(temp_df, Year ~ Rank)
distinct(df, ID = pmap_chr(select(df, num_range("", 1:3)),
~paste0(sort(c(...)), collapse="")))

How to label ties when creating a variable capturing the most frequent occurence of a group?

In the following example, how do I ask R to identify a tie as "tie" when I want to determine the most frequent value within a group?
I am basically following on from a previous question, that used which.max or which.is.max and a custom function (Create a variable capturing the most frequent occurence by group), but I want to acknowledge the ties as a tie. Any ideas?
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
I want to create a third variable freq that contains the most frequent observation in v1 by id, but also creates identifies ties as "tie".
From previous answers, this code works to create the freq variable, but just doesn't deal with the ties:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
You could slightly modify your function by testing if the maximum count occurs more than once. This happens in sum(tbl == max(tbl)). Then proceed accordingly.
df1 <-data.frame(
id=rep(1:2, each=4),
v1=rep(letters[1:4], c(2,2,3,1))
)
myFun <- function(x){
tbl <- table(x$v1)
nmax <- sum(tbl == max(tbl))
if (nmax == 1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
else
x$freq <- "tie"
x
}
ddply(df1,.(id),.fun=myFun)
id v1 freq
1 1 a tie
2 1 a tie
3 1 b tie
4 1 b tie
5 2 c c
6 2 c c
7 2 c c
8 2 d c

The rules of subsetting

Having df1 and df2 as follows:
df1 <- read.table(text =" x y z
1 1 1
1 2 1
1 1 2
2 1 1
2 2 2",header=TRUE)
df2 <- read.table(text =" a b c
1 1 1
1 2 8
1 1 2
2 6 2",header=TRUE)
I can ask of the data a bunch of things like:
df2[ df2$b == 6 | df2$c == 8 ,] #any rows where b=6 plus c=8 in df2
#and additive conditions
df2[ df2$b == 6 & df2$c == 8 ,] # zero rows
between data.frame:
df1[ df1$z %in% df2$c ,] # rows in df1 where values in z are in c (allrows)
This gives me all rows:
df1[ (df1$x %in% df2$a) &
(df1$y %in% df2$b) &
(df1$z %in% df2$c) ,]
but shouldn't this give me all rows of df1 too:
df1[ df1$z %in% df2$c | df1$b == 9,]
What I am really hoping to do is to subset df1 an df2 on three column conditions,
so that I only get rows in df1 where a,b,c all equal x,y,z at the same time within a row. In real data i will have more than 3 columns but I will still want to subset on 3 additive column conditions.
So subsetting my example data df1 on df2 my result would be:
df1
1 1 1
1 1 2
Playing with syntax has confusedme more and the SO posts are all variaion of what I want that actually lead to more confusion for me.
I figured out I can do this:
merge(df1,df2, by.x=c("x","y","z"),by.y=c("a","b","c"))
which gives me what I want, but I would like to understand why I am wrong in my [ attempts.
In addition to your nice solution using merge (thanks for that, I always forget merge), this can be achieved in base using ?interaction as follows. There may be other variations of this, but this is the one I am familiar with:
> df1[interaction(df1) %in% interaction(df2), ]
Now to answer your question: First, I think there's a typo (corrected) in:
df1[ df1$z %in% df2$c | df2$b == 9,] # second part should be df2$b == 9
You would get an error, because the first part evaluates to
[1] TRUE TRUE TRUE TRUE TRUE
and the second evaluates to:
[1] FALSE FALSE FALSE FALSE
You do a | operation on unequal lengths getting the error:
longer object length is not a multiple of shorter object length
Edit: If you have multiple columns then you can choose the interaction as such. For example, if you want to get from df1 the rows where the first two columns match with that of df2, then you could simply do:
> df1[interaction(df1[, 1:2]) %in% interaction(df2[, 1:2]), ]

Resources