Merging two vectors, but keeping unique elements of only one vector - r

Maybe easy question, but i failed nevertheless.
I have two vectors:
v1 <- c("A", "B", "C", "F", "G")
v2 <- c( "B", "C", "F", "G", "H","I")
I want to merge v1 and v2 to obtain a vector which contains all common elements and all unique elements of v2, but does not include any unique elements of v1.
Essentially, remove all "FALSE" of
> v1 %in% v2
[1] FALSE TRUE TRUE TRUE TRUE
but keep all "FALSE" of
> v2 %in% v1
[1] TRUE TRUE TRUE TRUE FALSE FALSE
plus any common element.
Desired output:
c("B", "C", "F", "G", "H","I")
Thank you very much!

Related

Fastest validation of sorting vector of pairs of elements until they are unorderly paired

I have an unsorted vector of length N. Each element of the vector appears is present precisely twice (the vector length is an even number). I have a custom sorting algorithm, and the goal is to iterate until the vector achieves a state in which each element is adjacent to its copy.
Unsorted vector = {A,F,J,E,F,A,J,E}
A valid sorted state = {A,A,J,J,E,E,F,F}
Another valid sorted state = {J,J,A,A,F,F,E,E}
So my question lies in what is the fastest way to check if a sorted state is valid so that I can speed up my iterations? For long vectors, this will dictate most of my scaling ability.
Something quick and dirty but I'm not sure it will always work:
all(duplicated(x) == c(FALSE,TRUE))
This is relying on the fact that the two same values will always be next to each other, one not-duplicated, the next duplicated. Seems to work with the test sets:
x <- c("A", "F", "J", "E", "F", "A", "J", "E")
s1 <- c("A", "A", "J", "J", "E", "E", "F", "F")
s2 <- c("J", "J", "A", "A", "F", "F", "E", "E")
all(duplicated(x) == c(FALSE,TRUE))
#[1] FALSE
all(duplicated(s1) == c(FALSE,TRUE))
#[1] TRUE
all(duplicated(s2) == c(FALSE,TRUE))
#[1] TRUE
And is pretty quick, looking through a million length vector in 5 hundredths of a second on my machine:
x <- rep(1:1e6, each=2)
system.time(all(duplicated(x) == c(FALSE,TRUE)))
# user system elapsed
# 0.04 0.00 0.05
An option involves converting the vector (as the length is even and an element is present exactly twice) to a two-row matrix, get the uniqueand test whether the number of rows is 1. If values duplicated are adjacent, while adding the dim attributes with matrix, the second row will be exactly the same as the first
f1 <- function(x)
{
nrow(unique(matrix(x, nrow = 2))) == 1
}
-testing
> v1 <- c("A", "F", "J", "E", "F", "A", "J", "E")
> v2 <- c("A", "A", "J", "J", "E", "E", "F", "F")
> v3 <- c("J", "J", "A", "A", "F", "F", "E", "E")
> f1(v1)
[1] FALSE
> f1(v2)
[1] TRUE
> f1(v3)
[1] TRUE
Or slightly more faster
f2 <- function(x)
{
sum(duplicated(matrix(x, nrow = 2))) == 1
}
-testing
> f2(v1)
[1] FALSE
> f2(v2)
[1] TRUE
> f2(v3)
[1] TRUE
-benchmarks
#thelatemail
> f3 <- function(x) all(duplicated(x) == c(FALSE,TRUE))
#TarJae
> f4 <- function(x) {rle_obj <- rle(x); all(rle_obj$lengths > 1)}
> x1 <- rep(1:1e8, each = 2)
> system.time(f1(x1))
user system elapsed
2.649 0.456 3.111
> system.time(f2(x1))
user system elapsed
2.258 0.433 2.694
> system.time(f3(x1))
user system elapsed
9.972 1.272 11.233
> system.time(f4(x1))
user system elapsed
7.051 3.281 10.333
Another option is to use rle function:
v1 <- c("A", "F", "J", "E", "F", "A", "J", "E")
v2 <- c("A", "A", "J", "J", "E", "E", "F", "F")
v3 <- c("J", "J", "A", "A", "F", "F", "E", "E")
rle_obj <- rle(v3)
all(rle_obj$lengths > 1)
test:
> rle_obj <- rle(v1)
> all(rle_obj$lengths > 1)
[1] FALSE
> rle_obj <- rle(v2)
> all(rle_obj$lengths > 1)
[1] TRUE
> rle_obj <- rle(v3)
> all(rle_obj$lengths > 1)
[1] TRUE
>

R add all combinations of three values of a vector to a three-dimensional array

I have a data frame with two columns. The first one "V1" indicates the objects on which the different items of the second column "V2" are found, e.g.:
V1 <- c("A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C")
V2 <- c("a","b","c","d","a","c","d","a","b","d","e")
df <- data.frame(V1, V2)
"A" for example contains "a", "b", "c", and "d". What I am looking for is a three dimensional array with dimensions of length(unique(V2)) (and the names "a" to "e" as dimnames).
For each unique value of V1 I want all possible combinations of three V2 items (e.g. for "A" it would be c("a", "b", "c"), c("a", "b", "d", and c("b", "c", "d").
Each of these "three-item-co-occurrences" should be regarded as a coordinate in the three-dimensional array and therefore be added to the frequency count which the values in the array should display. The outcome should be the following array
ar <- array(data = c(0,0,0,0,0,0,0,1,2,1,0,1,0,2,0,0,2,2,0,1,0,1,0,1,0,
0,0,1,2,1,0,0,0,0,0,1,0,0,1,0,2,0,1,0,1,1,0,0,1,0,
0,1,0,2,0,1,0,0,1,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,
0,2,2,0,1,2,0,1,0,1,2,1,0,0,0,0,0,0,0,0,1,1,0,0,0,
0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0),
dim = c(5, 5, 5),
dimnames = list(c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e"),
c("a", "b", "c", "d", "e")))
I was wondering about the 3D symmetry of your result. It took me a while to understand that you want to have all permutations of all combinations.
library(gtools) #for the permutations
foo <- function(x) {
#all combinations:
combs <- combn(x, 3, simplify = FALSE)
#all permutations for each of the combinations:
combs <- do.call(rbind, lapply(combs, permutations, n = 3, r = 3))
#tabulate:
do.call(table, lapply(asplit(combs, 2), factor, levels = letters[1:5]))
}
#apply grouped by V1, then sum the results
res <- Reduce("+", tapply(df$V2, df$V1, foo))
#check
all((res - ar)^2 == 0)
#[1] TRUE
I used to use the crossjoin CJ() to retain the pairwise count of all combinations of two different V2 items
res <- setDT(df)[,CJ(unique(V2), unique(V2)), V1][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
This code creates a data frame res with three columns. V1 and V2 contain the respective items of V2 from the original data frame df and N contains the count (how many times V1 and V2 appear with the same value of V1 (from the original data frame df).
Now, I found that I could perform this crossjoin with three 'dimensions' as well by just adding another unique(V2) and adapting the rest of the code accordingly.
The result is a data frame with four columns. V1, V2, and V3 indicate the original V2 items and N again shows the number of mutual appearances with the same original V1 objects.
res <- setDT(df)[,CJ(unique(V2), unique(V2), unique(V2)), V1][V1!=V2 & V1 != V3 & V2 != V3,
.N, .(V1,V2,V3)][order(V1,V2,V3)]
The advantage of this code is that all empty combinations (those which do not appear at all) are not considered. It worked with 1,000,000 unique values in V1 and over 600 unique items in V2, which would have otherwise caused an extremely large array of 600 x 600 x 600

Get changing patterns with grep -- R

I want to grade several students exams using an answer key with grep. So for example, the student's answers were
A B B C E D D
and the key is
A D B C E CD ABD
I want to check to see if the student's answers are found in the corresponding position in the answer key (multiple letters indicate "or" not "and". So "C" or "D"). How would I got about that using grep?
Or we can use Map/mapply from base R
unname(mapply(grepl, answer, key))
#[1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE
data
answer <- c("A", "B", "B", "C", "E", "D", "D")
key <- c("A", "D", "B", "C", "E", "CD", "ABD")
We can use the map2_lgl function from the purrr package with grepl. TRUE means the answer found to be matched with the key. FALSE means no match.
# Create example of answer and key
answer <- c("A", "B", "B", "C", "E", "D", "D")
key <- c("A", "D", "B", "C", "E", "CD", "ABD")
# Load packages
library(purrr)
# Check if answer is in key
map2_lgl(answer, key, ~grepl(.x, .y))
[1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE

R - combining columns by specific conditions

I currently has a data frame as follow:
groups <- data.frame(name=paste("person",c(1:27),sep=""),
assignment1 = c("F","A","B","H", "A", "E", "D", "G", "I", "I", "E", "A", "D", "C", "F", "C", "D", "H", "F", "H", "G", "I", "G", "C", "B", "E", "B"),
assignment2 = c("H", "F", "F", "D", "E", "G", "A", "E", "I", "C", "A", "H", "G", "B", "I", "C", "E", "I", "C", "A", "B", "B", "G", "D", "H", "F", "D"),stringsAsFactors = FALSE)
It would looks like this:
I would like to create a list for each person that only contains the people he had already worked with. For example, person1 is on group F and H for 1st and 2nd assignment respectively and
The member of groups F on 1st assignment are {"person1","person15", "person19"}.
The member of groups D on 2nd assignment are {"person1","person12", "person25"}.
I would like to create a vector for person1 like
{"person15", "person19", "person12", "person25"}.
Any one knows a convenient way to do this in R?
Any help will be appreciated. Thanks in advance.
You could do this:
teammates <- lapply(1:nrow(groups), function(i) {
assig1 <- subset(groups, assignment1 == groups$assignment1[i])$name
assig2 <- subset(groups, assignment2 == groups$assignment2[i])$name
unq_set <- unique(c(assig1, assig2))
return(setdiff(unq_set, groups$name[i]))
})
This takes a vector of row indices, and for each one applies a function that a) gets the names of those where assignments 1 & 2 match the given row, b) gets the unique superset of these, c) returns that, less the name of the person around whom the group is built
The output is a list like this:
[[1]]
[1] "person15" "person19" "person12" "person25"
[[2]]
[1] "person5" "person12" "person3" "person26"
[[3]]
[1] "person25" "person27" "person2" "person26"
...and so on
For more brevity, the following is equivalent (though order inside list items may be different). Same logic as #user5219763's answer for subsetting, but the setdiff part is important
teammates <- lapply(1:nrow(groups), function(i) {
setdiff(
with(groups, name[assignment1 == assignment1[i] |
assignment2 == assignment2[i] ]),
groups$name[i])
})
Here's a solution using dplyr and tidyr:
library(dplyr)
library(tidyr)
groups %>%
gather(var, val, -name) %>%
unite(comb, var, val) %>%
left_join(.,., by = 'comb') %>%
group_by(name.x) %>%
summarise(out = list(name.y))
The heavy lifting is done using the left_join before that we are combining columns, so that we can merge on eg assignment1_f. The output contains itself, and is not corrected for dupes - that is up to you.
However, as #akrun says, if you are doing a lot of this stuff, use igraph
You could use is.element()
workedWith <- function(index,data=groups){
data[is.element(data[,2],data[index,2]) | is.element(data[,3],data[index,3]),1]
}
lapply(X = seq(1:nrow(groups)),FUN = workedWith)

Can R display how many changes were made to a variable like Stata does

When one is, e.g., replacing a variable in Stata, the Stata output will say that x real changes were made to the variable. This is very useful to know. Is there any similar functionality in R?
I think you could achieve the desired results by simply comparing newly created vectors and tabulating the results:
A <- c("A", "B", "C", "D")
B <- c("A", "C", "C", "E")
A == B
# OR
table(A == B)
In effect, you should be able to save your transformations as a new column/vector and then compare with the original object, summarising TRUE/FALSE values should provide you with the desired information on how many values were changed.
Full output
> A <- c("A", "B", "C", "D")
> B <- c("A", "C", "C", "E")
> A == B
[1] TRUE FALSE TRUE FALSE
> table(A == B)["TRUE"]
TRUE
2
> table(A == B)
FALSE TRUE
2 2

Resources