How to inverse subset in R? - r

I am trying to make non-overlapping subsets of a totally inclusive group in R. The first subset contains pairs of elements from the totally inclusive group. The other subset should be all of the elements in the totally inclusive group, but not in the first subset.
poplength <- 10
samples <- 7
numpair <- 2
totallyinclusivegroup <- sample(1:poplength, samples)
Subset1 <- sample(totallyinclusivegroup, size = numpair*2)
I don't know how to get a "Subset2" that includes everything in "totallyinclusivegroup" but not in Subset 1. I've tried using the "-" operator, with no success. For example,
Subset2 <- totallyinclusivegroup[-Subset1]
does not work, and includes elements from Subset1. Any advice/help is appreciated.

We can negate with ! on the logical vector from %in% so that TRUE -> FALSE and viceversa
out <- totallyinclusivegroup[!totallyinclusivegroup %in% Subset1]
-output
Subset1
#[1] 2 6 9 7
totallyinclusivegroup
#[1] 3 2 6 1 9 7 8
out
#[1] 3 1 8
Or an easier option is setdiff
setdiff(totallyinclusivegroup, Subset1)
#[1] 3 1 8
If there are duplicate elements, it is better to use vsetdiff from vecsets
library(vecsets)
vsetdiff(totallyinclusivegroup, Subset1)

Try:
#Code
Subset2 <- totallyinclusivegroup[-which(totallyinclusivegroup%in% Subset1 )]
Output:
totallyinclusivegroup
[1] 8 5 10 2 9 1 3
Subset1
[1] 5 10 3 9
Subset2
[1] 8 2 1

Related

Duplicating R dataframe vector values using another vector as a guide

I have the following R dataframe: df = data.frame(value=c(5,4,3,2,1), a=c(2,0,1,6,9), b=c(7,0,0,3,4)). I would like to duplicate the values of a and b by the number of times of the corresponding position values in value. For example, Expanding b would look like b_ex = c(7,7,7,7,7,2,2,2,4). No values of three or four would be in b_ex because values of zero are in b[2] and b[3]. The expanded vectors would be assigned names and be stand-alone.
Thanks!
Maybe you are looking for :
result <- lapply(df[-1], function(x) rep(x[x != 0], df$value[x != 0]))
#$a
#[1] 2 2 2 2 2 1 1 1 6 6 9
#$b
#[1] 7 7 7 7 7 3 3 4
To have them as separate vectors in global environment use list2env :
list2env(result, .GlobalEnv)

Return a vector free of duplicates [without using unique() or duplicate()]

I am trying to get rid of duplicates within a vector without using unique function (as this one doesn`t work in that instance).
My loop looks as follows:
#finding and deleting duplicates
dupes <- function(x) {
for (i in 1:(length(x))){
while (is_true(all.equal(x[i],x[i+1]))){
x=x[-i]
}
}
print(x)
}
I want to run a vector through the function and get a vector (free of dupes) returned.
Here's one simple way to do it -
# for numeric vector
x <- c(1:8, 4:10)
# [1] 1 2 3 4 5 6 7 8 4 5 6 7 8 9 10
x[ave(x, x, FUN = seq_along) == 1]
# [1] 1 2 3 4 5 6 7 8 9 10
# for character vector
x <- as.character(iris$Species)
x[ave(x, x, FUN = seq_along) == 1]
# [1] "setosa" "versicolor" "virginica"
Here are a couple of ways to do it, assuming that your vector is NOT numeric (i.e. It is integer or character),
set.seed(666)
v1 <- sample(15:20, 10, replace = TRUE)
as.integer(names(table(v1)))
#[1] 15 16 17 19 20
rle(sort(v1))$values
#[1] 15 16 17 19 20
dplyr's distinct() function will work for you .
library(dplyr)
df_new <- distinct(your_vector)

choosing vector elements in a loop based on another vector [duplicate]

a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6

Apply function to dataframe with changing argument

I have 2 objects:
A data frame with 3 variables:
v1 <- 1:10
v2 <- 11:20
v3 <- 21:30
df <- data.frame(v1,v2,v3)
A numeric vector with 3 elements:
nv <- c(6,11,28)
I would like to compare the first variable to the first number, the second variable to the second number and so on.
which(df$v1 > nv[1])
which(df$v2 > nv[2])
which(df$v3 > nv[3])
Of course in reality my data frame has a lot more variables so manually typing each variable is not an option.
I encounter these kinds of problems quite frequently. What kind of documentation would I need to read to be fluent in these matters?
One option would be to compare with equally sized elements. For this we can replicate the elements in 'nv' each by number of rows of 'df' (rep(nv, each=nrow(df))) and compare with df or use the col function that does similar output as rep.
which(df > nv[col(df)], arr.ind=TRUE)
If you need a logical matrix that corresponds to comparison of each column with each element of 'nv'
sweep(df, 2, nv, FUN='>')
You could also use mapply:
mapply(FUN=function(x, y)which(x > y), x=df, y=nv)
#$v1
#[1] 7 8 9 10
#
#$v2
#[1] 2 3 4 5 6 7 8 9 10
#
#$v3
#[1] 9 10
I think these sorts of situations are tricky because normal looping solutions (e.g. the apply function) only loop through one object, but you need to loop both through df and nv simultaneously. One approach is to loop through the indices and to use them to grab the appropriate information from both df and nv. A convenient way to loop through indices is the sapply function:
sapply(seq_along(nv), function(x) which(df[,x] > nv[x]))
# [[1]]
# [1] 7 8 9 10
#
# [[2]]
# [1] 2 3 4 5 6 7 8 9 10
#
# [[3]]
# [1] 9 10

Split a vector into three vectors of unequal length in R

a questions from a relative n00b: I’d like to split a vector into three vectors of different lengths, with the values assigned to each vector at random. For example, I’d like to split the vector of length 12 below into vectors of length 2,3, and 7
I can get three equal sized vectors using this:
test<-1:12
split(test,sample(1:3))
Any suggestions on how to split test into vectors of 2,3, and 7 instead of three vectors of length 4?
You could use rep to create the indices for each group and then split based on that
split(1:12, rep(1:3, c(2, 3, 7)))
If you wanted the items to be randomly assigned so that it's not just the first 2 items in the first vector, the next 3 items in the second vector, ..., you could just add call to sample
split(1:12, sample(rep(1:3, c(2, 3, 7))))
If you don't have the specific lengths (2,3,7) in mind but just don't want it to be equal length vectors every time then SimonO101's answer is the way to go.
How about using sample slightly differently...
set.seed(123)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 6
#$`2`
#[1] 3 7 9 10 12
#$`3`
#[1] 2 4 5 8 11
set.seed(1234)
test<-1:12
split( test , sample(3, 12 , repl = TRUE) )
#$`1`
#[1] 1 7 8
#$`2`
#[1] 2 3 4 6 9 10 12
#$`3`
#[1] 5 11
The first argument in sample is the number of groups to split the vector into. The second argument is the number of elements in the vector. This will randomly assign each successive element into one of 3 vectors. For 4 vectors just do split( test , sample(4, 12 , repl = TRUE) ).
It is easier than you think. To split the vector in three new randomly chosen sets run the following code:
test <- 1:12
split(sample(test), 1:3)
By doing so any time you run your this code you would get a new random distribution in three different sets(perfect for k-fold cross validation).
You get:
> split(sample(test), 1:3)
$`1`
[1] 5 8 7 3
$`2`
[1] 4 1 10 9
$`3`
[1] 2 11 12 6
> split(sample(test), 1:3)
$`1`
[1] 12 6 4 1
$`2`
[1] 3 8 7 5
$`3`
[1] 9 2 10 11
You could use an auxiliary vector to format the way you want to split your data. Example:
Data <- c(1,2,3,4,5,6)
Format <- c("X","Y","X","Y","Z,"Z")
output <- split(Data,Format)
Will generate the output:
$X
[1] 1 3
$Y
[1] 2 4
$Z
[1] 5 6

Resources