Split vector randomly into two sets - r

I have a vector t with length 100 and want to divide it into 30 and 70 values but the values should be chosen randomly and without replacement. So none of the 30 values are allowed to be in the sub vector of the 70 values and vice versa.
I know the R function sample which I can use to randomly chose values from a vector with and without replacement. However, even when I use replace = FALSE I have to run the sample function twice once with 30 and once with 70 values to chose. That means that some of the 30 values might be in the 70 values and vice versa.
Any ideas?

How about this:
t <- 1:100 # or whatever your original set is
a <- sample(t, 70)
b <- setdiff(t, a)

Regarding my comment, what is wrong with:
vec <- 1:100
set.seed(2)
samp <- sample(length(vec), 30)
a <- vec[samp]
b <- vec[-samp]
?
To show these are separate sets with no duplicates:
R> intersect(a, b)
integer(0)
If you have duplicate values in your vector that is a different matter, but your question is unclear.
With duplicates in vec things are a bit more complicated and it depends what result you wanted to achieve.
R> set.seed(4)
R> vec <- sample(100, 100, replace = TRUE)
R> set.seed(6)
R> samp <- sample(100, 30)
R> a <- vec[samp]
R> b <- vec[-samp]
R> length(a)
[1] 30
R> length(b)
[1] 70
R> length(setdiff(vec, a))
[1] 41
So the setdiff() "fails" here as it doesn't get the length right, but then a and b contain duplicate values (but not observations! from the sample):
R> intersect(a, b)
[1] 57 35 91 27 71 63 8 92 49 77
The duplicates (intersection) arises because the values above occurred twice in the original sample vec

What about something like this?
x <- 1:100
s70 <- sample(x, 70, replace=FALSE)
s30 <-sample(setdiff(x, s70), 30, replace=FALSE)
s30 will have the same numbers as setdiff(x, s70), the difference between them is:
s30 an unordered vector of length 30 and setdiff(x, s70) will give you an (ascending) ordered vector of length 30. You said you want random subsamples of length 70 and 30 so s30 is better than just setdiff(x, s70). If order does not really matter, so the better alternative will be using setdiff without sample as in #seancarmody's answer.

As you've mentioned "split", you can also try something like this:
set.seed(1)
t <- sample(20:40, 100, replace=TRUE)
groups <- rep("A", 100)
groups[sample(100, 30)] <- "B"
table(groups)
# groups
# A B
# 70 30
split(t, groups)
# $A
# [1] 25 32 39 24 38 39 33 21 24 23 36 40 27 36 24 33 22 25 28 28 38 27 30 30 23
# [26] 34 35 37 33 31 36 20 30 35 34 30 29 25 22 26 33 28 26 29 26 33 30 36 21 38
# [51] 27 37 27 27 30 38 38 36 29 34 28 26 35 25 23 25 21 33 36 28
#
# $B
# [1] 27 33 34 28 30 35 39 20 32 37 36 22 28 36 31 38 21 30 39 25 28 40 24 34 22
# [26] 38 36 29 37 32

Related

How to apply the function to each row?

I want to generate 4 new columns from an existing variable total by random sampling. the results for each row should meet the condition s1 + s2 + s3 + s4 == total. Fro example,
> tabulate(sample.int(4, 100, replace = TRUE))
[1] 22 21 27 30
The following code does not work since the function appears to recycle the first row and applies it column-wise.
DT <- data.table(total = c(100, 110, 90, 92))
DT[, c(paste0("s", 1:4)) := tabulate(sample.int(4, total, replace = TRUE))]
> DT
total s1 s2 s3 s4
1: 100 31 31 31 31
2: 110 25 25 25 25
3: 90 22 22 22 22
4: 92 22 22 22 22
How to get around this? I am clearly missing some basic understanding on how R vector/list work. Your help will be much appreciated.
Edited following edited question:
data.table will expect a list internally when you want to assign to many columns. To get it so each row is unique, then you can do that by adding a by each row:
DT <- data.table(total = c(100, 110, 90, 102, 92))
DT[, c(paste0("s", 1:4)) := {
as.list(tabulate(sample.int(4, total, replace = TRUE)))
}, by = seq(NROW(DT))]
Which outputs the following, satisfying the OP criteria:
> DT
total s1 s2 s3 s4
1: 100 27 28 28 17
2: 110 25 23 36 26
3: 90 26 19 26 19
4: 102 28 24 21 29
5: 92 17 27 22 26
> apply(DT[, 2:5],1, sum)
[1] 100 110 90 102 92
Maybe you can try the code below
DTout <- cbind(
DT,
do.call(
rbind,
lapply(DT$total, function(x) diff(sort(c(0, sample(x - 1, 3), x))))
)
)
which gives
total V1 V2 V3 V4
1: 100 51 5 17 27
2: 110 41 1 40 28
3: 90 32 34 14 10
4: 102 5 73 13 11
5: 92 17 13 17 45
Test
> rowSums(DTout[,-1])
[1] 100 110 90 102 92

How can I create unique random numbers in R?

I hope to generate random numbers between 1:100 and then test their divisibility by 3. I have created a loop.
v <- c(0)
for(i in 1:100){
r <- floor(runif(1, min=1, max=100))
if(r %% 3 == 0){
v <- append(v,r)
}
}
print(v)
However, the numbers do keep repeating as you can see in the following output. Is there any way to only generate unique multiples of 3 between 1:100. I am aware there's a way to use the seq function and generate the same numbers, but I still want to know how to acquire unique random numbers.
Output:
[1] 0 18 87 30 45 90 12 72 75 60 27 84 90 27 42 54 63 15 63 30 72 69 57 30 3 6 15 30 3
[30] 60 72 6 6 18 75 96 84 78 24
sample(1:33)*3 is all the multiples of 3 in your range in a random order.

Can not convert values from factor into only numeric

I want to convert this variable into numeric, as you can see:
> class(DATA$estimate)
[1] "factor"
> head(DATA$estimate)
[1] 0,253001909 0,006235543 0,005285019 0,009080499 6,580140903 0,603060006
57 Levels: 0,000263863 0,000634365 0,004405696 0,005285019 0,006235543 0,009080499 0,009700147 0,018568434 0,253001909 ... 7,790580873
>
But when I want to convert, look what I have got
> DATA$estimate<-as.numeric(DATA$estimate)
> DATA$estimate
[1] 9 5 4 6 51 12 3 53 11 8 1 7 15 27 30 29 28 31 21 23 22 39 38 37 33 26 34 52 57 50 24 18 20 10 2 55 54 56 36 32 35 44 46
[44] 48 19 25 16 43 41 40 49 42 47 14 17 13 45
It's not numeric and I don't understand how the program gives these numbers!
data:
fac <- factor(c("0,253001909" ,"0,006235543" ,"0,005285019" ,"0,009080499" ,"6,580140903" ,"0,603060006"))
I convert to character, then turn the "," into ".", then convert to numeric.
as.numeric(sub(",",".",as.character(fac)))
in your case its:
DATA$estimate<-as.numeric(sub(",",".",as.character(DATA$estimate)))
You can also scan() your factor variable and specify , as decimal separator
fac <- factor(c("0,253001909" ,"0,006235543" ,"0,005285019" ,"0,009080499" ,
"6,580140903" ,"0,603060006"))
scan(text = as.character(fac), dec = ",")
#output
[1] 0.253001909 0.006235543 0.005285019 0.009080499 6.580140903
[6] 0.603060006

Calculate number of values in vector that exceed values in column of data.frame

I have a long list of numbers, e.g.
set.seed(123)
y<-round(runif(100, 0, 200))
And I would like to store in column y the number of values that exceed each value in column x of a data frame:
df <- data.frame(x=seq(0,200,20))
I can compute the numbers manually, like this:
length(which(y>=20)) #93 values exceed 20
length(which(y>=40)) #81 values exceed 40
etc. I know I can use a for-loop with all values of x, but is there a more elegant way?
I tried this:
df$y <- length(which(y>=df$x))
But this gives a warning and does not give me the desired output.
The data frame should look like this:
df
x y
1 0 100
2 20 93
3 40 81
4 60 70
5 80 61
6 100 47
7 120 40
8 140 29
9 160 19
10 180 8
11 200 0
You can compare each value of df$x against all value of y using sapply
sapply(df$x, function(a) sum(y>a))
#[1] 99 93 81 70 61 47 40 29 18 6 0
#Looking at your output, maybe you want
sapply(df$x, function(a) sum(y>=a))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Here's another approach using outer that allows for element wise comparison of two vectors
rowSums(outer(df$x,y, "<="))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Yet one more (from alexis_laz's comment)
length(y) - findInterval(df$x, sort(y), left.open = TRUE)
# [1] 100 93 81 70 61 47 40 29 19 8 0

Plot a list of variable length vectors in R

I have a list which has multiple vectors (total 80) of various lengths. On the x-axis I want the names of these vectors. On the y-axis I want to plot the values corresponding to each vector. How can I do it in R?
One way to do this is to reshape the data using reshape2::melt or some other method. Please try and make a reproducible example. I think this is the gist of what you are after:
set.seed(4)
mylist <- list(a = sample(1:50, 10, T),
b = sample(25:40, 15, T),
c = sample(51:75, 20, T))
mylist
# $a
# [1] 30 1 15 14 41 14 37 46 48 4
#
# $b
# [1] 37 29 26 40 31 32 40 34 40 37 36 40 33 32 35
#
# $c
# [1] 71 63 72 63 64 65 56 72 67 63 75 62 66 60 51 74 57 65 55 73
library(ggplot2)
library(reshape2)
df <- melt(mylist)
head(df)
# value L1
# 1 30 a
# 2 1 a
# 3 15 a
# 4 14 a
# 5 41 a
# 6 14 a
ggplot(df, aes(x = factor(L1), y = value)) + geom_point()

Resources