Given are two vectors:
vec_nums <- 1:20
vec_ltrs <- letters[1:10]
I would like to write a function that would merge them some each element from the second vectors appears on the precisely defined position within the first vector. For example, running:
vec_mrg <- funMergeVectsByPlace(x = vec_num, y = vec_ltrs, position = 3)
Should return vec_mrg of the following content:
[1] "a" "b" "1" "c" "d" "2" "f" "g" "3" "i" "j" "4" "l" "m" "5" ...
Desired characteristics:
The function places element from the vector passed via the y = on the position given in the position = counting from the left hand side. So position = 3 should be understood as *every third place" accounting for 3, 6, ...
The function should work on numeric string and factor vectors and return an ordered factor.
The function should work on factor, string and numeric vectors
In case of vector y being shorter than than the number of inserts in the x the function should return remaining part of x without any additions
Suggested structure
I would envisage for the function to be of this structure:
funMergeVectsByPlace <- function(x,y position = 3) {
# Convert
vec_a <- as.character(x)
vec_b <- as.character(y)
# Missing part
# Combine two vectors
# Create ordered factor
vec_fac <- factor(vec_mrg,
# levels =
# I want the levels to reflect the order of elements in the vec_merg
)
# Return
return(vec_fac)
}
Samples
Simplest
Concerning attempts, simplest approach:
vec_mrg <- c(vec_nums, vec_ltrs)
vec_mrg <- order(vec_mrg)
But this would not create the order
Loop
for (i in 1:length(vec_nums)) {
pos <- position
vec_nums[pos] <- vec_ltrs[i]
pos <- pos + pos
# i will be out of bounds and the way to move the other vector is missing
}
vec_mrg <- function(x,y,pos) {
res <- y
counter <- seq(floor(length(y)/(pos-1)))
for(i in counter) {
res <- append(res, x[i], seq(pos-1,by=pos, length.out=length(counter))[i])
}
res
}
vec_mrg(vec_nums, vec_ltrs, 3)
#[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j"
#[15] "5"
A loop-free solution:
funMergeVectsByPlace <- function( x, y, position )
{
n <- min( length(y)%/%(position-1), length(x) )
A <- rbind( matrix(head(y,n*(position-1)),position-1), head(x,n) )
rest <- c( x[-(1:n)], y[-(1:(n*(position-1)))] )
c(c(A),rest)
}
Speed comparison with Lafortunes solution:
> library(microbenchmark)
> vec_nums <- 1:20
> vec_ltrs <- letters[1:10]
> microbenchmark(Lafortune = vec_mrg(vec_nums,vec_ltrs,3),
+ mra68 = funMergeVectsByPlace(vec_nums,vec_ltrs,3),
+ times .... [TRUNCATED]
Unit: microseconds
expr min lq mean median uq max neval
Lafortune 137.677 143.112 161.12006 146.734 153.980 2931.512 10000
mra68 77.443 81.067 92.13208 83.331 86.954 2718.204 10000
Larger vectors:
> vec_nums <- 1:2000
> vec_ltrs <- letters[rep(1:10,100)]
> microbenchmark(Lafortune = vec_mrg(vec_nums,vec_ltrs,3),
+ mra68 = funMergeVectsByPlace(vec_nums,vec_ltrs,3),
+ times .... [TRUNCATED]
Unit: milliseconds
expr min lq mean median uq max neval
Lafortune 32.993883 40.991796 63.758011 51.171020 90.122351 456.9748 1000
mra68 1.101865 1.489533 2.468496 1.751299 3.338881 230.0460 1000
> v1 <- vec_mrg(vec_nums,vec_ltrs,3)
> v2 <- funMergeVectsByPlace(vec_nums,vec_ltrs,3)
>
Notice that the vec_mrg function does not append the rest of the x vector to the result, but funMergeVectsByPlace does. Otherwise the results are the same:
> v1 <- vec_mrg(1:20,letters[1:10],3)
> v2 <- funMergeVectsByPlace(1:20,letters[1:10],3)
> v1
[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j" "5"
> v2
[1] "a" "b" "1" "c" "d" "2" "e" "f" "3" "g" "h" "4" "i" "j" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
> identical(v1,v2[1:length(v1)])
[1] TRUE
>
Neither vec_mrg nor funMergeVectsByPlace return factors. If one includes factor(...), both functions are getting slower, but funMergeVectsByPlace is still faster than vec_mrg.
Related
I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).
Consider the following example vector in R:
x <- rep(LETTERS[1:5], 3) # Create example vector
x
# [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:
set.seed(53135)
sample(x) # sample() function puts same elements too close
# [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
How could I ensure that identical elements have a minimum distance of three?
So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:
min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Bundled in a function
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
after #27ϕ9 comment:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E
3 3 3 3 3
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
UPDATE
After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else x[which(freq.got)[1]] )
}, .init=sample(x,1))
}
now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.
Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)
shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E
3 3 3 3 3
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E
3 3 3 3 3
Update:
shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}
this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.
If your data is large, then it may be (way) faster to rely on probability to do that kind of task.
Here's an example:
prob_shuffler = function(x, min.dist = 2){
n = length(x)
res = sample(x)
OK = FALSE
# We loop until we have a solution
while(!OK){
OK = TRUE
for(i in 1:min.dist){
# We check if identical elements are 'i' steps away
pblm = res[1:(n-i)] == res[-(1:i)]
if(any(pblm)){
if(sum(pblm) >= (n - i)/2){
# back to square 1
res = sample(x)
} else {
# we pair each identical element with
# an extra one
extra = sample(which(!pblm), sum(pblm))
id_reshuffle = c(which(pblm), extra)
res[id_reshuffle] = sample(res[id_reshuffle])
}
# We recheck from the beginning
OK = FALSE
break
}
}
}
res
}
Even though the while loop looks scary, in practice convergence is fast. Of course, the lower the probability to have two characters at min.dist away, the faster the convergence.
The current solutions by #Abdessabour Mtk and #Carles Sans Fuentes work but, depending on the size of the input data, quickly become prohibitively slow. Here's a benchmark:
library(microbenchmark)
x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520
microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 1) 87.001 111.501 155.071 131.801 192.401 264.401 10
#> shuffler_am(x, 1) 17218.100 18041.900 20324.301 18740.351 22296.301 26495.200 10
#> shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701 10
microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 2) 140.1 195.201 236.3312 245.252 263.202 354.101 10
#> shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800 29133.400 10
#> shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401 10
microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 3) 318.001 450.402 631.5312 573.352 782.2 1070.401 10
#> shuffler_am(x, 3) 19003.501 19622.300 23314.4808 20784.551 28281.5 32885.101 10
#> shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901 10
We can remark two things: a) in all logic, the speed of prob_shuffler depends on min.dist while the other methods not so much, b) prob_shuffler is about 100-fold faster for just 520 observations (and it scales).
Of course if the probability to have two identical characters at min.dist away is extremely high, then the recursive methods should be faster. But in most practical cases, the probability method is faster.
I hope this answer works fine for you. It is done with base R, but it works. I leave the printing if you want to check line by line:
x <- rep(LETTERS[1:5], 3) # Create example vector
shuffle <- function(x, min_dist=3){
#init variables
result<-c() # result vector
count<-0
vec_use<-x
vec_keep<-c()
for(i in 1:length(x)){
# print(paste0("iteration =", i))
if (count>min_dist){
valback<-vec_keep[1]
# print(paste0("value to be returned:", valback))
ntimes_valback<-(table(vec_keep)[valback])
vec_use<- c(vec_use,rep(valback,ntimes_valback))
# print(paste0("vec_use after giving back valbak =", valback))
# print(paste0(vec_use,","))
vec_keep <- vec_keep[!vec_keep %in% valback]
# print(paste0("vec_keep after removing valback =", valback))
# print(paste0(vec_keep,","))
}
val<-sample(vec_use,1)
# print(paste0("val = ",val))#remove value
vec_keep<- c(vec_keep,x[x %in% val])
vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
# print(paste0("vec_keep ="))
# print(paste0(vec_keep,","))
vec_use <- vec_use[!vec_use %in% val]
# print(paste0("vec_use ="))
# print(paste0(vec_use,","))
result[i]<-val
count<-count+1
}
return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"
I would like to name the elements of a list on the fly with the content of the variable, how should I do?
DT <- data.table(A=LETTERS[1:3], B=letters[1:3], C= 1:9)
lapply(unique(DT$A), function(xA){
RTN <-
lapply(unique(DT$B), function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output)== 0L) {
}else{
c(xA, xB, output)
}
})
})
the result is
[[1]]
[[1]][[1]]
[1] "A" "a" "1" "4" "7"
[[1]][[2]]
NULL
[[1]][[3]]
NULL
[[2]]
[[2]][[1]]
NULL
[[2]][[2]]
[1] "B" "b" "2" "5" "8"
[[2]][[3]]
NULL
I would like to make it as following
[[A]]
[[A]][[a]]
[1] "A" "a" "1" "4" "7"
[[A]][[b]]
NULL
[[A]][[c]]
NULL
[[B]]
[[B]][[a]]
NULL
[[B]][[B]]
[1] "B" "b" "2" "5" "8"
[[B]][[c]]
NULL
Besides, how can I remove the NULL, and make it a complete case matrix? Many thanks.
Here are two solutions:
1) Use sapply and set USE.NAMES = TRUE
2) Capture the names before each lapply and set them after.
DT <- data.table(A=LETTERS[1:3], B=letters[1:3], C= 1:9)
outer_list_names <- unique(DT$A)
outer_list <- lapply(unique(DT$A), function(xA){
RTN_names = unique(DT$B)
RTN <-
lapply(unique(DT$B), function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output)== 0L) {
}else{
c(xA, xB, output)
}
})
names(RTN) <- RTN_names
})
names(outer_list) <- outer_list_names
outer_list
We could create a named vector to name the list
A_vec <- setNames(unique(DT$A), unique(DT$A))
B_vec <- setNames(unique(DT$B), unique(DT$B))
lapply(A_vec, function(xA){
RTN <- lapply(B_vec, function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output) > 0L) {
c(xA, xB, output)
}
})
})
#$A
#$A$a
#[1] "A" "a" "1" "4" "7"
#$A$b
#NULL
#$A$c
#NULL
#$B
#$B$a
#NULL
#$B$b
#[1] "B" "b" "2" "5" "8"
#$B$c
#NULL
If you want to remove the NULL values we could have a Filter to remove them
lapply(A_vec, function(xA){
RTN <- lapply(B_vec, function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output) > 0L) {
c(xA, xB, output)
}
})
Filter(Negate(is.null), RTN)
})
#$A
#$A$a
#[1] "A" "a" "1" "4" "7"
#$B
#$B$b
#[1] "B" "b" "2" "5" "8"
#$C
#$C$c
#[1] "C" "c" "3" "6" "9"
I have written this loop to extract the names of each element of a vector that occurs within a time interval (bin). I was wondering if I am missing a faster way to do this... I want to implement a randomization aspect to vectors that are 1000s in length and as such do not want to rely on a loop.
mydata <- structure(c(1199.91666666667, 1200.5, 1204.63333333333, 1205.5,
1206.3, 1208.73333333333, 1209.06666666667, 1209.93333333333,
1210.98333333333, 1214.56666666667, 1216.06666666667, 1216.63333333333,
1216.91666666667, 1219.13333333333, 1221.35, 1221.51666666667,
1225.35, 1225.53333333333, 1225.96666666667, 1227.61666666667,
1228.91666666667, 1230.31666666667, 1233.53333333333, 1235.8,
1237.51666666667, 1239.41666666667, 1241.6, 1247.08333333333,
1247.45, 1252.7, 1253.26666666667), .Names = c("B", "A", "B",
"E", "A", "A", "B", "G", "G", "C", "A", "D", "E", "B", "B", "E",
"E", "G", "F", "A", "C", "A", "F", "B", "A", "F", "F", "G", "F",
"G", "F"))
mydata
B A B E A A B G G C A D E B B E E
1199.917 1200.500 1204.633 1205.500 1206.300 1208.733 1209.067 1209.933 1210.983 1214.567 1216.067 1216.633 1216.917 1219.133 1221.350 1221.517 1225.350
G F A C A F B A F F G F G F
1225.533 1225.967 1227.617 1228.917 1230.317 1233.533 1235.800 1237.517 1239.417 1241.600 1247.083 1247.450 1252.700 1253.267
These represent consecutive times in seconds of events. Say we want to make our intervals 5s long. My approach is to make a vector of the beginning of each interval and then use a loop to find the names of elements occurring within that interval:
N=5
ints <- seq(mydata[1], mydata[length(mydata)], N)
out<-list()
for(i in 1:length(ints)){
out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])
}
out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
This is fine for small samples - but I can see this would get slow when dealing with very large samples that are permuted 1000s of times.
My suggestion is to use findInterval (based on an answer to this earlier question of mine):
mydata2 = c(-Inf, mydata)
ints <- seq(mydata[1], mydata[length(mydata)]+5, N)
idx = findInterval(ints-1e-10, mydata2)
out<-list()
for(i in 1:(length(ints)-1)){
out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])
}
As you can see I have to do a little tinkering with the beginning (adding a first value that is smaller than the first breakpoint, adding an epsilon). Here's the result, it is identical to yours:
> out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
In terms of speed for the example there is some improvement:
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 67.177 76.9725 85.73347 82.8035 95.866 119.890 100
mts 43.851 52.7150 62.72116 58.3130 73.007 96.099 100
alexis 75.573 86.5360 95.72593 91.4340 100.531 234.649 100
R_Yoda 2032.066 2158.4870 2303.68887 2191.3750 2281.409 8719.314 100
For larger vectors (I chose length 2000) this is clearer:
set.seed(123)
mydata = sort(runif(n = 2000, min = 0, max = 100))
names(mydata) = sample(LETTERS[1:7], size = 2000, replace = T)
mydata2 = c(-Inf, mydata)
ints2 <- seq(mydata[1], mydata[length(mydata)]+5, N)
dt <- data.table(data=mydata, names=names(mydata) )
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 804.243 846.9275 993.9957 862.0890 883.3140 7140.218 100
mts 77.439 88.8685 100.6148 100.0640 106.5955 188.466 100
alexis 187.066 204.7930 220.1689 215.5225 225.3190 299.026 100
R_Yoda 3831.348 4066.4640 4366.5382 4140.1700 4248.8635 11829.923 100
For performance reasons I am using data.table:
Edit: This solution works, but is NOT very fast (as proved by the answer of mts)
library(Hmisc)
library(data.table)
# assuming that your mydata vector from the question is loaded
N=5 # code from your question...
ints <- seq(mydata[1], mydata[length(mydata)], N) # code from your question...
dt <- data.table(data=mydata, names=names(mydata) )
dt[, groups := cut2(data,ints)] # attention: shall the interval ends be included in the group or not?
groups <- dt[ , .(result=list(names)), by=groups] # the elements of a data.table can be a list itself!
# to get the result as list:
out <- groups[,result]
out
Edit: You could replace cut2 by findInterval and do it all in one line, but it is still slower:
out <- dt[, .(result=list(names)), by = findInterval(data,ints) ]
This is the result:
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
Here's a little code to illustrate my problem:
x <- 1:10
# > x
# [1] 1 2 3 4 5 6 7 8 9 10
y <- rep(letters[1:2], 5)
# > y
# [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"
z <- rep(c(5,4), 5)
# > z
# [1] 5 4 5 4 5 4 5 4 5 4
Now, depending in which order I issue the next two commands I get different subassignments:
x first, y second:
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > x
# [1] "1" "a" "3" "4" "5" "6" "7" "8" "9" "10"
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > y
# [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"
y first, x second:
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > y
# [1] "a" "a" "a" "b" "a" "b" "a" "b" "a" "b"
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > x
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
The assignment of the second vector depends on the assignment done in the previous vector. Hence, in the second assignment I need to make sure that I have the relevant indices still available for the second assigment. My first idea is:
x[ind <- ((x == 2) & (y != "a") & (z == 4))] <- "a"
y[ind] <- "a"
rm(ind)
I want to avoid a separate call to do the assignment of the ind vector given that I might be doing a lot of this. Would that still be considered good coding in R or can it lead to any devious behaviour I haven't thought of?
Your solution seems fine. However, I would still regard your code as somewhat bad practice. Consider your first bullet:
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
At line 1, your numeric variable x is converted to a character since you assign "a" to the TRUE indices or maybe not if no indices are TRUE. Hence your output type is not really clear. That's somewhat bad practice and can lead to all sorts or problems downstream. You should stay within on type.
This also means that the x == 2 in your second line in the above is somewhat unclear though R correctly interprets the comparison. Again however, it could cause problems in a more elaborate example. But maybe you don't have these type issues in your application.
I have a vector of characters consisting of only 'a' or 'g', I want to convert them to integers based on frequency, i.e. the more frequent one should be coded to 0, and the other to 1, for example:
set.seed(17)
x = sample(c('g', 'a'), 10, replace=T)
x
# [1] "g" "a" "g" "a" "g" "a" "g" "g" "a" "g"
x[x == names(which.max(table(x)))] = 0
x[x != 0] = 1
x
# [1] "0" "1" "0" "1" "0" "1" "0" "0" "1" "0"
This works, but I wonder if there is a more efficient way to do it.
(We don't have to consider the 50%-50% case here, because it should never happen in our study.)
Use this:
ag.encode <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 ) 1-result else as.numeric(result)
}
If you want to keep the labels in a factor structure, use this instead:
ag.encode2factor <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 )
{
factor(2-result, labels=c("a","g"))
}
else
{
factor(result+1, labels=c("g","a"))
}
}
You can convert your character vector to a factor one. This solution is more general in the sense you don't need to know the name of the 2 characters used to create x.
y <- as.integer(factor(x))-1
if(sum(y)>length(y)/2) y <- as.integer(!y)