Here's a little code to illustrate my problem:
x <- 1:10
# > x
# [1] 1 2 3 4 5 6 7 8 9 10
y <- rep(letters[1:2], 5)
# > y
# [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"
z <- rep(c(5,4), 5)
# > z
# [1] 5 4 5 4 5 4 5 4 5 4
Now, depending in which order I issue the next two commands I get different subassignments:
x first, y second:
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > x
# [1] "1" "a" "3" "4" "5" "6" "7" "8" "9" "10"
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > y
# [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b"
y first, x second:
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > y
# [1] "a" "a" "a" "b" "a" "b" "a" "b" "a" "b"
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
# > x
# [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
The assignment of the second vector depends on the assignment done in the previous vector. Hence, in the second assignment I need to make sure that I have the relevant indices still available for the second assigment. My first idea is:
x[ind <- ((x == 2) & (y != "a") & (z == 4))] <- "a"
y[ind] <- "a"
rm(ind)
I want to avoid a separate call to do the assignment of the ind vector given that I might be doing a lot of this. Would that still be considered good coding in R or can it lead to any devious behaviour I haven't thought of?
Your solution seems fine. However, I would still regard your code as somewhat bad practice. Consider your first bullet:
x[(x == 2) & (y != "a") & (z == 4)] <- "a"
y[(x == 2) & (y != "a") & (z == 4)] <- "a"
At line 1, your numeric variable x is converted to a character since you assign "a" to the TRUE indices or maybe not if no indices are TRUE. Hence your output type is not really clear. That's somewhat bad practice and can lead to all sorts or problems downstream. You should stay within on type.
This also means that the x == 2 in your second line in the above is somewhat unclear though R correctly interprets the comparison. Again however, it could cause problems in a more elaborate example. But maybe you don't have these type issues in your application.
Related
I created this small example. I want to print some values, for example, B for NA values using the if else statement.
x = c(1,7,NA, 3, NA, NA)
for(i in 1: length(x)){
y = x[i]
if(y == 1){
print("A")
}
else if(y == 'NA'){
print("B")
}
else{
print("C")
}
}
I am getting an error message Error in if (y == 1) { : missing value where TRUE/FALSE needed Why can't I print B for NA values? How to use NA within the if else statement?
The issue is also that == with NA returns NA and not FALSE. Make sure to add a condition to check NA as well. Also, y == 'NA' should be replaced with is.na(y)
for(i in 1:length(x)){
y = x[i]
if(y == 1 & !is.na(y)){
print("A")
}
else if(is.na(y)){
print("B")
}
else{
print("C")
}
}
-output
[1] "A"
[1] "C"
[1] "B"
[1] "C"
[1] "B"
[1] "B"
Or this can be done in a vectorized way
c("C", "B", "A")[1 + is.na(x) + 2 *(x %in% 1)]
#[1] "A" "C" "B" "C" "B" "B"
To avoid repetition, ensure that the first block checks for NA:
x = c(1,7,NA, 3, NA, NA)
for(i in 1: length(x)){
y = x[i]
if(is.na(y)){
print("B")
}
else if(y == 1){
print("A")
}
else{
print("C")
}
}
[1] "A"
[1] "C"
[1] "B"
[1] "C"
[1] "B"
[1] "B"
You can use vectorized way using case_when or nested ifelse -
dplyr::case_when(is.na(x) ~ 'B',
x == 1 ~ 'A',
TRUE ~ 'C')
#[1] "A" "C" "B" "C" "B" "B"
I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).
Consider the following example vector in R:
x <- rep(LETTERS[1:5], 3) # Create example vector
x
# [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:
set.seed(53135)
sample(x) # sample() function puts same elements too close
# [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
How could I ensure that identical elements have a minimum distance of three?
So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:
min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Bundled in a function
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
after #27ϕ9 comment:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E
3 3 3 3 3
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
UPDATE
After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else x[which(freq.got)[1]] )
}, .init=sample(x,1))
}
now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.
Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)
shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E
3 3 3 3 3
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E
3 3 3 3 3
Update:
shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}
this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.
If your data is large, then it may be (way) faster to rely on probability to do that kind of task.
Here's an example:
prob_shuffler = function(x, min.dist = 2){
n = length(x)
res = sample(x)
OK = FALSE
# We loop until we have a solution
while(!OK){
OK = TRUE
for(i in 1:min.dist){
# We check if identical elements are 'i' steps away
pblm = res[1:(n-i)] == res[-(1:i)]
if(any(pblm)){
if(sum(pblm) >= (n - i)/2){
# back to square 1
res = sample(x)
} else {
# we pair each identical element with
# an extra one
extra = sample(which(!pblm), sum(pblm))
id_reshuffle = c(which(pblm), extra)
res[id_reshuffle] = sample(res[id_reshuffle])
}
# We recheck from the beginning
OK = FALSE
break
}
}
}
res
}
Even though the while loop looks scary, in practice convergence is fast. Of course, the lower the probability to have two characters at min.dist away, the faster the convergence.
The current solutions by #Abdessabour Mtk and #Carles Sans Fuentes work but, depending on the size of the input data, quickly become prohibitively slow. Here's a benchmark:
library(microbenchmark)
x = rep(c(letters, LETTERS), 10)
length(x)
#> [1] 520
microbenchmark(prob_shuffler(x, 1), shuffler_am(x, 1), shuffler_csf(x, 1), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 1) 87.001 111.501 155.071 131.801 192.401 264.401 10
#> shuffler_am(x, 1) 17218.100 18041.900 20324.301 18740.351 22296.301 26495.200 10
#> shuffler_csf(x, 1) 86771.401 88550.501 118185.581 95582.001 98781.601 341826.701 10
microbenchmark(prob_shuffler(x, 2), shuffler_am(x, 2), shuffler_csf(x, 2), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 2) 140.1 195.201 236.3312 245.252 263.202 354.101 10
#> shuffler_am(x, 2) 18886.2 19526.901 22967.6409 21021.151 26758.800 29133.400 10
#> shuffler_csf(x, 2) 86078.1 92209.901 97151.0609 97612.251 99850.101 107981.401 10
microbenchmark(prob_shuffler(x, 3), shuffler_am(x, 3), shuffler_csf(x, 3), times = 10)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> prob_shuffler(x, 3) 318.001 450.402 631.5312 573.352 782.2 1070.401 10
#> shuffler_am(x, 3) 19003.501 19622.300 23314.4808 20784.551 28281.5 32885.101 10
#> shuffler_csf(x, 3) 87692.701 96152.202 101233.5411 100925.201 108034.7 113814.901 10
We can remark two things: a) in all logic, the speed of prob_shuffler depends on min.dist while the other methods not so much, b) prob_shuffler is about 100-fold faster for just 520 observations (and it scales).
Of course if the probability to have two identical characters at min.dist away is extremely high, then the recursive methods should be faster. But in most practical cases, the probability method is faster.
I hope this answer works fine for you. It is done with base R, but it works. I leave the printing if you want to check line by line:
x <- rep(LETTERS[1:5], 3) # Create example vector
shuffle <- function(x, min_dist=3){
#init variables
result<-c() # result vector
count<-0
vec_use<-x
vec_keep<-c()
for(i in 1:length(x)){
# print(paste0("iteration =", i))
if (count>min_dist){
valback<-vec_keep[1]
# print(paste0("value to be returned:", valback))
ntimes_valback<-(table(vec_keep)[valback])
vec_use<- c(vec_use,rep(valback,ntimes_valback))
# print(paste0("vec_use after giving back valbak =", valback))
# print(paste0(vec_use,","))
vec_keep <- vec_keep[!vec_keep %in% valback]
# print(paste0("vec_keep after removing valback =", valback))
# print(paste0(vec_keep,","))
}
val<-sample(vec_use,1)
# print(paste0("val = ",val))#remove value
vec_keep<- c(vec_keep,x[x %in% val])
vec_keep<-vec_keep[1:(length(vec_keep)-1)]#removing 1 letter
# print(paste0("vec_keep ="))
# print(paste0(vec_keep,","))
vec_use <- vec_use[!vec_use %in% val]
# print(paste0("vec_use ="))
# print(paste0(vec_use,","))
result[i]<-val
count<-count+1
}
return(result)
}
shuffle(x)
"C" "D" "B" "E" "C" "A" "B" "D" "E" "A" "C" "D" "B" "E" "C"
I can't get my head around this problem regarding ifelse:
Say I have two vectors:
x <- c(0, 1:4, 1:4)
y <- letters[1:3]
When I do
ifelse(x==2, y[x], x)
I get
"0" "1" "c" "3" "4" "1" "c" "3" "4"
However, it should return "b" at position 2 of vector y.
Why is ifelse doing that?
To explain this strange behaviour the source code of ifelse is helpful (see below).
As soon as you call ifelse the expressions passed as the arguments test, yes and no are evaluated resulting in:
Browse[2]> test
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
Browse[2]> yes
[1] "a" "b" "c" NA "a" "b" "c" NA
Browse[2]> no
[1] 0 1 2 3 4 1 2 3 4
Observe that y[x] uses the values of x to pick values from y
and the value 0 is empty (= ignored) , values above 3 are NA,
that is why the `yes´ argument becomes
[1] "a" "b" "c" NA "a" "b" "c" NA
The code line
ans[test & ok] <- rep(yes, length.out = length(ans))[test & ok]
is then applied at the end and effectivly does update all TRUE-elements using the test logical vector:
yes[test]
which results in:
[1] "c" "c"
being stored in the result indices 3 and 7
ans[test & ok]
So the problem is using y[x] as second argument to ifelse + the non-intuitive ifelse behaviour to use a logical index to pick the "TRUE"-results from y[x]...
Lesson learned: Avoid complicated ifelse logic, it has lot of side effects (eg. you may loose the correct data type or attributes).
# ifelse function
function (test, yes, no)
{
if (is.atomic(test)) {
if (typeof(test) != "logical")
storage.mode(test) <- "logical"
if (length(test) == 1 && is.null(attributes(test))) {
if (is.na(test))
return(NA)
else if (test) {
if (length(yes) == 1) {
yat <- attributes(yes)
if (is.null(yat) || (is.function(yes) && identical(names(yat),
"srcref")))
return(yes)
}
}
else if (length(no) == 1) {
nat <- attributes(no)
if (is.null(nat) || (is.function(no) && identical(names(nat),
"srcref")))
return(no)
}
}
}
else test <- if (isS4(test))
methods::as(test, "logical")
else as.logical(test)
ans <- test
ok <- !(nas <- is.na(test))
if (any(test[ok]))
ans[test & ok] <- rep(yes, length.out = length(ans))[test &
ok]
if (any(!test[ok]))
ans[!test & ok] <- rep(no, length.out = length(ans))[!test &
ok]
ans[nas] <- NA
ans
}
You are using 0 as an index in the first element so that is why the alignment is messed up.
y[x]
[1] "a" "b" "c" NA "a" "b" "c" NA
So
> y[0]
character(0)
> y[1]
[1] "a"
> y[2]
[1] "b"
> y[3]
[1] "c"
So the length of y[x] is different than the length of x.
What you want is
> ifelse(x==2, y[x+1], x)
[1] "0" "1" "c" "3" "4" "1" "c" "3" "4"
but only if the first element is always 0.
Old answer
Because
x <- c(0, 1:4, 1:4)
returns
[1] 0 1 2 3 4 1 2 3 4
so x==2
returns
1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
so for y = letters[1:3]
ifelse(x==2, y[x], x)
You are going to get the letters in the third and seventh positions.
The documentation for ifelse says that if one vector is too short it will be recycled which you would expect to be
c("a","b","c","a","b","c","a").
However when I try
ifelse(x==3, y[x], x)
I get
[1] "0" "1" "2" NA "4" "1" "2" NA "4"
Which tells me that the recycling is not working the way I would expect.
So that's the nominal reason you are getting the result. The reason it works like that is something I don't know now, but if I figure it out I will add to this answer. I suspect it has to do with the conversion to a string.
Just looking at y[x] I get
[1] "a" "b" "c" NA "a" "b" "c" NA
Which, by the way is only length 8 even though x is length 9.
So this really doesn't have to do with ifelse() at all, it is really about a different issue with recycling.
From Comment: It returns c because: which(x==2) returns 3 and 7. I don't know why it doesn't recycle 7 but chooses only 3. Perhaps because y is less than length 7
Try:
ind<-which(x==2)
ind1<-ind[1]-1
ifelse(x==2,y[ind1],x)
[1] "0" "1" "b" "3" "4" "1" "b" "3" "4"
Here's an attempt to make a function:
dynamic_index<-function(ind,x,y){
x<-x
y<-y
ind1<-which(x==ind)
ind2<-ind1[1]-1
ifelse(x==ind,y[ind2],x)
}
dynamic_index(2,x,y)
The result occurs lat way because the == function returns a vector of logicals:
x <- c(0, 1:4, 1:4)
y <- letters[1:3]
ifelse(x==2, y[x], x)
#look at x==2
x==2
[1] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
It's a logical vector that has true in the third position, not the second so it is the third value of y that is selected. This also shows why the answer that references the behavior of which is incorrect.
x <- c(0, 1:4, 1:4)
y <- letters[1:3]
ifelse(x==2, y[x], x)
in ifelse it will check each position in x .if it is true then it will print y[x] position it means the position which was checked in x and that position of value in Y will be printed .it will check all the values in X
I would like to name the elements of a list on the fly with the content of the variable, how should I do?
DT <- data.table(A=LETTERS[1:3], B=letters[1:3], C= 1:9)
lapply(unique(DT$A), function(xA){
RTN <-
lapply(unique(DT$B), function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output)== 0L) {
}else{
c(xA, xB, output)
}
})
})
the result is
[[1]]
[[1]][[1]]
[1] "A" "a" "1" "4" "7"
[[1]][[2]]
NULL
[[1]][[3]]
NULL
[[2]]
[[2]][[1]]
NULL
[[2]][[2]]
[1] "B" "b" "2" "5" "8"
[[2]][[3]]
NULL
I would like to make it as following
[[A]]
[[A]][[a]]
[1] "A" "a" "1" "4" "7"
[[A]][[b]]
NULL
[[A]][[c]]
NULL
[[B]]
[[B]][[a]]
NULL
[[B]][[B]]
[1] "B" "b" "2" "5" "8"
[[B]][[c]]
NULL
Besides, how can I remove the NULL, and make it a complete case matrix? Many thanks.
Here are two solutions:
1) Use sapply and set USE.NAMES = TRUE
2) Capture the names before each lapply and set them after.
DT <- data.table(A=LETTERS[1:3], B=letters[1:3], C= 1:9)
outer_list_names <- unique(DT$A)
outer_list <- lapply(unique(DT$A), function(xA){
RTN_names = unique(DT$B)
RTN <-
lapply(unique(DT$B), function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output)== 0L) {
}else{
c(xA, xB, output)
}
})
names(RTN) <- RTN_names
})
names(outer_list) <- outer_list_names
outer_list
We could create a named vector to name the list
A_vec <- setNames(unique(DT$A), unique(DT$A))
B_vec <- setNames(unique(DT$B), unique(DT$B))
lapply(A_vec, function(xA){
RTN <- lapply(B_vec, function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output) > 0L) {
c(xA, xB, output)
}
})
})
#$A
#$A$a
#[1] "A" "a" "1" "4" "7"
#$A$b
#NULL
#$A$c
#NULL
#$B
#$B$a
#NULL
#$B$b
#[1] "B" "b" "2" "5" "8"
#$B$c
#NULL
If you want to remove the NULL values we could have a Filter to remove them
lapply(A_vec, function(xA){
RTN <- lapply(B_vec, function(xB){
output <- DT[A == xA & B == xB]$C
if(length(output) > 0L) {
c(xA, xB, output)
}
})
Filter(Negate(is.null), RTN)
})
#$A
#$A$a
#[1] "A" "a" "1" "4" "7"
#$B
#$B$b
#[1] "B" "b" "2" "5" "8"
#$C
#$C$c
#[1] "C" "c" "3" "6" "9"
I have a vector of characters consisting of only 'a' or 'g', I want to convert them to integers based on frequency, i.e. the more frequent one should be coded to 0, and the other to 1, for example:
set.seed(17)
x = sample(c('g', 'a'), 10, replace=T)
x
# [1] "g" "a" "g" "a" "g" "a" "g" "g" "a" "g"
x[x == names(which.max(table(x)))] = 0
x[x != 0] = 1
x
# [1] "0" "1" "0" "1" "0" "1" "0" "0" "1" "0"
This works, but I wonder if there is a more efficient way to do it.
(We don't have to consider the 50%-50% case here, because it should never happen in our study.)
Use this:
ag.encode <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 ) 1-result else as.numeric(result)
}
If you want to keep the labels in a factor structure, use this instead:
ag.encode2factor <- function(x)
{
result <- x == "a"
if( sum(result) > length(result) %/% 2 )
{
factor(2-result, labels=c("a","g"))
}
else
{
factor(result+1, labels=c("g","a"))
}
}
You can convert your character vector to a factor one. This solution is more general in the sense you don't need to know the name of the 2 characters used to create x.
y <- as.integer(factor(x))-1
if(sum(y)>length(y)/2) y <- as.integer(!y)