Make R consider negative position searches as being out of bounds

Make R consider negative position searches as being out of bounds - r

vec<-c("a", "b", "c", "d")
my task is to extract the second element from the right and left of the key string.
If our key string is "d", if we do
i<-c("d")
vec.1 <- append(vec.1, vec[which(vec == i) + 2])
we get NA. But if we do
i<-c("a")
vec.1 <- append(vec.1, vec[which(vec == i) - 2])
we get "b", "c", "d". Is it possible to consider negative values in subscripts as positions being out of the vector like a positive subscript that exceeds the length of the vector? That way the result will be a NA.

library(Hmisc)
Lag(vec,2)[vec=="d"]
#[1] "b"
Lag(vec,2)[vec=="a"]
#[1] ""
Lag(vec,-2)[vec=="a"]
#[1] "c"
Lag(vec,-2)[vec=="c"]
#[1] ""

I'm sure I could do better, but it's late here. Why not write a small function to do what you want.
myVec <- function(input, match, change) {
temp = which(input == match)
if ((temp + change) <= 0) {
append(NA, input)
} else {
append(input, input[temp + change])
}
}
vec <- c("a", "b", "c", "d")
myVec(vec, "a", -1)
# [1] NA "a" "b" "c" "d"
myVec(vec, "c", -1)
# [1] "a" "b" "c" "d" "b"
myVec(vec, "c", -3)
# [1] NA "a" "b" "c" "d"
myVec(vec, "d", 1)
# [1] "a" "b" "c" "d" NA

Related

Subset a vector and retrieve first elements if exceed the length in R

Imagine I have a vector like this one:
c("A", "B", "C", "D")
there are 4 positions. If I make a sample with size 1 I can get 1, 2, 3 or 4. What I want is to be able to subset of length 3 of that vector according its order, for example, if I get 2:
c("B", "C", "D")
If I get 3:
c("C", "D", "A")
If I get 4:
c("D","A", "B")
So that's the logic, the vector is sorted and the last elements connects with the first element when I subset.

Using seq, f gives you the desired subset for a specified vector v, of which you would like to subset l elements with a starting point at the nth position.
f <- function(v, n, l) v[seq(n - 1, length.out = l) %% length(v) + 1]
output
f(v, n = 4, l = 3)
#[1] "D" "A" "B"
f(v, n = 3, l = 4)
#[1] "C" "D" "A" "B"
f(v, n = 2, l = 5)
#[1] "B" "C" "D" "A" "B"

I think I got it!
v <- c("A", "B", "C", "D")
p <- sample(1:length(v), 1)
r <- c(v[p:length(v)])
c(r, v[!(v %in% r)])[1:3]
And the outputs:
v <- c("A", "B", "C", "D") # your vector
r <- c(v[2:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "B" "C" "D"
r <- c(v[3:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "C" "D" "A"
r <- c(v[4:length(v)])
c(r, v[!(v %in% r)])[1:3]
#> [1] "D" "A" "B"
Created on 2022-05-16 by the reprex package (v2.0.1)
Wrapped in a function:
f <- function(v, nth) {
r <- c(v[nth:length(v)])
return(c(r, v[!(v %in% r)])[1:3])
}

Reconstructing Markov chain from figure in R

I am trying to reconstruct a Markov process from Shannons paper "A mathematical theory of communication". My question concerns figure 3 on page 8 and a corresponding sequence (message) from that Markov chain from page 5 section (B). I just wanted to check if I coded the right Markov chain to this figure from Shannons paper:
Here is my attempt:
install.packages("markovchain")
library(markovchain)
MessageABCDE = c("A", "B", "C", "D", "E")
MessageTransitionMatrix = matrix(c(.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1,
.4,.1,.2,.2,.1),
nrow = 5,
byrow = TRUE,
dimname = list(MessageABCDE, MessageABCDE))
MCmessage = new("markovchain", states = MessageABCDE,
byrow = TRUE,
transitionMatrix = MessageTransitionMatrix,
name = "WritingMessage")
steadyStates(MCmessage)
markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
My goal was to also create a sequence (message) from that chain. I am mostly uncertain about the transition matrix, where infered the probabilities had to be all the same in every row. I am happy with the output of markovchainSequence, but I am not 100% sure, if I am doing it right.
Here is my console output for markovchainSequence:
> markovchainSequence(n = 20, markovchain = MCmessage, t0 = "A")
[1] "D" "E" "A" "D" "A" "A" "B" "D" "E" "C" "A" "A" "E" "C" "C" "D" "D" "D"
[19] "A" "C"

Looks fine. It's maybe odd because with fully independent states like this there isn't any need for a Markov chain. One could equally well use
tokens <- c("A", "B", "C", "D", "E")
probs <- c(0.4, 0.1, 0.2, 0.2, 0.1)
sample(tokens, size=20, replace=TRUE, prob=probs)
## [1] "A" "B" "A" "B" "D" "B" "C" "D" "A" "D" "C" "E" "A" "A" "C" "E" "C" "D" "C" "C"
Will likely make more sense once there is a variety of conditional probabilities.

Why second ifelse not evaluated in R and why if else does not vectorize?

Consider the following df:
structure(list(GID7173723 = c("A", "T", "G", "A", "G"), GID4878677 = c("G",
"C", "G", "A", "G"), GID88208 = c("A", "T", "G", "A", "G"), GID346403 = c("A",
"T", "G", "A", "G"), GID268825 = c("G", "C", "G", "A", "G")), row.names = c(NA,
5L), class = "data.frame")
Here is how it looks:
GID7173723 GID4878677 GID88208 GID346403 GID268825
1 A G A A G
2 T C T T C
3 G G G G G
4 A A A A A
5 G G G G G
Now consider two vectors:
ref <- c("A", "T", "G", "A", "G")
alt <- c("G", "C", "T", "C", "A")
And the function:
f = function(x){
ifelse(x==ref,2,x)
ifelse(x==alt,0,x)
}
When I run sapply just the second ifelse evaluates:
sapply(dfn,f)
GID7173723 GID4878677 GID88208 GID346403 GID268825
[1,] "A" "0" "A" "A" "0"
[2,] "T" "0" "T" "T" "0"
[3,] "G" "G" "G" "G" "G"
[4,] "A" "A" "A" "A" "A"
[5,] "G" "G" "G" "G" "G"
If I run something like that:
f = function(x){
if (x==ref) {return(2)
}
else if (x==alt) {return(0)
}
else {
return(x)
}
}
I get the warning message:
sapply(dfn,f)
Warning messages:
1: In if (x == ref) { :
the condition has length > 1 and only the first element will be used
2: In if (x == ref) { :
the condition has length > 1 and only the first element will be used
3: In if (x == alt) { :
the condition has length > 1 and only the first element will be used
4: In if (x == ref) { :
the condition has length > 1 and only the first element will be used
5: In if (x == ref) { :
the condition has length > 1 and only the first element will be used
6: In if (x == ref) { :
the condition has length > 1 and only the first element will be used
7: In if (x == alt) { :
the condition has length > 1 and only the first element will be used
I believe the latter function is due to the nature of if else to not vectorize. I really would like to solve this problem without using neither for loops nor sweep but only with if else statements followed by the apply family functions.

You may try making an assignment with the first call to ifelse:
f <- function(x){
x <- ifelse(x == ref, 2, x)
return(ifelse(x == alt, 0, x))
}
The main problem with your current approach is that the first ifelse is not "sticking" without an assignment on the LHS.

When you are comparing values with a vector use %in% and not == because == does an element-wise comparison.
You need to save the changes of first ifelse statement before executing second statement.
Change your function to :
f = function(x){
x <- ifelse(x %in% ref,2,x)
x <- ifelse(x %in% alt,0,x)
return(x)
}
You can then use lapply :
dfn[] <- lapply(dfn,f)

While look results not getting printed

could anyone help me while results are not getting displayed here below
col_name <- c("A", "B", "C", "D")
i <- 1
while (i < length(col_name)) {
col_name[i]
i = i+1
}
Expected output
"A"
"B"
"C"

It needs a print
while (i < length(col_name)) {
print(col_name[i])
i = i+1
}
#[1] "A"
#[1] "B"
#[1] "C"
If we need to store the output, initialize an object and update
out <- c()
i <- 1
while (i < length(col_name)) { out <- c(out, col_name[i]); i = i+1}
out
#[1] "A" "B" "C"

Extracting names of vector by time bin

I have written this loop to extract the names of each element of a vector that occurs within a time interval (bin). I was wondering if I am missing a faster way to do this... I want to implement a randomization aspect to vectors that are 1000s in length and as such do not want to rely on a loop.
mydata <- structure(c(1199.91666666667, 1200.5, 1204.63333333333, 1205.5,
1206.3, 1208.73333333333, 1209.06666666667, 1209.93333333333,
1210.98333333333, 1214.56666666667, 1216.06666666667, 1216.63333333333,
1216.91666666667, 1219.13333333333, 1221.35, 1221.51666666667,
1225.35, 1225.53333333333, 1225.96666666667, 1227.61666666667,
1228.91666666667, 1230.31666666667, 1233.53333333333, 1235.8,
1237.51666666667, 1239.41666666667, 1241.6, 1247.08333333333,
1247.45, 1252.7, 1253.26666666667), .Names = c("B", "A", "B",
"E", "A", "A", "B", "G", "G", "C", "A", "D", "E", "B", "B", "E",
"E", "G", "F", "A", "C", "A", "F", "B", "A", "F", "F", "G", "F",
"G", "F"))
mydata
B A B E A A B G G C A D E B B E E
1199.917 1200.500 1204.633 1205.500 1206.300 1208.733 1209.067 1209.933 1210.983 1214.567 1216.067 1216.633 1216.917 1219.133 1221.350 1221.517 1225.350
G F A C A F B A F F G F G F
1225.533 1225.967 1227.617 1228.917 1230.317 1233.533 1235.800 1237.517 1239.417 1241.600 1247.083 1247.450 1252.700 1253.267
These represent consecutive times in seconds of events. Say we want to make our intervals 5s long. My approach is to make a vector of the beginning of each interval and then use a loop to find the names of elements occurring within that interval:
N=5
ints <- seq(mydata[1], mydata[length(mydata)], N)
out<-list()
for(i in 1:length(ints)){
out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])
}
out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
This is fine for small samples - but I can see this would get slow when dealing with very large samples that are permuted 1000s of times.

My suggestion is to use findInterval (based on an answer to this earlier question of mine):
mydata2 = c(-Inf, mydata)
ints <- seq(mydata[1], mydata[length(mydata)]+5, N)
idx = findInterval(ints-1e-10, mydata2)
out<-list()
for(i in 1:(length(ints)-1)){
out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])
}
As you can see I have to do a little tinkering with the beginning (adding a first value that is smaller than the first breakpoint, adding an epsilon). Here's the result, it is identical to yours:
> out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
In terms of speed for the example there is some improvement:
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 67.177 76.9725 85.73347 82.8035 95.866 119.890 100
mts 43.851 52.7150 62.72116 58.3130 73.007 96.099 100
alexis 75.573 86.5360 95.72593 91.4340 100.531 234.649 100
R_Yoda 2032.066 2158.4870 2303.68887 2191.3750 2281.409 8719.314 100
For larger vectors (I chose length 2000) this is clearer:
set.seed(123)
mydata = sort(runif(n = 2000, min = 0, max = 100))
names(mydata) = sample(LETTERS[1:7], size = 2000, replace = T)
mydata2 = c(-Inf, mydata)
ints2 <- seq(mydata[1], mydata[length(mydata)]+5, N)
dt <- data.table(data=mydata, names=names(mydata) )
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 804.243 846.9275 993.9957 862.0890 883.3140 7140.218 100
mts 77.439 88.8685 100.6148 100.0640 106.5955 188.466 100
alexis 187.066 204.7930 220.1689 215.5225 225.3190 299.026 100
R_Yoda 3831.348 4066.4640 4366.5382 4140.1700 4248.8635 11829.923 100

For performance reasons I am using data.table:
Edit: This solution works, but is NOT very fast (as proved by the answer of mts)
library(Hmisc)
library(data.table)
# assuming that your mydata vector from the question is loaded
N=5 # code from your question...
ints <- seq(mydata[1], mydata[length(mydata)], N) # code from your question...
dt <- data.table(data=mydata, names=names(mydata) )
dt[, groups := cut2(data,ints)] # attention: shall the interval ends be included in the group or not?
groups <- dt[ , .(result=list(names)), by=groups] # the elements of a data.table can be a list itself!
# to get the result as list:
out <- groups[,result]
out
Edit: You could replace cut2 by findInterval and do it all in one line, but it is still slower:
out <- dt[, .(result=list(names)), by = findInterval(data,ints) ]
This is the result:
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Make R consider negative position searches as being out of bounds - r

library(Hmisc) Lag(vec,2)[vec=="d"] #[1] "b" Lag(vec,2)[vec=="a"] #[1] "" Lag(vec,-2)[vec=="a"] #[1] "c" Lag(vec,-2)[vec=="c"] #[1] ""

Related

Subset a vector and retrieve first elements if exceed the length in R

Reconstructing Markov chain from figure in R

Why second ifelse not evaluated in R and why if else does not vectorize?

While look results not getting printed

Extracting names of vector by time bin

Categories

Resources