Generating Multiple Subsets in R - r

I have a large sequence of bytes, and I would like to generate a list containing an arbitrary number of subsets of that sequence. I suspect I need to use one of the apply functions, but the trick is that I need to iterate over the vector of starting positions, not the sequence itself.
Here's an example of how I want it to work --
extrct_by_mod <- function(x, startpos, endpos, lrecl)
{
x[1:length(x) %% lrecl %in% startpos:endpos]
}
tmp_seq <- letters[1:25]
startpos <- c(0, 2)
endpos <- c(1, 5)
lrecl <- 5
list_one <- extrct_by_mod(x=tmp_seq, startpos=startpos[1], endpos=endpos[1], lrecl=lrecl)
list_two <- extrct_by_mod(x=tmp_seq, startpos=startpos[2], endpos=endpos[2], lrecl=lrecl)
what_i_want <- list(list_one, list_two)
Ideally, I'd like to be able to just add more values to startpos and endpos, thus automatically generate more subsets to add to my list. Note that the subsets will not be the same length, and in some cases, not even the same type.
My datasets are fairly large, so something that scales well would be ideal. I realize that this could be done with a loop, but I'm understanding that you generally want to avoid looping in R.
Thank you!

Saving some time by pre-calculating the modulo-selection index:
> cats <- 1:length(tmp_seq) %% lrecl
> mapply(function(start,end) { tmp_seq[cats %in% start:end]} , startpos, endpos)
[[1]]
[1] "a" "e" "f" "j" "k" "o" "p" "t" "u" "y"
[[2]]
[1] "b" "c" "d" "g" "h" "i" "l" "m" "n" "q" "r" "s" "v" "w" "x"
(It is not correct that R apply functions are any faster than equivalent loops.)

Related

Why subtracting an empty vector in R deletes everything?

Could someone please enlighten me why subtracting an empty vector in R results in the whole content of a data frame being deleted? Just to give an example
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
Will give me all blanks in JointProcedures3 if WhichInstances2 has all its value as FALSE, but it should simply give me what JointProcedures3 was before those lines of code.
This is not the first time it has happened to me and I have asked my supervisor and it has happened to him as well and he just thinks t is a quirk of R.
Rewriting the code as
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
if(length(WhichInstances2)>0)
{
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
}
fixes the issue. But it should not have in principle made a scooby of a difference if that conditional was there or not, since if length(WhichInstances2) was equal to 0, I would simply be subtract nothing from the original JointProcedures3...
Thanks all for your input.
Let's try a simpler example to see what's happening.
x <- 1:5
y <- LETTERS[1:5]
which(x>4)
## [1] 5
y[which(x>4)]
## [1] "E"
So far so good ...
which(x>5)
## integer(0)
> y[which(x>5)]
## character(0)
This is also fine. Now what if we negate? The problem is that integer(0) is a zero-length vector, so -integer(0) is also a zero-length vector, so y[-which(x>5] is also a zero-length vector ..
What can you do about it? Don't use which(); instead use logical indexing directly, and use ! to negate the condition:
y[!(x>5)]
## [1] "A" "B" "C" "D" "E"
In your case:
JointID_OK <- (JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[!JointID_OK,]
For what it's worth, this is section 8.1.13 in the R Inferno, "negative nothing is something"
It seems you are checking for ids in a vector and you intend to remove them from another; probably setdiff is what you are looking for.
Consider if we have a vector of the lowercase letters of the alphabet (its an r builtin) and we want to remove any entry that matches something that is not in there ("ab") , as programmers we would wish for nothing to be removed and keep our 26 letters
# wont work
letters[ - which(letters=="ab")]
#works
setdiff(letters , which(letters=="ab"))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"

R: Check if strings in a vector are present in other vectors, and return name of the match

I need a tool more selective than %in% or match(). I need a code that matches a vector of string with another vector, and that returns the names of the matches.
Currently I have the following,
test <- c("country_A", "country_B", "country_C", "country_D", "country_E", "country_F") rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_4 <- c("country_C", "country_E", "country_M", "country_F)
i <- 1
while (i <= 33) {
print(i)
print(test[[i]])
if (grepl(test[[i]], rating_3) == TRUE) {
print(grepl(test[[i]], rating_3)) }
i <- i+1
},
This should check each element of test present in rating_3, but for some reason, it returns only the position, the name of the string, and a warning;
[1]
[country_A]
There were 6 warnings (use warnings() to see them)
I need to know what this piece of code fails, but I'd like to eventually have it return the name only when it's inside another vector, and if possible, testing it against several vectors at once, having it print the name of the vector in which it fits, something like
[1]
[String]
[rating_3]
How could I get something like that?
Without a reproducible example, it is hard to determine what exactly you need, but I think this could be done using %in%:
# create reprex
test <- sample(letters,10)
rating_3 <- sample(letters, 20)
print(rating_3[rating_3 %in% test])
[1] "r" "z" "l" "e" "m" "c" "p" "t" "f" "x" "n" "h" "b" "o" "s" "v" "k" "w" "a"
[20] "i"

return number of specific element of vector based of its name [duplicate]

This question already has answers here:
Convert letters to numbers
(5 answers)
Closed 5 years ago.
I need to return number of element in vector based on vector element name. Lets say i have vector of letters:
myLetters=letters[1:26]
> myLetters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and what I intent to do is to create/find function that returns me the number of element when called for example:
myFunction(myLetters["b"])
[1] 2
myFunction(myLetters["z"])
[1]26
In summary I need a way to refer to excel columns by writing letters of a column (A,B,C later maybe even AA or further) and to get the number.
If you want to refer to excel columnnames, you could create a reference vector with all possible excel column names:
eg1 <- expand.grid(LETTERS, LETTERS)
eg2 <- expand.grid(LETTERS, LETTERS, LETTERS)
excelcols <- c(LETTERS, paste0(eg1[[2]], eg1[[1]]), paste0(paste0(eg2[[3]], eg2[[2]], eg2[[1]])))
After which you can use which:
> which(excelcols == 'A')
[1] 1
> which(excelcols == 'AB')
[1] 28
> which(excelcols == 'ABC')
[1] 731
If you need to find the number of times specific letter occurs then the following should work:
myLetters = c("a","a", "b")
myFunction = function(myLetters, findLetter){
length(which(myLetters==findLetter))
}
Let find how many times "a" occurs in myLetters:
myFunction(myLetters, "a")
# [1] 2

Use sample() without replacement multiple times with increasing sample size in R

I want to take "random" samples from a vector called data but with increasing size and without replacement.
To illustrate my point data looks for example like:
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
What I need is to get different sampling vectors with increasing sampling size (starting with size=2) for example by 2 but without duplicates between the different vectors and store everything into a list so that the result would look something like this:
sample_1<-c("s","d")
sample_2<-c("s","d","a","f")
sample_3<-c("s","d","a","f","m","n")
sample_4<-c("s","d","a","f","m","n","l","c")
sample_5<-c("s","d","a","f","m","n","l","c","j","x")
sample_6<-c("s","d","a","f","m","n","l","c","j","x","v","k")
sample_7<-c("s","d","a","f","m","n","l","c","j","x","v","k","g","b")
sample_8<-c("s","d","a","f","m","n","l","c","j","x","v","k","g","b","h")
samples<-list(sample_1,sample_2,sample_3,sample_4,sample_5,sample_6,sample_7,sample_8)
What i have so far is:
samples<-sapply(seq(from=2, to=length(data), by=2), function(i) sample(data,size=i,replace=F),simplify=F,USE.NAMES=T )
What does not work is to have the increasing sample size but keeping the samples of the previous steps and to have a last list element with all observations.
Is something like this possible?
I'm not sure whether I understood you correctly, but perhaps you only need to scramble the data once:
data = letters
data_random = sample(data)
sapply(seq(from=2, to=length(data), by=2),
function (x) data_random[1:x],
simplify = FALSE)
After your comments on other answer I think I get what you want to achieve, so extending my previous code I end up with:
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
set.seed(123)
nbitems=length(data)/2+length(data)%%2
results=vector("list",nbitems)
results[[1]] <- sample(data,2) # get first sample
for (i in 2:nbitems) { # Loop for each result
samplesavail <- data[!data %in% results[[i-1]]] # Reduce the samples available
results[[i]] <- c(results[[i-1]], sample( samplesavail, min( length(samplesavail), 2) ) ) # concatenate a new sample, size depends on step and remaining samples available.
}
Hope this match your intended use:
> results
[[1]]
[1] "n" "f"
[[2]]
[1] "n" "f" "a" "g"
[[3]]
[1] "n" "f" "a" "g" "m" "v"
[[4]]
[1] "n" "f" "a" "g" "m" "v" "x" "l"
[[5]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j"
[[6]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h"
[[7]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h" "d" "s"
[[8]]
[1] "n" "f" "a" "g" "m" "v" "x" "l" "b" "j" "k" "h" "d" "s" "c"
Previous approach:
If I understood you well (but far unsure):
data<-c("a","s","d","f","g","h","j","k","l","x","c","v","b","n","m")
set.seed(123) # fix the seed for repro of answer, remove in real case
nbitems=length(data)/2+length(data)%%2 # Get how much entries we should have when stepping by 2
results=vector("list",nbitems) # preallocate the list (as we'll start by end)
results[[nbitems]] = sample(data,length(data)) # sample the datas
for (i in nbitems:2) {
results[[i-1]] <- results[[i]][1:(length(results[[i]]) - 2)] # for each iteration, take down the 2 last entries.
}
This give a single entry as first result.
Just noticed this is the same idea as #sbstn answer but with a more complicated backward approach, posting in case it can have some value.

Determine all characters present in a vector of strings

Say I have the following dataframe consisting of two vectors containing character strings:
df <- data.frame(
"ID"= c("1a", "1b", "1c", "1d"),
"Codes" = c("BX.MX|GX.WX", "MX.RX|BX.YX", "MX.OX|GX.GX", "MX.OX|YX.OX"),
stringsAsFactors = FALSE)
I'd like a simple way to determine which characters have been used in a given vector. In other words, the output of such a function would reveal:
find.characters(df$Codes) # hypothetical function
[1] "B" "G" "M" "W" "X" "R" "Y" "O" "|" "."
find.characters(df$ID) # hypothetical function
[1] "1" "a" "b" "c" "d"
You can create a custom function to do this. The idea is to split the strings into individual characters (strsplit(v1, '')), output will be list. We can unlist it to make it a vector, then get the unique elements. But, this is not sorted yet. Based on the example showed, you may want to sort the letters and other characters differently. So, we use grep to index the 'LETTER' character, and use this to separately sort the subset of vectors and concatenate c( it together.
find.characters <- function(v1){
x1 <- unique(unlist(strsplit(v1, '')))
indx <- grepl('[A-Z]', x1)
c(sort(x1[indx]), sort(x1[!indx]))
}
find.characters(df$Codes)
#[1] "B" "G" "M" "O" "R" "W" "X" "Y" "|" "."
find.characters(df$ID)
#[1] "1" "a" "b" "c" "d"
NOTE: Generally, I would use grepl('[A-Za-z]', x1), but I didn't do that because the expected result for the 'ID' column is different.
find.characters<-function(x){
unique(c(strsplit(split="",x),recursive = T))
}

Resources