Why subtracting an empty vector in R deletes everything? - r

Could someone please enlighten me why subtracting an empty vector in R results in the whole content of a data frame being deleted? Just to give an example
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
Will give me all blanks in JointProcedures3 if WhichInstances2 has all its value as FALSE, but it should simply give me what JointProcedures3 was before those lines of code.
This is not the first time it has happened to me and I have asked my supervisor and it has happened to him as well and he just thinks t is a quirk of R.
Rewriting the code as
WhichInstances2 <- which(JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
if(length(WhichInstances2)>0)
{
JointProcedures3 <-JointProcedures3[-WhichInstances2,]
}
fixes the issue. But it should not have in principle made a scooby of a difference if that conditional was there or not, since if length(WhichInstances2) was equal to 0, I would simply be subtract nothing from the original JointProcedures3...
Thanks all for your input.

Let's try a simpler example to see what's happening.
x <- 1:5
y <- LETTERS[1:5]
which(x>4)
## [1] 5
y[which(x>4)]
## [1] "E"
So far so good ...
which(x>5)
## integer(0)
> y[which(x>5)]
## character(0)
This is also fine. Now what if we negate? The problem is that integer(0) is a zero-length vector, so -integer(0) is also a zero-length vector, so y[-which(x>5] is also a zero-length vector ..
What can you do about it? Don't use which(); instead use logical indexing directly, and use ! to negate the condition:
y[!(x>5)]
## [1] "A" "B" "C" "D" "E"
In your case:
JointID_OK <- (JointProcedures3$JointID %in% KneeIDcount$Var1[KneeIDcount$Freq >1])
JointProcedures3 <-JointProcedures3[!JointID_OK,]
For what it's worth, this is section 8.1.13 in the R Inferno, "negative nothing is something"

It seems you are checking for ids in a vector and you intend to remove them from another; probably setdiff is what you are looking for.
Consider if we have a vector of the lowercase letters of the alphabet (its an r builtin) and we want to remove any entry that matches something that is not in there ("ab") , as programmers we would wish for nothing to be removed and keep our 26 letters
# wont work
letters[ - which(letters=="ab")]
#works
setdiff(letters , which(letters=="ab"))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u"
[22] "v" "w" "x" "y" "z"

Related

Unexpected behaviour when indexing vector with y[length(x) + 1:length(y)] [duplicate]

This question already has an answer here:
Order of operator precedence when using ":" (the colon)
(1 answer)
Closed 1 year ago.
I have a very specific problem, where I need to use the length of some shorter array, to subset a longer array. I have presented a toy example below. I do not understand why it does not work to just add 1 when indexing, and I don't understand why it returns the long array filled with NA.
x <- letters[1:5]
x
# [1] "a" "b" "c" "d" "e"
y <- letters[1:10]
y
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
y[length(x): length(y)]
# [1] "e" "f" "g" "h" "i" "j"
y[length(x) + 1: length(y)]
# [1] "f" "g" "h" "i" "j" NA NA NA NA NA
y[(length(x) + 1): length(y)]
# [1] "f" "g" "h" "i" "j"
Using y[length(x): length(y)] almost solves my probem, but the resulting array is too long, I dont want to return 'e', I have to start from one more index to the right. I thought I could solve this by using y[length(x) + 1: length(y)], but that gives me, for some reasons, a vector of the same length as y, and fills NA in the end. I found that using ( solved the problem, but again, I don't understand why, and what is happening when I don't use (, if someone could help me?
The colon operator comes before addition in the order of operations. Using the parentheses tells R that you want the value of length + 1 as the first number in the sequence.
So, as you mention, the following should work:
y[(length(x) + 1): length(y)]

R: Check if strings in a vector are present in other vectors, and return name of the match

I need a tool more selective than %in% or match(). I need a code that matches a vector of string with another vector, and that returns the names of the matches.
Currently I have the following,
test <- c("country_A", "country_B", "country_C", "country_D", "country_E", "country_F") rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_3 <- c("country_B", "country_D", "country_G", "country_K")
rating_4 <- c("country_C", "country_E", "country_M", "country_F)
i <- 1
while (i <= 33) {
print(i)
print(test[[i]])
if (grepl(test[[i]], rating_3) == TRUE) {
print(grepl(test[[i]], rating_3)) }
i <- i+1
},
This should check each element of test present in rating_3, but for some reason, it returns only the position, the name of the string, and a warning;
[1]
[country_A]
There were 6 warnings (use warnings() to see them)
I need to know what this piece of code fails, but I'd like to eventually have it return the name only when it's inside another vector, and if possible, testing it against several vectors at once, having it print the name of the vector in which it fits, something like
[1]
[String]
[rating_3]
How could I get something like that?
Without a reproducible example, it is hard to determine what exactly you need, but I think this could be done using %in%:
# create reprex
test <- sample(letters,10)
rating_3 <- sample(letters, 20)
print(rating_3[rating_3 %in% test])
[1] "r" "z" "l" "e" "m" "c" "p" "t" "f" "x" "n" "h" "b" "o" "s" "v" "k" "w" "a"
[20] "i"

How to retain character strings using positional indexing?

What I need to do is very similar to what the function below does
x = c("abcde", "ghij", "klmnopq")
tstrsplit(x, "", fixed=TRUE, keep=c(1,3,5), names=c('first','second','third'))
However, I would like to be able to return strings using ranges of values. For example, I would like to specify that in first I want to have the first two letters for each element.
Thus instead of having:
$first
[1] "a" "g" "k"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
The output should look like
$first
[1] "ab" "gh" "kl"
$second
[1] "c" "i" "m"
$third
[1] "e" NA "o"
Background:
I have a large .txt file of records and a lookup table that tells from which position to which position each attribute goes, and the expected max width from which position. The txt file looks like:
James Brown M 01-01-1970
And then in a separate file I have a lookup table that says:
Field Start width
Name 1 7
FamilyN 9 7
Gender 11 1
Incidentally, I would appreciate any feedback on the best way to import this type of large .txt file. I feel like read.table is inappropriate since it tries to reduce to a dataframe format which is not what these files really are.
Something like this maybe:
x = c("abcde", "ghij", "klmnopq")
library(tidyverse)
list(c(1,3,5), c(2,1,1)) %>%
pmap(~ substr(x, .x, .x + .y - 1) %>% replace(., .=="", NA))
[[1]]
[1] "ab" "gh" "kl"
[[2]]
[1] "c" "i" "m"
[[3]]
[1] "e" NA "o"
I've hardcoded the positions. Per #MrFlick's comment, if you have a large number of strings, you'll need some strategy for deciding on the character positions so that you can automate it, rather than hardcoding it.

Generating Multiple Subsets in R

I have a large sequence of bytes, and I would like to generate a list containing an arbitrary number of subsets of that sequence. I suspect I need to use one of the apply functions, but the trick is that I need to iterate over the vector of starting positions, not the sequence itself.
Here's an example of how I want it to work --
extrct_by_mod <- function(x, startpos, endpos, lrecl)
{
x[1:length(x) %% lrecl %in% startpos:endpos]
}
tmp_seq <- letters[1:25]
startpos <- c(0, 2)
endpos <- c(1, 5)
lrecl <- 5
list_one <- extrct_by_mod(x=tmp_seq, startpos=startpos[1], endpos=endpos[1], lrecl=lrecl)
list_two <- extrct_by_mod(x=tmp_seq, startpos=startpos[2], endpos=endpos[2], lrecl=lrecl)
what_i_want <- list(list_one, list_two)
Ideally, I'd like to be able to just add more values to startpos and endpos, thus automatically generate more subsets to add to my list. Note that the subsets will not be the same length, and in some cases, not even the same type.
My datasets are fairly large, so something that scales well would be ideal. I realize that this could be done with a loop, but I'm understanding that you generally want to avoid looping in R.
Thank you!
Saving some time by pre-calculating the modulo-selection index:
> cats <- 1:length(tmp_seq) %% lrecl
> mapply(function(start,end) { tmp_seq[cats %in% start:end]} , startpos, endpos)
[[1]]
[1] "a" "e" "f" "j" "k" "o" "p" "t" "u" "y"
[[2]]
[1] "b" "c" "d" "g" "h" "i" "l" "m" "n" "q" "r" "s" "v" "w" "x"
(It is not correct that R apply functions are any faster than equivalent loops.)

Selecting and matching multiple vectors in a list in R

I have a list of vectors like this:
>list
[[1]]
[1] "a" "m" "l" "s" "t" "o"
[[2]]
[1] "m" "y" "o" "t" "e"
[[3]]
[1] "n" "a" "s"
[[4]]
[1] "b" "u" "z" "u" "l" "a"
[[5]]
[1] "c" "m" "u" "s" "r" "i" "x" "t"
1-First, I want to select the vector in the table with the highest number of elements (in this case the 5th vector with 8 elements). This is easy.
2-Second I want to select all vectors in the list with length equal or immediately lower than the previous, and intersect them with the previous vector.
Another possibility I have is selecting by the name of the 1st character. In this case this would be equivalent to select the vectors starting with "a" or "b", the first and fourth in the list. In this case what I do not know is how to select multiple vectors in a list knowing their first element.
3-Finally, I want to keep just the intersection with the minimum number of matches.
In this case the the four vector in the list, starting with "b". Then start the process again for the rest of the vectors but considering already the 4th and 5th vector when "intersecting". In this case would be pick up the second element and intersect this element with a "unique() combination" of the 4th and 5th.
I hope I have explained myself!. Is there a way to do this in R without 3-4 "for" and "if" loops? in another words. Is there a clever way to do it using lapply or similar?
This should do it?
list <- strsplit(list("amlsto", "myote","nas","buzula","cmsusrixt"), "")
# find minimum length
lens <- sapply(list, length)
which.min(lens)
# which are same or 1 shorter than previous
inds <- which (lens==c(-1,head(lens, -1)) | lens==c(-1,head(lens,-1))-1)
# get the intersections
inters <- mapply(intersect, list[inds], list[inds-1], SIMPLIFY=FALSE)
#Get items where first in vector is in target set
target <- c("a","b")
isTarget <- sapply(list, "[[",1) %in% target
# Minimum number of overlaps
which.min(lapply(inters, length))

Resources