delete vector entries based on another vector - r

I have two vectors
a <- c(1:20)
b <- c(2,11,14)
I want to delete the entries in the a vector based on the vector entries in b (I want the 2nd, 11th, and 14th entries deleted).
I've tried several methods, including:
c <- a[!a %in% b]
but that doesn't work.
Any suggestions? I've tried searching SO, but can only find deleting based on values.

You can simply index into a and remove the elements at indices in b as follows:
a <- c(1:20)
b <- c(2,11,14)
a[-b]
[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
I created 3.1 million entries and am randomly sampling 100,000 to remove. As can be seen, it is blazing fast.
a <- 1:3100000
b <- sample(a, 100000)
system.time(a[-b])
user system elapsed
0.024 0.003 0.027
Edited: Adding this extra check option based on comment below by akrun and thelatemail to handle the case where b might be null.
a[if(length(b)) -b else TRUE]

The approach by #Gopala works in most cases except when the 'b' vector is NULL. To make it a bit more general, we can get the logical condition using seq_along(a) with %in%
a[!seq_along(a) %in% b]
#[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
Now, if we change 'b' to
b <- vector('integer')
a[-b]
#integer(0)
a[!seq_along(a) %in% b]
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The former returns a vector of length 0, while the %in% approach returns the whole vector 'a'.
Other method is obviously more efficient, but in case if we need an approach that works on the case I mentioned, this can be used.
system.time(a[-b])
# user system elapsed
# 0.07 0.00 0.08
system.time(a[!seq_along(a) %in% b])
# user system elapsed
# 0.17 0.01 0.18
The approach posted by #thelatemail to make the first approach general
system.time(a[if(length(b)==0) TRUE else -b])
# user system elapsed
# 0.05 0.00 0.05
NOTE: Benchmark data from #Gopala's post.

Related

Is there a way to create a permutation of a vector without using the sample() function in R?

I hope you are having a nice day. I would like to know if there is a way to create a permutation (rearrangement) of the values in a vector in R?
My professor provided with an assignment in which we are supposed create functions for a randomization test, one while using sample() to create a permutation and one not using the sample() function. So far all of my efforts have been fruitless, as any answer that I can find always resorts in the use of the sample() function. I have tried several other methods, such as indexing with runif() and writing my own functions, but to no avail. Alas, I have accepted defeat and come here for salvation.
While using the sample() function, the code looks like:
#create the groups
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
#create a permutation of the combined vector without replacement using the sample function()
permsample <-sample(c(a,b),replace=FALSE)
permsample
[1] 2 5 6 1 7 7 3 8 6 3 5 9 2 7 4 8 1 5
And, for reference, the entire code of my function looks like:
PermutationTtest <- function(a, b, P){
sample.t.value <- t.test(a, b)$statistic
perm.t.values<-matrix(rep(0,P),P,1)
N <-length(a)
M <-length(b)
for (i in 1:P)
{
permsample <-sample(c(a,b),replace=FALSE)
pgroup1 <- permsample[1:N]
pgroup2 <- permsample[(N+1) : (N+M)]
perm.t.values[i]<- t.test(pgroup1, pgroup2)$statistic
}
return(mean(perm.t.values))
}
How would I achieve the same thing, but without using the sample() function and within the confines of base R? The only hint my professor gave was "use indices." Thank you very much for your help and have a nice day.
You can use runif() to generate a value between 1.0 and the length of the final array. The floor() function returns the integer part of that number. At each iteration, i decrease the range of the random number to choose, append the element in the rn'th position of the original array to the new one and remove it.
a <- c(2,5,5,6,6,7,8,9)
b <- c(1,1,2,3,3,4,5,7,7,8)
c<-c(a,b)
index<-length(c)
perm<-c()
for(i in 1:length(c)){
rn = floor(runif(1, min=1, max=index))
perm<-append(perm,c[rn])
c=c[-rn]
index=index-1
}
It is easier to see what is going on if we use consecutive numbers:
a <- 1:8
b <- 9:17
ab <- c(a, b)
ab
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Now draw 17 (length(ab)) random numbers and use them to order ab:
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 5 13 11 12 6 1 17 3 10 2 8 16 7 4 9 15 14
rnd <- runif(length(ab))
ab[order(rnd)]
# [1] 14 11 5 15 10 7 13 9 17 8 2 6 1 4 16 12 3
For each permutation just draw another 17 random numbers.

word_stats function from qdap package application on a dataframe

I have a dataframe, where one column contains strings.
q = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
I want to use the word_stats function for each individual record.
is it possible?
text_statistic <- apply(q,1,word_stats)
this will apply word_stats() row-by-row and return a list with the results of word_stats() for every row
you can do it many ways, lapply or sapply apply a Function over a List or Vector.
word_stats <- function(x) {length(unlist(strsplit(x, ' ')))}
sapply(q$text, word_stats)
Sure have a look at the grouping.var argument:
dat = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
with(dat, qdap::word_stats(text, number))
## number n.sent n.words n.char n.syl n.poly wps cps sps psps cpw spw pspw n.state p.state n.hapax grow.rate
## 1 2 1 10 38 14 2 10 38 14 2 3.800 1.400 .200 1 1 10 1
## 2 1 1 8 35 12 1 8 35 12 1 4.375 1.500 .125 1 1 8 1

Identify first match position in a string

I have a character string ("00010000") and need to identify which position do we see the first "1". (This tells me which month a customer is active)
I have a dataset that looks like this:
id <- c(1:5)
seq <- c("00010000","00001000","01000000","10000000","00010000")
df <- data.frame(id,seq)
I would like to create a new field identifying the first_month_active for each id.
I can do this manually with a nested ifelse function:
df$first_month_active <-
ifelse(substr(df$seq,1,1)=="1",1,
ifelse(substr(df$seq,2,2)=="1",2,
ifelse(substr(df$seq,3,3)=="1",3,
ifelse(substr(df$seq,4,4)=="1",4,
ifelse(substr(df$seq,5,5)=="1",5,99 )))))
Which gives me the desired result:
id seq first_position
1 00010000 4
2 00001000 5
3 01000000 2
4 10000000 1
5 00010000 4
However, this is not an ideal solution for my data, which contains 36 months.
I would like to use a loop with an ifelse statement, however I am really struggling with syntax
for (i in 1:36) {
ifelse(substr(df$seq,0+i,0+i)=="1",0+i,
}
Any ideas would be greatly appreciated
Or try the stringi package
library(stringi)
stri_locate_first_fixed(df$seq, "1")[, 1]
## [1] 4 5 2 1 4
Skip the loop and the ifelse:
9 - nchar(as.numeric(seq))
## [1] 4 5 2 1 4
This won't work the same in your data.frame because you coerced seq to factor implicitly, so just do:
9 - nchar(as.numeric(as.character(df$seq)))
## [1] 4 5 2 1 4
Edit: Just for fun, since Frank didn't convert his comment into an answer, here's strsplit solution:
# from original vector
sapply(strsplit(seq, "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4
# from data.frame
sapply(strsplit(as.character(df$seq), "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4
You can use gregexpr.
> unlist(gregexpr(pattern=1,seq,fixed=T))
[1] 4 5 2 1 4
The following could do this job:
library(stringr)
str_locate(pattern ='1',seq)
Some comparisons:
library(stringi)
library(stringr)
seq <- c("00010010","00001000","10000010","10000000","00010000")
seq2 <- rep(seq, 5e6)
system.time(regexpr("1", seq2))
user system elapsed
4.78 0.03 4.82
system.time(9-nchar(as.numeric(as.character(seq2))))
user system elapsed
34.89 0.18 35.52
system.time(str_locate(pattern ='1',seq2))
user system elapsed
6.17 0.21 6.53
system.time(stri_locate_first_fixed(seq2, "1")[, 1])
user system elapsed
1.68 0.15 1.84
system.time(nchar(seq2)-round(log10(as.numeric(seq2))))
user system elapsed
7.67 0.09 7.86
system.time(nchar(sub('1.*', '', seq2))+1)
user system elapsed
14.61 0.11 14.93
Another one, using log:
nchar(seq)-round(log10(as.numeric(seq)))
Another option using sub
nchar(sub('1.*', '', seq))+1
#[1] 4 5 2 1 4

Is there a way to refer to the end of a vector?

I don't want to save the huge intermediate results for some of my calculations, and hence want to run some tests without saving these memory expensive vectors.
Say, during the computation I have a vector of arbitrary length l.
But I don't know what l is, and I can't save the vector in the memory.
Is there a way I can refer the length of the vector, something like
vec[100:END] or vec[100:-1] or vec[100:last]
Please note that vec here is not a variable, and it only refers to an intermediate expression which will output a vector.
I know length, head and tail functions, and that vec[-(1:99)] is an equivalent expression.
But, I actually want to know if there is some reference that will run an iteration from a specified number to the 'end' of the vector.
Thanks!!
I'm probably not understanding your question. If this isn't useful let me know and I'll delete it.
I gather you want to extract the elements from a vector of arbitrary length, from element N to the end, without explicitly storing the vector (which is required if you want to use, e.g. length(vec)). Here are two ways:
N <- 5 # grab element 5 to the end.
set.seed(12)
(1:sample(N:100,1))[-(1:(N-1))]
# [1] 5 6 7 8 9 10 11
set.seed(12)
tail(1:sample(N:100,1),-(N-1))
# [1] 5 6 7 8 9 10 11
Both of these create (temporarily) a sequence of integers of random length (>=5), and extract the elements from 5 to the end without self-referencing.
You mentioned memory a could of times. If you're concerned about memory and assigning large objects, you should take a look at the Memory-limits documentation, and the related links. First, there are ways to operate on the language in R. Here I only assign one object, the function f, and use it without making any other assignments.
> f <- function(x, y) x:y ## actually, g <- ":" is only 96 bytes
> object.size(f)
# 1560 bytes
> f(5, 20)[3:7]
# [1] 7 8 9 10 11
> object.size(f)
# 1560 bytes
> f(5, 20)[3:length(f(5, 20))]
# [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> object.size(f)
# 1560 bytes
You can also use an expression to hold an unevaluated function call.
> e <- expression(f(5, 20)) ## but again, g <- ":" is better
> eval(e)
# [1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> eval(e)[6:9]
# [1] 10 11 12 13
> eval(e)[6:length(eval(e))]
# [1] 10 11 12 13 14 15 16 17 18 19 20
> rev(eval(e))
# [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5
Depending of the type of data you're working with, there are ways to
avoid using large amounts of memory during a session. Here are a few related to
your question.
memory.profile()
gc()
# used (Mb) gc trigger (Mb) max used (Mb)
# Ncells 274711 14.7 531268 28.4 531268 28.4
# Vcells 502886 3.9 1031040 7.9 881084 6.8
?gc() is good knowledge to have, and I can't really explain it. Best to read
about it. Also, I just learned about memCompress() and memDecompress() for
in-memory compression/storage. Here's a look Also, if you're working with
integer values, notifying R about it can help save memory.
That's what the L is for on the end of the rep.int() call.
x <- rep.int(5L, 1e4L)
y <- as.raw(x)
z1 <- memCompress(y)
z2 <- memCompress(y, "b")
z3 <- memCompress(y, "x")
mapply(function(a) object.size(get(a)), c('x','y','z1','z2','z3'))
# x y z1 z2 z3
# 40040 10040 88 88 168
And there is also
delayedAssign("p", rep.int(5L, 1e5L))
which is a promise object that takes up 0 bytes of memory until it is first evaluated.

Is there a faster way to get percent change?

I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16

Resources