Identify first match position in a string - r

I have a character string ("00010000") and need to identify which position do we see the first "1". (This tells me which month a customer is active)
I have a dataset that looks like this:
id <- c(1:5)
seq <- c("00010000","00001000","01000000","10000000","00010000")
df <- data.frame(id,seq)
I would like to create a new field identifying the first_month_active for each id.
I can do this manually with a nested ifelse function:
df$first_month_active <-
ifelse(substr(df$seq,1,1)=="1",1,
ifelse(substr(df$seq,2,2)=="1",2,
ifelse(substr(df$seq,3,3)=="1",3,
ifelse(substr(df$seq,4,4)=="1",4,
ifelse(substr(df$seq,5,5)=="1",5,99 )))))
Which gives me the desired result:
id seq first_position
1 00010000 4
2 00001000 5
3 01000000 2
4 10000000 1
5 00010000 4
However, this is not an ideal solution for my data, which contains 36 months.
I would like to use a loop with an ifelse statement, however I am really struggling with syntax
for (i in 1:36) {
ifelse(substr(df$seq,0+i,0+i)=="1",0+i,
}
Any ideas would be greatly appreciated

Or try the stringi package
library(stringi)
stri_locate_first_fixed(df$seq, "1")[, 1]
## [1] 4 5 2 1 4

Skip the loop and the ifelse:
9 - nchar(as.numeric(seq))
## [1] 4 5 2 1 4
This won't work the same in your data.frame because you coerced seq to factor implicitly, so just do:
9 - nchar(as.numeric(as.character(df$seq)))
## [1] 4 5 2 1 4
Edit: Just for fun, since Frank didn't convert his comment into an answer, here's strsplit solution:
# from original vector
sapply(strsplit(seq, "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4
# from data.frame
sapply(strsplit(as.character(df$seq), "1"), nchar)[1,] + 1
## [1] 4 5 2 1 4

You can use gregexpr.
> unlist(gregexpr(pattern=1,seq,fixed=T))
[1] 4 5 2 1 4

The following could do this job:
library(stringr)
str_locate(pattern ='1',seq)

Some comparisons:
library(stringi)
library(stringr)
seq <- c("00010010","00001000","10000010","10000000","00010000")
seq2 <- rep(seq, 5e6)
system.time(regexpr("1", seq2))
user system elapsed
4.78 0.03 4.82
system.time(9-nchar(as.numeric(as.character(seq2))))
user system elapsed
34.89 0.18 35.52
system.time(str_locate(pattern ='1',seq2))
user system elapsed
6.17 0.21 6.53
system.time(stri_locate_first_fixed(seq2, "1")[, 1])
user system elapsed
1.68 0.15 1.84
system.time(nchar(seq2)-round(log10(as.numeric(seq2))))
user system elapsed
7.67 0.09 7.86
system.time(nchar(sub('1.*', '', seq2))+1)
user system elapsed
14.61 0.11 14.93

Another one, using log:
nchar(seq)-round(log10(as.numeric(seq)))

Another option using sub
nchar(sub('1.*', '', seq))+1
#[1] 4 5 2 1 4

Related

How to sort a vector in R without repeating ranks

Good afternoon ,
My question may seem very elementary but i'm getting troubles with it.
Assume we have the following vector :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
I'm willing to sort the vector decreasingly , then getting indices which means :
sort.int(x, index.return=TRUE,decreasing=TRUE)
$x
[1] 1.00 1.00 0.75 0.75 0.50 0.50 0.50 0.25 0.25
$ix
[1] 3 4 1 2 5 6 7 8 9
However, the expected output should be :
y=c(2,2,1,1,3,3,3,4,4)
This means :
1 is the highest value ----- > 1
0.75 is the second highest value ----- > 2
0.5 is the third ----- > 3
0.25 is the lowest value -----> 4
I also tried :
x=c(0.75,0.75,1,1,0.5,0.5,0.5,0.25,0.25)
order(unique(sort(x)))
sort(unique(x),decreasing=TRUE)
[1] 1 2 3 4
[1] 1.00 0.75 0.50 0.25
But I don't know how to subset from x to get the expected output y .
Thank you for help !
sort will sort all the values, and use each value once. It seems like you want to ignore the indices of duplicated values after the first. We can use match for this, which will always return the index of the first match.
match(sort.int(x, decreasing = TRUE), unique(x))
# [1] 2 2 1 1 3 3 3 4 4

delete vector entries based on another vector

I have two vectors
a <- c(1:20)
b <- c(2,11,14)
I want to delete the entries in the a vector based on the vector entries in b (I want the 2nd, 11th, and 14th entries deleted).
I've tried several methods, including:
c <- a[!a %in% b]
but that doesn't work.
Any suggestions? I've tried searching SO, but can only find deleting based on values.
You can simply index into a and remove the elements at indices in b as follows:
a <- c(1:20)
b <- c(2,11,14)
a[-b]
[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
I created 3.1 million entries and am randomly sampling 100,000 to remove. As can be seen, it is blazing fast.
a <- 1:3100000
b <- sample(a, 100000)
system.time(a[-b])
user system elapsed
0.024 0.003 0.027
Edited: Adding this extra check option based on comment below by akrun and thelatemail to handle the case where b might be null.
a[if(length(b)) -b else TRUE]
The approach by #Gopala works in most cases except when the 'b' vector is NULL. To make it a bit more general, we can get the logical condition using seq_along(a) with %in%
a[!seq_along(a) %in% b]
#[1] 1 3 4 5 6 7 8 9 10 12 13 15 16 17 18 19 20
Now, if we change 'b' to
b <- vector('integer')
a[-b]
#integer(0)
a[!seq_along(a) %in% b]
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
The former returns a vector of length 0, while the %in% approach returns the whole vector 'a'.
Other method is obviously more efficient, but in case if we need an approach that works on the case I mentioned, this can be used.
system.time(a[-b])
# user system elapsed
# 0.07 0.00 0.08
system.time(a[!seq_along(a) %in% b])
# user system elapsed
# 0.17 0.01 0.18
The approach posted by #thelatemail to make the first approach general
system.time(a[if(length(b)==0) TRUE else -b])
# user system elapsed
# 0.05 0.00 0.05
NOTE: Benchmark data from #Gopala's post.

Replace strings in data frame columns with integer in R

I have a data frame called 'foo':
foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
'foo' was uploaded from a different program as a CSV file and has two columns. one is a numerical data type and the other is a factor data type.
str(foo)
'data.frame': 5 obs. of 2 variables:
$ row1: num 1 2 3 4 5
$ row2: Factor w/ 4 levels "-","1","2.01",..: 2 3 4 1 1
Notice there are dashes, e.g. "-" , in foo$row2, which causes this column to be a factor. I want to replace the dashes with zeros, such that data.class(foo$row2) will return 'numerical'. The idea is to replace all dashes in each column so I can run numberical analyses on it with R.
What is the simplest way to do this in R?
Thanks,
Q: The idea is to replace all dashes in each column so I can run numerical analyses on it with R.
Use apply or sapply with sub
kk<-data.frame(apply(foo,2,function(x) as.numeric(sub("-",0,x))))
> kk
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00
> str(kk$row2)
num [1:5] 1 2.01 3 0 0
Or, you can use sapply
kk<-data.frame(sapply(names(foo),function(x)as.numeric(sub("-",0,foo[,x]))))
Update:
If you want just the second col, you don't need to use apply:foo$row2<- as.numeric(sub("-",0,foo[,2]))
Here is one simple way to do it. There might be a more elegant way, but this will work:
> foo <- data.frame("row1" = c(1,2,3,4,5), "row2" = c(1,2.01,3,"-","-"))
> levels(foo$row2)[levels(foo$row2)=="-"]<-0
> foo$row2<-as.numeric(as.character(foo$row2))
> class(foo$row2)
[1] "numeric"
> foo
row1 row2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 0.00
I would use ifelse() for this:
foo$row2 <- ifelse(foo$row2 == "-", 0, as.numeric(foo$row2))
you might also need to as as.character() to convert from factor to character
How about gsub...
as.numeric( gsub("-" , 0 , foo[,2] ) )
#[1] 1.00 2.01 3.00 0.00 0.00

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Is there a faster way to get percent change?

I have a data frame with around 25000 records and 10 columns. I am using code to determine the change to the previous value in the same column (NewVal) based on another column (y) with a percent change already in it.
x=c(1:25000)
y=rpois(25000,2)
z=data.frame(x,y)
z[1,'NewVal']=z[1,'x']
So I ran this:
for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+(z$NewVal[i-1]*(z$y[i]/100))}
This takes considerably longer than I expected it to. Granted I may be an impatient person - as a scathing letter drafted to me once said - but I am trying to escape the world of Excel (after I read http://www.burns-stat.com/pages/Tutor/spreadsheet_addiction.html, which is causing me more problems as I have begun to mistrust data - that letter also mentioned my trust issues).
I would like to do this without using any of the functions from packages as I would like to know what the formula for creating the values is - or if you will, I am a demanding control freak according to that friendly missive.
I would also like to know how to get a moving average just like rollmean in caTools. Either that or how do I figure out what their formula is? I tried entering rollmean and I think it refers to another function (I am new to R). This should probably be another question - but as that letter said, I don't ever make the right decisions in my life.
The secret in R is to vectorise. In your example you can use cumprod to do the heavy lifting:
z$NewVal2 <- x[1] * cumprod(with(z, 1 +(c(0, y[-1]/100))))
all.equal(z$NewVal, z$NewVal2)
[1] TRUE
head(z, 10)
x y NewVal NewVal2
1 25 4 25.00000 25.00000
2 24 3 25.75000 25.75000
3 23 0 25.75000 25.75000
4 22 1 26.00750 26.00750
5 21 3 26.78773 26.78773
6 20 2 27.32348 27.32348
7 19 2 27.86995 27.86995
8 18 3 28.70605 28.70605
9 17 4 29.85429 29.85429
10 16 2 30.45138 30.45138
On my machine, the loop takes just less than 3 minutes to run, while the cumprod statement is virtually instantaneous.
I got about a 800-fold improvement with Reduce:
system.time(z[, "NewVal"] <-Reduce("*", c(1, 1+z$y[-1]/100), accumulate=T) )
user system elapsed
0.139 0.008 0.148
> head(z)
x y NewVal
1 1 1 1.000
2 2 1 1.010
3 3 1 1.020
4 4 5 1.071
5 5 1 1.082
6 6 2 1.103
7 7 2 1.126
8 8 3 1.159
9 9 0 1.159
10 10 1 1.171
> system.time(for(i in 2:nrow(z)){z$NewVal[i]=z$NewVal[i-1]+
(z$NewVal[i-1]*(z$y[i]/100))})
user system elapsed
37.29 106.38 143.16

Resources