strsplit(rquote, split = "")[[1]] in R - r

rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
This question has been asked before on this forum and has one answer on it but I couldn't understand anything from that answer, so here I am asking this question again.
In the above code what is the meaning of [[1]] ?
The program that I'm trying to run:
rquote <- "r's internals are irrefutably intriguing"
chars <- strsplit(rquote, split = "")[[1]]
rcount <- 0
for (char in chars) {
if (char == "r") {
rcount <- rcount + 1
}
if (char == "u") {
break
}
}
print(rcount)
When I don't use [[1]] I get the following warning message in for loop and I get a wrong output of 1 for rcount instead of 5:
Warning message: the condition has length > 1 and only the first element will be used

strsplit is vectorized. That means it splits each element of a vector into a vectors. To handle this vector of vectors it returns a list in which a slot (indexed by [[) corresponds to a element of the input vector.
If you use the function on a one element vector (single string as you do), you get a one-slot list. Using [[1]] right after strsplit() selects the first slot of the list - the anticipated vector.
Unfortunately, your list chars works in a for loop - you have one iteration with the one slot. In if you compare the vector of letters against "r" which throws the warning. Since the first element of the comparison is TRUE, the condition holds and rcount is rised by 1 = your result. Since you are not indexing the letters but the one phrase, the cycle stops there.
Maybe if you run something like strsplit(c("one", "two"), split="") , the outcome will be more straightforward.
> strsplit(c("one", "two"), split="")
[[1]]
[1] "o" "n" "e"
[[2]]
[1] "t" "w" "o"
> strsplit(c("one", "two"), split="")[[1]]
[1] "o" "n" "e"
> strsplit(c("one"), split="")[[1]][2]
[1] "n"

We'll start with the below as data, without [[1]]:
rquote <- "r's internals are irrefutably intriguing"
chars2 <- strsplit(rquote, split = "")
class(chars2)
[1] "list"
It is always good to have an estimate of your return value, your above '5'. We have both length and lengths.
length(chars2)
[1] 1 # our list
lengths(chars2)
[1] 40 # elements within our list
We'll use lengths in our for loop for counter, and, as you did, establish a receiver vector outside the loop,
rcount2 <- 0
for (i in 1:lengths(chars2)) {
if (chars2[[1]][i] == 'r') {
rcount2 <- rcount2 +1
}
if (chars2[[1]][i] == 'u') {
break
}
}
print(rcount2)
[1] 6
length(which(chars2[[1]] == 'r')) # as a check, and another way to estimate
[1] 6
Now supposing, rather than list, we have a character vector:
chars1 <- strsplit(rquote, split = '')[[1]]
length(chars1)
[1] 40
rcount1 <- 0
for(i in 1:length(chars1)) {
if(chars1[i] == 'r') {
rcount1 <- rcount1 +1
}
if (chars1[i] == 'u') {
break
}
}
print(rcount1)
[1] 5
length(which(chars1 == 'r'))
[1] 6
Hey, there's your '5'. What's going on here? Head scratch...
all.equal(chars1, unlist(chars2))
[1] TRUE
That break should just give us 5 'r' before a 'u' is encountered. What's happening when it's a list (or does that matter...?), how does the final r make it into rcount2?
And this is where the fun begins. Jeez. break for coffee and thinking. Runs okay. Usual morning hallucination. They come and go. But, as a final note, when you really want to torture yourself, put browser() inside your for loop and step thru.
Browse[1]> i
[1] 24
Browse[1]> n
debug at #7: break
Browse[1]> chars2[[1]][i] == 'u'
[1] TRUE
Browse[1]> n
> rcount2
[1] 5

Related

String matching within a list of lists [duplicate]

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

Once statement in R i.e. evaluate if statement only once

I am looking for a method in R to run the block inside the if statement only the first time the if statement is evaluated as TRUE, but the block would not be run again even if the if condition is TRUE again. Specifically, the method would be useful in a loop.
This would be the "once" statement (it is called so in some exotic languages).
Example:
for (id in id_list){ # runs over a list of several id's which are random
if (id == "snake"){ # I want to run this block only the first time and NOT each time id == "snake"
# now, do some calculations
# ...
}
# do some other calculations by default for all other runs inside the loop
# ...
}
I would be also curious to know how would this work in Python.
1) duplicated Using the test input shown in the first line iterate over an index and add a condition using duplicated. This avoids using a flag making it less error prone.
id_list <- c("a", "snake", "b", "snake") # test input
dup <- duplicated(id_list)
for(i in seq_along(id_list)) {
if (id_list[i] == "snake" && (!dup)[i]) print("snake")
print(i)
}
giving:
[1] 1
[1] "snake"
[1] 2
[1] 3
[1] 4
2) match Another approach to determine which iteration represents the first instance of snake and using that in the condition.
ix <- match("snake", id_list, nomatch = 0)
for(i in seq_along(id_list)) {
if (i == ix) print("snake")
print(i)
}
giving:
[1] 1
[1] "snake"
[1] 2
[1] 3
[1] 4
3) once
Another approach is to create a once function which returns TRUE the first time it is run and FALSE otherwise. This does use a mutable variable, x, (similar to a flag) but at least it is encapsulated. The genOnce function outputs a fresh once function.
It is important to use && in the condition to ensure that the right hand side of && is only run if the left hand side is TRUE. & does not have that short circuiting property.
genOnce <- function(x = 0) function() (x <<- x + 1) == 1
once <- genOnce()
for(id in id_list) {
if (id == "snake" && once()) print("***")
print(id)
}
giving:
[1] "a"
[1] "***"
[1] "snake"
[1] "b"
[1] "snake"
Suggested solution using a 'global' variable (i.e. 'flag') to denote first pass into if clause:
first <- TRUE
for (i in 1:5) {
if (first & i > 0) {
print("run this block only the first time")
first <- FALSE
}
print("do some other calculations")
}
Output:
[1] "run this block only the first time"
[1] "do some other calculations"
[1] "do some other calculations"
[1] "do some other calculations"
[1] "do some other calculations"
[1] "do some other calculations"

grep exact match in vector inside a list in R

I have a list like this:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
> grep("ABC", map_tmp)
[1] 1 3
> grep("^ABC$", map_tmp)
[1] 1 # by using regex, I get the index of "ABC" in the list
> grep("^KML$", map_tmp)
[1] 5 # I wanted 3, but I got 5. Claiming the end of a string by "$" didn't help in this case.
> grep("^HIJ$", map_tmp)
integer(0) # the regex do not return to me the index of a string inside the vector
How can I get the index of a string (exact match) in the list?
I'm ok not to use grep. Is there any way to get the index of a certain string (exact match) in the list? Thanks!
Using lapply:
which(lapply(map_tmp, function(x) grep("^HIJ$", x))!=0)
The lapply function gives you a list of which for each element in the list (0 if there's no match). The which!=0 function gives you the element in the list where your string occurs.
Use either mapply or Map with str_detect to find the position, I have run only for one string "KML" , you can run it for all others. I hope this is helpful.
First of all we make the lists even so that we can process it easily
library(stringr)
map_tmp_1 <- lapply(map_tmp, `length<-`, max(lengths(map_tmp)))
### Making the list even
val <- t(mapply(str_detect,map_tmp_1,"^KML$"))
> which(val[,1] == T)
[1] 3
> which(val[,2] == T)
integer(0)
In case of "ABC" string:
val <- t(mapply(str_detect,map_tmp_1,"ABC"))
> which(val[,1] == T)
[1] 1
> which(val[,2] == T)
[1] 3
>
I had the same question. I cannot explain why grep would work well in a list with characters but not with regex. Anyway, the best way I found to match a character string using common R script is:
map_tmp <- list("ABC",
c("EGF", "HIJ"),
c("KML", "ABC-IOP"),
"SIN",
"KMLLL")
sapply( map_tmp , match , 'ABC' )
It returns a list with similar structure as the input with 'NA' or '1', depending on the result of the match test:
[[1]]
[1] 1
[[2]]
[1] NA NA
[[3]]
[1] NA NA
[[4]]
[1] NA
[[5]]
[1] NA

How to access variable length lists inside a list in R

When I call strsplit() on a column of a data frame, depending on the results of the strsplit(), I sometimes get one or two "sublists" as a result of splitting. For example,
v <- c("50", "1 h 30 ", "1 h", NA)
split <- strsplit(v, "h")
[[1]]
[1] "50"
[[2]]
[1] "1" " 30"
[[3]]
[1] "1 "
[[4]]
[1] NA
I know I can access the individual lists of split using '[]' and '[[]]' tells me the contents of those sublists, so I think I understand that. And that I can access the " 30" in [[2]] by doing split[[2]][2].
Unfortunately, I don't know how to access this programmatically over the entire column that I have. I am trying to convert the column to numeric data. But that "1 h 30" case is giving me a lot of trouble.
func1 <- function(x){
split.l <- strsplit(x, "h")
len <- lapply(split.l, length)
total <- ifelse(len == 2, as.numeric(split.l[2]) + as.numeric(split.l[1]) * 60, as.numeric(split.l[2]))
return(total)
}
v <- ifelse(grepl("h", v), func1(v), as.numeric(v))
I know len returns the vector of the length of the splits. But when it comes to actually accessing that individual sublist's second element, I simply don't know how to do it properly. This will generate an error because split.l[1] and split.l[2] will only return the first two elements of the entire original dataframe column every time. [[1]] and [[2]] won't work either. I need something like [[i]][1] and [[i]][2]. But I'm trying not to use a for loop and iterate.
To make a long story short, how do I access the inner list element programmatically
For reference, I did look at this which helped. But I still haven't been able to solve it. apply strsplit to specific column in a data.frame
I'm really struggling with lists and list processing in R so any help is appreciated.
A common idiom is lapply(l,[, 2), which applied to your example gives:
> lapply(split, `[`, 2)
[[1]]
[1] NA
[[2]]
[1] " 30 "
[[3]]
[1] NA
[[4]]
[1] NA
sapply() will collapse this to a vector if it can.
What is being done is lapply() takes each component of split in turn ā€” this is the [[i]] bit of your pseudo code ā€” and to each of those we want to extract the nth element. We do by applying the [ function with argument nā€” in this case 2L.
If you want the first element unless there is a second element, in which case take the second, you could just write a wrapper instead of using [ directly:
wrapper <- function(x) {
if(length(x) > 1L) {
x[2L]
} else {
x[1L]
}
}
lapply(split, wrapper)
which gives
> lapply(split, wrapper)
[[1]]
[1] "50"
[[2]]
[1] " 30 "
[[3]]
[1] "1 "
[[4]]
[1] NA
or perhaps
lens <- lengths(split)
out <- lapply(split, `[`, 2L)
ind <- lens == 1L
out[ind] <- lapply(split[ind], `[`, 1L)
out
but that loops over the output from strsplit() twice.

Dynamically numbering files in R with placeholders using an If-elseif

I have a vector
x <- c(1,90,233)
I need to convert this to a vector of the form:
result = c("001.csv","090.csv","233.csv")
This is the function that I wrote to perform this operation:
convert <- function(x){
for (a in 1:length(x)){
if (x[a]<10) {
x[a]<- paste("00",x[a],".csv",sep="")
}
else if (x[a] < 100) {
x[a]<- paste("0", x[a], ".csv",sep="")
}
else {
x[a]<-paste(x[a],".csv",sep="")
}
}
x
}
The output I got was:
[1] "001.csv","90.csv","233.csv"
So, a[2] is 90 was processed in the else part and not the else if part. Then I changed the else if condition to x[a]<=99
convert <- function(x){
for (a in 1:length(x)){
if (x[a]<10) {
x[a]<- paste("00",x[a],".csv",sep="")
}
else if (x[a] <= 99) {
x[a]<- paste("0", x[a], ".csv",sep="")
}
else {
x[a]<-paste(x[a],".csv",sep="")
}
}
x
}
I got this output:
[1] "001.csv" "090.csv" "0233.csv"
Now both x[2] and x[3] ie 90 and 233 are being processed in the ElseIf part. What am I doing wrong here? And how do I get the output I need?
This is a little bit more dynamic as you do not need to specify the number of places held by the largest number.
Step 1:
Obtain the maximum number of places held.
(nb = max(nchar(x)))
To get:
3
Step 2:
Paste the number into a sprintf() call that will automatically format the digit.
sprintf("%0*d.csv", nb, x)
To get:
[1] "001.csv" "090.csv" "233.csv"
The problem is that the first round of your loop makes a character, that converts the whole vector to type character. You can get around that using nchar
convert <- function(x){
for (a in 1:length(x)){
if (nchar(x[a]) == 1) {
x[a]<- paste("00",x[a],".csv",sep="")
}
else if (nchar(x[a]) == 2) {
x[a]<- paste("0", x[a], ".csv",sep="")
}
else {
x[a]<-paste(x[a],".csv",sep="")
}
}
x
}
sprintf("%03d", x)
[1] "001" "090" "233"
You can avoid a call to paste by including the ".csv" in the format string:
sprintf("%03d.csv", x)
[1] "001.csv" "090.csv" "233.csv"
The problem with the original code is the conversion to character, which happens on the first element.
Here's the conversion to character:
> x <- c(1, 90, 233)
> x
[1] 1 90 233
> x[1] <- "001.csv"
> x
[1] "001.csv" "90" "233"
Here's the resulting comparison of the second element:
> "90" <= 99
[1] TRUE
> "90" < 100
[1] FALSE
Similarly for the third:
> "233" < 100
[1] FALSE
> "233" <= 99
[1] TRUE
In all of these cases, the right-hand side is converted to character, then the comparison is made, as character strings.
Your code doesn't work as expected because the whole vector gets converted into a character vector after first assignment(conversion of numeric to character).
Please note that when a string is compared to digit, the characters are matched one by one. For eg. if you compare "90" to 100 then 9 is compared to 1, hence control goes to the else part and in the case of comparison of "233" to 99, 2 is compared 9.
You can get around this by assigning the changed values to another vector.Or, you could use the str_pad function from the stringr package.
library(stringr)
x=c(1,90,233)
padded_name= str_pad(x,width=3,side="left",pad="0")
file_name = paste0(padded_name, ".csv")

Resources