Separating A String Into Characters - r

I have some ordered test results encoded in a character string. The string can be of arbitrary length. Each digit in the string represents a test result. In the following, for example, there are four test results represented:
2069
I want to tidy these up in R by splitting the string into individual observations. No problem with strsplit or string::str_split, which returns four values that will become my observations.
strsplit("2069" %>% as.character(), split = "") %>% unlist()
[1] "2" "0" "6" "9"
Now, however, I have realized that some results are values greater than 9. These two-digit values have been encoded with parentheses to make clear they are not individual results.
For example, in the following case I still have four values, but some have been enclosed in parentheses to group the values larger than 9.
2(10)1(12)
I'm struggling with a way to break these up so that I get
[1] "2" "10" "1" "12"
Appreciate any guidance. Thanks.

Updated - pattern match based on the OP's new pattern showed in the comments. Here, we use str_extract to extract one or more digits that follow an open parentheses (regex lookaround ) or (|) any character that is not a parentheses ([^()])
library(stringr)
str_extract_all(str1, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
-testing on the OP's extra pattern
str_extract_all(str2, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
[[5]]
[1] "10" "0" "2" "0" "1"
-Earlier solutions (Based on the assumption that all the numbers that are greater than 9 will be wrapped inside the parentheses)
We may split on the parentheses in base R
unlist(strsplit(str1[1], "\\(|\\)"))
[1] "2" "10" "1" "12"
Assuming if there are both cases, then an option is to get the index of those elements have the parentheses and do this separately
i1 <- grepl("\\(|\\)", str1)
lst1 <- vector('list', length(str1))
lst1[i1] <- strsplit(str1[i1], "\\(|\\)")
lst1[!i1] <- strsplit(str1[!i1], "")
unlist(lst1)
[1] "2" "10" "1" "12" "2" "0" "6" "9" "2" "15" "2" "1" "3" "1"
or another option is ifelse with grepl to create a single delimiter and then use strsplit
lst1 <- strsplit(trimws(ifelse(grepl("\\(|\\)", str1),
gsub("\\(|\\)", ",", str1), gsub("(?<=.)(?=.)", "\\1,\\2",
str1, perl = TRUE)), whitespace = ","), ",")
lst1
[[1]]
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"
data
str1 <- c("2(10)1(12)", "2069", "2(15)", "2131")
str2 <- c(str1, "(10)0201")

Maybe we can do like below (borrow str1 from #akrun)
> mapply(strsplit, str1, ifelse(grepl("[()]", str1), "\\(|\\)", ""))
$`2(10)1(12)`
[1] "2" "10" "1" "12"
$`2069`
[1] "2" "0" "6" "9"
$`2(15)`
[1] "2" "15"
$`2131`
[1] "2" "1" "3" "1"

Use
(?<=\()\d+(?=\))|\d
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
\( '('
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
\d+ digits (0-9) (1 or more times (matching
the most amount possible))
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
\) ')'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
\d digits (0-9)
R code:
library(stringr)
str1 <- c("2(10)1(12)", "2069", "2(15)", "2131")
str_extract_all(str1, "(?<=\\()\\d+(?=\\))|\\d")
Results:
[1] "2" "10" "1" "12"
[[2]]
[1] "2" "0" "6" "9"
[[3]]
[1] "2" "15"
[[4]]
[1] "2" "1" "3" "1"

Related

simple character splitting is baffling me [duplicate]

This question already has answers here:
How do I deal with special characters like \^$.?*|+()[{ in my regex?
(2 answers)
Closed 4 years ago.
the split as shown below is driving me crazy...nee somehelp to spot where is the problem
> p5<-Data$poorcoverageusers[5]
> p5
[1] "405874050693761|405874004853834|405874056470063|405874055308702"
> strsplit(p5,"|")
[[1]]
[1] "4" "0" "5" "8" "7" "4" "0" "5" "0" "6" "9" "3" "7" "6" "1" "|" "4" "0" "5" "8" "7" "4" "0" "0" "4" "8" "5" "3" "8" "3" "4" "|" "4" "0" "5"
[36] "8" "7" "4" "0" "5" "6" "4" "7" "0" "0" "6" "3" "|" "4" "0" "5" "8" "7" "4" "0" "5" "5" "3" "0" "8" "7" "0" "2"
> typeof(Data$poorcoverageusers[5])
[1] "character"
i wanted it to be splitted by "|"... so output should have been 405874050693761 405874004853834 405874056470063 405874055308702
what is he mistake i m making..
thnks for help
r
library(stringr)
s <- "405874050693761|405874004853834|405874056470063|405874055308702"
str_split(s, fixed("|")) # returns a list of character vectors
# [[1]]
# [1] "405874050693761" "405874004853834" "405874056470063" "405874055308702"
str_split(s, fixed("|"), simplify = T) # returns a character matrix
# [,1] [,2] [,3] [,4]
# [1,] "405874050693761" "405874004853834" "405874056470063" "405874055308702"

How to create a dataframe with various length of rows in R?

I am having a list of list paths which shown below ?
The Code is :
for (each in paths)
{
print (each)
}
The output is :
[1] "1" "2"
[1] "1" "2" "3"
[1] "1" "2" "3" "5"
[1] "1" "2" "4"
[1] "1" "2" "4" "5"
[1] "1" "3"
[1] "1" "3" "5"
[1] "1" "4"
[1] "1" "4" "5"
[1] "1" "5"
[1] "2" "3"
[1] "2" "3" "5"
[1] "2" "4"
[1] "2" "4" "5"
[1] "3" "5"
[1] "4" "5"
How to append this all as a rows of a data frame. as.data.frame fails due to unequal rows length.
A data frame is rectangular by definition, with the same number of columns in each row. You could set the length of each of your rows to be the same (they will be filled in with NA), and then rbind them together:
maxlength = max(lengths(paths))
paths2 = lapply(paths, function(x) {length(x) = maxlength; return(x)})
paths_df = do.call(rbind, args = paths2)
That will give a matrix, but you can easily convert to data frame from there.
data.frame needs to be rectangular. Also all elements of a given column need to be the same type of object. Thus, you could have a data.frame column composed of object of type list which can vary in size.
paths=list(1,c(1,2))
df=data.frame("pathNumber"= 1:length(paths))
df$path=paths
The result looks like this
pathNumber path
1 1 1
2 2 1, 2
One option is to have the list as a column of a data frame. This may be desirable if you want to have some other columns.
df <- data.frame(paths = I(paths))

R: Using gsub to replace a digit matched by pattern (n) with (n-1) in character vector

I am trying to match the last digit in a character vector and replace it with the matched digit - 1. I have believe gsub is what I need to use but I cannot figure out what to use as the 'replace' argument. I can match the last number using:
gsub('[0-9]$', ???, chrvector)
But I am not sure how to replace the matched number with itself - 1.
Any help would be much appreciated.
Thank you.
We can do this easily with gsubfn
library(gsubfn)
gsubfn("([0-9]+)", ~as.numeric(x)-1, chrvector)
#[1] "str97" "v197exdf"
Or for the last digit
gsubfn("([0-9])([^0-9]*)$", ~paste0(as.numeric(x)-1, y), chrvector2)
#[1] "str97" "v197exdf" "v33chr138d"
data
chrvector <- c("str98", "v198exdf")
chrvector2 <- c("str98", "v198exdf", "v33chr139d")
Assuming the last digit is not zero,
chrvector <- as.character(1:5)
chrvector
#[1] "1" "2" "3" "4" "5"
chrvector <- paste(chrvector, collapse='') # convert to character string
chrvector <- paste0(substring(chrvector,1, nchar(chrvector)-1), as.integer(gsub('.*([0-9])$', '\\1', chrvector))-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "4" "4"
This works even if you have the last digit zero:
chrvector <- c(as.character(1:4), '0') # [1] "1" "2" "3" "4" "0"
chrvector <- paste(chrvector, collapse='')
chrvector <- as.character(as.integer(chrvector)-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "3" "9"

extracting text R between special characters

I have multiple strings as shown below:
filename="numbers [www.imagesplitter.net]-0-0.jpeg"
filename1="numbers [www.imagesplitter.net]-0-1.jpeg"
filename2="numbers [www.imagesplitter.net]-19-9.jpeg"
I want the text that appears between the second "-" and the last period.
I would like to get 0,1,9 respectively.
How do I do this? I am not sure how to detect the second "-" and the last period.
Try
sub('^[^-]*-[^-]*-(\\d+)\\..*$', '\\1', files)
#[1] "0" "1" "9"
or
gsub('^[^-]*-[^-]*-|\\..*$', '', files)
#[1] "0" "1" "9"
data
files <- c(filename, filename1, filename2)
I would simply use strsplit to split the strings accordingly here:
sapply(strsplit(files, '[-.]'), '[', 5)
# [1] "0" "1" "9"
Try this:
files=c(filename, filename1, filename2)
sub(".*-(.+)\\.jpeg", "\\1", files)
You could use regmatches function also.
> x <- c("numbers [www.imagesplitter.net]-0-0.jpeg","numbers [www.imagesplitter.net]-0-1.jpeg", "numbers [www.imagesplitter.net]-19-9.jpeg")
> unlist(regmatches(x, gregexpr("^(?:[^-]*-){2}\\K.*(?=\\.)", x, perl=TRUE)))
[1] "0" "1" "9"
You could use the same regex in stringr , str_extract_all function also.
> library(stringr)
> unlist(str_extract_all(x, perl("^(?:[^-]*-){2}\\K.*(?=\\.)")))
[1] "0" "1" "9"
OR
> unlist(str_extract_all(x, perl("(?<=-)[^-.]*(?=\\.)")))
[1] "0" "1" "9"
OR
> unlist(str_extract_all(x, perl(".*-\\K\\d+")))
[1] "0" "1" "9"
you can try
sub("^[^-]+-[^-]+-(.*)\\.[^\\.]*$", "\\1", c(filename, filename1, filename2))
[1] "0" "1" "9"

Iterating over characters of string R

Could somebody explain me why this does not print all the numbers separately in R.
numberstring <- "0123456789"
for (number in numberstring) {
print(number)
}
Aren't strings just arrays of chars? Whats the way to do it in R?
In R "0123456789" is a character vector of length 1.
If you want to iterate over the characters, you have to split the string into
a vector of single characters using strsplit.
numberstring <- "0123456789"
numberstring_split <- strsplit(numberstring, "")[[1]]
for (number in numberstring_split) {
print(number)
}
# [1] "0"
# [1] "1"
# [1] "2"
# [1] "3"
# [1] "4"
# [1] "5"
# [1] "6"
# [1] "7"
# [1] "8"
# [1] "9"
Just for fun, here are a few other ways to split a string at each character.
x <- "0123456789"
substring(x, 1:nchar(x), 1:nchar(x))
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
regmatches(x, gregexpr(".", x))[[1]]
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
scan(text = gsub("(.)", "\\1 ", x), what = character())
# [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
Possible with tidyverse::str_split
numberstring <- "0123456789"
str_split(numberstring,boundary("character"))
1. '0''1''2''3''4''5''6''7''8''9'
Here's a naive approach for iterating a string using a for loop and substring. This isn't any better than existing answers for the common case, but it might be useful if you want to break out of the loop early instead of always traversing the entire string once up front, as str_split/scan/substring(x, 1:nchar(x), 1:nchar(x))/regmatches requires.
s <- "0123456789"
if (s != "") {
for (i in 1:nchar(s)) {
print(substring(s, i, i))
}
}
The if is needed to avoid looping backwards from 1 to 0, inclusive of both ends.
Your question is not 100% clear as to the desired outcome (print each character individually from a string, or store each number in a way that the given print loop will result in each number being produced on its own line).
To store numberstring such that it prints using the loop you included:
numberstring<-c(0,1,2,3,4,5,6,7,8,9)
for(number in numberstring){print(number);}
[1] 0
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
>

Resources