Getting all characters ahead of first appearance of special character in R - r

I want to get all characters that are ahead of the first "." if there is one. Otherwise, I want to get back the same character ("8" -> "8").
Example:
v<-c("7.7.4","8","12.6","11.5.2.1")
I want to get something like this:
[1] "7 "8" "12" "11"
My idea was to split each element at "." and then only take the first split. I found no solution that worked...

You can use sub
sub("\\..*", "", v)
#[1] "7" "8" "12" "11"
or a few stringi options:
library(stringi)
stri_replace_first_regex(v, "\\..*", "")
#[1] "7" "8" "12" "11"
# extract vs. replace
stri_extract_first_regex(v, "[^\\.]+")
#[1] "7" "8" "12" "11"
If you want to use a splitting approach, these will work:
unlist(strsplit(v, "\\..*"))
#[1] "7" "8" "12" "11"
# stringi option
unlist(stri_split_regex(v, "\\..*", omit_empty=TRUE))
#[1] "7" "8" "12" "11"
unlist(stri_split_fixed(v, ".", n=1, tokens_only=TRUE))
unlist(stri_split_regex(v, "[^\\w]", n=1, tokens_only=TRUE))
Other sub variations that use a capture group to target the leading characters specifically:
sub("(\\w+).+", "\\1", v) # \w matches [[:alnum:]_] (i.e. alphanumerics and underscores)
sub("([[:alnum:]]+).+", "\\1", v) # exclude underscores
# variations on a theme
sub("(\\w+)\\..*", "\\1", v)
sub("(\\d+)\\..*", "\\1", v) # narrower: \d for digits specifically
sub("(.+)\\..*", "\\1", v) # broader: "." matches any single character
# stringi variation just for fun:
stri_extract_first_regex(v, "\\w+")

scan() would actually work well for this. Since we want everything before the first ., we can use that as a comment character and scan() will remove everything after and including that character, for each element in v.
scan(text = v, comment.char = ".")
# [1] 7 8 12 11
The above returns a numeric vector, which might be where you are headed. If you need to stick with characters, add the what argument to denote we want a character vector returned.
scan(text = v, comment.char = ".", what = "")
# [1] "7" "8" "12" "11"
Data:
v <- c("7.7.4", "8", "12.6", "11.5.2.1")

Related

Remove characters after last occurrence of delimiter - but keep characters when delimiter occurs once at the beginning

Sorry for the awkward title - very open for suggestions how to better phrase it...
This is very similar to Question 1, Question 2 and question 3. All those questions have a solution that would remove after "every last" occurrence of the delimiter (most often the underscore), including when it occurs at the beginning of the string.
I need to keep those strings where the delimiter occurs only once, at the beginning of the string.
In the example, for x[3] and x[5], I'd like to keep "-3" and "-5". My first attempt keeps -5, but not -3...
x <- c("1 - 2","2-1", "-3", "4", "-5-6")
gsub("(.*)\\-.*$", "\\1", x)
#> [1] "1 " "2" "" "4" "-5"
gsub("\\-[^\\-].*$", "", x)
#> [1] "1 " "2" "" "4" ""
edit
Ronaks current solution works for the previous example, but fails when there are other characters than "numbers", either before or after the delimiter.
x <- c("1 - 2","2-1", "-3", "4", "-5-6", "-0.6", "20/200", "20/200-3")
stringr::str_match(x, '(-?\\d+)-?')[, 2]
#> [1] "1" "2" "-3" "4" "-5" "-0" "20" "20"
desired output
#> [1] "1" "2" "-3" "4" "-5" "-0.6" "20/200" "20/200"
(For the curious: this is for conversion of notations of visual acuity data, which tells us how well we can discriminate letters on a chart. This data can be sometimes very messy, but follows generally a certain pattern of notation.)
This seems to get what you want:
str_extract(x, "(-)?\\d+[.\\d/]*(?=-?)")
[1] "1" "2" "-3" "4" "-5" "-0.6" "20/200" "20/200"
This matches an optional - followed by a number of one or more digits followed by either . or a number or / zero or more times (*) to the left of ((?= ...)) an optional -
EDIT:
A base Rsolution is this:
unlist(regmatches(x, gregexpr("^(-)?\\d+[.\\d/]*(?=-?)", x, perl = T)))
[1] "1" "2" "-3" "4" "-5" "-0.6" "20/200" "20/200"
Data:
x <- c("1 - 2","2-1", "-3", "4", "-5-6", "-0.6", "20/200", "20/200-3")
Using str_match :
stringr::str_match(x, '(-?\\d+)-?')[, 2]
#[1] "1" "2" "-3" "4" "-5"
This captures an optional "-" followed by a number which is followed by another optional "-".
Using str_extract :
stringr::str_extract(x, '-?\\d+(?=-?)')
and in base R :
sub("(-?\\d+)-?.*", "\\1", x)

R: Using gsub to replace a digit matched by pattern (n) with (n-1) in character vector

I am trying to match the last digit in a character vector and replace it with the matched digit - 1. I have believe gsub is what I need to use but I cannot figure out what to use as the 'replace' argument. I can match the last number using:
gsub('[0-9]$', ???, chrvector)
But I am not sure how to replace the matched number with itself - 1.
Any help would be much appreciated.
Thank you.
We can do this easily with gsubfn
library(gsubfn)
gsubfn("([0-9]+)", ~as.numeric(x)-1, chrvector)
#[1] "str97" "v197exdf"
Or for the last digit
gsubfn("([0-9])([^0-9]*)$", ~paste0(as.numeric(x)-1, y), chrvector2)
#[1] "str97" "v197exdf" "v33chr138d"
data
chrvector <- c("str98", "v198exdf")
chrvector2 <- c("str98", "v198exdf", "v33chr139d")
Assuming the last digit is not zero,
chrvector <- as.character(1:5)
chrvector
#[1] "1" "2" "3" "4" "5"
chrvector <- paste(chrvector, collapse='') # convert to character string
chrvector <- paste0(substring(chrvector,1, nchar(chrvector)-1), as.integer(gsub('.*([0-9])$', '\\1', chrvector))-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "4" "4"
This works even if you have the last digit zero:
chrvector <- c(as.character(1:4), '0') # [1] "1" "2" "3" "4" "0"
chrvector <- paste(chrvector, collapse='')
chrvector <- as.character(as.integer(chrvector)-1)
unlist(strsplit(chrvector, split=''))
# [1] "1" "2" "3" "3" "9"

extracting text R between special characters

I have multiple strings as shown below:
filename="numbers [www.imagesplitter.net]-0-0.jpeg"
filename1="numbers [www.imagesplitter.net]-0-1.jpeg"
filename2="numbers [www.imagesplitter.net]-19-9.jpeg"
I want the text that appears between the second "-" and the last period.
I would like to get 0,1,9 respectively.
How do I do this? I am not sure how to detect the second "-" and the last period.
Try
sub('^[^-]*-[^-]*-(\\d+)\\..*$', '\\1', files)
#[1] "0" "1" "9"
or
gsub('^[^-]*-[^-]*-|\\..*$', '', files)
#[1] "0" "1" "9"
data
files <- c(filename, filename1, filename2)
I would simply use strsplit to split the strings accordingly here:
sapply(strsplit(files, '[-.]'), '[', 5)
# [1] "0" "1" "9"
Try this:
files=c(filename, filename1, filename2)
sub(".*-(.+)\\.jpeg", "\\1", files)
You could use regmatches function also.
> x <- c("numbers [www.imagesplitter.net]-0-0.jpeg","numbers [www.imagesplitter.net]-0-1.jpeg", "numbers [www.imagesplitter.net]-19-9.jpeg")
> unlist(regmatches(x, gregexpr("^(?:[^-]*-){2}\\K.*(?=\\.)", x, perl=TRUE)))
[1] "0" "1" "9"
You could use the same regex in stringr , str_extract_all function also.
> library(stringr)
> unlist(str_extract_all(x, perl("^(?:[^-]*-){2}\\K.*(?=\\.)")))
[1] "0" "1" "9"
OR
> unlist(str_extract_all(x, perl("(?<=-)[^-.]*(?=\\.)")))
[1] "0" "1" "9"
OR
> unlist(str_extract_all(x, perl(".*-\\K\\d+")))
[1] "0" "1" "9"
you can try
sub("^[^-]+-[^-]+-(.*)\\.[^\\.]*$", "\\1", c(filename, filename1, filename2))
[1] "0" "1" "9"

How to edit "row.names" after split and cut2 in R?

I want to edit out some information from row.names that are created automatically once split and cut2 were used. See following code:
#Mock data
date_time <- as.factor(c('8/24/07 17:30','8/24/07 18:00','8/24/07 18:30',
'8/24/07 19:00','8/24/07 19:30','8/24/07 20:00',
'8/24/07 20:30','8/24/07 21:00','8/24/07 21:30',
'8/24/07 22:00','8/24/07 22:30','8/24/07 23:00',
'8/24/07 23:30','8/25/07 00:00','8/25/07 00:30'))
U. <- as.numeric(c('0.2355','0.2602','0.2039','0.2571','0.1419','0.0778','0.3557',
'0.3065','0.1559','0.0943','0.1519','0.1498','0.1574','0.1929'
,'0.1407'))
#Mock data frame
test_data <- data.frame(date_time,U.)
#To use cut2
library(Hmisc)
#Splitting the data into categories
sub_data <- split(test_data,cut2(test_data$U.,c(0,0.1,0.2)))
new_data <- do.call("rbind",sub_data)
test_data <- new_data
You will see that "test_data" would have an extra column "row.names" with values such as "[0.000,0.100).6", "[0.000,0.100).10", etc.
How do I remove "[0.000,0.100)" and keep the number after the "." such as 6 and 10 so that I can reference these rows by their original row number later?
Any other better method to do this?
You could also set the names of sub_data to NULL.
names(sub_data) <- NULL
test_data <- do.call('rbind', sub_data)
row.names(test_data)
#[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"
You could use a Regular Expression (Regex), as follows:
rownames(test_data) = gsub(".*[]\\)]\\.", "", rownames(test_data))
It's cryptic if you're not familiar with Regular Expressions, but it basically says match any sequence of characters (.*) that are followed by either a brace or parenthesis ([]\\)]) and then by a period (\\.) and remove all of it.
The double backslashes are "escapes" indicating that the character following the double-backslash should be interpreted literally, rather than in its special Regex meaning (e.g., . means "match any single character", but \\. means "this is really just a period").
Just for fun, you can also use regmatches
> Names <- rownames(test_data)
> ( rownames(test_data) <- regmatches(Names, regexpr("[0-9]+$", Names)) )
[1] "6" "10" "5" "9" "11" "12" "13" "14" "15" "1" "2" "3" "4" "7" "8"

Split a character string by the symbol "*"

> test = "23*45"
I'd like to split testby the symbol *
I tried...
> strsplit(test,'*')
and I got...
[[1]]
[1] "2" "3" "*" "4" "5"
What I aim to have is:
[[1]]
[1] "23" "45"
You need to escape the star...
test = "23*45"
strsplit( test , "\\*" )
#[[1]]
#[1] "23" "45"
The split is a regular expression and * means the preceeding item is matched zero or more times. You are splitting on nothing , i.e. splitting into individual characters, as noted in the Details section of strsplit(). \\* means *treat * as a literal *.
Alternatively use the fixed argument...
strsplit( test , "*" , fixed = TRUE )
#[[1]]
#[1] "23" "45"
Which gets R to treat the split pattern as literal and not a regular expression.
You might want to look at this package:
http://www.rexamine.com/resources/stringi/
To install this package simply run:
install.packages("stringi")
Example:
stri_split_fixed(test, "*")

Resources