I have a returned string like this from my code: (<C1>, 4.297, %)
And I am trying to extract only the value 4.297 from this string using gsub command:
Fat<-gsub("\\D", "", stringV)
However, this extracts not only 4.297 but also the number '1' in C1.
Is there a way to extract only 4.297 from this string, please can you help.
Thanks
How about this?
# Your sample character string
ss <- "(<C1>, 4.297, %)";
gsub(".+,\\s*(\\d+\\.\\d+),.+", "\\1", ss)
#[1] "4.297"
or
gsub(".+,\\s*([0-9\\.]+),.+", "\\1", ss)
Convert to numeric with as.numeric if necessary.
Another option is str_extract to match one or more numeric elements with . and is preceded by a word boundary and succeeded by word boundary(\\b)
library(stringr)
as.numeric(str_extract(stringV, "\\b[0-9.]+\\b"))
#[1] 4.297
If there are multiple numbers, use str_extract_all
data
stringV <- "(<C1>, 4.297, %)"
An alternative is to treat your vector as a comma-separated-variable, and use read.csv
df <- read.csv(text = stringV, colClasses = c("character", "numeric", "character"), header = F)
V1 V2 V3
1 (<C1> 4.297 %)
This method is relying on the 'numeric' being in the 'second' position in the vector.
you can use as.numeric convert no number string to NA.
ss <- as.numeric(unlist(strsplit(stringV, ',')))
ss[!is.na(ss)]
#[1] 4.297
Related
I have a character string that looks like below and I want to delete lines that doesn't have any value after '_'.
How do I do that in R?
[985] "Pclo_" "P2yr13_ S329" "Basp1_ S131"
[988] "Stk39_ S405" "Srrm2_ S351" "Grin2b_ S930"
[991] "Matr3_ S604" "Map1b_ S1781" "Crmp1_"
[994] "Elmo1_" "Pcdhgc5_" "Sp4_"
[997] "Pbrm1_" "Pphln1_" "Gnl1_ S33"
[1000] "Kiaa1456_"
We can use grep
grep("_$", v1, invert = TRUE, value = TRUE)
Or endsWith
v1[!endsWith(v1, "_")]
We can use substring to get the last character in the vector and select if it is not "_".
x <- c("Pclo_","P2yr13_ S329","Basp1_ S131")
x[substring(x, nchar(x)) != '_']
#[1] "P2yr13_ S329" "Basp1_ S131"
Last character can be extracted using regex as well with sub :
x[sub('.*(.)$', '\\1', x) != '_']
I have a table with a string column formatted like this
abcdWorkstart.csv
abcdWorkcomplete.csv
And I would like to extract the last word in that filename. So I think the beginning pattern would be the word "Work" and ending pattern would be ".csv". I wrote something using grepl but not working.
grepl("Work{*}.csv", data$filename)
Basically I want to extract whatever between Work and .csv
desired outcome:
start
complete
I think you need sub or gsub (substitute/extract) instead of grepl (find if match exists). Note that when not found, it will return the entire string unmodified:
fn <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
out <- sub(".*Work(.*)\\.csv$", "\\1", fn)
out
# [1] "start" "complete" "abcdNothing.csv"
You can work around this by filtering out the unchanged ones:
out[ out != fn ]
# [1] "start" "complete"
Or marking them invalid with NA (or something else):
out[ out == fn ] <- NA
out
# [1] "start" "complete" NA
With str_extract from stringr. This uses positive lookarounds to match any character one or more times (.+) between "Work" and ".csv":
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
library(stringr)
str_extract(x, "(?<=Work).+(?=\\.csv)")
# [1] "start" "complete"
Just as an alternative way, remove everything you don't want.
x <- c("abcdWorkstart.csv", "abcdWorkcomplete.csv")
gsub("^.*Work|\\.csv$", "", x)
#[1] "start" "complete"
please note:
I have to use gsub. Because I first remove ^.*Work then \\.csv$.
For [\\s\\S] or \\d\\D ... (does not work with [g]?sub)
https://regex101.com/r/wFgkgG/1
Works with akruns approach:
regmatches(v1, regexpr("(?<=Work)[\\s\\S]+(?=[.]csv)", v1, perl = T))
str1<-
'12
.2
12'
gsub("[^.]","m",str1,perl=T)
gsub(".","m",str1,perl=T)
gsub(".","m",str1,perl=F)
. matches also \n when using the R engine.
Here is an option using regmatches/regexpr from base R. Using a regex lookaround to match all characters that are not a . after the string 'Work', extract with regmatches
regmatches(v1, regexpr("(?<=Work)[^.]+(?=[.]csv)", v1, perl = TRUE))
#[1] "start" "complete"
data
v1 <- c('abcdWorkstart.csv', 'abcdWorkcomplete.csv', 'abcdNothing.csv')
I have a column in a data.table full of strings in the format string+integer. e.g.
string1, string2, string3, string4, string5,
When I use sort(), I put these strings in the wrong order.
string1, string10, string11, string12, string13, ..., string2, string20,
string21, string22, string23, ....
How would I sort these to be in the order
string01, string02, string03, string04, strin0g5, ... , string10,, string11,
string12, etc.
One method could be to add a 0 to each integer <10, 1-9?
I suspect you would extract the string with str_extract(dt$string_column, "[a-z]+") and then add a 0 to each single-digit integer...somehow with sprintf()
We can remove the characters that are not numbers to do the sorting
dt[order(as.integer(gsub("\\D+", "", col1)))]
You could go for mixedsort in gtools:
vec <- c("string1", "string10", "string11", "string12", "string13","string2",
"string20", "string21", "string22", "string23")
library(gtools)
mixedsort(vec)
#[1] "string1" "string2" "string10" "string11" "string12" "string13"
# "string20" "string21" "string22" "string23"
You could use the str_extract of stringr package to obtain the digits and order according to that
x = c("string1","string3","stringZ","string2","stringX","string10")
library(stringr)
c(x[grepl("\\d+",x)][order(as.integer(str_extract(x[grepl("\\d+",x)],"\\d+")))],
sort(x[!grepl("\\d+",x)]))
#[1] "string1" "string2" "string3" "string10" "stringX" "stringZ"
Assuming the string is something like below:
library(data.table)
library(stringr)
xstring <- data.table(x = c("string1","string11","string2",'string10',"stringx"))
extracts <- str_extract(xstring$x,"(?<=string)(\\d*)")
y_string <- ifelse(nchar(extracts)==2 | extracts=="",extracts,paste0("0",extracts))
fin_string <- str_replace(xstring$x,"(?<=string)(\\d*)",y_string)
sort(fin_string)
Output:
> sort(fin_string)
[1] "string01" "string02" "string10" "string11"
[5] "stringx"
I have a date value as follows
"'2015-10-24'"
class Character
I am trying to format this value such that it looks like this '10/24/2015'
I know how to use noquote function and strip the quotes and gsub function to replace the - with / but I am not sure how to switch the year, date and month such that it looks like this '10/24/2015'
Any help is much appreciated.
We can convert to Date class after removing the ' with gsub, and then use format to get the expected output
format(as.Date(gsub("'", '', v1)), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
Or without using the gsub to remove ', we can specify the ' also in the format within as.Date
format(as.Date(v1, "'%Y-%m-%d'"), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
This can be made more compact if we are using library(lubridate)
library(lubridate)
format(ymd(v1), "'%m/%d/%Y'")
#[1] "'10/24/2015'" "'10/25/2015'"
If we don't need the ' in the output, we don't have to specify that in the format,
format(ymd(v1), "%m/%d/%Y")
#[1] "10/24/2015" "10/25/2015"
Or we can do this using only gsub by capturing the characters as a group. In the below code, we capture the first 4 characters (.{4}) as a group by wrapping with parentheses followed by matching the -, then capturing the next two characters, followed by -, and capturing the last two characters. In the replacement, we can shuffle the capture groups as per the requirement. In this case, the second capture group should come first (\\2) followed by /, then the third (\\3) and so on...
gsub('(.{4})-(.{2})-(.{2})', '\\2/\\3/\\1', v1)
#[1] "'10/24/2015'" "'10/25/2015'"
To avoid the quotes,
gsub('.(.{4})-(.{2})-(.{2}).', '\\2/\\3/\\1', v1)
#[1] "10/24/2015" "10/25/2015"
In addition, there are other ways such as splitting the string
vapply(strsplit(v1, "['-]"), function(x) paste(x[c(3,4,2)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
or extracting the numeric part with str_extract_all and pasteing as before.
library(stringr)
vapply(str_extract_all(v1, '\\d+'), function(x)
paste(x[c(2,3,1)], collapse='/'), character(1))
#[1] "10/24/2015" "10/25/2015"
data
v1 <- c("'2015-10-24'", "'2015-10-25'")
You can also use the function strftime to get the result
d <- "'2015-10-24'"
strftime(as.Date(gsub("'", "", d)), "%m/%d/%Y")
# [1] "10/24/2015"
How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A