I'm having the following difficulty in R: a dataframe has a column with some IDs, which are in the character format. What's the most concise way I can delete the 0s in front?
Example:
This is what I have:
ID <- as.character(c("001001","0001002","01003","001004","1005"))
order <- c("a","b","c","d","e")
df <- as.data.frame(cbind(ID, order))
This is what I want:
ID2 <- as.character(c("1001","1002","1003","1004","1005"))
order2 <- c("a","b","c","d","e")
df2 <- as.data.frame(cbind(ID, order))
I've tried replacing strings but it deletes the 0s I don't want (ex: the ID2[1] = 11).
Thanks in advance!
use trimws from base R
trimws(c("001001","0001002","01003","001004","1005"),which = "left",whitespace = "0")
#> [1] "1001" "1002" "1003" "1004" "1005"
Created on 2020-06-30 by the reprex package (v0.3.0)
It is easier to do this if we convert to integer or numeric class as numeric values cannot have 0 prefix. After the conversion, just wrap with as.character if we need the class to remain as character
df$ID <- as.character(as.integer(df$ID))
df$ID
#[1] "1001" "1002" "1003" "1004" "1005"
It could also be done in a regex way (unnecessary though)
df$ID <- sub("^0+", "", df$ID)
In the above code, we match one or more 0s (0+) at the start (^) of the string and replace with blank ("")
if the IDs have characters other than digits, an option is also to capture the digits after the prefix 0's and replace with the backreference (\\1) of the captured groups. This would make sure that strings "0xyz" remains as such
df$ID <- sub("^0+(\\d+)$", "\\1", df$ID)
Related
I have a vector of character ids, as rownames of a data frame in R. The rownames have the following pattern:
head(foo)
[1] "ENSG00000197372 (ZNF675)" "ENSG00000112624 (GLTSCR1L)"
[3] "ENSG00000151320 (AKAP6)" "ENSG00000139910 (NOVA1)"
[5] "ENSG00000137449 (CPEB2)" "ENSG00000004779 (NDUFAB1)"
I would like to somehow subset the above rownames (~700 entries) in order to keep only the gene symbols in the parenthesis part-i.e. ZNF675-and drop the rest part: is this possible through a function like gsub ?
We can use sub to match characters that are not (, then capture the characters inside the ( which is not a ) and replace it with the backreference (\\1) of the captured group
row.names(foo) <- sub("^[^(]+\\(([^)]+).*", "\\1", row.names(foo))
row.names(foo)
#[1] "ZNF675" "GLTSCR1L" "AKAP6" "NOVA1" "CPEB2" "NDUFAB1"
Or using str_extract from stringr
library(stringr)
str_extract(row.names(foo), "(?<=\\()[^)]+")
data
foo <- data.frame(col1 = rnorm(6))
row.names(foo) <- c("ENSG00000197372 (ZNF675)",
"ENSG00000112624 (GLTSCR1L)", "ENSG00000151320 (AKAP6)",
"ENSG00000139910 (NOVA1)",
"ENSG00000137449 (CPEB2)", "ENSG00000004779 (NDUFAB1)")
Please, I want to rename the columns of my table by removing the year label. Here are my columns names :
"PROV_201601" "MNT_201602" "PROV_201612" .... and so on !
My objective is to remove the "2016" from the name of the column. I am only familiar with R but not with regular expressions.
Any help is appreciated !
Thank you.
We can try with sub to match a _ capture as a group followed by four digits (\\d{4}) and replace with the backreference of the captured group (\\1) or use _
sub("(_)\\d{4}", "\\1", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
If it is specific to 2016 then
sub("2016", "", v1)
#[1] "PROV_01" "MNT_02" "PROV_12"
data
v1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
First, use sub() to replace all instances of "2016" with "". This will eliminate 2016 from the character strings.
col1 <- c("PROV_201601", "MNT_201602", "PROV_201612")
col2 <- sub("2016", "", col1)
Now rename your columns of data frame dat using names():
names(dat) <- col2
I have a returned string like this from my code: (<C1>, 4.297, %)
And I am trying to extract only the value 4.297 from this string using gsub command:
Fat<-gsub("\\D", "", stringV)
However, this extracts not only 4.297 but also the number '1' in C1.
Is there a way to extract only 4.297 from this string, please can you help.
Thanks
How about this?
# Your sample character string
ss <- "(<C1>, 4.297, %)";
gsub(".+,\\s*(\\d+\\.\\d+),.+", "\\1", ss)
#[1] "4.297"
or
gsub(".+,\\s*([0-9\\.]+),.+", "\\1", ss)
Convert to numeric with as.numeric if necessary.
Another option is str_extract to match one or more numeric elements with . and is preceded by a word boundary and succeeded by word boundary(\\b)
library(stringr)
as.numeric(str_extract(stringV, "\\b[0-9.]+\\b"))
#[1] 4.297
If there are multiple numbers, use str_extract_all
data
stringV <- "(<C1>, 4.297, %)"
An alternative is to treat your vector as a comma-separated-variable, and use read.csv
df <- read.csv(text = stringV, colClasses = c("character", "numeric", "character"), header = F)
V1 V2 V3
1 (<C1> 4.297 %)
This method is relying on the 'numeric' being in the 'second' position in the vector.
you can use as.numeric convert no number string to NA.
ss <- as.numeric(unlist(strsplit(stringV, ',')))
ss[!is.na(ss)]
#[1] 4.297
I have a column in a data.table full of strings in the format string+integer. e.g.
string1, string2, string3, string4, string5,
When I use sort(), I put these strings in the wrong order.
string1, string10, string11, string12, string13, ..., string2, string20,
string21, string22, string23, ....
How would I sort these to be in the order
string01, string02, string03, string04, strin0g5, ... , string10,, string11,
string12, etc.
One method could be to add a 0 to each integer <10, 1-9?
I suspect you would extract the string with str_extract(dt$string_column, "[a-z]+") and then add a 0 to each single-digit integer...somehow with sprintf()
We can remove the characters that are not numbers to do the sorting
dt[order(as.integer(gsub("\\D+", "", col1)))]
You could go for mixedsort in gtools:
vec <- c("string1", "string10", "string11", "string12", "string13","string2",
"string20", "string21", "string22", "string23")
library(gtools)
mixedsort(vec)
#[1] "string1" "string2" "string10" "string11" "string12" "string13"
# "string20" "string21" "string22" "string23"
You could use the str_extract of stringr package to obtain the digits and order according to that
x = c("string1","string3","stringZ","string2","stringX","string10")
library(stringr)
c(x[grepl("\\d+",x)][order(as.integer(str_extract(x[grepl("\\d+",x)],"\\d+")))],
sort(x[!grepl("\\d+",x)]))
#[1] "string1" "string2" "string3" "string10" "stringX" "stringZ"
Assuming the string is something like below:
library(data.table)
library(stringr)
xstring <- data.table(x = c("string1","string11","string2",'string10',"stringx"))
extracts <- str_extract(xstring$x,"(?<=string)(\\d*)")
y_string <- ifelse(nchar(extracts)==2 | extracts=="",extracts,paste0("0",extracts))
fin_string <- str_replace(xstring$x,"(?<=string)(\\d*)",y_string)
sort(fin_string)
Output:
> sort(fin_string)
[1] "string01" "string02" "string10" "string11"
[5] "stringx"
How do I remove part of a string? For example in ATGAS_1121 I want to remove everything before _.
Use regular expressions. In this case, you can use gsub:
gsub("^.*?_","_","ATGAS_1121")
[1] "_1121"
This regular expression matches the beginning of the string (^), any character (.) repeated zero or more times (*), and underscore (_). The ? makes the match "lazy" so that it only matches are far as the first underscore. That match is replaced with just an underscore. See ?regex for more details and references
You can use a built-in for this, strsplit:
> s = "TGAS_1121"
> s1 = unlist(strsplit(s, split='_', fixed=TRUE))[2]
> s1
[1] "1121"
strsplit returns both pieces of the string parsed on the split parameter as a list. That's probably not what you want, so wrap the call in unlist, then index that array so that only the second of the two elements in the vector are returned.
Finally, the fixed parameter should be set to TRUE to indicate that the split parameter is not a regular expression, but a literal matching character.
If you're a Tidyverse kind of person, here's the stringr solution:
R> library(stringr)
R> strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
R> strings %>% str_replace(".*_", "_")
[1] "_1121" "_1432" "_1121"
# Or:
R> strings %>% str_replace("^[A-Z]*", "")
[1] "_1121" "_1432" "_1121"
Here's the strsplit solution if s is a vector:
> s <- c("TGAS_1121", "MGAS_1432")
> s1 <- sapply(strsplit(s, split='_', fixed=TRUE), function(x) (x[2]))
> s1
[1] "1121" "1432"
Maybe the most intuitive solution is probably to use the stringr function str_remove which is even easier than str_replace as it has only 1 argument instead of 2.
The only tricky part in your example is that you want to keep the underscore but its possible: You must match the regular expression until it finds the specified string pattern (?=pattern).
See example:
strings = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
strings %>% stringr::str_remove(".+?(?=_)")
[1] "_1121" "_1432" "_1121"
Here the strsplit solution for a dataframe using dplyr package
col1 = c("TGAS_1121", "MGAS_1432", "ATGAS_1121")
col2 = c("T", "M", "A")
df = data.frame(col1, col2)
df
col1 col2
1 TGAS_1121 T
2 MGAS_1432 M
3 ATGAS_1121 A
df<-mutate(df,col1=as.character(col1))
df2<-mutate(df,col1=sapply(strsplit(df$col1, split='_', fixed=TRUE),function(x) (x[2])))
df2
col1 col2
1 1121 T
2 1432 M
3 1121 A