parse string with numbers or hyphens as one string in R

parse string with numbers or hyphens as one string in R - r

I am trying to parse strings with hyphens and/or numbers to call specific rows.
gene_name <- c("EP-CAM")
Genename=paste0("RNA$",gene_name)
Gene=eval(parse(text = paste0(Genename)))
This is the error:
Error in eval(parse(text = paste0(Genename))) :
object 'CAM' not found
I would need to get RNA$EP-CAM parsed for example. Backquotes will not give me the output and only show me the string.
With numbers the same would happen. I guess this is just a problem of the parse command. Is there an alternative to it?
This is in analogy to this problem: Unexpected symbol error in parse(text = str) with hyphen after a digit
Thank you so much for you support.
D

Adding back ticks to the call works for me. The problem, here, is that "EP-CAM" isn't really a valid name.
RNA <- list(`EP-CAM` = 0)
gene <- c("EP-CAM")
geneName <- paste0("RNA$`", gene, "`")
eval(parse(text = geneName))
# [1] 0
In fact, the following renames the column as EP.CAM.
data.frame(`EP-CAM` = 0)
# EP.CAM
# 1 0

Related

R Is there a way to remove special character from beginning of a string

A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.

We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``

You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.

Substitute based on regex [duplicate]

This question already has answers here:
Extracting numbers from vectors of strings
(12 answers)
Closed 1 year ago.
relatively new to R, need help with applying a regex-based substitution.
I have a data frame in one column of which I have a sequence of digits (my values of interest) followed by a string of all sorts of characters.
Example:
4623(randomcharacters)
I need to remove everything after the initial digits to continue working with the values. My idea was to use gsub to remove the non-digit characters by positive lookbehind.
The code I have is:
sub_function <- function() {
gsub("?<=[[:digit:]].", " ", fixed = T)
}
data_frame$`x` <- data_known$`x` %>%
sapply(sub_function)
But I then get the error:
Error in FUN(X[[i]], ...) : unused argument (X[[i]])
Any help would be greatly appreciated!

Here is a base R function.
It uses sub, not gsub, since there will be only one substitution. And there's no need for look behind, the meta-character ^ marks the beginning of the string, followed by an optional minus sign, followed by at least one digit. Everything else is discarded.
sub_function <- function(x){
sub("(^-*[[:digit:]]+).*", "\\1", x)
}
data <- data.frame(x = c("4623(randomcharacters)", "-4623(randomcharacters)"))
sub_function(data$x)
#[1] "4623" "-4623"
Edit
With this simple modification the function returns a numeric vector.
sub_function <- function(x){
y <- sub("(^-*[[:digit:]]+).*", "\\1", x)
as.numeric(y)
}

There are a few ways to accomplish this, but I like using functions from {tidyverse}:
library(tidyverse)
# Create some dummy data
df <- tibble(targetcol = c("4658(randomcharacters)", "5847(randomcharacters)", "4958(randomcharacters)"))
df <- mutate(df, just_digits = str_extract(targetcol, pattern = "^[[:digit:]]+"))
Output (contents of df):
targetcol just_digits
<chr> <chr>
1 4658(randomcharacters) 4658
2 5847(randomcharacters) 5847
3 4958(randomcharacters) 4958

If you always want to extract numbers from the data, you can use parse_number from readr. It will also return data in numeric form by default.
Using #Rory S' data.
sub_function <- function(x) {
readr::parse_number(x)
}
sub_function(df$targetcol)
#[1] 4658 5847 4958

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.

EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"

I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo

If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

Avoid that space in column name is replaced with period (".") when using read.csv()

I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.

If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.

To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).

Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)

Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"

Split string according to occurrence of a character

My task is to split and extract the part from a string until the occurrence of the fourth underscore.
I am working with R right now but I am kind of a beginner with programming and stuff.
The input looks like this:
6_10_36_0_1
6_10_38_16_15
6_100_76_16_18.1
My required result would look like this:
6_10_36_0
6_10_38_16
6_100_76_16
My idea is the following:
substr(data$x, 0, XXX)
While XXX defines the position before the fourth underscore, maybe using grep or strsplit?
Sorry, if I asked a stupid and easy-to-answer question. However I didn't find a fitting to answers already posted.
edit:
> bestand$ID<-sub("(_[0-9.]+$)", "", bestand$x)
Fehler in `$<-.data.frame`(`*tmp*`, "ID", value = character(0)) :
replacement has 0 rows, data has 36513
> gsub("(_[0-9.]+$)", "", "6_100_63_8_2")
[1] "6_100_63_8"
>
apparently the command works, however it doesnot work with the matrix..

You can use regular expression to replace with null, in php we do
$string = '6_10_36_0_1';
$newstring =preg_replace('/(_[0-9.]+$)/', '', $string);
Edit (I dono exactly about r but roughly it would be like this)
sub("(_[0-9.]+$)", "", 'your strings or array of strings')
gsub("(_[0-9.]+$)", "", 'your strings or array of strings')
and the tutorial is here

The stringr package has lots of handy shortcuts for this kind of work:
# input data
data <- read.table(text = "6_10_36_0_1
6_10_38_16_15
6_100_76_16_18.1")
# load library
library(stringr)
# prepare regular expression
regexp <- "([[:digit:]]+_){3}[[:digit:]]+"
# process string
(str_extract(data$V1, regexp))
Which gives the desired result:
[1] "6_10_36_0" "6_10_38_16" "6_100_76_16"
To explain the regexp a little:
[[:digit:]] is any number 0 to 9
+ means the preceding item (in this case, a digit) will be matched one or more times
_ is the underscore, as is
{3} means repeat the previous string three times
This page is also very useful for this kind of string processing: http://en.wikibooks.org/wiki/R_Programming/Text_Processing

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

parse string with numbers or hyphens as one string in R - r

Related

R Is there a way to remove special character from beginning of a string

Substitute based on regex [duplicate]

RegEx for a conditional pattern in a string

Avoid that space in column name is replaced with period (".") when using read.csv()

Split string according to occurrence of a character

Categories

Resources