R: Removing substring by occurrence of character

R: Removing substring by occurrence of character - r

I have a vector, species_name, in dataframe genexp_2016 which contains the common and scientific names, as well as the location of several different species. For example, species_name strings may be written as
head(genexp_2016)
rank species_name status
1 1396 Addax (Addax nasomaculatus) - Wherever found E
2 1313 Babirusa (Babyrousa babyrussa) - Wherever found E
3 1396 Baboon, gelada (Theropithecus gelada) - Wherever found T
4 229 Bat, Florida bonneted (Eumops floridanus) - Wherever found E
5 109 Bat, gray (Myotis grisescens) - Wherever found E
What I'm attempting to do, however, is find a way to remove the end of each string in 'species_name` such that I am left with only the common name and the scientific name, and remove the location ('Wherever found').
I have thought about trying to tell R to delete everything after the first occurrence of the - character, but this is an imperfect method since some species in the dataframe have a heifen in their name, such as the black-footed ferret.
The most effective solution I've thought of is this: Telling R to read strings starting from the end instead of the beginning, and upon finding the first occurrence of -, delete everything between that character's position in the string and the end of the string. It seems like this is something I should be able to do in R, but my skills are currently not so advanced to know how to do this. Does anyone have any ideas of how I might code this, or perhaps a more efficient way for me to remove the location description in each string?
Thanks, and I appreciate any help you can offer.

Too keep everything until the last - (they keyword here is greedy) you could do:
x <- 'Addax (Addax nasomaculatus) - Wherever found'
sub('(.+)-.+', '\\1', x)
# [1] "Addax (Addax nasomaculatus) "

Related

How to grep a string ending in a specific punctuation mark

I'm trying to grep strings that end in a dash in R, but having trouble. I've worked out how to grep strings ending in any punctuation mark, maybe not the best way but this worked:
grep("\\#[[:print:]]+[[:punct:]]$",c)
Can't for the life of me work out how to grep strings that end specifically in a dash
for example these strings:
- # (piano) - not this.
- # hello hello - not this either.
I'd like to sub all the stuff between the dashes (and including the dashes) with nothing "" and leave the text to the right of the second dash, which end in full stops. So, I would like the output to be (for example, based on the example above):
not this.
and
not this either.
Any help would be appreciated.
Thank you!
Maro
UPDATE:
Hi again everyone,
I'm just updating my original question again:
So what I had in my original data was these three examples (I tried to simplify in my original post above, but I think it might be helpful for you all to see what I was actually dealing with):
- # (Piano) - no, and neither can you.
- # (Piano) - uh-huh.
- # Many dreams ago - Try it again.
(numbers 1-3 are for the purposes of making things clearer, they are not part of the strings)
I was trying to find a way to delete all the stuff between and including the two dashes, and leave all the stuff after the second dash, so I wanted my output to be:
no, and neither can you.
uh-huh.
Try it again.
I ended up using this:
gsub(("-[[:blank:]]#[[:blank:]]\\(?[A-Z][a-z]*\\)?[[:blank:]]-", "", c)
which helped me get 1. and 2. in one go. But this didn't help with 3 - I thought by including the question mark after the open and close bracket (which I thought meant 'optional') this would help me get all three targets, but for some reason it didn't. To then get 3, I just ended up targeting that specific string i.e. - # Many dreams ago -, by using:
gsub(("- # Many dreams ago -"), "", c)
I'm new to this, so not the best solution I'm sure.
In my original post (this has been edited a couple of times) I included square brackets around the three strings, which explains some of the answers I originally received from members of the community. Apologies for the confusion!
Thanks everyone - if there's anything that doesn't make sense, please let me know, and I'll try to clarify.
Maro

If you want to stay in between the square brackets you can start the match at #, then use a negated character class [^][]* matching optional chars other than an opening or closing square bracket, and match the last -
Replace the match with an empty string.
c <- "[- # (piano) - not this.]"
sub("#[^][]*-", "", c)
Output
[1] "[- not this.]"
For a more specific match of that string format, you can match the whole line including the square brackets, the # and the string ending on a full stop, and capture what you want to keep.
In the replacement use the capture group value.
c <- c("[- # (piano) - not this.]", "[- # hello hello - not this either.]")
sub("\\[[^][#]*#[^][]*-\\s*([^][]*\\.)]", "\\1", c)
Output
[1] "not this." "not this either."

Is the operator OR sensitive to the position of the argument that follows it

I want to write fairly concise, more readable R code.
I try to go to the line each time to avoid having very long codes. I have noticed that I have different results depending on whether I go to the line or not after the OR operator in grepl. And that annoys me
For example with this code. I have:
sigaps$Oncologie<-ifelse(
grepl("Radioth[ée]rapie|Chimioth[ée]rapie|Radiochimioth[ée]rapie|Cancer|Tumeur|Tumoral",
sigaps$Titre.de.l.étude,
ignore.case=TRUE),1,0)
table(sigaps$Oncologie)
0 1
377 157
But when I moved Tumoral to the next line, I have a different result. I dont understand what doesn't
works:
sigaps$Oncologie<-ifelse(
grepl("Radioth[ée]rapie|Chimioth[ée]rapie|Radiochimioth[ée]rapie|Cancer|Tumeur|
Tumoral",
sigaps$Titre.de.l.étude,
ignore.case=TRUE),1,0)
table(sigaps$Oncologie)
0 1
380 154
I have always done this. But I'm wondering, if I can't get the same results with two different ways of coding that I find identical, am I not making a coding mistake for years?

The difference occurs because your are breaking the line and adding spaces into your string by splitting it on to the next line and indenting it. Fix it by a) not doing that or b) creating your string with paste(..., sep="|")
grepl(paste("Radioth[ée]rapie", "Chimioth[ée]rapie",
"Radiochimioth[ée]rapie", "Cancer",
"Tumeur", "Tumoral", sep="|"),
sigaps$Titre.de.l.étude, ignore.case=TRUE)

How can i remove the first x number of characters of a column name from 200+ columns with each column being not the same number of characters

How can I remove a specific number of characters from a column name from 200+ column names for example: "Q1: GOING OUT?" and "Q5: STATE, PROVINCE, COUNTY, ETC" I just want to remove the "Q1: " and the "Q5: "I have looked around but haven't been able to find one where I don't have to manually rename them manually. Are there any functions or ways to use it through tidyverse? I have only been starting with R for 2 months.
I don't really have anything to show. I have considered using for loops and possibly using gsub or case_when, but don't really understand how to properly use them.
#probably not correctly written but tried to do it anyways
for ( x in x(0:length) and _:(length(CandyData)-1){
front -> substring(0:3)
back -> substring(4:length(CandyData))
print <- back
}
I don't really have any errors because I haven't been able to make it work properly.

Try this:
col_all<-c("Q1:GOING OUT?","Q2:STATE","Q100:PROVINCE","Q200:COUNTRY","Q299:ID") #This is an example.If you already have a dataframe ,you may get colnames by **col_all<-names(df)**
for(col in 1:length(col_all)) # Iterate over the col_all list
{
colname=col_all[col] # assign each column name to variable colname at each iteration
match=gregexpr(pattern =':',colname) # Find index of : for each colname(Since you want to delete characters before colon and keep the string succeeding :
index1=as.numeric(match[1]) # only first element is needed for index
if(index1>0)
{
col_all[col]=substr(colname,index1+1,nchar(colname))#Take substring after : for each column name and assign it to col_all list
}
}
names(df)<-col_all #assign list as column name of dataframe

The H 1 answer is still the best: sub() or gsub() functions will do the work. And do not fear the regex, it is a powerful tool in data management.
Here is the gsub version:
names(df) <- gsub("^.*:","",names(df))
It works this way: for each name, fetch characters until reaching ":" and then, remove all the fetched characters (including ":").
Remember to up vote H 1 soluce in the comments

Character values stored in DATAFRAME with Double Quotes while reading into R

I have a csv file with almost 4 millions records and 30 + columns.
The Columns are of varied type that includes Numeric, Alphanumeric, Date Column, character etc.
Attempt 1:
When I first read the file in R using read.csv Function then only 2 millions of the records were read.
This may have happened because of some special characters in the DATA.
Attempt 2:
I provided the argument quote = "" in read.csv Function and all the records were read succesfully.
However this brings up 2 issues:
a. all teh Columns were appended with 'x.' modifier:
egs.: x.date , x.name
b. all the Character Columns were loaded in dataframe, enclosed with double quotes ""
Can someone, please advise me that how to resolve these 2 issues and get the data loaded in R succesfully?
I work for a financial insititution and the data is highly sensitive, hence cannot paste the screenshot over here.
I also tried to create the scenario at my home but all my efforts were of little or of no avail.
The below screenshot is closest I have came to the exact scenario:
DATAFRAME SCREENSHOT: Not exact copy

Does R have a wildcard expression (such as an asterisk (*))?

I hope I am not missing something obvious but here is my question:
Given data like this
Data1=
Name Number
AndyBullxxx 12
AlexPullxcx 14
AndyPamxvb 56
RickRalldfg 34
AndyPantert 45
SamSaltedf 45
I would like to be able to pull out all of the rows starting with "Andy"
subset(Data1,Name=="Andy*")
AndyBullxxx 12
AndyPamxvb 56
AndyPantert 45
So basically a wild card symbol that will let me subset all rows that begin with a certain character designation.

Try,
df[grep("^Andy",rownames(df)),]
the first argument of grep is a pattern, a regular expression. grep gives back the rows that contain the pattern given in this first argument. the ^ means beginning with.
Let's make this reproducible:
df <- data.frame(x=c(12,14,56,34,45,45))
rownames(df) <- c("AndyBullxxx", "AlexPullxcx","AndyPamxvb", "RickRalldfg","AndyPantert","SamSaltedf")
## see what grep does
grep("^Andy",rownames(df))

If you are not comfortable with regex there is a function in the utils package, which can convert wildcard based expressions to regex. So you can do
df[grepl(glob2rx('Andy*'), rownames(df)),]

I think you want to go for a regular expression:
subset(Data1, grepl("^\bAndy", Name))
or
Data1[grepl("^\bAndy", Data1$Name),]
In the regular expression the "^" stands for startswith, and \b for the next set of characters is going to be a word. Regular expressions are a powerful text processing tool that require some study. There are a lot of tutorials and websites online. One of them that I use is:
http://www.regular-expressions.info/
I can't use R right now (my wifes laptop ;)), so these pieces of code go untested. Maybe next time provide us with an example dataset, that makes it really easy to provide a working example.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Removing substring by occurrence of character - r

Too keep everything until the last - (they keyword here is greedy) you could do: x <- 'Addax (Addax nasomaculatus) - Wherever found' sub('(.+)-.+', '\\1', x) # [1] "Addax (Addax nasomaculatus) "

Related

How to grep a string ending in a specific punctuation mark

Is the operator OR sensitive to the position of the argument that follows it

How can i remove the first x number of characters of a column name from 200+ columns with each column being not the same number of characters

Character values stored in DATAFRAME with Double Quotes while reading into R

Does R have a wildcard expression (such as an asterisk (*))?

Categories

Resources