Trying to filter a data frame based on consecutive words in description variables (R) - r

Asking around to troubleshoot solutions for a project I'm doing in R. I have a data frame with every pitch from the previous season in professional baseball, and I'm trying to assign plays to the fielder that got to the ball first. To do this, I'm trying to manipulate description variables (des), which look something like this.
"Gary Sanchez grounds out, third baseman Eugenio Suarez to first baseman Ty France."
In this instance, my goal would be to assign this play to Eugenio Suarez, as he was the one the ball was hit to. To give this a test run, I decided to use grepl (I've used it to separate certain plays in the past). To make sure it wouldn't include plays the third baseman was involved in but not the original fielder, I tried something like this.
DF %>% filter(grepl(", third baseman", des) == T)
Commas and "to" are integral in determining the original fielder from these description variables, but grepl simply filtered for every description that had a comma and which a third baseman was involved in.
Is there some function or way I could filter for consecutive characters so that that previous filter would query plays like the original? Let me know, thanks!

It's unclear exactly what you want to do, but it seems like you need a look around in the regexp.
If you want to filter for all the plays with third baseman Eugenio Suarez:
grepl("(?<=, third baseman )Eugenio Suarez", desc, perl = TRUE)
This finds all the descriptions where the name "Eugenio Suarez" is preceded by ", third baseman ".
If you want to just extract the name, you can use a function like stringr::str_extract:
stringr::str_extract(desc, "(?<=, third baseman )(Eugenio Suarez)")
You can generalize this if the person is always two names an followed by " to" by using the followed by look around:
stringr::str_extract(desc, "[[:alpha:]]+ [[:alpha:]]+(?= to)")

Related

How to edit each text value of a column of a data frame in r?

I am working with a large data frame in r which includes a column containing the text content of a number of tweets. Each value starts with "RT #(account which is retweeted): ", for example "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...". I need to change each value in this column to only include the account name ("#RosannaXia"). How would I be able to do this? I understand that I may be able to do this with gsub and regular expressions (a lookbehind and a lookahead), but when I tried the following lookahead code it did not doing anything (or show an error):
Unnested_rts$rt_user <- gsub("[a-z](?=:)", "", Unnested_rts$rt_user, perl=TRUE)
Is there a better way to do this? I am not sure what went wrong, but I am still a very inexperienced coder. Any help would be greatly appreciated!
You can extract everything from # till a colon (:).
x <- "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet..."
sub('RT (#.*?):.*', '\\1', x)
#[1] "#RosannaXia"
For your case , it would be -
Unnested_rts$rt_user <- sub('RT (#.*?):.*', '\\1', Unnested_rts$rt_user)
A few things:
according to twitter, a handle can include alphanumeric ([A-Za-z0-9]) and underscores, this needs to be in your pattern;
your pattern needs to capture it and preserve it, and discard everything else, since we don't always know how to match everything else, we'll stick with matching what we know and use .* on either side.
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", "RT #RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...", perl=TRUE)
# [1] "#RosannaXia"
Since you want this for the entire column, you can probably just to
gsub(".*(#[A-Za-z0-9_]+)(?=:).*", "\\1", Unnested_rts$rt_user, perl=TRUE)
The only catch is that if there is a failed match (pattern is not found), then the entire string is returned, which may not be what you want. If you want to extract what you found, then there are several techniques that use gregexpr and regmatches, or perhaps stringr::str_extract.

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

Backreferencing a repeated regex pattern using str_match in R

I am not too great at regex's and have been stuck on this problem for awhile now. I have biological taxonomic information stored as strings in a "taxonomyString" column in a dataframe. The strings look like this:
“domain;kingdom;phylum;class;order;genus;species”
My goal is to split all of the strings (e.g., “domain”) into a taxonomic level column (e.g., “domain” into a "Domain" column) . I have accomplished this using the following (very long) code,
*taxa_data_six <- taxa_data %>% filter(str_count(taxonomyString, pattern = ";") == 6) %>%
tidyr::extract(taxonomyString, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), regex = "([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+)")*
I had to include a lot of different possible characters in between the semicolons because some of the taxa had [brackets] around the name, etc.
Besides being cumbersome, after running through my code, I have found there to be some errors in the taxonomyString, which I would like to clean up.
Sometimes, a class name is broken up by semicolons, e.g., what should be incertae sedis; is actually incertae;sedis;. These kinds of errors are throwing off my code, which assumes that the first semicolon always denotes the domain, the second, the kingdom, and so on.
In any case, my question is simple, but has been giving me a lot of grief. I would like to be able to group each taxonomyString by semicolons, e.g., group 1 is domain;, group 2 is kingdom;, so that I can refer back to them in another call and correct the errors. In the case of incertae;sedis;, I should be able to call group 4 and merge it with group 5. I have looked online about how to refer back to capture groups in R, and from what I've seen str_match seems to be the most efficient way to do this; however, I am uncertain why my ([:alnum:]*;) regex is not capturing the groups in str_match. I have tried different variations of this regexp (with parenthesis in different places), but I am stuck.
I am wondering if someone can help me write the str_match() function that will accomplish my goal.
Any help would be appreciated.
Edit
At this point, it seems like I should go with Wiktor's recommendation and simply split the strings by ;'s, and then fix the errors. Would anyone be able split the strings into their own columns?

Is there any way to replicate Rstudio's integrated search function as a code?

For context, I asked a question earlier today about matching company names with various variations against a big list with a lot of different company names by using the "stringdist" function from the stringdist package, in order to identify the companies in that big list. This is the question I asked.
Unfortunately, I have not been able to make any improvements to my code, which is why I'm starting to look away from stringdist and try something completely different.
I use Rstudio, and I've noticed that the internal search function in that program is much more effective:
As you can see by the picture, simply searching for the company name in the top right corner gives me the output that I'm looking for, such as the longer name "AMMINEX EMISSIONS..." and "AMMINEX AS".
However, in my previous attempt with the stringdist function (see the link to my previous question) I would get results like "LAMINEX" which are not at all relevant, but would appear before the more useful matches:
So it seems like using the algorithm that Rstudio uses is much more efficient in my case, however I'm not quite sure if it's possible to replicate this algorithm in code form, instead of having to manually search for each company.
Assuming I have a data frame that looks like this:
Company_list <- data.frame(Companies=c('AMMINEX', 'Microsoft', 'Apple'))
What would be a way for me to search for all 3 companies at the same time and get the same type of results in a data frame, like Rstudio does in the first image?
From your description of which results are good or bad, it sounds like you like exact matches of a substring rather than things that are close on those distance measures. In that case you can imitate Rstudio's search function with grepl
library(tidyverse)
demo.df <- data.frame(name = paste(rep(c("abc","jkl","xyz"), each=4), sample(1:100,4*3)), limbs=1:4*3)
demo.df%>%filter(grepl('abc|xyz',name))
where the pipe in the grepl pattern string means 'or', letting you search for multiple companies at the same time. So, to search for the names from the example data frame this string would be paste0(Company_list$Companies,collapse="|") Is this what you're after?

grep or gsub for everything except a specific string in R

I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?
Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.

Resources