grep or gsub for everything except a specific string in R - r

I'm trying to match everything except a specific string in R, and I've seen a bunch of posts on this suggesting a negative lookaround, but I haven't gotten that to work.
I have a dataset looking at crime incidents in SF, and I want to sort cases that have a resolution or do not. In the resolution field, cases have things listed like arrest booked, arrest cited, juvenile booked, etc., or none. I want to relabel all the specific resolutions like the different arrests to "RESOLVED" and keep the instances with "NONE" as such. So, I thought I could gsub or grep for not "NONE".
Based on what I've read on finding all strings except one specific string, I would have thought this would work:
resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE)
Where I make a vector that searches through my training dataset, specifically the resolution column, and finds the terms that aren't "NONE". But, I just get an empty vector.
Does anyone have suggestions, or know why this might not be working in R? Or, even if there was a way to just use gsub, how do I say "not NONE" for my regex in R?
trainData$Resolution = gsub("!NONE", RESOLVED, trainData$Resolution) << what's the way to negate the string here?

Based on your explanation, it seems as though you don't need regular expressions (i.e. gsub()) at all. You can use != since you are looking for all non-matches of an exact string. Perhaps you want
within(trainData, {
## next line only necessary if you have a factor column
Resolution <- as.character(Resolution)
Resolution[Resolution != "NONE"] <- "RESOLVED"
})

resolution_vector = grep("^(?!NONE$).*", trainData$Resolution, fixed=TRUE,perl=TRUE)
You need to use option perl=TRUE.

Related

Trying to filter a data frame based on consecutive words in description variables (R)

Asking around to troubleshoot solutions for a project I'm doing in R. I have a data frame with every pitch from the previous season in professional baseball, and I'm trying to assign plays to the fielder that got to the ball first. To do this, I'm trying to manipulate description variables (des), which look something like this.
"Gary Sanchez grounds out, third baseman Eugenio Suarez to first baseman Ty France."
In this instance, my goal would be to assign this play to Eugenio Suarez, as he was the one the ball was hit to. To give this a test run, I decided to use grepl (I've used it to separate certain plays in the past). To make sure it wouldn't include plays the third baseman was involved in but not the original fielder, I tried something like this.
DF %>% filter(grepl(", third baseman", des) == T)
Commas and "to" are integral in determining the original fielder from these description variables, but grepl simply filtered for every description that had a comma and which a third baseman was involved in.
Is there some function or way I could filter for consecutive characters so that that previous filter would query plays like the original? Let me know, thanks!
It's unclear exactly what you want to do, but it seems like you need a look around in the regexp.
If you want to filter for all the plays with third baseman Eugenio Suarez:
grepl("(?<=, third baseman )Eugenio Suarez", desc, perl = TRUE)
This finds all the descriptions where the name "Eugenio Suarez" is preceded by ", third baseman ".
If you want to just extract the name, you can use a function like stringr::str_extract:
stringr::str_extract(desc, "(?<=, third baseman )(Eugenio Suarez)")
You can generalize this if the person is always two names an followed by " to" by using the followed by look around:
stringr::str_extract(desc, "[[:alpha:]]+ [[:alpha:]]+(?= to)")

Is there a way to extract a substring from a cell in OpenOffice Calc?

I have tens of thousands of rows of unstructured data in csv format. I need to extract certain product attributes from a long string of text. Given a set of acceptable attributes, if there is a match, I need it to fill in the cell with the match.
Example data:
"[ROOT];Earrings;Brands;Brands>JeweleryExchange;Earrings>Gender;Earrings>Gemstone;Earrings>Metal;Earrings>Occasion;Earrings>Style;Earrings>Gender>Women's;Earrings>Gemstone>Zircon;Earrings>Metal>White Gold;Earrings>Occasion>Just to say: I Love You;Earrings>Style>Drop/Dangle;Earrings>Style>Fashion;Not Visible;Gifts;Gifts>Price>$500 - $1000;Gifts>Shop>Earrings;Gifts>Occasion;Gifts>Occasion>Christmas;Gifts>Occasion>Just to say: I Love You;Gifts>For>Her"
Look up table of values:
Zircon, Diamond, Pearl, Ruby
Output:
Zircon
I tried using the VLOOKUP() function, but it needs to match an entire cell and works better for translating acronyms. Haven't really found a built in function that accomplishes what I need. The data is totally unstructured, and changes from row to row with no consistency even within variations of the same product. Does anyone have an idea how to do this?? Or how to write an OpenOffice Calc function to accomplish this? Also open to other better methods of doing this if anyone has any experience or ideas in how to approach this...
ok so I figured out how to do this on my own... I created many different columns, each with a keyword I was looking to extract as a header.
Spreadsheet solution for structured data extraction
Then I used this formula to extract the keywords into the correct row beneath the column header. =IF(ISERROR(SEARCH(CF$1,$D769)),"",CF$1) The Search function returns a number value for the position of a search string otherwise it produces an error. I use the iserror function to determine if there is an error condition, and the if statement in such a way that if there is an error, it leaves the cell blank, else it takes the value of the header. Had over 100 columns of specific information to extract, into one final column where I join all the previous cells in the row together for the final list. Worked like a charm. Recommend this approach to anyone who has to do a similar task.

Backreferencing a repeated regex pattern using str_match in R

I am not too great at regex's and have been stuck on this problem for awhile now. I have biological taxonomic information stored as strings in a "taxonomyString" column in a dataframe. The strings look like this:
“domain;kingdom;phylum;class;order;genus;species”
My goal is to split all of the strings (e.g., “domain”) into a taxonomic level column (e.g., “domain” into a "Domain" column) . I have accomplished this using the following (very long) code,
*taxa_data_six <- taxa_data %>% filter(str_count(taxonomyString, pattern = ";") == 6) %>%
tidyr::extract(taxonomyString, into = c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), regex = "([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+);([\\]\\[\\-\\p{Zs}.()[:alnum:][:blank:]]+)")*
I had to include a lot of different possible characters in between the semicolons because some of the taxa had [brackets] around the name, etc.
Besides being cumbersome, after running through my code, I have found there to be some errors in the taxonomyString, which I would like to clean up.
Sometimes, a class name is broken up by semicolons, e.g., what should be incertae sedis; is actually incertae;sedis;. These kinds of errors are throwing off my code, which assumes that the first semicolon always denotes the domain, the second, the kingdom, and so on.
In any case, my question is simple, but has been giving me a lot of grief. I would like to be able to group each taxonomyString by semicolons, e.g., group 1 is domain;, group 2 is kingdom;, so that I can refer back to them in another call and correct the errors. In the case of incertae;sedis;, I should be able to call group 4 and merge it with group 5. I have looked online about how to refer back to capture groups in R, and from what I've seen str_match seems to be the most efficient way to do this; however, I am uncertain why my ([:alnum:]*;) regex is not capturing the groups in str_match. I have tried different variations of this regexp (with parenthesis in different places), but I am stuck.
I am wondering if someone can help me write the str_match() function that will accomplish my goal.
Any help would be appreciated.
Edit
At this point, it seems like I should go with Wiktor's recommendation and simply split the strings by ;'s, and then fix the errors. Would anyone be able split the strings into their own columns?

Splitting strings into elements from a list

A function in a package gives me a character, where the original strings are merged together. I need to separate them, in other words I have to find the original elements. Here is an example and what I have tried:
orig<-c("answer1","answer2","answer3")
result<-"answer3answer2"
What I need as an outcome is:
c("answer2","answer3")
I have tried to split() result, but there is no string to base it on, especially that I have no former knowledge of what the answers will be.
I have tried to match() the result to the orig, but I would need to do that with all substrings.
There has to be an easy solution, but I haven't found it.
index <- gregexpr(paste(orig,collapse='|'),result)[[1]]
starts <- as.numeric(index)
stops <- starts + attributes(index)$match.length - 1 )
substring(result, starts, stops)
This should work for well defined and reversible inputs. Alternatively, is it possible to append some strings to the input of the function, such that it can be easily separated afterwards?
What you're describing seems to be exactly string matching, and for your strings, grepl seems to be just the thing, in particular:
FindSubstrings <- function(orig, result){
orig[sapply(orig, grepl, result)]
}
In more detail: grepl takes a pattern argument and looks whether it occurs in your string (result in our case), and returns a TRUE/FALSE value. We subset the original values by the logical vector - does the value occur in the string?
Possible improvements:
fixed=TRUE may be a bright idea, because you don't need the full regex power for simple strings matching
some match patterns may contain others, for example, "answer10" contains "answer1"
stringi may be faster for such tasks (just rumors floating around, haven't rigorously tested), so you may want to look into it if you do this a lot.

R: Want to do a dictionary check and remove unwanted space in between where removing space will make it a proper word

I am using R for text mining and have data that have been concatenated from different text columns. There are cases where words have been split by a space like"functi oning". I want to detect all such cases and remove space in between by doing dictionary check. I know splitWords function in aspell, I want a function exactly opposite of what this does.
Here is an approach, based on some code I found, but you need to provide some example text and even just pseudo code to help others respond.
First create an object that has a huge set of words spelled correctly. Then you compare your vector of words to that set with adist and an argument set to a single difference -- ideally, the internal spaces you would like to remove. I doubt that this will solve everything, but it may help.
sorted_words <- comments(sort(table(strsplit(tolower(paste(readLines("http://www.norvig.com/big.txt"), collapse = " ")), "[^a-z]+")), decreasing = TRUE))
correct <- function(*your vector*) { c(sorted_words[adist(*your vector*, sorted_words) <= min(adist(word, sorted_words), 2)], word)[1] }
Then use the correct function.

Resources