I am having issue with partial string matching. I have pairs of people, and I need to compare their names. To do this I have run a charmatch both directions on the two last names, to see if name1 is part of name2, and vice versa. I have a small dataset below to demonstrate the question. I use charmatch below; I have used pmatch as well and it returns the same result.
When charmatch says seeks matches for the seeks matches for the elements of its first argument among those of its second... I take that to mean it will treat each group of characters in element1 as a pattern n see if same group exists in element2. But that's obviously not what's happening, it looks like it's direction specific.
So...is it direction specific? And if so...what else can I use to do what I am describing? My EG names pun intended, what I actually run into are lots of last names where husband has his name and wife has hers + husband. I need to be able to see if husband last name exists within wife last name.
I know it can be done with regular expressions but I am not familiar with them, probably should be, but am not, so I'd prefer an answer that does not use regex.
eg_data <- data.frame(name1 = c('Jimmy Conway', 'Jimmy'),
name2 = c('Conway','Jimmy Conway'))
eg_data$share_name1 <- mapply(charmatch, eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(charmatch, eg_data$name2, eg_data$name1)
eg_data$share_name <- 0
eg_data$share_name [(eg_data$share_name1==1 | eg_data$share_name2==1)]
<- 1
Same two lines, only string detect, not charmatch.
eg_data$share_name1 <- mapply(str_detect,eg_data$name1, eg_data$name2)
eg_data$share_name2 <- mapply(str_detect,eg_data$name2, eg_data$name1)
OR even
eg_data$share_name1 <- ifelse(mapply(str_detect,eg_data$name1, eg_data$name2)==TRUE,1,0)
eg_data$share_name2 <- ifelse(mapply(str_detect,eg_data$name2, eg_data$name1)==TRUE,1,0)
Thanks for anyone who looked. I hope this helps others.
This could be useful
> with(eg_data, intersect(name1, name2))
[1] "Jimmy Conway"
Related
before I get started, I would like you to know that I am completely new to coding in R. For a group assignment our professor set up a database by scraping data from Amazon. Within the database, which is called 'dat', there is a column named 'product_name'. We were given a set group of utilitarian words. I think you can guess where this is going. Within the column 'product_name' we have to find for each product name whether any of the utilitarian words appeared. If yes, how many times. We were given the following code by our professor to use for this assignment:
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus)))
}
after which i created the following codes:
uti_words <-c("additives","antioxidant","artificial", "busy", "calcium","calories", "carb", "carbohydrates", "chemicals", "cholesterol", "convenient", "dense", "diet", "fast")
sentences <- (dat$product_name)
nb_words (lexicon=uti_words,corpus=sentences)
when i run nb_words, however, I noticed something went wrong. A sentence contained the word 'breakfast'. My code counted this as a match because the word 'fast' from 'uti_words' matched with it. I don't want this to happen, does anyone know how to make it so that I only get exact matches and no partial matches?
We may have to add word boundary (\\b) to avoid partial matches
uti_words <- paste0("\\b", trimws(uti_words), "\\b")
Or another option is to change the grepl part of the code with fixed = TRUE
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus, fixed = TRUE)))
}
As the output of a certain operation, I have the following dataframe whith 729 observations.
> head(con)
Connections
1 r_con[C3-C3,Intercept]
2 r_con[C3-C4,Intercept]
3 r_con[C3-CP1,Intercept]
4 r_con[C3-CP2,Intercept]
5 r_con[C3-CP5,Intercept]
6 r_con[C3-CP6,Intercept]
As can be seen, the pattern to be removed is everything but the pair of Electrode information, for instance, in the first observation this would be C3-C3. Now, this is my take on the issue, which I'd expect to have the dataframe with everything removed. If I'm not wrong (which probably am) the regex syntax is ok and from my understanding I believe fixed=TRUE is also necessary. However, I do not understand the R output. When I would expect the pattern to be changed by nothing ""it returns this output, which doesn't make sense to me.
> gsub("r_con\\[\\,Intercept\\]\\","",con,fixed=TRUE)
[1] "3:731"
I believe this will probably be a silly question for an expert programmer, which I am far from being, and any insight would be much appreciated.
[UPDATE WITH SOLUTION]
Thanks to Tim and Ben I realised I was using a wrong regex syntax and a wrong source, this made it to me:
con2 <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", con$Connections)
I think your problem is that you're accessing "con" in your sub call. Also, as the user above me pointed out, you probably don't want to use sub.
I'm assuming, that your data is consistent, i.e., the strings in con$Connections follow more or less the same pattern. Then, this works:
I have set up this example:
con <- data.frame(Connections = c("r_con[C3-C3,Intercept]", "r_con[C3-CP1,Intercept]"))
library(stringr)
f <- function(x){
part <- str_split(x, ",")[[1]][1]
str_sub(part, 7, -1)
}
f(con$Connections[1])
sapply(con$Connections, f)
The sub function doesn't work this way. One viable approach would be to capture the quantity you want, then use this capture group as the replacement:
x <- "r_con[C3-C3,Intercept]"
term <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", x)
term
[1] "C3-C3"
I have a column with different game titles. In order to collect them, I have to change all of them to a singluar spelling.
For example, I have:
str_replace_all(FavouriteGames_DF$FavGame1, pattern = c("SKYRIM|
THE ELDER SCROLLS V: SKYRIM|
ELDER SCROLLS SKYRIM|
ELDER SCROLLS V SKYRIM|
SKYRIM (BETHESDA 2011)|
SKYRIM (MODDED)|
THE ELDERSCROLLS V: SKYRIM"),
replacement = "THE ELDER SCROLLS 5: SKYRIM")
The problem is, that str_replace_all is kinda bad for this, as it can't just search for any matching pattern and replace it with the replacement, but apparently has to go through it in order and I can't predict where in the DataSet which term will arrive.
I do not want the function to replace incomplete matches (ie., turning "The ELDERSCROLLS V: SKYRIM" to THE ELDERSCOLLS V: THE ELDER SCROLL 5: Skyrim")
Putting the patterns into pattern = c("1", "2") it will not work at all, because it can only check for the patterns in order.
I also tried the FindReplace function from the DataCombine package, but that one doesn't seem to work either for reasons I do not quite understand (claiming I am missing dimensions and the vector not being a character vector). Anyway, I want to use as few packages as possible and would prefer to stay in the tidyverse.
Does anybody have a good solution? I do not want to search for each term on it's own as I have to do this a lot and I already have to do it for 6 columns as mutate_at doesn_t seem to work with str_replace.
Thanks!
My comment as an answer:
FavouriteGames_DF[FavouriteGames_Df$FavGame1 %in% pattern, ]$FavGame1 <- replacement
A handy solution would be to just use "SKYRIM" as a pattern, as it is the common word on all the patterns you specified. You could define a very simple function to check for that pattern and then use lapply on the specific column you want to check for:
check <- function(x){
y <- unlist(strsplit(x, " "))
if("SKYRIM" %in% y)
return("THE ELDER SCROLLS 5: SKYRIM")
else
return(x)
}
FavouriteGames_DF["FavGame1"] <- lapply(FavouriteGames_DF["FavGame1"], check)
There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO
I have what is probably a really dumb grep in R question. Apologies, because this seems like it should be so easy - I'm obviously just missing something.
I have a vector of strings, let's call it alice. Some of alice is printed out below:
T.8EFF.SP.OT1.D5.VSVOVA#4
T.8EFF.SP.OT1.D6.LISOVA#1
T.8EFF.SP.OT1.D6.LISOVA#2
T.8EFF.SP.OT1.D6.LISOVA#3
T.8EFF.SP.OT1.D6.VSVOVA#4
T.8EFF.SP.OT1.D8.VSVOVA#3
T.8EFF.SP.OT1.D8.VSVOVA#4
T.8MEM.SP#1
T.8MEM.SP#3
T.8MEM.SP.OT1.D106.VSVOVA#2
T.8MEM.SP.OT1.D45.LISOVA#1
T.8MEM.SP.OT1.D45.LISOVA#3
I'd like grep to give me the number after the D that appears in some of these strings, conditional on the string containing "LIS" and an empty string or something otherwise.
I was hoping that grep would return me the value of a capturing group rather than the whole string. Here's my R-flavoured regexp:
pattern <- (?<=\\.D)([0-9]+)(?=.LIS)
nothing too complicated. But in order to get what I'm after, rather than just using grep(pattern, alice, value = TRUE, perl = TRUE) I'm doing the following, which seems bad:
reg.out <- regexpr(
"(?<=\\.D)[0-9]+(?=.LIS)",
alice,
perl=TRUE
)
substr(alice,reg.out,reg.out + attr(reg.out,"match.length")-1)
Looking at it now it doesn't seem too ugly, but the amount of messing about it's taken to get this utterly trivial thing working has been embarrassing. Anyone any pointers about how to go about this properly?
Bonus marks for pointing me to a webpage that explains the difference between whatever I access with $,# and attr.
Try the stringr package:
library(stringr)
str_match(alice, ".*\\.D([0-9]+)\\.LIS.*")[, 2]
You can do something like this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice)
If you only want the subset of alice where your pattern matches, try this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice[grepl(pat, alice)])