This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 6 years ago.
I have a column in a dataframe where each row contains bunch of names and these names are seperated by a comma , as shown below
Col1
----------------------------------------------------
Missy Monroe, Andy Dalton P, Deny Grove, Easton West
Susan Schmidt, Bella Blu, Dennis Lee H, Georges Madison
Maya Unger, Kal Rapinsky, Richard Izzo, Rob Kolfax
Bismark Bison, Twyla Yellow Bird Bell, Yost Jefferson
I am searching for three names in this column, Missy Monroe, or Dennis Lee, or Bismark Bison if any one of these names are found then a value Yes should be imputed in the second column, if neither of these names are found then value should be No in the second column. The final output should be as follows.
Col1 Results
----------------------------------------------------------------------
Missy Monroe, Andy Dalton P, Deny Grove, Easton West Yes
Susan Schmidt, Bella Blu, Dennis Lee H, Georges Madison Yes
Maya Unger, Kal Rapinsky, Richard Izzo, Rob Kolfax No
Bismark Bison, Twyla Yellow Bird Bell, Yost Jefferson Yes
Any help on accomplishing this is much appreciated.
This should work for a data frame df:
df$Results <- ifelse(grepl("(Missy Monroe|Dennis Lee|Bismark Bison)",
df$Col1), "Yes", "No")
The grepl function returns TRUE or FALSE, which is a perfect input for ifelse.
As #david-arenburg notes, if you are planning on using this column for additional data analysis, it is probably a better idea to construct it as a logical vector rather than a string vector. In this case,
df$Results <- grepl("(Missy Monroe|Dennis Lee|Bismark Bison)", df$Col1)
will suffice.
Related
I have a lot of text data in a data.table. I have several text patterns that I'm interested in. I want to subset the table so it shows text that matches at least two of the patterns.
This is further complicated by the fact that some of the patterns already are an either/or, for example something like "paul|john".
I think I either want an expression that would mean directly to subset on that basis, or alternatively if I could count the number of times the patterns occur I could then use that as a tool to subset. I've seen ways to count the number of times patterns occur but not where the info is clearly linked to the IDs in the original dataset, if that makes sense.
At the moment the best I can think of would be to add a column to the data.table for each pattern, check if each pattern matches individually, then filter on the sum of the patterns. This seems quite convoluted so I am hoping there is a better way, as there are quite a lot of patterns to check!
Example data
text_table <- data.table(ID = (1:5), text = c("lucy, sarah and paul live on the same street",
"lucy has only moved here recently",
"lucy and sarah are cousins",
"john is also new to the area",
"paul and john have known each other a long time"))
text_patterns <- as.character(c("lucy", "sarah", "paul|john"))
With the example data, I would want IDs 1 and 3 in the subsetted data.
Thanks for your help!
We can paste the 'text_patterns' with the |, use that as pattern in 'str_count' to get the count of matching substring, and check if it is greater than 1 to filter the rows of the data.table
library(data.table)
text_table[str_count(text, paste(text_patterns, collapse="|")) >1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins
#3: 5 paul and john have known each other a long time
Update
If we need to consider each 'text_pattern' as a fixed pattern, we loop through the patterns, check whether the pattern is present (str_detect) and get the sum of all the patterns with + to create the logical vector for subsetting rows
i1 <- text_table[, Reduce(`+`, lapply(text_patterns,
function(x) str_detect(text, x))) >1]
text_table[i1]
# ID text
#1: 1 lucy, sarah and paul live on the same street
#2: 3 lucy and sarah are cousins
This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
Apologies if this is a duplicate question, as it seems like something simple enough that may have been asked already, although a quick search of the question didn't bring up an exact match to my particular issue - if it exists, would appreciate if you shared the question.
Dataframe for reference - I've made the example dataframe by hand, so don't have dput() for now, but could provide it:
.
> head(data[, 1:8], n = 4)
A B C D E F
1 Donald Will Joe Chris Greg Isaiah
2 Donald Will Jeff Chris Greg Isaiah
3 Donald Will Jeff Steve Greg Isaiah
4 Donald Will Jeff Steve Isaiah Greg
.
In this (small example of my larger) dataframe, I need remove any duplicate rows, where a row is considered a duplicate if it has all of the same names as another row, without regard to which columns the names are in. So in this case, row 4 would be considered a duplicate of row 3, and I would want to remove (either) row.
Of note, the order of the columns is very important in my dataframe, and so I cannot simply sort each row alphabetically and then remove exact duplicates.
Thanks for any help!!
df <- read.table(header=TRUE,stringsAsFactors=FALSE,text="
A B C D E F
1 Donald Will Joe Chris Greg Isaiah
2 Donald Will Jeff Chris Greg Isaiah
3 Donald Will Jeff Steve Greg Isaiah
4 Donald Will Jeff Steve Isaiah Greg")
df <- df[!duplicated(t(apply(df,1,sort))),]
Apologies, I'm a novice but I don't seem to be able to find an answer to this question.
I've scraped tabular data from a web page. After some cleaning It appears in a single unnamed column.
[1] John
[2] Smith
[3] Tina
[4] Jordan
and so on.....
I'm obviously looking for the result of:
FirstName | LastName
[1] John Smith
[2] Tina Jordan
et al.
Much of what has gotten me to this point was sourced from: http://statistics.berkeley.edu/computing/r-reading-webpages
A very helpful resource for beginners such as myself.
I would be grateful for any advice you can give me.
Thanks,
C R Eaton
We create a logical index ('i1'), create a data.frame by extracting the elements in the first column of the original dataset ('dat') using 'i1'. The 'i1' elements will recycle to the length of the column, so if we do 'dat[i1,1]`, it will extract 1st element, 3rd, 5th, etc. For the last name, we just negate the 'i1', so that it will extract 2nd, 4th, etc..
i1 <- c(TRUE, FALSE)
d1 <- data.frame(FirstName = dat[i1,1], LastName = dat[!i1, 1], stringsAsFactors=FALSE)
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
Given a conditional value in one column, I'm trying to get the unique list of values in another column using 'r'. For instance, if the input was sex = "M", then the output should be a list of unique names (John, Allan, Matt, Chris).
if input were Country = US, then return a list of unique names (John, Kate). any solutions would be greatly appreciated!!
Country Name Sex
US John M
US John M
US Kate F
Canada Allan M
Canada Kate F
Canada Matt M
England Nicole F
Germany Kate F
Germany Matt M
Germany Chris M
If I understand this correctly, you just need to use subset
You would use it as
subset(data,sex=="m",select=c("whatever","cols you want to keep"))
Note if you want all of them, you don't need to put anything for select.
And if you've got duplicates, you can get only the unique entries by running unique() on it.
For your data, this would be something like...
mydat=read.table("clipboard", header=TRUE)
unique(subset(mydat, Sex=="M"))
Country Name Sex
1 US John M
4 Canada Allan M
6 Canada Matt M
9 Germany Matt M
10 Germany Chris M
I have a list in a data frame of thousands of names in a long list. Many of the names have small differences in them which make them slightly different. I would like to find a way to match these names. For example:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.')
I've looked at amatch in the stringdist function, as well as agrep, but these all require a master list of names that are used to match another list of names against. In my case, I don't have such a master list so I'd like to create one from the data by identifying names with highly similar patterns so I can look at them and decide whether they're the same person (which in many cases they are). I'd like an output in a new column that helps me to know these are a likely match, and maybe some sort of similarity score based on Levenshtein distance or something. Maybe something like this:
names match SimilarityScore
1 jon smith a 9
2 jon, smith a 8
3 Jon Smith a 9
4 jon smith et al a 5
5 bob seger b 9
6 bob, seger b 8
7 bobby seger b 7
8 bob seger jr. b 5
Is something like this possible?
Drawing upon the post found here I have found that hierarchical text clustering will do what I'm looking for.
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied')
# Levenshtein Distance
e <- adist(names)
rownames(e) <- names
hc <- hclust(as.dist(e))
plot(hc)
rect.hclust(hc,k=3) #the k value provides the number of clusters
df <- data.frame(names,cutree(hc,k=3))
The output looks really good if you pick the right number of clusters (three in this case):
names cutree.hc..k...3.
jon smith jon smith 1
jon, smith jon, smith 1
Jon Smith Jon Smith 1
jon smith et al jon smith et al 1
bob seger bob seger 2
bob, seger bob, seger 2
bobby seger bobby seger 2
bob seger jr. bob seger jr. 2
jake jake 3
jakey jakey 3
jack jack 3
jakeyfied jakeyfied 3
However, names are oftentimes more complex than this, and after adding a few more difficult names, I found that the default adist options didn't give the best clustering:
names <- c('jon smith','jon, smith','Jon Smith','jon smith et al','bob seger','bob, seger','bobby seger','bob seger jr.','jake','jakey','jack','jakeyfied','1234 ranch','5678 ranch','9983','7777')
d <- adist(names)
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6)
I was able to improve upon this by increasing the cost of the substitution value to 2 and leaving the insertion and deletion costs at 1, and ignoring case. This helped to to minimize the mistaken grouping of totally different four character number strings, which I didn't want grouped:
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2)) #i=insertion, d=deletion s=substitution
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=6
I further fine tuned the clustering by removing common terms such as "ranch" and "et al" using the gsub tool in the grep package and increasing the number of clusters by one:
names<-gsub("ranch","",names)
names<-gsub("et al","",names)
d <- adist(names,ignore.case=TRUE, costs=c(i=1,d=1,s=2))
rownames(d) <- names
hc <- hclust(as.dist(d))
plot(hc)
rect.hclust(hc,k=7)
Although there are methods to let the data sort out the best number of clusters instead of manually trying to pick the number, I found that it was easiest to use trial and error, although there is information here about that approach.
The suggestion by Roman in the comments on the natural language processing is probably the best place to start. But for a back-of-the-envelope type of approach you can look at the distance in terms of ascii code:
mynames = c("abcd efghijkl mn","zbcd efghijkl mn","bbcd efghijkl mn","erqe")
asc <- function(x) { strtoi(charToRaw(x),16L) }
namesToChar= sapply(mynames, asc)
maxLength= max(unlist(lapply(namesToChar,length)))
namesToChar =lapply(namesToChar, function(x) { c(x, rep(-1, times = maxLength-length(x) )) } )
namesToChar = do.call("rbind",namesToChar)
dist(namesToChar,method="euclidean")
dist(namesToChar,method="canberra")
Though it seems to give OK enough numbers for the sample,
> dist(namesToChar,method="manhattan")
abcd efghijkl mn zbcd efghijkl mn bbcd efghijkl mn
zbcd efghijkl mn 25
bbcd efghijkl mn 1 24
erqe 257 274 256
this approach suffers from the fact that there does not seem to be an adequate distance method for the dist function for what you want to do. An element-wise binary comparison followed by a more standard distance perhaps ('manhattan' seems closest to your needs)? You could always implement that yourself of course. Also the -1 fill out is a hack here, you would need to replace that with the average ascii code of your sample if you decide to go this route.
For a similarity score versus the overall population you can take inverse of the average distance against each other word.