simple Join error where some rows join and some don't - r

I have two dataframe which I'm trying to join which should be straight forward but I see some anomalous behavior.
Dataframe A
Name Sample Country Path
John S18902 UK /Home/drive/John
BOB 135671 USA /Home/drive/BOB
Tim GB12345_serum_63 UK /Home/drive/Tim
Wayne 12345_6789 UK /Home/drive/Wayne
Dataframe B
Surname Sample State FILE
Paul S18902 NJ John.csv
Gem 135671 PP BOB.csv
Love GB12345_serum_63 AP Tim.csv
Dave 12345_6789 MK Wayne.csv
I am using R markdown to do a simple join using the following command
Dataframec <- DataframeA %>%
left_join(DataframeB ,by = "Sample",all.x=T )
All rows join apart from the row where sample== GB12345_serum_63
There should be a simple fix to this but I'm out of ideas.
Thank you

If you are cutting-and-pasting your data directly into your question then the reason for this is because your key values are technically different due to having different numbers of spaces.
I cut and paste from your question from the beginning of the value to the start of the adjacent column name. So to 'country' in the first case and to 'state' in the second case
DataframeA: "GB12345_serum_63"
DataframeB: "GB12345_serum_63 "
You can see for DataframeB there are 3 space characters after the value. This can be resolved by removing extra whitespace from your key values as follows using regular expressions: gsub("^\\s+|\\s+$", "", x)
DataframeA$Sample <- gsub("^\\s+|\\s+$", "", DataframeA$Sample)
DataframeB$Sample <- gsub("^\\s+|\\s+$", "", DataframeB$Sample)
Now your join should work

Related

Fuzzy matching by category

I am trying to fuzzy match two different dataframes based on company names, using the agrep function. To improve my matching, I would like to only match companies if they are located in the same country.
df1: df2:
Company ISO Company ISO
Aalberts Industries NL Aalberts NL
Allison NL Allison transmission NL
Allison UK Allison transmission UK
I use the following function to match:
testb$test <- ""
for(i in 1:dim(testb)[1]) {x2 <- agrep(testb$name[i], testa$name, ignore.case=TRUE, value=TRUE, max.distance = Inf, useBytes = TRUE, fixed = TRUE)
x2 <- paste0(x2,"")
testb$test2[i] <- x2
}
I can create a subset for every country and than match each subset, which works, but is time consuming. Is there another way to let R only match company names if df1$ISO = df2$ISO? Thanks!
Try indexing with the data.table package: https://www.r-bloggers.com/intro-to-the-data-table-package/.
Your company columns seem to be too dissimilar to match consistently and accurately with agrep(). For example, "Aalberts Industries" will match "Aalberts" only when you set max.distance to a value greater than 10. The same string distance would also report a match between "Algebra" and "Alleyway" — not very close at all. I recommend cleaning out the unnecessary words in your company columns before matching.
Sorry, I would make this a comment, but I don't have the required reputation. Maybe someone could convert this to a comment for me?

R how to get data column to rows of first and second values

Apologies, I'm a novice but I don't seem to be able to find an answer to this question.
I've scraped tabular data from a web page. After some cleaning It appears in a single unnamed column.
[1] John
[2] Smith
[3] Tina
[4] Jordan
and so on.....
I'm obviously looking for the result of:
FirstName | LastName
[1] John Smith
[2] Tina Jordan
et al.
Much of what has gotten me to this point was sourced from: http://statistics.berkeley.edu/computing/r-reading-webpages
A very helpful resource for beginners such as myself.
I would be grateful for any advice you can give me.
Thanks,
C R Eaton
We create a logical index ('i1'), create a data.frame by extracting the elements in the first column of the original dataset ('dat') using 'i1'. The 'i1' elements will recycle to the length of the column, so if we do 'dat[i1,1]`, it will extract 1st element, 3rd, 5th, etc. For the last name, we just negate the 'i1', so that it will extract 2nd, 4th, etc..
i1 <- c(TRUE, FALSE)
d1 <- data.frame(FirstName = dat[i1,1], LastName = dat[!i1, 1], stringsAsFactors=FALSE)

How to remove specific duplicates in R

I have the following data:
> head(bigdata)
type text
1 neutral The week in 32 photos
2 neutral Look at me! 22 selfies of the week
3 neutral Inside rebel tunnels in Homs
4 neutral Voices from Ukraine
5 neutral Water dries up ahead of World Cup
6 positive Who's your hero? Nominate them
My duplicates will look like this (with empty $type):
7 Who's your hero? Nominate them
8 Water dries up ahead of World Cup
I remove duplicates like this:
bigdata <- bigdata[!duplicated(bigdata$text),]
The problem is, it removes the wrong duplicate. I want to remove the one where $type is empty, not the one that has a value for $type.
How can I remove a specific duplicate in R?
So here's a solution that does not use duplicated(...).
# creates an example - you have this already...
set.seed(1) # for reproducible example
bigdata <- data.frame(type=rep(c("positive","negative"),5),
text=sample(letters[1:10],10),
stringsAsFactors=F)
# add some duplicates
bigdata <- rbind(bigdata,data.frame(type="",text=bigdata$text[1:5]))
# you start here...
newdf <- with(bigdata,bigdata[order(text,type,decreasing=T),])
result <- aggregate(newdf,by=list(text=newdf$text),head,1)[2:3]
This sorts bigdata by text and type, in decreasing order, so that for a given text, the empty type will appear after any non-empty type. Then we extract only the first occurrence of each type for every text.
If your data really is "big", then a data.table solution will probably be faster.
library(data.table)
DT <- as.data.table(bigdata)
setkey(DT, text, type)
DT.result <- DT[, list(type = type[.N]), by = text]
This does basically the same thing, but since setkey sorts only in increasing order, we use type[.N] to get the last occurrence of type for a every text. .N is a special variable that holds the number of elements for that group.
Note that the current development version implements a function setorder(), which orders a data.table by reference, and can order in both increasing and decreasing order. So, using the devel version, it'd be:
require(data.table) # 1.9.3
setorder(DT, text, -type)
DT[, list(type = type[1L]), by = text]
You should keep rows that are either not duplicated or not missing a type value. The duplicated function only returns the second and later duplicates of each value (check out duplicated(c(1, 1, 2))), so we need to use both that value and the value of duplicated called with fromLast=TRUE.
bigdata <- bigdata[!(duplicated(bigdata$text) |
duplicated(bigdata$text, fromLast=TRUE)) |
!is.na(bigdata$type),]
foo = function(x){
x == ""
}
bigdata <- bigdata[-(!duplicated(bigdata$text)&sapply(bigdata$type, foo)),]

R: converting comma separated entry to columns with non-characters

I have a column of names in R that are separated by a comma.
For example:
John, Doe
Rebecca, Homes
I'd like to separate the first and last names into separate columns.
One additional problem I have is that sometimes there will be a name that doesn't have a comma. For example:
John, Doe
Rebecca, Homes
Organization LLC
I've looked at using strsplit(a, ","), but I get the following error Error in strsplit(wn, ",") : non-character argument.
Here is an example within Stack Convert comma separated entry to columns
Any help regarding solving this simple problem will be greatly appreciated. Thanks.
In 2 steps :
You can use read.table with fill=TRUE, to read all lines (You can also use readLines)
treat without commas as seprator.
The code is something like this :
aa <- read.table(text='John, Doe
Rebecca, Homes
Organization LLC',sep=',',fill=TRUE,colClasses='character')
## treat lines without comma
aa[nchar(aa$V2) ==0,] <-
do.call(rbind,strsplit(aa[nchar(aa$V2) ==0,]$V1,' ')) ## space as separator :I assume you
don't have compound name
> aa
V1 V2
1 John Doe
2 Rebecca Homes
3 Organization LLC
EDIT better method : I use a reglar expression to replace any space by a comma to have regular separator. I assume that you don't have any compound name.
ff <- readLines(textConnection('John, Doe
Rebecca, Homes
Organization LLC'))
do.call(rbind,
strsplit(gsub('[ ]|, |,[ ]',',',ff),','))

Transforming character strings in R

I have to merge to data frames in R. The two data frames share a common id variable, the name of the subject. However, the names in one data frame are partly capitalized, while in the other they are in lower cases. Furthermore the names appear in reverse order. Here is a sample from the data frames:
DataFrame1$Name:
"Van Brempt Kathleen"
"Gräßle Ingeborg"
"Gauzès Jean-Paul"
"Winkler Iuliu"
DataFrame2$Name:
"Kathleen VAN BREMPT"
"Ingeborg GRÄSSLE"
"Jean-Paul GAUZÈS"
"Iuliu WINKLER"
Is there a way in R to make these two variables usable as an identifier for merging the data frames?
Best, Thomas
You can use gsub to convert the names around:
> names
[1] "Kathleen VAN BREMPT" "jean-paul GAULTIER"
> gsub("([^\\s]*)\\s(.*)","\\2 \\1",names,perl=TRUE)
[1] "VAN BREMPT Kathleen" "GAULTIER jean-paul"
>
This works by matching first anything up to the first space and then anything after that, and switching them around. Then add tolower() or toupper() if you want, and use match() for joining your data frames.
Good luck matching Grassle with Graßle though. Lots of other things will probably bite you too, such as people with two first names separated by space, or someone listed with a title!
Barry
Here's a complete solution that combines the two partial methods offered so far (and overcomes the fears expressed by Spacedman about "matching Grassle with Graßle"):
DataFrame2$revname <- gsub("([^\\s]*)\\s(.*)","\\2 \\1",DataFrame2$Name,perl=TRUE)
DataFrame2$agnum <-sapply(tolower(DataFrame2$revname), agrep, tolower(DataFrame1$Name) )
DataFrame1$num <-1:nrow(DataFrame1)
merge(DataFrame1, DataFrame2, by.x="num", by.y="agnum")
Output:
num Name.x Name.y revname
1 1 Van Brempt Kathleen Kathleen VAN BREMPT VAN BREMPT Kathleen
2 2 Gräßle Ingeborg Ingeborg GRÄSSLE GRÄSSLE Ingeborg
3 3 Gauzès Jean-Paul Jean-Paul GAUZÈS GAUZÈS Jean-Paul
4 4 Winkler Iuliu Iuliu WINKLER WINKLER Iuliu
The third step would not be necessary if DatFrame1 had rownames that were still sequentially numbered (as they would be by default). The merge statement would then be:
merge(DataFrame1, DataFrame2, by.x="row.names", by.y="agnum")
--
David.
Can you add an additional column/variable to each data frame which is a lowercase version of the original name:
DataFrame1$NameLower <- tolower(DataFrame1$Name)
DataFrame2$NameLower <- tolower(DataFrame2$Name)
Then perform a merge on this:
MergedDataFrame <- merge(DataFrame1, DataFrame2, by="NameLower")
In addition to the answer using gsub to rearrange the names, you might want to also look at the agrep function, this looks for approximate matches. You can use this with sapply to find the matching rows from one data frame to the other, e.g.:
> sapply( c('newyork', 'NEWJersey', 'Vormont'), agrep, x=state.name, ignore.case=TRUE )
newyork NEWJersey Vormont
32 30 45

Resources