Combine cells having similar values in a row - r

I have a data frame like below.
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald trump agreement climate united states
3 donald trump agreement paris united states
4 donald trump agreement united states NA
5 donald trump climate emission united states
6 donald trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united kingdom chicken hen wimp
12 united states agreement paris NA
And I want the resultant as a data frame with rows like below
For example,
Row1 should be as such since it doesn't have any similar rows.
if you see rows 2,3,4,5 and 12. They should be combined in a same row like
united states donald trump paris climate agreement emission
And rows 7,9 and 11 should be combined as
united kingdom chicken hen wimp mustard
It can be in any order.

Assume the data frame DF shown reproducibly in the Note at the end.
Convert that to a character matrix m. Let us say that two rows are similar if they have more than one element in common and define is_similar to take two row indexes and return TRUE or FALSE accordingly. Then apply that to every pair of rows using outer. Interpret that as the adjacency matrix of a graph and calculate the connected compnents splitting DF into a list L each of whose elements is a data frame of the rows from DF that constitute that connected component.. Finally rework L into a character matrix.
library(igraph)
m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))
adj <- graph.adjacency(smat)
cl <- components(adj)$membership
str(split(1:n, cl))
## List of 6
## $ 1: int 1
## $ 2: int [1:5] 2 3 4 5 12
## $ 3: int 6
## $ 4: int [1:3] 7 9 11
## $ 5: int 8
## $ 6: int 10
spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))
giving:
[,1] [,2] [,3] [,4] [,5] [,6]
1 "application" "android" "ios" NA NA NA
2 "donald_trump" "united_states" "agreement" "climate" "paris" "emission"
3 "donald_trump" "entertainer" "host" "president" NA NA
4 "hen" "pan" "united_kingdom" "chicken" "mustard" "wimp"
5 "husband" "pamela" "private_lives" NA NA NA
6 "sex" "associate" "pamela" "partner" NA NA
Note: The input in reproducible form is:
Lines <- "
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald_trump agreement climate united_states
3 donald_trump agreement paris united_states
4 donald_trump agreement united_states NA
5 donald_trump climate emission united_states
6 donald_trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private_lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united_kingdom chicken hen wimp
12 united_states agreement paris NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Update: Fixed similarity definition.

Related

How can I alter the values of certain rows in a column, based on a condition from another column in a dataframe, using the ifelse function?

So I have this first dataframe (fish18) which consists of data on fish specimens, and a column "grade" that is to be filled with conditions in an ifelse function.
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo NA 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India NA 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa NA 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa NA 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa NA 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States NA 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
And after filling the grade column I have something like this (fish19)
species BIN collectors country grade species_frequency
1 Poecilothrissa congica BOLD:AAF7519 mljs et al, Democratic Republic of the Congo D 2
2 Acanthurus triostegus BOLD:AAA9362 Vinothkumar S, Kaleshkumar K and Rajaram R. India A 54
3 Pseudogramma polyacantha BOLD:AAC5137 Allan D. Connell South Africa C 15
4 Pomadasys commersonnii BOLD:AAD1338 Allan D. Connell South Africa A 12
5 Secutor insidiator BOLD:AAB2487 Allan D. Connell South Africa E 18
6 Sebastes macdonaldi BOLD:AAJ7419 Merit McCrea United States B 3
BIN_per_species collector_per_species countries_per_species species_per_bin
1 2 1 1 1
2 1 21 15 1
3 3 6 6 1
4 1 2 1 1
5 4 5 4 2
6 1 1 1 1
Both dataframes have many specimens belonging to the same species of fish, and the thing is that the grades are suposed to be assigned to each species for every specimen of that species. The problem I'm having is that some rows belonging to the same species are having different grades, specially in the case of the grades "C" and "E". What I want to incorporate into my ifelse function is: Change from grade "C" to "E" every occurrence of the dataframe where two or more specimens belonging to the same species are assigned "C" in one row and "E" in another row. Because if one species has grade "E", every other row with that species name should also have grade "E".
So far I've tried the %in% function and just using "=="
Trying with %in%
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]%in%fish19$species[fish19$grade=="C"]==TRUE,"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Trying with "=="
assign_grades=function(fish18){
fish19<-fish18 %>%
mutate(grade = ifelse(species_frequency<3,"D",ifelse(BIN_per_species==1 & (collector_per_species>1 | countries_per_species>1),"A",ifelse(BIN_per_species==1 & collector_per_species==1 | countries_per_species==1,"B",ifelse(BIN_per_species>1 & species_per_bin==1,"C",ifelse(species_per_bin>1,"E",ifelse(fish19$species[fish19$grade=="E"]==fish19$species[fish19$grade=="C"],"E",NA))) ))))
assign('fish19',fish19,envir=.GlobalEnv)
}
assign_grades(fish18)
Both these two options did not work and the output of this alteration should be that if one occurrence of a specific species name has the grade "E" assigned to it, so should all other occurences with that same species name.
I'm sorry if this was confusion but I tried to be as clear as I could, thank you in advance for any responses.
Kind of a long winded answer, but:
dat = data.frame('species'=c('a','b','c','a','a','b'),'grade'=c('E','E','C','C','C','D'))
dat %>% left_join(dat %>%
group_by(species) %>%
summarize(sum_e = sum(grade=='E')),by='species')
Then you could do an ifelse for sum_e>0

How can I overcome this error Error in tbl_vars(y) : argument "y" is missing, with no default?

I am trying to perform an inner join on 2 tables.
One is a hotel dataset which I have tokenized before using
df1 = read.csv("chennai.csv", header = TRUE, stringsAsFactors=FALSE)
library(dplyr)
library(tidytext)
hotel <- df1 %>% unnest_tokens(word,Review_Text)
data("stop_words")
hotel <- hotel %>%
anti_join(stop_words)
head(hotel)
Hotel_name Review_Title Sentiment
1 Accord Metropolitan Excellent comfortableness during stay 3
2 Accord Metropolitan Excellent comfortableness during stay 3
3 Accord Metropolitan Excellent comfortableness during stay 3
4 Accord Metropolitan Excellent comfortableness during stay 3
5 Accord Metropolitan Excellent comfortableness during stay 3
6 Accord Metropolitan Not too comfortable 1
Rating_Percentage X X.1 X.2 X.3 word
1 100 NA NA NA nice
2 100 NA NA NA stay
3 100 NA NA NA business
4 100 NA NA NA tourist
5 100 NA NA NA purpose
6 20 NA NA NA hotel
I have also used a simplified version of General Inquirer Dictionary spreadsheet
df <- read.csv("ib.csv", header=T, stringsAsFactors=FALSE)
dat <-subset(df, select=c(2,1))
head(dat)
word Scoree
1 A
2 ABANDON Negativ
3 ABANDONMENT Negativ
4 ABATE Negativ
5 ABATEMENT
6 ABDICATE Negativ
I have tried to do an inner_join where I encounter this error.
observation<- hotel %>%
+ inner_join(dat, by = "word") %>%
+ count(Scoree)

Weighting a String Distance Metric based on regular expressions

Is it possible to weight a string distance metric such as the Damerau-Levenshtein distance where the weight changes based on the character type?
I am looking to create a fuzzy match of addresses and need to weight numbers and letters differently so that an address like:
"5 James Street" and "5 Jmaes Street" are considered identical and
"5 James Street" and "6 James Street" are considered different.
I considered splitting the addresses into numbers and letters prior to applying the string distance however this will miss flats at "5a" and "5b". The ordering is also not consistent amongst the data set so one entry may be "James Street 5".
I am using R with the stringdist package currently but not restricted to these.
Thanks!
Here's an idea. It involves a bit of manual processing but it might be a good starting point. First, we compute the approximate string distance between the addresses using adist() (or stringdist() with the best suited method to your data) without paying attention to street numbers.
m <- adist(v)
rownames(m) <- v
> m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#5 James Street 0 2 3 1 4 17 17
#5 Jmaes Street 2 0 4 3 6 17 17
#5#Jam#es Str$eet 3 4 0 4 6 17 17
#6 James Street 1 3 4 0 4 17 17
#James Street 5 4 6 6 4 0 16 17
#10a Cold Winter Road 17 17 17 17 16 0 1
#10b Cold Winter Road 17 17 17 17 17 1 0
In this case, we can clearly identify the two clusters, but we could also use hclust() to visualize a dendrogram.
cl <- hclust(as.dist(m))
plot(cl)
rect.hclust(cl, 2)
Then, we tag each street to it's corresponding cluster of similarities, iterate through them and check for matching street numbers.
library(dplyr)
res <- data.frame(cluster = cutree(cl, 2)) %>%
tibble::rownames_to_column("address") %>%
mutate(
# Extract all components of the address
lst = stringi::stri_extract_all_words(address),
# Identify the component containing the street number and return it
num = purrr::map_chr(lst, .f = ~ grep("\\d+", .x, value = TRUE))) %>%
# For each cluster, tag matching street numbers
mutate(group = group_indices_(., .dots = c("cluster", "num")))
Which gives:
# address cluster lst num group
#1 5 James Street 1 5, James, Street 5 1
#2 5 Jmaes Street 1 5, Jmaes, Street 5 1
#3 5#Jam#es Str$eet 1 5, Jam, es, Str, eet 5 1
#4 6 James Street 1 6, James, Street 6 2
#5 James Street 5 1 James, Street, 5 5 1
#6 10a Cold Winter Road 2 10a, Cold, Winter, Road 10a 3
#7 10b Cold Winter Road 2 10b, Cold, Winter, Road 10b 4
You could then pull() the unique addresses based on group using distinct():
> distinct(res, group, .keep_all = TRUE) %>% pull(address)
#[1] "5 James Street" "6 James Street" "10a Cold Winter Road"
# "10b Cold Winter Road"
Data
v <- c("5 James Street", "5 Jmaes Street", "5#Jam#es Str$eet", "6 James Street",
"James Street 5", "10a Cold Winter Road", "10b Cold Winter Road")

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Resources