Weighting a String Distance Metric based on regular expressions - r

Is it possible to weight a string distance metric such as the Damerau-Levenshtein distance where the weight changes based on the character type?
I am looking to create a fuzzy match of addresses and need to weight numbers and letters differently so that an address like:
"5 James Street" and "5 Jmaes Street" are considered identical and
"5 James Street" and "6 James Street" are considered different.
I considered splitting the addresses into numbers and letters prior to applying the string distance however this will miss flats at "5a" and "5b". The ordering is also not consistent amongst the data set so one entry may be "James Street 5".
I am using R with the stringdist package currently but not restricted to these.
Thanks!

Here's an idea. It involves a bit of manual processing but it might be a good starting point. First, we compute the approximate string distance between the addresses using adist() (or stringdist() with the best suited method to your data) without paying attention to street numbers.
m <- adist(v)
rownames(m) <- v
> m
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#5 James Street 0 2 3 1 4 17 17
#5 Jmaes Street 2 0 4 3 6 17 17
#5#Jam#es Str$eet 3 4 0 4 6 17 17
#6 James Street 1 3 4 0 4 17 17
#James Street 5 4 6 6 4 0 16 17
#10a Cold Winter Road 17 17 17 17 16 0 1
#10b Cold Winter Road 17 17 17 17 17 1 0
In this case, we can clearly identify the two clusters, but we could also use hclust() to visualize a dendrogram.
cl <- hclust(as.dist(m))
plot(cl)
rect.hclust(cl, 2)
Then, we tag each street to it's corresponding cluster of similarities, iterate through them and check for matching street numbers.
library(dplyr)
res <- data.frame(cluster = cutree(cl, 2)) %>%
tibble::rownames_to_column("address") %>%
mutate(
# Extract all components of the address
lst = stringi::stri_extract_all_words(address),
# Identify the component containing the street number and return it
num = purrr::map_chr(lst, .f = ~ grep("\\d+", .x, value = TRUE))) %>%
# For each cluster, tag matching street numbers
mutate(group = group_indices_(., .dots = c("cluster", "num")))
Which gives:
# address cluster lst num group
#1 5 James Street 1 5, James, Street 5 1
#2 5 Jmaes Street 1 5, Jmaes, Street 5 1
#3 5#Jam#es Str$eet 1 5, Jam, es, Str, eet 5 1
#4 6 James Street 1 6, James, Street 6 2
#5 James Street 5 1 James, Street, 5 5 1
#6 10a Cold Winter Road 2 10a, Cold, Winter, Road 10a 3
#7 10b Cold Winter Road 2 10b, Cold, Winter, Road 10b 4
You could then pull() the unique addresses based on group using distinct():
> distinct(res, group, .keep_all = TRUE) %>% pull(address)
#[1] "5 James Street" "6 James Street" "10a Cold Winter Road"
# "10b Cold Winter Road"
Data
v <- c("5 James Street", "5 Jmaes Street", "5#Jam#es Str$eet", "6 James Street",
"James Street 5", "10a Cold Winter Road", "10b Cold Winter Road")

Related

put the resulting values from for loop into a table in r [duplicate]

This question already has an answer here:
Using Reshape from wide to long in R [closed]
(1 answer)
Closed 2 years ago.
I'm trying to calculate the total number of matches played by each team in the year 2019 and put them in a table along with the corresponding team names
teams<-c("Sunrisers Hyderabad", "Mumbai Indians", "Gujarat Lions", "Rising Pune Supergiants",
"Royal Challengers Bangalore","Kolkata Knight Riders","Delhi Daredevils",
"Kings XI Punjab", "Deccan Chargers","Rajasthan Royals", "Chennai Super Kings",
"Kochi Tuskers Kerala", "Pune Warriors", "Delhi Capitals", " Gujarat Lions")
for (j in teams) {
print(j)
ipl_table %>%
filter(season==2019 & (team1==j | team2 ==j)) %>%
summarise(match_count=n())->kl
print(kl)
match_played<-data.frame(Teams=teams,Match_count=kl)
}
The match played by last team (i.e Gujarat Lions is 0 and its filling 0's for all other teams as well.
The output match_played can be found on the link given below.
I'd be really glad if someone could help me regarding this error as I'm very new to R.
filter for the particular season, get data in long format and then count number of matches.
library(dplyr)
matches %>%
filter(season == 2019) %>%
tidyr::pivot_longer(cols = c(team1, team2), values_to = 'team_name') %>%
count(team_name) -> result
result
# team_name n
# <chr> <int>
#1 Chennai Super Kings 17
#2 Delhi Capitals 16
#3 Kings XI Punjab 14
#4 Kolkata Knight Riders 14
#5 Mumbai Indians 16
#6 Rajasthan Royals 14
#7 Royal Challengers Bangalore 14
#8 Sunrisers Hyderabad 15
Here is an example
library(tidyr)
df_2019 <- matches[matches$season == 2019, ] # get the season you need
df_long <- gather(df_2019, Team_id, Team_Name, team1:team2) # Make it long format
final_count <- data.frame(t(table(df_long$Team_Name)))[-1] # count the number of matches
names(final_count) <- c("Team", "Matches")
Team Matches
1 Chennai Super Kings 17
2 Delhi Capitals 16
3 Kings XI Punjab 14
4 Kolkata Knight Riders 14
5 Mumbai Indians 16
6 Rajasthan Royals 14
7 Royal Challengers Bangalore 14
8 Sunrisers Hyderabad 15
Or by using base R
final_count <- data.frame(t(table(c(df_2019$team1, df_2019$team2))))[-1]
names(final_count) <- c("Team", "Matches")
final_count

How to compare two strings word by word in R

I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE". The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE". I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset. I have to match word by word since there may be a different order of words, synonims or abbreviations. Then I calcolate the similarity score to keep only the best match and accept those above a certain score. Is there a way to improve it? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.
ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
split1 <- strsplit(x$DESCRIPTION, " ")
split2 <- strsplit(y$DESCRIPTION, " ")
risultato <- vector()
distanza <- vector()
for(i in 1:NROW(split1)) {
best_dist <- -5
closest_match <- -5
for(j in 1:NROW(split2)) {
dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
if (dist > best_dist) {
closest_match <- y$DESCRIPTION[j]
best_dist <- dist
}
}
distanza <- append(distanza, best_dist)
risultato <- append(risultato, closest_match)
}
confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
}
match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")
The R tm (text mining) library can help here:
library(tm)
library(proxy) # for computing cosine similarity
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)
# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'
# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))
# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)
# convert to a regular matrix
dtm = as.matrix(dtm)
# look at it (t() transpose for readability)
t(dtm)
Docs
Terms 1 2 3 4 5 6
123 1 0 0 0 1 0
peter 1 0 0 0 1 0
rose 1 0 0 0 1 0
street 1 0 0 1 1 1
chicago 0 1 0 0 0 1
flower 0 1 0 0 0 1
jane 0 1 0 0 0 1
jenny 0 1 0 0 0 1
str 0 1 0 0 0 0
430f 0 0 1 1 0 0
miss 0 0 1 0 0 0
name 0 0 1 1 0 0
sarah 0 0 1 1 0 0
strt 0 0 1 0 0 0
washington 0 0 1 1 0 0
brown 0 0 0 1 0 0
green 0 0 0 0 1 0
# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")
# result:
ORIGINALE REFERENCE document
document 4 5 6
1 0.7958759 0.1055728 0.7763932 <-- difference (smaller = more similar)
2 1.0000000 1.0000000 0.2000000
3 0.3333333 1.0000000 1.0000000
# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
apply(
similarity,1,function(x){
data.table(i=which.min(x),distance=min(x))
}
)
)
# result:
i distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE
# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)
# result:
i DESCRIPTION.x CODICE.x distance DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt NA 0.3333333 sarah brown name street 430f washington 135tg67
2: 2 mr peter 123 rose street 3b LA NA 0.1055728 peter green 123 rose street 3b LA aw56
3: 3 4c flower str jenny jane Chicago NA 0.2000000 jenny jane flower street 4c Chicago 83776250
# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]
Good question. for loops are slow in R:
for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {
For fast R, you need to vectorize your algorithm. I'm not that handy with data.frame anymore, so I'll use its successor, data.table.
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']
# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)
# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]
# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]
# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)
Here is result:
DESCRIPTION CODICE DESCRIPTION.y CODICE.y N
1: 4c flower str jenny jane Chicago NA jenny jane flower street 4c Chicago 83776250 5
2: mr peter 123 rose street 3b LA NA peter green 123 rose street 3b LA aw56 6
3: washington miss sarah 430f name strt NA sarah brown name street 430f washington 135tg67 4
PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops.
What about :
library(stringdist)
library(dplyr)
library(tidyr)
data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
group_by(desc_o) %>%
filter(dist==max(dist))
desc_o desc_r dist
<chr> <chr> <dbl>
1 " 4c flower str jenny jane Chicago" jenny jane flower street 4c Chicago 0.486
2 "mr peter 123 rose street 3b LA" peter green 123 rose street 3b LA 0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385

Combine cells having similar values in a row

I have a data frame like below.
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald trump agreement climate united states
3 donald trump agreement paris united states
4 donald trump agreement united states NA
5 donald trump climate emission united states
6 donald trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united kingdom chicken hen wimp
12 united states agreement paris NA
And I want the resultant as a data frame with rows like below
For example,
Row1 should be as such since it doesn't have any similar rows.
if you see rows 2,3,4,5 and 12. They should be combined in a same row like
united states donald trump paris climate agreement emission
And rows 7,9 and 11 should be combined as
united kingdom chicken hen wimp mustard
It can be in any order.
Assume the data frame DF shown reproducibly in the Note at the end.
Convert that to a character matrix m. Let us say that two rows are similar if they have more than one element in common and define is_similar to take two row indexes and return TRUE or FALSE accordingly. Then apply that to every pair of rows using outer. Interpret that as the adjacency matrix of a graph and calculate the connected compnents splitting DF into a list L each of whose elements is a data frame of the rows from DF that constitute that connected component.. Finally rework L into a character matrix.
library(igraph)
m <- as.matrix(DF)
n <- nrow(m)
is_similar <- function(i, j) length(intersect(na.omit(m[i, ]), na.omit(m[j, ]))) > 1
smat <- outer(1:n, 1:n, Vectorize(is_similar))
adj <- graph.adjacency(smat)
cl <- components(adj)$membership
str(split(1:n, cl))
## List of 6
## $ 1: int 1
## $ 2: int [1:5] 2 3 4 5 12
## $ 3: int 6
## $ 4: int [1:3] 7 9 11
## $ 5: int 8
## $ 6: int 10
spl <- split(DF, cl)
L <- lapply(spl, function(x) na.omit(unique(unlist(x))))
t(do.call("cbind", lapply(L, ts)))
giving:
[,1] [,2] [,3] [,4] [,5] [,6]
1 "application" "android" "ios" NA NA NA
2 "donald_trump" "united_states" "agreement" "climate" "paris" "emission"
3 "donald_trump" "entertainer" "host" "president" NA NA
4 "hen" "pan" "united_kingdom" "chicken" "mustard" "wimp"
5 "husband" "pamela" "private_lives" NA NA NA
6 "sex" "associate" "pamela" "partner" NA NA
Note: The input in reproducible form is:
Lines <- "
New_ment1_1 New_ment1_2 New_ment1_3 New_ment1_4
1 application android ios NA
2 donald_trump agreement climate united_states
3 donald_trump agreement paris united_states
4 donald_trump agreement united_states NA
5 donald_trump climate emission united_states
6 donald_trump entertainer host president
7 hen chicken mustard wimp
8 husband pamela private_lives NA
9 pan chicken hen wimp
10 sex associate pamela partner
11 united_kingdom chicken hen wimp
12 united_states agreement paris NA"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
Update: Fixed similarity definition.

Loading tables in R with space seperation

How to load a space seperated table with spaces inside the fields?
Simple case data:
Grade Area School Goals
4 Rural Elm Popular
4 Rural Elm Sports
4 Rural Elm Grades
4 Rural Elm Popular
3 Rural Brentwood Elementary Sports
3 Suburban Ridge Popular
Notice how the last element has space seperation in naming the school("Brentwood Elementary" as opposed to "Elm")
The following query fails with: "line x did not have y elements"
dat = read.table("dat.txt",header=TRUE)
Edit:
The data points are all factors and contains a set values
Edit: full data available through http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html
Thanks to #AmandaMahto
Actually, if you can use the data source Ananda found, it's pretty easy since the <pre> area is tab delimited:
library(rvest)
pg <- html("http://lib.stat.cmu.edu/DASL/Datafiles/PopularKids.html")
dat <- pg %>% html_nodes("pre") %>% html_text()
dat <- read.table(text=dat, sep="\t", header=TRUE, stringsAsFactors=FALSE)
dat[245:249,]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 245 girl 4 9 White Rural Sand Grades 1 3 2 4
## 246 girl 4 9 White Rural Sand Sports 3 2 1 4
## 247 girl 4 9 White Rural Sand Sports 3 2 1 4
## 248 girl 4 9 White Rural Sand Grades 2 1 3 4
## 249 girl 6 12 White Rural Brown Middle Popular 4 2 1 3
To actually answer your question (this is a bit like Ananda's answer) you'll need to know where the problem column is and work around it. This one uses gsubfn and pre-defined values for that column to make whole then split afterwards:
library(gsubfn)
# awful.txt is here https://gist.github.com/hrbrmstr/13cee15c91fdadb10fbc
lines <- readLines("awful.txt")
schools <- c("Brentwood Elementary", "Brentwood Middle", "Brown Middle", "Westdale Middle")
expr <- paste("(", paste(schools, collapse="|"), ")", sep="")
lines <- gsubfn(expr, function(x) { gsub(" ", "_", x) }, lines)
dat <- read.table(text=paste(lines, sep="", collapse="\n"),
header=TRUE, stringsAsFactors=FALSE)
dat$School <- gsub("_", " ", dat$School)
dat[c(1,34,94,198,255,324,377,433),]
## Gender Grade Age Race Urban.Rural School Goals Grades Sports Looks Money
## 1 boy 5 11 White Rural Elm Sports 1 2 4 3
## 34 boy 4 10 White Suburban Brentwood Elementary Grades 2 1 3 4
## 94 girl 6 11 White Suburban Brentwood Middle Grades 3 4 1 2
## 198 boy 5 10 White Rural Ridge Sports 4 2 1 3
## 255 girl 6 12 Other Rural Brown Middle Grades 3 2 1 4
## 324 boy 4 9 Other Urban Main Grades 4 1 3 2
## 377 boy 4 9 White Urban Portage Popular 4 1 2 3
## 433 girl 6 11 White Urban Westdale Middle Popular 4 2 1 3
Unfortunately, the answer to this question is pretty much "It depends on how much you know about the data set."
For instance, in the description for the dataset, it specifies the possible values for each variable. Here, we know that there are only a few schools with multi-word names, and that these follow a predictable pattern of "Elementary" and "Middle".
As such, you could read the data in using readLines and figure out the least obtrusive way to insert a delimiter before re-reading the data with read.table.
Here's an example:
Sample data:
cat("Grade Area School Goals Value",
"4 Rural Elm Popular 1",
"4 Rural Elm Sports 2",
"4 Rural Elm Grades 1",
"4 Rural Elm Popular 3",
"3 Rural Brentwood Elementary Sports 4",
"3 Rural Brentwood Middle Grades 3",
"3 Suburban Ridge Popular 3", sep = "\n", file = "test.txt")
Read it in as a character vector:
x <- readLines("test.txt")
Use gsub to force the multi-word school names to become a single word (separated by an underscore). Then, use read.table to get your data.frame.
read.table(text = gsub(" (Elementary|Middle)", "_\\1", x), header = TRUE)
# Grade Area School Goals Value
# 1 4 Rural Elm Popular 1
# 2 4 Rural Elm Sports 2
# 3 4 Rural Elm Grades 1
# 4 4 Rural Elm Popular 3
# 5 3 Rural Brentwood_Elementary Sports 4
# 6 3 Rural Brentwood_Middle Grades 3
# 7 3 Suburban Ridge Popular 3

Arrange dataframe for pairwise correlations

I am working with data in the following form:
Country Player Goals
"USA" "Tim" 0
"USA" "Tim" 0
"USA" "Dempsey" 3
"USA" "Dempsey" 5
"Brasil" "Neymar" 6
"Brasil" "Neymar" 2
"Brasil" "Hulk" 5
"Brasil" "Luiz" 2
"England" "Rooney" 4
"England" "Stewart" 2
Each row represents the number of goals that a player scored per game, and also contains that player's country. I would like to have the data in the form such that I can run pairwise correlations to see whether being from the same country has some association with the number of goals that a player scores. The data would look like this:
Player_1 Player_2
0 8 # Tim Dempsey
8 5 # Neymar Hulk
8 2 # Neymar Luiz
5 2 # Hulk Luiz
4 2 # Rooney Stewart
(You can ignore the comments, they are there simply to clarify what each row contains).
How would I do this?
table(df$player)
gets me the number of goals per player, but then how to I generate these pairwise combinations?
This is a pretty classic self-join problem. I'm gonna start by summarizing your data to get the total goals for each player. I like dplyr for this, but aggregate or data.table work just fine too.
library(dplyr)
df <- df %>% group_by(Player, Country) %>% dplyr::summarize(Goals = sum(Goals))
> df
Source: local data frame [7 x 3]
Groups: Player
Player Country Goals
1 Dempsey USA 8
2 Hulk Brasil 5
3 Luiz Brasil 2
4 Neymar Brasil 8
5 Rooney England 4
6 Stewart England 2
7 Tim USA 0
Then, using good old merge, we join it to itself based on country, and then so we don't get each row twice (Dempsey, Tim and Tim, Dempsey---not to mention Dempsey, Dempsey), we'll subset it so that Player.x is alphabetically before Player.y. Since I already loaded dplyr I'll use filter, but subset would do the same thing.
df2 <- merge(df, df, by.x = "Country", by.y = "Country")
df2 <- filter(df2, as.character(Player.x) < as.character(Player.y))
> df2
Country Player.x Goals.x Player.y Goals.y
2 Brasil Hulk 5 Luiz 2
3 Brasil Hulk 5 Neymar 8
6 Brasil Luiz 2 Neymar 8
11 England Rooney 4 Stewart 2
15 USA Dempsey 8 Tim 0
The self-join could be done in dplyr if we made a little copy of the data and renamed the Player and Goals columns so they wouldn't be joined on. Since merge is pretty smart about the renaming, it's easier in this case.
There is probably a smarter way to get from the aggregated data to the pairs, but assuming your data is not too big (national soccer data), you can always do something like:
A<-aggregate(df$Goals~df$Player+df$Country,data=df,sum)
players_in_c<-table(A[,2])
dat<-NULL
for(i in levels(df$Country)) {
count<-players_in_c[i]
pair<-combn(count,m=2)
B<-A[A[,2]==i,]
dat<-rbind(dat, cbind(B[pair[1,],],B[pair[2,],]) )
}
dat
> dat
df$Player df$Country df$Goals df$Player df$Country df$Goals
1 Hulk Brasil 5 Luiz Brasil 2
1.1 Hulk Brasil 5 Neymar Brasil 8
2 Luiz Brasil 2 Neymar Brasil 8
4 Rooney England 4 Stewart England 2
6 Dempsey USA 8 Tim USA 0

Resources