I have the following dataset in R:
address = c( "44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
The data looks something like this:
> my_data
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
7 3665 Unit Moon Crescent Pine Rec. cntR 2
8 NO ADDRESS PROVIDED Boston Swimming Complex 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
11 1799 Orlans Street, Maryland , USA Massachusetts Hospital 3
I am trying to follow this R tutorial (https://cran.r-project.org/web/packages/RecordLinkage/vignettes/WeightBased.pdf) and learn how to remove duplicates based on fuzzy conditions. The goal (within each "block") is to keep all unique records - and for fuzzy duplicates, only keep one occurrence of the duplicate.
I tried the following code:
library(RecordLinkage)
pairs=compare.dedup(my_data, blockfld=3)
But when I inspect the results, everything is NA - given these results, I think I am doing something wrong and there does not seem to be any point in continuing until this error is resolved.
Can someone please show me how I can resolve this problem and continue on with the tutorial?
In the end, I am looking for something like this:
address name blocking_var
1 44 Ocean Road Atlanta Georgia Pancake House of America 1
2 882 4N Road River NY, NY 12345 ABC Center Building 1
4 123 Fake Road Boston Drive Boston BD Home 25 New 1
6 3665 Apt 5 Moon Crs Pine Recreational Center 2
9 31 Silver Way Road boston gym center 2
10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital 3
Thank you!
You forgot to enable the string comparison on columns (strcmp parameter):
address = c(
"44 Ocean Road Atlanta Georgia", "882 4N Road River NY, NY 12345", "882 - River Road NY, ZIP 12345", "123 Fake Road Boston Drive Boston", "123 Fake - Rd Boston 56789", "3665 Apt 5 Moon Crs", "3665 Unit Moon Crescent", "NO ADDRESS PROVIDED", "31 Silver Way Road", "1800 Orleans St, Baltimore, MD 21287, United States",
"1799 Orlans Street, Maryland , USA")
name = c("Pancake House of America" ,"ABC Center Building", "Cent. Bldg ABC", "BD Home 25 New", "Boarding Direct 25", "Pine Recreational Center", "Pine Rec. cntR", "Boston Swimming Complex", "boston gym center", "mas hospital" , "Massachusetts Hospital" )
blocking_var = c(1, 1,1,1, 1, 2,2,2,2,3,3)
my_data = data.frame(address, name, blocking_var)
library(RecordLinkage)
pairs <- compare.dedup(my_data, blockfld=3, strcmp = c("address", "name"))
pairs
#> $data
#> address name
#> 1 44 Ocean Road Atlanta Georgia Pancake House of America
#> 2 882 4N Road River NY, NY 12345 ABC Center Building
#> 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC
#> 4 123 Fake Road Boston Drive Boston BD Home 25 New
#> 5 123 Fake - Rd Boston 56789 Boarding Direct 25
#> 6 3665 Apt 5 Moon Crs Pine Recreational Center
#> 7 3665 Unit Moon Crescent Pine Rec. cntR
#> 8 NO ADDRESS PROVIDED Boston Swimming Complex
#> 9 31 Silver Way Road boston gym center
#> 10 1800 Orleans St, Baltimore, MD 21287, United States mas hospital
#> 11 1799 Orlans Street, Maryland , USA Massachusetts Hospital
#> blocking_var
#> 1 1
#> 2 1
#> 3 1
#> 4 1
#> 5 1
#> 6 2
#> 7 2
#> 8 2
#> 9 2
#> 10 3
#> 11 3
#>
#> $pairs
#> id1 id2 address name blocking_var is_match
#> 1 1 2 0.4657088 0.5014620 1 NA
#> 2 1 3 0.4256705 0.4551587 1 NA
#> 3 1 4 0.5924184 0.4543651 1 NA
#> 4 1 5 0.5139994 0.4768519 1 NA
#> 5 2 3 0.9082051 0.5802005 1 NA
#> 6 2 4 0.5112554 0.4734336 1 NA
#> 7 2 5 0.5094017 0.5467836 1 NA
#> 8 3 4 0.4767677 0.4404762 1 NA
#> 9 3 5 0.5418803 0.4761905 1 NA
#> 10 4 5 0.8550583 0.6672619 1 NA
#> 11 6 7 0.8749962 0.8306277 1 NA
#> 12 6 8 0.4385965 0.5243193 1 NA
#> 13 6 9 0.5622807 0.5502822 1 NA
#> 14 7 8 0.3974066 0.5075914 1 NA
#> 15 7 9 0.5626812 0.5896359 1 NA
#> 16 8 9 0.3942495 0.6478338 1 NA
#> 17 10 11 0.6939076 0.6843434 1 NA
#>
#> $frequencies
#> address name blocking_var
#> 0.09090909 0.09090909 0.33333333
#>
#> $type
#> [1] "deduplication"
#>
#> attr(,"class")
#> [1] "RecLinkData"
It then goes like this, using e.g. the EpiLink algorithm:
# Compute EpiLink weights
pairs_w <- epiWeights(pairs)
# Explore the pairs and their weight to find a good cutoff
getPairs(pairs_w, min.weight=0.6, max.weight=0.8)
#> id address
#> 1 2 882 4N Road River NY, NY 12345
#> 2 3 882 - River Road NY, ZIP 12345
#> 3
#> 4 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 5 11 1799 Orlans Street, Maryland , USA
#> 6
#> 7 7 3665 Unit Moon Crescent
#> 8 9 31 Silver Way Road
#> 9
#> 10 6 3665 Apt 5 Moon Crs
#> 11 9 31 Silver Way Road
#> 12
#> 13 2 882 4N Road River NY, NY 12345
#> 14 5 123 Fake - Rd Boston 56789
#> 15
#> 16 1 44 Ocean Road Atlanta Georgia
#> 17 4 123 Fake Road Boston Drive Boston
#> 18
#> 19 8 NO ADDRESS PROVIDED
#> 20 9 31 Silver Way Road
#> 21
#> 22 3 882 - River Road NY, ZIP 12345
#> 23 5 123 Fake - Rd Boston 56789
#> 24
#> name blocking_var Weight
#> 1 ABC Center Building 1
#> 2 Cent. Bldg ABC 1 0.7916856
#> 3
#> 4 mas hospital 3
#> 5 Massachusetts Hospital 3 0.7468321
#> 6
#> 7 Pine Rec. cntR 2
#> 8 boston gym center 2 0.6548348
#> 9
#> 10 Pine Recreational Center 2
#> 11 boston gym center 2 0.6386475
#> 12
#> 13 ABC Center Building 1
#> 14 Boarding Direct 25 1 0.6156913
#> 15
#> 16 Pancake House of America 1
#> 17 BD Home 25 New 1 0.6118630
#> 18
#> 19 Boston Swimming Complex 2
#> 20 boston gym center 2 0.6099491
#> 21
#> 22 Cent. Bldg ABC 1
#> 23 Boarding Direct 25 1 0.6001716
#> 24
I chose > 0.7 to classify as link, < 0.6 to classify as a non-link.
Matches in-between are labelled as "possible".
pairs_class <- epiClassify(pairs_w, threshold.upper = 0.7, threshold.lower = 0.6)
summary(pairs_class)
#>
#> Deduplication Data Set
#>
#> 11 records
#> 17 record pairs
#>
#> 0 matches
#> 0 non-matches
#> 17 pairs with unknown status
#>
#>
#> Weight distribution:
#>
#> [0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7] (0.7,0.75] (0.75,0.8] (0.8,0.85]
#> 1 6 5 1 1 1 1
#> (0.85,0.9]
#> 1
#>
#> 4 links detected
#> 6 possible links detected
#> 7 non-links detected
#>
#> Classification table:
#>
#> classification
#> true status N P L
#> <NA> 7 6 4
And the results:
# detected links, possible matches, non-links
getPairs(pairs_class, show = "links")
#> id address
#> 1 6 3665 Apt 5 Moon Crs
#> 2 7 3665 Unit Moon Crescent
#> 3
#> 4 4 123 Fake Road Boston Drive Boston
#> 5 5 123 Fake - Rd Boston 56789
#> 6
#> 7 2 882 4N Road River NY, NY 12345
#> 8 3 882 - River Road NY, ZIP 12345
#> 9
#> 10 10 1800 Orleans St, Baltimore, MD 21287, United States
#> 11 11 1799 Orlans Street, Maryland , USA
#> 12
#> name blocking_var Weight
#> 1 Pine Recreational Center 2
#> 2 Pine Rec. cntR 2 0.8801340
#> 3
#> 4 BD Home 25 New 1
#> 5 Boarding Direct 25 1 0.8054952
#> 6
#> 7 ABC Center Building 1
#> 8 Cent. Bldg ABC 1 0.7916856
#> 9
#> 10 mas hospital 3
#> 11 Massachusetts Hospital 3 0.7468321
#> 12
getPairs(pairs_class, show = "possible")
#> id address name blocking_var
#> 1 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 2 9 31 Silver Way Road boston gym center 2
#> 3
#> 4 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 5 9 31 Silver Way Road boston gym center 2
#> 6
#> 7 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 8 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 9
#> 10 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 11 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 12
#> 13 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 14 9 31 Silver Way Road boston gym center 2
#> 15
#> 16 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 17 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 18
#> Weight
#> 1
#> 2 0.6548348
#> 3
#> 4
#> 5 0.6386475
#> 6
#> 7
#> 8 0.6156913
#> 9
#> 10
#> 11 0.6118630
#> 12
#> 13
#> 14 0.6099491
#> 15
#> 16
#> 17 0.6001716
#> 18
getPairs(pairs_class, show = "nonlinks")
#> id address name blocking_var
#> 1 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 2 5 123 Fake - Rd Boston 56789 Boarding Direct 25 1
#> 3
#> 4 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 5 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 6
#> 7 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 8 2 882 4N Road River NY, NY 12345 ABC Center Building 1
#> 9
#> 10 6 3665 Apt 5 Moon Crs Pine Recreational Center 2
#> 11 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 12
#> 13 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 14 4 123 Fake Road Boston Drive Boston BD Home 25 New 1
#> 15
#> 16 7 3665 Unit Moon Crescent Pine Rec. cntR 2
#> 17 8 NO ADDRESS PROVIDED Boston Swimming Complex 2
#> 18
#> 19 1 44 Ocean Road Atlanta Georgia Pancake House of America 1
#> 20 3 882 - River Road NY, ZIP 12345 Cent. Bldg ABC 1
#> 21
#> Weight
#> 1
#> 2 0.5890881
#> 3
#> 4
#> 5 0.5865789
#> 6
#> 7
#> 8 0.5794458
#> 9
#> 10
#> 11 0.5777132
#> 12
#> 13
#> 14 0.5591162
#> 15
#> 16
#> 17 0.5541298
#> 18
#> 19
#> 20 0.5442886
#> 21
Created on 2022-11-17 with reprex v2.0.2
I have a data frame like this:
city year value
<chr> <dbl> <dbl>
1 la 1 NA
2 la 2 NA
3 la 3 NA
4 la 4 20
5 la 5 25
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
I would like to remove any cities that don't have a complete 5 years worth of data. So in this case, I'd like to remove all rows for city la despite the fact that there is data for years 4 and 5, resulting in the following data frame:
city year value
<chr> <dbl> <dbl>
1 nyc 1 18
2 nyc 2 29
3 nyc 3 24
4 nyc 4 17
5 nyc 5 30
Is this possible? Thanks in advance.
In Base R:
subset(df, !ave(value, city, FUN = anyNA))
city year value
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
in Tidyverse
df %>%
group_by(city) %>%
filter(!anyNA(value))
# A tibble: 5 x 3
# Groups: city [1]
city year value
<chr> <int> <int>
1 nyc 1 18
2 nyc 2 29
3 nyc 3 24
4 nyc 4 17
5 nyc 5 30
or even
df %>%
group_by(city) %>%
filter(all(!is.na(value)))
Another base R option with ave
> subset(df, !is.na(ave(value, city)))
city year value
6 nyc 1 18
7 nyc 2 29
8 nyc 3 24
9 nyc 4 17
10 nyc 5 30
or a data.table one
> library(data.table)
> setDT(df)[, .SD[!anyNA(value)], city]
city year value
1: nyc 1 18
2: nyc 2 29
3: nyc 3 24
4: nyc 4 17
5: nyc 5 30
I have two pieces of data.
First, cross-sectional data-1600 corn field data
Second, weather data for the corresponding counties for the corn field area from 2013 to 2020. -Date is year, month, day
I want to create panel data using two kinds of data.
The problem is to join the weather data in each field for each year, month, and day (like vlookup in Excel).
For example, the two tables below.
Cross-section
ID
county
Longtitude
Latitude
1
Texas
-101.8259
36.99026
2
Cimarron
-101.7264
36.99253
3
Texas
-101.8038
36.99012
4
Cimarron
-101.9427
36.97605
5
Cimarron
-102.2219
36.96172
6
Beaver
-102.0777
36.96919
7
Beaver
-101.6181
36.98999
Time series data
YEAR
MONTH
DAY
county
A
B
2013
1
1
Texas
1
5
2013
1
2
Texas
2
6
2013
1
3
Texas
3
7
2013
1
4
Texas
4
8
2014
1
1
Texas
9
10
2014
1
2
Texas
11
12
2014
1
3
Texas
13
14
2014
1
4
Texas
15
16
The data I want to create is below.
ID
county
Longtitude
Latitude
YEAR
MONTH
DAY
county
A
B
1
Texas
-101.8259
36.99026
2013
1
1
Texas
1
5
2
Cimarron
-101.7264
36.99253
2013
1
1
Cimarron
-
-
3
Texas
-101.8038
36.99012
2013
1
1
Texas
1
5
4
Cimarron
-101.9427
36.97605
2013
1
1
Cimarron
-
-
5
Cimarron
-102.2219
36.96172
2013
1
1
Cimarron
-
-
6
Beaver
-102.0777
36.96919
2013
1
1
Beaver
-
-
7
Beaver
-101.6181
36.98999
2013
1
1
Beaver
-
-
1
Texas
-101.8259
36.99026
2013
1
2
Texas
2
6
2
Cimarron
-101.7264
36.99253
2013
2
1
Cimarron
1
5
3
Texas
-101.8038
36.99012
2013
1
2
Texas
2
6
4
Cimarron
-101.9427
36.97605
2013
1
2
Cimarron
2
6
5
Cimarron
-102.2219
36.96172
2013
1
2
Cimarron
-
6
Beaver
-102.0777
36.96919
2013
1
2
Beaver
1
5
7
Beaver
-101.6181
36.98999
2013
1
2
Beaver
1
5
…
1
Texas
-101.8259
36.99026
2014
1
4
Texas
15
16
2
Cimarron
-101.7264
36.99253
2014
2
4
Cimarron
15
16
3
Texas
-101.8038
36.99012
2014
1
2
Texas
15
16
4
Cimarron
-101.9427
36.97605
2014
1
4
Cimarron
-
-
5
Cimarron
-102.2219
36.96172
2014
1
4
Cimarron
-
-
6
Beaver
-102.0777
36.96919
2014
1
4
Beaver
-
-
7
Beaver
-101.6181
36.98999
2014
1
4
Beaver
-
-
-Is a number other than NA.
In other words, I want to perform 1 cross section for each date using a function like left_join and rbind the data to create panel data.
If I want to make a panel from January 1, 2013 to January 1, 2014,
I need to use 7 observations * 3 (counts) *365 (dates).
My data is much more than this, so I use a loop (1600 observations, 77 counties, 10 years).
If you give me any ideas, I appreciate it!
edit:
In other words, considering only the two of January 1st and January 2nd, left_join variables such as tavg of 1600 fields of table 1 (using data from the 1st), repeat the same process on the 2nd, and then combine the two data. Is. That is, 1600*2 data are generated (of course, the values of the tavg variables on the 1st and 2nd days are different). I have to take this course for 10 years.
This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.