merge two dataframes on column with fuzzy match in R - r

I have two dataframes, one for 2008, the other for 2004 and 2012 data. Examples of the dataframes are below.
df_08 <- read.table(text = c("
observation year x code_location location
1 2008 300 23-940 town no. 1 town no. 1
2 2008 234 23-941 town no. 2 town no. 2
3 2008 947 23-942 city St 23 city St 23
4 2008 102 23-943 Mtn town 5 Mtn town 5 "), header = TRUE)
df_04_12 <- read.table(text = c("
observation year y code_location location
1 2004 124 23-940 town no. 1 town no. 1
2 2004 395 23-345 town # 2 town # 2
3 2004 1349 23-942 city St 23 city St 23
4 2012 930 53-443 Mtn town 5 Mtn town 5
5 2012 185 99-999 town no. 1 town no. 1
6 2012 500 23-941 town Number 2 town Number 2
7 2012 185 34-942 city Street 23 city Street 23
8 2012 195 23-943 Mt town 5 Mt town 5 "), header = TRUE)
I want to merge df_08 to df_04_12 using the location variable (the codes are not consistent across years). However, slight variations in the location name, eg Mtn v. Mt or no. v. #, result in no match. Given these slight variations between location names, is there a way to merge these dataframes and get the following? I currently do not have any code for this since I am not sure how to match locations for a merge.
observation year y code_location location.x location.y y.y
1 2004 124 23-940 town no. 1 town no. 1 town no.1 300
2 2004 395 "23-345 town # 2" "town # 2" "town no. 2" 234
3 2004 1349 23-942 city St 23 city St 23 city St 23 947
4 2012 930 53-443 Mtn town 5 Mtn town 5 Mtn town 5 102
5 2012 185 99-999 town no. 1 town no. 1 town no. 1 300
6 2012 500 23-941 town Number 2 town Number 2 town no. 2 234
7 2012 185 34-942 city Street 23 city Street 23 city St 23 947
8 2012 195 23-943 Mt town 5 Mt town 5 Mtn town 5 102

You can use levenshtein distance on character variables but there is no way to account for symbols. I would suggest you clear all of the symbols before merge and than use the stringdist package. There is no clean solution for this problem, you will have to develop your own method as it relates to your data.
Some of the methods that are used in fuzzy matching is string distance calculations and SoundX transformation of the data, you just have to find out what is appropriate for your data.

Related

Two way table with mean of a third variable R

Here's my problem. I have a table, of which I show a sample here. I would like to have the Country as row, Stars as column and the mean of the price for each combination. I used aggregate which gave me the info that i want but not how I want it.
The table looks like that :
Country Stars Price
1 Canada 4 567
2 China 2 435
3 Russia 3 456
4 Canada 5 687
5 Canada 4 432
6 Russia 3 567
7 China 4 1200
8 Russia 3 985
9 Canada 2 453
10 Russia 3 234
11 Russia 4 546
12 Canada 3 786
13 China 2 456
14 China 3 234
15 Russia 4 800
16 China 5 987
I used this code :
aggregate(Stars[,3],list(Country=Stars$Country, Stars = Stars$Stars), mean)
output :
Country Stars x
1 Canada 2 453.0
2 China 2 445.5
3 Canada 3 786.0
4 China 3 234.0
5 Russia 3 560.5
6 Canada 4 499.5
7 China 4 1200.0
8 Russia 4 673.0
9 Canada 5 687.0
10 China 5 987.0
Where x stands for the mean, I would like to change x for "price mean" to...
So the goal would be to have one country per row and the number of stars as column with the mean of the price for each pair.
Thank you very much.
It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse and baseR methods are also possible and are easy too
To obtain a two-way table of means, after attaching data you can use
tapply(Price, list(Country,Stars), mean)

How to manually enter a cell in a dataframe? [duplicate]

This question already has answers here:
Update a Value in One Column Based on Criteria in Other Columns
(4 answers)
dplyr replacing na values in a column based on multiple conditions
(2 answers)
Closed 2 years ago.
This is my dataframe:
county state cases deaths FIPS
Abbeville South Carolina 4 0 45001
Acadia Louisiana 9 1 22001
Accomack Virginia 3 0 51001
New York C New York 2 0 NA
Ada Idaho 113 2 16001
Adair Iowa 1 0 19001
I would like to manually put "55555" into the NA cell. My actual df is thousands of lines long and the row where the NA changes based on the day. I would like to add based on the county. Is there a way to say df[df$county == "New York C",] <- df$FIPS = "55555" or something like that? I don't want to insert based on the column or row number because they change.
This will put 55555 into the NA cells within column FIPS where country is New York C
df$FIPS[is.na(df$FIPS) & df$county == "New York C"] <- 55555
Output
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 55555
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
Data
df
# county state cases deaths FIPS
# 1 Abbeville South Carolina 4 0 45001
# 2 Acadia Louisiana 9 1 22001
# 3 Accomack Virginia 3 0 51001
# 4 New York C New York 2 0 NA
# 5 Ada Idaho 113 2 16001
# 6 Adair Iowa 1 0 19001
# 7 New York C New York 1 0 18000
You could use & (and) to substitute de df$FIPS entries that meet the two desired conditions.
df$FIPS[is.na(df$FIPS) & df$state=="New York"]<-5555
If you want to change values based on multiple conditions, I'd go with dplyr::mutate().
library(dplyr)
df <- df %>%
mutate(FIPS = ifelse(is.na(FIPS) & county == "New York C", 55555, FIPS))

Using a list or vector as input for a select statement

I have two data frames. The original, df1,
Country Ccode Year Happiness Power
1 France FR 2000 1872 1213
2 France FR 2001 2345 1234
3 UK UK 2000 2234 1726
4 UK UK 2001 9082 6433
and df1vars which contains only a vector of a few column names:
1 Country
2 Year
3 Happiness
I would like to select from df1 the columns in df1vars. When I, against my better judgment tried to do the following:
library(dplyr)
df2 <- select(df1, df1vars)
To get the output:
Country Year Happiness
1 France 2000 1872
2 France 2001 2345
3 UK 2000 2234
4 UK 2001 9082
I got the message:
Error: ``ES1varselect`` must evaluate to column positions or names, not a list
Is there an efficient workaround to this aspect of the select statement?
If both are data.frame then
df1[,c(df1vars$Colname)]
where df1 is your data.frame mentioned in the question, df1vars is the other data.frame having following content:
Colname
1 Country
2 Year
3 Happiness
Final output:
Country Year Ccode
1 France 2000 FR
2 France 2001 FR
3 UK 2000 UK
4 UK 2001 UK

Backtrack values in R for a logic

My request is slightly complicated.
Below is how my data looks like.
**S.no Date City Sales diff Indicator
1 1 1/1/2017 New York 2795 0 0
2 2 1/31/2017 New York 4248 1453 0
3 3 3/2/2017 New York 1330 -2918 1
4 4 4/1/2017 New York 3535 2205 0
5 5 5/1/2017 New York 4330 795 0
6 6 5/31/2017 New York 3360 -970 1
7 7 6/30/2017 New York 2238 -1122 1
8 8 1/1/2017 Paris 1451 0 0
9 9 1/31/2017 Paris 2339 888 0
10 10 3/2/2017 Paris 2029 -310 1
11 11 4/1/2017 Paris 1850 -179 1
12 12 5/1/2017 Paris 2800 950 1
13 13 5/31/2017 Paris 1986 -814 0
14 14 6/30/2017 Paris 3776 1790 0
15 15 1/1/2017 London 1646 0 0
16 16 1/31/2017 London 3575 1929 0
17 17 3/2/2017 London 1161 -2414 1
18 18 4/1/2017 London 1766 605 0
19 19 5/1/2017 London 2799 1033 0
20 20 5/31/2017 London 2761 -38 1
21 21 6/30/2017 London 1048 -1713 1**
diff is Current Month Sales-Last Month Sales, for each group, and Indicator is when diff is negative or positive.
I want to compute a logic for each group starting from last row to first row, aka in reverse order.
I want to see in reverse order, the value of Sales when indicator was 1. The compare that captured Sales value to the threshold value(2000), for next steps.
Now below are two cases of comparison(Capture Sales v/s Threshold).
a. If captured value of sales, when Indicator is first 1(starting from last row to 1st row), is less than 2000, then store the captured values in a new dataset for each group.
b. If the captured of sales, when Indicator is first 1(starting from last row to 1st row), is greater than 2000, then skip that Indicator=1 row and move to the next row where Indicator=1, and repeat the same step for pt.a) and pt. b)
I want to bring the result in a new dataset, that will have a single row for each City providing me the "Sales value" for the aforementioned logic, along with the Date.
I simply want to understand how can i bring up this logic in R. Will rle function help?
Result:
S.no Date City Value(Sales)
3. 3/2/2017 New York 1330
11. 4/1/2017 Paris 1850
21. 6/30/2017 London 1048
Thanks,
J
If we assume that your data is already arranged it ascending order, you can do the following with base R:
threshold <- 2000
my_new_df <- my_df[my_df$Indicator == 1 & my_df$Sales < threshold, ]
my_new_df
# S.no Date City Sales diff Indicator
# 3 3 2017-03-02 New York 1330 -2918 1
# 11 11 2017-04-01 Paris 1850 -179 1
# 17 17 2017-03-02 London 1161 -2414 1
# 21 21 2017-06-30 London 1048 -1713 1
Now we have all rows where the Indicator is equal to one and the Salse value less than our threshold. But London has to rows and we only wnat the last one:
my_new_df <- my_new_df[!duplicated(my_new_df$City, fromLast = T),
c("S.no", "Date", "City", "Sales")]
my_new_df
# S.no Date City Sales
# 3 3 2017-03-02 New York 1330
# 11 11 2017-04-01 Paris 1850
# 21 21 2017-06-30 London 1048
With the fromLast-argument in the duplicated, we start in the last row to check, whether the City has already been in the data set.

Add new column to long dataframe from another dataframe?

Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.

Resources