Fuzzy string matching in r - r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?

What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Related

Group By and Summarize

I have a quick question, I did a group by and summarize function for the following data by doing this. However, How do I summarize the length of the variable (Trump, Obama, McConnell) individually.
dta.subset.tabluea = dta.subset %>%
group_by(variable,catvalue2) %>%
summarize(value = length(catvalue))
the output i got was
variable catvalue2 value
1 Trump Slightly Warm 216
2 Trump Very Cold 778
3 Trump Very Warm 311
4 Trump <NA> 176
5 Obama Slightly Warm 251
6 Obama Very Cold 427
7 Obama Very Warm 676
8 Obama <NA> 224
9 McConnell Slightly Warm 248
10 McConnell Very Cold 731
11 McConnell Very Warm 60
12 McConnell <NA> 444
However, How do I summarize the length of the variable (Trump, Obama, McConnell) in another column. I need this info so I can make percentages.
If i do the following, I would get the same answers as the first column.
summarize(value = length(catvalue), varvalue = length(vatvalue))

Assigning coordinates to data of a different year in R?

I have got a data frame of Germany from 2012 with 8187 rows for 8187 postal codes (and about 10 variables listed as columns), but with no coordinates. Additionally, I have got coordinates of a different shapefile with 8203 rows (also including mostly the same postal codes).
I need the correct coordinates of the 8203 cases to be assigned to the 8178 cases of the initial data frame.
The problem: The difference of correct assignments needed is not 8178 with 16 cases missing (8203 - 8187 = 16), it is more. There are some towns (with postal codes) of 2012 which are not listed in the more recent shapefile and vice versa.
(I) Perhaps the easiest solution would be to obtain the coordinates from 2012 (unprojected: CRS("+init=epsg:4326")). --> Does anybody know an open source platform for this purpose? And do they have exactly 8187 postal codes?
(II) Or: Does anybody have an experience with assigning coordinates from to a data set of a different year? - Or, should this be avoided in any way because of some slightly changing borders and coordinates (especially when the data should be mapped and visualized in polygons from 2012) and some towns not listed in the older "and" in the newer data set?
I would appreciate your expert advice on how to approach (and hopefully solve) this issue!
EDIT - MWE:
# data set from 2012
> df1
# A tibble: 9 x 4
ID PLZ5 Name Var1
<dbl> <dbl> <chr> <dbl>
1 1 1067 Dresden 01067 40
2 2 1069 Dresden 01069 110
3 224 4571 Rötha 0
4 225 4574 Deutzen 120
5 226 4575 Neukieritzsch 144
6 262 4860 Torgau 23
7 263 4862 Mockrehna 57
8 8186 99996 Menteroda 0
9 8187 99998 Körner 26
# coordinates of recent shapefile
> df2
# A tibble: 9 x 5
ID PLZ5 Name Longitude Latitude
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1067 Dresden-01067 13.71832 51.06018
2 2 1069 Dresden-01069 13.73655 51.03994
3 224 4571 Roetha 12.47311 51.20390
4 225 4575 Neukieritzsch 12.41355 51.15278
5 260 4860 Torgau 12.94737 51.55790
6 261 4861 Bennewitz 13.00145 51.51125
7 262 4862 Mockrehna 12.83097 51.51125
8 8202 99996 Obermehler 10.59146 51.28864
9 8203 99998 Koerner 10.55294 51.21257
Hence,
4 225 4574 Deutzen 120
--> is not listed in df2 and:
6 261 4861 Bennewitz 13.00145 51.51125
--> is not listed in df1.
Any ideas concerning (I) and (II)?

Rfacebook: get reactions to posts

I want to use Rfacebook to get the reactions (not just likes) to specific posts but couldn't find a way to do that. Basically, I would want the same output for a comment as I get for a post:
> BBC <- getPage(page="bbcnews", token=fb_oauth, n=5, since="2017-10-03", until="2017-10-06", feed=FALSE, reactions=TRUE, verbose=TRUE)
5 posts > BBC
id likes_count from_id from_name
1 228735667216_10155178331342217 1602 228735667216 BBC News
2 228735667216_10155178840252217 7575 228735667216 BBC News
3 228735667216_10155178915482217 5735 228735667216 BBC News
4 228735667216_10155180617187217 6843 228735667216 BBC News
5 228735667216_1964396086910573 1736 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
2 Puerto Rico: President Donald J. Trump compares Hurricane Maria to a "real catastrophe like Katrina" bbc.in/2yG9gyZ
3 Do mass shootings ever change gun laws? http://bbc.in/2fIbjv0
4 "Boris asked me to give you this" - The moment comedian Lee Nelson interrupts Prime Minister Theresa May's speech.. by handing her a P45.
5 In her big conference speech, Theresa May talked about council houses and energy prices - but the announcements were overshadowed by a coughing fit and a protester. (Via BBC Politics)\nhttp://bbc.in/2fMCIw3
created_time type link story comments_count shares_count
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ NA 406 230
2 2017-10-03T21:34:21+0000 video https://www.facebook.com/bbcnews/videos/10155178840252217/ NA 14722 12284
3 2017-10-03T21:56:01+0000 video https://www.facebook.com/bbcnews/videos/10155178915482217/ NA 3059 2418
4 2017-10-04T11:17:28+0000 video https://www.facebook.com/bbcnews/videos/10155180617187217/ NA 1737 2973
5 2017-10-04T17:16:33+0000 video https://www.facebook.com/bbcnews/videos/1964396086910573/ NA 636 238
love_count haha_count wow_count sad_count angry_count
1 125 16 18 1063 20
2 318 1155 5023 1072 23698
3 104 69 61 980 504
4 513 4127 76 10 80
5 83 467 24 11 21
Now, I want for the first 5 comments of the first post to also have an output like above. I get all of it except the reactions (corresponding to the columns love_count, haha_count, wow_count, sad_count, angry_count) by using the following code:
> BBC_post <- getPost(BBC$id[1], token=fb_oauth, comments=TRUE, n.comments=5, likes=FALSE, reactions=FALSE)
> BBC_post
$post
from_id from_name
1 228735667216 BBC News
message
1 "What did those people do to deserve that?" \n\nThis woman left the scene of the Las Vegas shooting just moments before it began.
created_time type link id
1 2017-10-03T18:23:36+0000 video https://www.facebook.com/bbcnews/videos/10155178331342217/ 228735667216_10155178331342217
likes_count comments_count shares_count
1 1602 406 230
$comments
from_id from_name
1 880124212162441 David Bourton
2 10159595379610445 Valerie Gregory
3 10159810965680122 Nadeem Hussain
4 1657693134252376 Samir Amghar
5 10215327133878123 Shlomo Resnikov
message
1 It's unfathomable to the rest of the world that there are so many people who believe the killer's right to their guns are greater than their victims right to life.
2 That's backwards. The victims didn't do anything. The NRA, the politicians who are bought and paid for by them, including President Trump, and the shooter did. That is where solving the problem begins.
3 BBC ask Israel the same Question... what did the Palestinians civilians do to deserve an Apartheid regime !!!
4 Praying and thinking of the victims will not prevent the next shooting. One failed attempt at a shoe bomb and we all take off our shoes at the airport. 274 Mass shootings since January and no change in your regulation of guns.
5 As a Jew , we constantly ask those kind of questions regarding to the holocaust ,”where was god in the holocaust “? Or “How did he allow this horror”? And the answer that facilitates the most is mysterious ways of god are beyond our perception ,we cannot grasp divine calculation .
created_time likes_count comments_count id
1 2017-10-03T18:25:58+0000 225 71 10155178331342217_10155178338952217
2 2017-10-03T18:29:04+0000 79 45 10155178331342217_10155178346307217
3 2017-10-03T18:28:34+0000 60 38 10155178331342217_10155178345382217
4 2017-10-03T18:32:11+0000 37 3 10155178331342217_10155178354272217
5 2017-10-03T18:44:19+0000 16 20 10155178331342217_10155178380902217
### how do I also display the REACTIONS a comment got? It is not "reactions=TRUE" since that will display the reactions to the post itself and not the comment of the post
Does anyone know how to get there? Or does Rfacebook simply not allow for that (yet) since the feature of 'reacting to comments' was introduced not too long ago?
Many thanks in advance and all the best,
Ivo

Deleting rows dynamically based on certain condition in R

Problem description:
From the below table, I would want to remove all the rows above the quarter value of 2014-Q3 i.e. rows 1,2
Also note that this is a dynamic data-set. Which means when we move on to the next quarter i.e. 2016-Q3, I would want to remove all the rows above quarter value of 2014-Q4 automatically through a code without any manual intervention
(and when we move to next qtr 2016-Q4, would want to remove all rows above 2015-Q1 and so on)
I have a variable which captures the first quarter I would like to see in my final data-frame (in this case 2014-Q3) and this variable would change as we progress in the future
QTR Revenue
1 2014-Q1 456
2 2014-Q2 3113
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
How do I code this?
Here is a semi-automated method using which:
myFunc <- function(df, year, quarter) {
dropper <- paste(year, paste0("Q",(quarter-1)), sep="-")
df[-(1:which(as.character(df$QTR)==dropper)),]
}
myFunc(df, 2014, 3)
QTR Revenue
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
To subset, you can just assign output
dfNew <- myFunc(df, 2014, 3)
At this point, you can pretty easily change the year and quarter to perform a new subset.
Thanks lmo
Was going through articles and I think we can use the dplyr package to do this in a much simpler way:
>df % slice((nrow(df)-7):(nrow(df)))
Get the below result
>df
3 2014-Q3 23
4 2014-Q4 173
5 2015-Q1 1670
6 2015-Q2 157
7 2015-Q3 115
.. .. ..
10 2016-Q2 232
This would act in a dynamic way too as once we have more rows entered beyond 2016-Q2, the range of 8 rows (to be selected) is maintained by the nrow function

Wrong histogram from data

I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco
Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.

Resources