Spot erroneous row data frame - r

I have a data frame like this
head(data)
V1 V2 V3 V4 V5 V6 V7
1 458263182005000000 1941 2 14 -73.90 38.60 US009239
2 451063182005000002 1941 2 14 -74.00 36.90 US009239
3 447463182005000000 1941 2 14 -74.00 35.40 US009239
4 443863182105000000 1941 2 15 -74.00 34.00 US009239
5 436663182105000001 1941 2 15 -74.00 32.60 US009239
6 433063182105000000 1941 2 15 -73.80 31.70 US009239
but when I do
data <- read.table("data.dat",header=F,sep=";")
I get this error
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'
How can I determine in which row something is going wrong (e.g. the format is different)?
Many thanks

R says can not allocate memory. So you can check how large is the dataset and your computer memory.

despite this is an old question...
I think R_AllocStringBuffer
does not have to do with the overall memory of your computer. Which is also the optionen in this thread:
R could not allocate memory on ff procedure. How come?
Maybe check the delimiter "," or ";". It seems to create a huge string...

Related

Create a variable and apllying a function to another one

I have a dataframe with a column Arrivo (formatted as date) and a column Giorni (formatted as integer) with number of days (es.: 2, 3, 6 etc..).
I would like to apply two function to these columns and precisely, I would like to duplicate a row for the number in the column Giorni and while duplicating these rows, I would like to create a new column called data.osservazione that is equal to Arrivo and augmented of one day iteratively.
From this:
No. Casa Anno Data Categoria Camera Arrivo Stornata.il Giorni
1 2.867 SEELE 2019 03/09/2019 CDV 316 28/03/2020 NA 3
2 148.000 SEELE 2020 20/01/2020 CDS 105 29/03/2020 NA 3
3 3.684 SEELE 2019 16/11/2019 CD 102 02/04/2020 NA 5
to this:
No. data.osservazione Casa Anno Data Categoria Camera Arrivo
1 2867 3/28/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
2 2867 3/29/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
3 2867 3/30/2020 SEELE 2019 03/09/2019 CDV 316 3/28/2020 0:00:00
4 148 3/29/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
5 148 3/30/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
6 148 3/31/2020 SEELE 2020 20/01/2020 CDS 105 3/29/2020 0:00:00
Stornata.il Giorni
1 #N/D 3
2 #N/D 3
3 #N/D 3
4 #N/D 3
I was able to duplicate the rows but I don't know how to concurrently create the new column with the values I need.
Please don't mind the date values in the columns, I'll fix them in the end.
Thanks in advance
Since I am a fan of the data.table package, I will propose a solution using data.table. You can install it by typing on the console: install.packages("data.table").
My approach was to create a separate data.frame with an index going from 0 to whatever number is in Giorni by row from the original data.frame and then merge this new data.frame with the original data that you have and, by virtue of many to one matches from the key specified, the resulting data.frame will "expand" to the size desired, therefore "duplicating" the rows when necessary.
For this, I used seq_len(). If you do seq_len(3L), you get: [1] 1 2 3 which is the sequence from 1L to whatever integer you've given in length.out when length.out >= 1L. Thus seq_len() will produce a sequence that ends in whatever is in Giorni, the challenge is to do by row since length.out in seq_len() needs to be a vector of size 1. We use by in data.table syntax to accomplish this.
So let's get started, first you load data.table:
library(data.table) # load data.table
setDT(data) # data.frame into data.table
In your example, it isn't clear whether Arrivo is in Date format, I am assuming it isn't so I convert to Date --you will need this to add the days later.
# is `Arrivo`` date? If no, into date fmt
data[["Arrivo"]] <- as.Date(data[["Arrivo"]], format = "%d/%m/%y")
The next bit is key, using seq_len() and by in data.table syntax, I create a separate data.table --which is always a data.frame, but not the other way around-- with the sequence by every single element of Giorni, therefore expanding the data to the desired size. I use by = "No." because I want to apply seq_len() to every value of Giorni associated with No. the values in No..
# create an index with the count from `Giorni`, subtract by 1 so the first day is 0.
d1 <- data[, seq_len(Giorni) - 1, by = "No."]
Check the result, you can see where I am going now:
> d1
No. V1
1: 2867 0
2: 2867 1
3: 2867 2
4: 148 0
5: 148 1
Lastly, you inner join d1 with the original data, I am using data.table join syntax here. Then you add the index V1 to Arrivo:
# merge with previous data
res <- d1[data, on = "No."]
# add days to `Arrivo``, create column data.osservazione
res[ , data.osservazione := V1 + Arrivo]
Result:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
The next commands are just cosmetic, formatting dates and deleting columns:
# reformat `Arrivo` and `data.osservazione`
cols <- c("Arrivo", "data.osservazione")
res[, (cols) := lapply(.SD, function(x) format(x=x, format="%d/%m/%Y")), .SDcols=cols]
# remove index
res[, V1 := NULL]
Console:
> res
No. V1 Casa Anno Data Categoria Camera Arrivo
1: 2867 0 SEELE 2019 03/09/2019 CDV 316 2020-03-28
2: 2867 1 SEELE 2019 03/09/2019 CDV 316 2020-03-28
3: 2867 2 SEELE 2019 03/09/2019 CDV 316 2020-03-28
4: 148 0 SEELE 2019 20/01/2020 CDS 105 2020-03-29
5: 148 1 SEELE 2019 20/01/2020 CDS 105 2020-03-29
Stornata.il Giorni data.osservazione
1: NA 3 2020-03-28
2: NA 3 2020-03-29
3: NA 3 2020-03-30
4: NA 2 2020-03-29
5: NA 2 2020-03-30
Hi #JdeMello and really thank you for the quick answer!
Indeed it was what I was looking for, but in the mean time I kinda found a solution myself using lubridate and tidyverse and purrr.
What I did was transform variables from Posix to date (revenue is my df):
revenue <- revenue %>% mutate(Data = as_date(Data), Arrivo = as_date(Arrivo), `Stornata il` = as_date(`Stornata il`), Partenza = as_date(Partenza))
Then I created another data frame but included variables id and data_obs:
revenue_1 <- revenue %>% mutate(data_obs = Arrivo, id = 1:nrow(revenue))
I created another data frame with the variable data_obs iterated by the number of Giorni:
revenue_2 <- revenue_1 %>% group_by(id, data_obs) %>%
complete(Giorni = sequence(Giorni)) %>%
ungroup() %>%
mutate(data_obs = data_obs + Giorni -1)
I extracted data_obs:
data_obs <- revenue_2$data_obs
I created another data frame to duplicate the rows:
revenue_3 <- revenue %>% map_df(.,rep, .$Giorni)
And finally created the ultimate data frame that I needed:
revenue_finale <- revenue_3 %>% mutate(data_obs = data_obs)
I know it's kinda redundant having created all those data frame but I have very little knowledge of R at the moment and had to work around.
I wanted to merge data frames but for unknown reasons to me, it didn't work out.
What I found kinda fun is that you can play with many packages to get to your point instead of using just one.
I've never used data.table so your answer is very interesting and I'll try to memorize it.
So again, really really thank you!!

Avoid For-Loops in R

I'm sure this question has been posed before, but would like some input on my specific question. In return for your help, I'll use an interesting example.
Sean Lahman provides giant datasets of MLB baseball statistics, available free on his website (http://www.seanlahman.com/baseball-archive/statistics/).
I'd like to use this data to answer the following question: What is the average number of home runs per game recorded for each decade in the MLB?
Below I've pasted all relevant script:
teamdata = read.csv("Teams.csv", header = TRUE)
decades = c(1870,1880,1890,1900,1910,1920,1930,1940,1950,1960,1970,1980,1990,2000,2010,2020)
i = 0
meanhomers = c()
for(i in c(1:length(decades))){
meanhomers[i] = mean(teamdata$HR[teamdata$yearID>=decades[i] & teamdata$yearID<decades[i+1]]);
i = i+1
}
My primary question is, how could this answer have been determined without resorting to the dreaded for-loop?
Side question: What simple script would have generated the decades vector for me?
(For those interested in the answer to the baseball question, see below.)
meanhomers
[1] 4.641026 23.735849 34.456522 20.421053 25.755682 61.837500 84.012500
[8] 80.987500 130.375000 132.166667 120.093496 126.700000 148.737410 173.826667
[15] 152.973333 NaN
Edit for clarity: Turns out I answered the wrong question; the answer provided above indicates the number of home runs per team per year, not per game. A little fix of the denominator would get the correct result.
Here's a data.table example. Because others showed how to use cut, I took another route for splitting the data into decades:
teamdata[,list(HRperYear=mean(HR)),by=10*floor((yearID)/10)]
However, the original question mentions average HRs per game, not per year (though the code and answers clearly deal with HRs per year).
Here's how you could compute average HRs per game (and average games per team per year):
teamdata[,list(HRperYear=mean(HR),HRperGame=sum(HR)/sum(G),games=mean(G)),by=10*floor(yearID/10)]
floor HRperYear HRperGame games
1: 1870 4.641026 0.08911866 52.07692
2: 1880 23.735849 0.21543555 110.17610
3: 1890 34.456522 0.25140108 137.05797
4: 1900 20.421053 0.13686067 149.21053
5: 1910 25.755682 0.17010657 151.40909
6: 1920 61.837500 0.40144445 154.03750
7: 1930 84.012500 0.54593453 153.88750
8: 1940 80.987500 0.52351325 154.70000
9: 1950 130.375000 0.84289640 154.67500
10: 1960 132.166667 0.81977946 161.22222
11: 1970 120.093496 0.74580935 161.02439
12: 1980 126.700000 0.80990313 156.43846
13: 1990 148.737410 0.95741873 155.35252
14: 2000 173.826667 1.07340167 161.94000
15: 2010 152.973333 0.94427984 162.00000
(The low average game totals in the 1980's and 1990's are due to the 1981 and 1994-5 player strikes).
PS: Nicely-written question, but it would be extra nice for you to provide a fully reproducible example so that I don't have to go and download the CSV to answer your question. Making dummy data is OK.
You can use seq to generate sequences.
decades <- seq(1870, 2020, by=10)
You can use cut to split up numeric variables into intervals.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, dig.lab=4)
Basically it creates a factor with one level for each decade (as specified by the breaks). The dig.lab=4 is just so it prints the years as e.g. "1870" not "1.87e+03".
See ?cut for further configuration (e.g. is '1980' included in this decade or the next one, & so on. You can even configure the labels if you think you'll use them.)
Then to do something for each decade, use the plyr package (data.table and dplyr are other options, but I think plyr has the easiest learning curve, and your data does not seem very large to need data.table).
library(plyr)
ddply(teamdata, .(decade), summarize, meanhomers=mean(HR))
decade meanhomers
1 (1870,1880] 4.930233
2 (1880,1890] 25.409091
3 (1890,1900] 35.115702
4 (1900,1910] 20.068750
5 (1910,1920] 27.284091
6 (1920,1930] 67.681250
7 (1930,1940] 84.050000
8 (1940,1950] 84.125000
9 (1950,1960] 130.718750
10 (1960,1970] 133.349515
11 (1970,1980] 117.745968
12 (1980,1990] 127.584615
13 (1990,2000] 155.053191
14 (2000,2010] 170.226667
15 (2010,2020] 152.775000
Mine is a little different to yours because my intervals are (, ] whereas yours are [, ). Can adjust cut to switch these around.
You can also use the sqldf package in order to use SQL queries on the data.
Here is the code:
library(sqldf)
sqldf("select floor(yearID/10)*10 as decade,avg(hr) as count
from Teams
group by decade;")
decade count
1 1870 4.641026
2 1880 23.735849
3 1890 34.456522
4 1900 20.421053
5 1910 25.755682
6 1920 61.837500
7 1930 84.012500
8 1940 80.987500
9 1950 130.375000
10 1960 132.166667
11 1970 120.093496
12 1980 126.700000
13 1990 148.737410
14 2000 173.826667
15 2010 152.973333
aggregate is handy for this sort of thing. You can use your decades object with findInterval to put the years into bins:
aggregate(HR ~ findInterval(yearID, decades), data=teamdata, FUN=mean)
## findInterval(yearID, decades) HR
## 1 1 4.641026
## 2 2 23.735849
## 3 3 34.456522
## 4 4 20.421053
## 5 5 25.755682
## 6 6 61.837500
## 7 7 84.012500
## 8 8 80.987500
## 9 9 130.375000
## 10 10 132.166667
## 11 11 120.093496
## 12 12 126.700000
## 13 13 148.737410
## 14 14 173.826667
## 15 15 152.973333
Note that the intervals used are left-closed, as you desire. Also note that the intervals need not be regular. Yours are, which leads to the "side question" of how to produce the decades vector: don't even compute it. Instead, directly compute which decade each year falls in:
aggregate(HR ~ I(10 * (yearID %/% 10)), data=teamdata, FUN=mean)
## I(10 * (yearID%/%10)) HR
## 1 1870 4.641026
## 2 1880 23.735849
## 3 1890 34.456522
## 4 1900 20.421053
## 5 1910 25.755682
## 6 1920 61.837500
## 7 1930 84.012500
## 8 1940 80.987500
## 9 1950 130.375000
## 10 1960 132.166667
## 11 1970 120.093496
## 12 1980 126.700000
## 13 1990 148.737410
## 14 2000 173.826667
## 15 2010 152.973333
I usually prefer the formula interface to aggregate as used above, but you can get better names directly by using the non-formula interface. Here's the example for each of the above:
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=findInterval(yearID,decades)), FUN=mean))
## Decade mean.HR
## 1 1 4.641026
## ...
with(teamdata, aggregate(list(mean.HR=HR), list(Decade=10 * (yearID %/% 10)), FUN=mean))
## Decade mean.HR
## 1 1870 4.641026
## ...
dplyr::group_by, mixed with cut is a good option here, and avoids looping. The decades vector is just a stepped sequence.
decades <- seq(1870,2020,by=10)
cut breaks the data into categories, which I've labelled by the decades themselves for clarity.
teamdata$decade <- cut(teamdata$yearID, breaks=decades, right=FALSE, labels=decades[1:(length(decades)-1)])
Then dplyr handles the grouped summarise as neatly as you could hope
library(dplyr)
teamdata %>% group_by(decade) %>% summarise(meanhomers=mean(HR))
# decade meanhomers
# (fctr) (dbl)
# 1 1870 4.641026
# 2 1880 23.735849
# 3 1890 34.456522
# 4 1900 20.421053
# 5 1910 25.755682
# 6 1920 61.837500
# 7 1930 84.012500
# 8 1940 80.987500
# 9 1950 130.375000
# 10 1960 132.166667
# 11 1970 120.093496
# 12 1980 126.700000
# 13 1990 148.737410
# 14 2000 173.826667
# 15 2010 152.973333

Last name, First Name to First Name Last Name

I have a set of names in last, first format
Name Pos Team Week.x Year.x GID.x h.a.x Oppt.x Week1Points DK.salary.x Week.y Year.y GID.y
1 Abdullah, Ameer RB det 1 2015 2995 a sdg 19.4 4000 2 2015 2995
2 Adams, Davante WR gnb 1 2015 5263 a chi 9.9 4400 2 2015 5263
3 Agholor, Nelson WR phi 1 2015 5378 a atl 1.5 5700 2 2015 5378
4 Aiken, Kamar WR bal 1 2015 5275 a den 0.9 3300 2 2015 5275
5 Ajirotutu, Seyi WR phi 1 2015 3877 a atl 0.0 3000 NA NA NA
6 Allen, Dwayne TE ind 1 2015 4551 a buf 10.7 3400 2 2015 4551
That is just the fist 6 lines. I would like to flip the names to First name Last Name. Here is what I tried.
> strsplit(DKPoints$Name, split = ",")
This splits the name variable, but there are white spaces, so to clear them I tried,
> str_trim(splitnames)
But the results did not come out right. Here is what they look like.
[1] "c(\"Abdullah\", \" Ameer\")" "c(\"Adams\", \" Davante\")"
[3] "c(\"Agholor\", \" Nelson\")" "c(\"Aiken\", \" Kamar\")"
[5] "c(\"Ajirotutu\", \" Seyi\")" "c(\"Allen\", \" Dwayne\")"
Any advice? I would like to get a column for the data frame to look like
Ameer Abdullah
Davabte Adams
Nelson Agholor
Kamar Aiken
Any advice would be much appreciated. Thanks
sub("(\\w+),\\s(\\w+)","\\2 \\1", df$name)
(\\w+) matches the names, ,\\s matches ", "(comma and space), \\2 \\1 returns the names in opposite order.
Assuming all names are "Lastname, firstname" you could do something like this:
names <- c("A, B","C, D","E, F")
newnames <- sapply(strsplit(names, split=", "),function(x)
{paste(rev(x),collapse=" ")})
> newnames
[1] "B A" "D C" "F E"
It splits each name on ", " and then pastes things back together in reverse order.
Edit: probably no problem for small datasets, but the other solutions provided are a lot faster. Microbenchmark results for 100.000 'names':
Unit: milliseconds
expr min lq mean median uq max neval cld
heroka 1103.0419 1242.6418 1276.7765 1274.6746 1311.1218 1557.8579 50 c
lyzander 149.4466 177.0036 206.4558 191.1249 218.1756 345.7960 50 b
johannes 142.7585 144.5943 151.0078 146.0602 147.1980 284.2589 50 a
One way using srt_split_fixed:
library(stringr)
#split Name into two columns
splits <- str_split_fixed(df$Name, ", ", 2)
#now merge these two columns the other way round
df$Name <- paste(splits[,2], splits[,1], sep = ' ')
Output:
Name Pos Team Week.x Year.x GID.x h.a.x Oppt.x Week1Points DK.salary.x Week.y Year.y GID.y
1 Ameer Abdullah RB det 1 2015 2995 a sdg 19.4 4000 2 2015 2995
2 Davante Adams WR gnb 1 2015 5263 a chi 9.9 4400 2 2015 5263
3 Nelson Agholor WR phi 1 2015 5378 a atl 1.5 5700 2 2015 5378
4 Kamar Aiken WR bal 1 2015 5275 a den 0.9 3300 2 2015 5275
5 Seyi Ajirotutu WR phi 1 2015 3877 a atl 0.0 3000 NA NA NA
6 Dwayne Allen TE ind 1 2015 4551 a buf 10.7 3400 2 2015 4551
Try this one:
df$Name2<-paste(gsub("^.+\\,","",df$Name),gsub("\\,.+$","",df$Name),sep=" ")
where df is your data frame.

How to read .csv-data containing thousand separators and special handling of zeros (in R)?

R version 3.2.2 on Ubuntu 14.04
I am trying to read in R .csv-data (two columns: "id" and "variable1") containing the thousand separator ",".
So far no problem. I am using read.csv2 and the data looks like that:
> data <- read.csv2("data.csv", sep = ";", stringsAsFactors = FALSE, dec = ".")
> data[1000:1010, ]
id variable1
1 2,001
1,001 2,002
1,002 2,001
1,003 2,002
1,004 2,001
1,005 2,002
1,006 2,001
1,007 2,002
1,008 2,001
1,009 2,002
1,01 2,001
After that first I tried to use gsub() to remove the commas:
data[, c("id", "variable1")] <- sapply(data[, c("id", "variable1")],
function(x) {as.numeric(gsub("\\,","", as.character(x)))})
> data[1000:1010, ]
id variable1
1 2001
1001 2002
1002 2001
1003 2002
1004 2001
1005 2002
1006 2001
1007 2002
1008 2001
1009 2002
101 2001
I think my problem is already obvious in the first output, because there is a thousand separator, but the "ending zeros" are missing. Like number "1000" is just displayed as "1" and "1010" as "1,01" for the "id"-variable in the data (also in the .csv-data). Of course, R can't identify this.
So my question is: Is there are way to tell R that every number must have three numbers after the thousand separator when reading in the data (or maybe after that), so that I have the correct numbers?
The data should look like this:
> data[1000:1010, ]
id variable1
1000 2001
1001 2002
1002 2001
1003 2002
1004 2001
1005 2002
1006 2001
1007 2002
1008 2001
1009 2002
1010 2001
Edit:
Thanks you all for your answers. Unfortunately the suggestions will work for this example but not for my data, because I think I chose bad example rows. Other rows in the data can look like this:
id1 variable1
1 1 2,001
999 999 1,102
1000 1 2,001
1001 1,001 2,002
1002 1,002 2,001
Of course, there is twice the number "1". The first is really a "1", but the second should be a "1000". But now I think I can't solve my problem with R. Maybe I need a better export of the original data, because the problem appears also in the .csv data.
If "," is the only separator, i.e. all of the numbers are integers, you can set the dec argument of csv2 (or read.csv) to "," and multiply by 1000:
data <- read.csv2(
text = "id ; variable1
1 ; 2,001
1,008 ; 2,001
1,009 ; 2,002
1,01 ; 2,001
1,3 ; 2,0",
sep = ";",
stringsAsFactors = FALSE,
header = TRUE,
dec = "," )
.
> 1000*data
id variable1
1 1000 2001
2 1008 2001
3 1009 2002
4 1010 2001
5 1300 2000
>
after you removed the commas, you could do the following:
data$id <- data$id*(10^(4-nchar(data$id)))

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Resources