variable value occuring on 2 dates R - r

I want to find who had an apple or an orange on at least 2 different (unique) dates. I would like to create a new column with a binary indicator for whether an individual had an orange or an apple on at least two dates (1=yes, 0=no).
The nearest I've come is this plyr code.
df1<- ddply(df, .(names, fruit), mutate, acne = ifelse(fruit=="apple" | fruit=="orange" & length(unique(dates))>=2,1,0))
This is not the solution however. anne gets apples twice but on the same date, so she should not get a 1 here. Similarly ted gets a 1, even though he only got an apple once.
This is closer, but still not correct. It gives a 1 to any fruit that has occurred twice. Need the fruit to occur twice per person on two individual dates per person
df2<- ddply(df, .(fruit), mutate, acne = ifelse(length(unique(dates))>=2, 1, 0
##this one gives a 1 to any fruit that has occurred twice. Need the fruit to occur twice per person on two individual dates per person.
If anyone could point me in the right direction here I would be very grateful.
Thank you in advance
SAMPLE DF
names<-as.character(c("john", "john", "philip", "ted", "john", "john", "anne", "john", "mary","anne", "mary","mary","philip","mary", "su","mary", "jim", "sylvia", "mary", "ted","ted","mary", "sylvia", "jim", "ted", "john", "ted"))
dates<-as.Date(c("2010-07-01", "2010-07-13", "2010-05-12","2010-02-14","2010-06-30","2010-08-15", "2010-03-21","2010-04-04","2010-09-01", "2010-03-21", "2010-12-01", "2011-01-01", "2010-08-12", "2010-11-11", "2010-05-12", "2010-12-03", "2010-07-12", "2010-12-21", "2010-02-18", "2010-10-29", "2010-08-13", "2010-11-11", "2010-05-12", "2010-04-01", "2010-05-06", "2010-09-28", "2010-11-28" ))
fruit<-as.character(c("kiwi","apple","mango", "banana","strawberry","orange","apple","raspberry", "orange","apple","orange", "apple", "strawberry", "apple", "pineapple", "peach", "orange", "nectarine", "grape","banana", "melon", "apricot", "plum", "lychee", "mango", "watermelon", "apple" ))
df<-data.frame(names,dates,fruit)
df
Desired ouput
names dates fruit v1
7 anne 2010-03-21 apple 0
10 anne 2010-03-21 apple 0
17 jim 2010-07-12 orange 0
24 jim 2010-04-01 lychee 0
1 john 2010-07-01 kiwi 1
2 john 2010-07-13 apple 1
5 john 2010-06-30 strawberry 1
6 john 2010-08-15 orange 1
8 john 2010-04-04 raspberry 1
26 john 2010-09-28 watermelon 1
9 mary 2010-09-01 orange 1
11 mary 2010-12-01 orange 1
12 mary 2011-01-01 apple 1
14 mary 2010-11-11 apple 1
16 mary 2010-12-03 peach 1
19 mary 2010-02-18 grape 1
22 mary 2010-11-11 apricot 1
3 philip 2010-05-12 mango 0
13 philip 2010-08-12 strawberry 0
15 su 2010-05-12 pineapple 0
18 sylvia 2010-12-21 nectarine 0
23 sylvia 2010-05-12 plum 0
4 ted 2010-02-14 banana 0
20 ted 2010-10-29 banana 0
21 ted 2010-08-13 melon 0
25 ted 2010-05-06 mango 0
27 ted 2010-11-28 apple 0

this should probably do the trick:
v1 = ave(1:nrow(df),df$names,FUN=function(x) length(unique(df$dates[x[df$fruit[x]
%in% c("orange","apple")]]))>1)
df$v1 = v1
df = df[order(df$names),]

If I understood correctly, for the purpose of your problem, apples == oranges. So the plan is
to (1) create a small data.frame where fruits are oranges or apples only, as you don't care about other fruits, (b) select only unique date/name rows, (c) aggregate by name and (d) merge back to your original data.frame to get your result:
ndf <- subset(df, fruit %in% c("apple", "orange"))
ndf <- ndf[!duplicated(ndf[, c("names", "dates")]), ]
Here you can use table, but I prefer aggregate
v <- aggregate(rep(1, nrow(ndf)), by = ndf[, "names", drop = FALSE], sum)
v$x <- ifelse(v$x > 1, 1, 0)
rv <- merge(df, v)
It is a bit longer, codewise, than other answers but clear and most certainly does the job.
You could just use aggregate without the first two parts, but if you have huge data.frame, with lots of names aggregating for every name can prove very costly.

I did something similar to #amit's solution using by. Rownames got mangled during do.call, but you can fix that.
result <- by(df, INDICES = df$names, FUN = function(x) {
if (length(unique(x$dates)) == 1) {
x$index <- 0
return(x)
}
ao.sum <- sum(x$fruit %in% c("apple", "orange"))
if (ao.sum < 2) x$index <- 0 else x$index <- 1
x
})
do.call("rbind", result)
names dates fruit index
anne.7 anne 2010-03-21 apple 0
anne.10 anne 2010-03-21 apple 0
jim.17 jim 2010-07-12 orange 0
jim.24 jim 2010-04-01 lychee 0
john.1 john 2010-07-01 kiwi 1
john.2 john 2010-07-13 apple 1
john.5 john 2010-06-30 strawberry 1
john.6 john 2010-08-15 orange 1
john.8 john 2010-04-04 raspberry 1
john.26 john 2010-09-28 watermelon 1
mary.9 mary 2010-09-01 orange 1
mary.11 mary 2010-12-01 orange 1
mary.12 mary 2011-01-01 apple 1
mary.14 mary 2010-11-11 apple 1
mary.16 mary 2010-12-03 peach 1
mary.19 mary 2010-02-18 grape 1
mary.22 mary 2010-11-11 apricot 1
philip.3 philip 2010-05-12 mango 0
philip.13 philip 2010-08-12 strawberry 0
su su 2010-05-12 pineapple 0
sylvia.18 sylvia 2010-12-21 nectarine 0
sylvia.23 sylvia 2010-05-12 plum 0
ted.4 ted 2010-02-14 banana 0
ted.20 ted 2010-10-29 banana 0
ted.21 ted 2010-08-13 melon 0
ted.25 ted 2010-05-06 mango 0
ted.27 ted 2010-11-28 apple 0

Related

Group dataframe rows by creating a unique ID column based on the amount of time passed between entries and variable values

I'm trying to group the rows of my dataframe into "courses" when the same variables appear at regular date intervals. When there is a gap in time frequency or when one of variables change I would like to give it a new course ID.
To give an example, my data looks something like this:
Date Name Item
1 2018-06-02 Johan Apple
2 2018-07-05 Johan Apple
3 2018-08-02 Johan Apple
4 2019-04-15 Johan Apple
5 2019-05-15 Johan Apple
6 2019-05-30 Samantha Orange
7 2019-06-12 Samantha Orange
8 2019-06-27 Samantha Orange
9 2018-02-15 Mary Lemon
10 2018-04-10 Mary Lemon
11 2018-06-12 Mary Lemon
12 2018-08-13 Mary Lime
13 2018-08-27 Mary Lime
14 2017-03-09 George Kiwi
Each different combination of Name and Item should generate a new course ID.
However (the tricky part) if there is a significant time gap between two transactions where the other variables are constant, defined as: either more than 6 months or more than three times the average interval up to that date for that specific combination of Item and Name then it should be given a new CourseID
In my example:
Because Johan had a break after August 2018, transactions after that should have a new CourseID. Ideally the interval to check for future breaks would then be based on the average in this new group.
Samantha is buying oranges on a biweekly basis with no siginficant gap so all her transactions will have one CourseID.
Mary is buying lemons at a regular interval but then switches to buying limes at a regular interval, so these have two CourseIDs.
George just bought the one Kiwi, so a single CourseID
Code to reproduce:
data.frame(Date = as.Date(c("2018-06-02", "2018-07-05", "2018-08-02", "2019-04-15", "2019-05-15", "2019-05-30", "2019-06-12", "2019-06-27", "2018-02-15", "2018-04-10", "2018-06-12", "2018-08-13", "2018-08-27", "2017-03-09")),
Name = c(rep("Johan", 5), rep("Samantha", 3), rep("Mary", 5), "George"),
Item = c(rep("Apple", 5), rep("Orange", 3), rep("Lemon", 3), rep("Lime",2), "Kiwi"))
I'd like to create an additional column which has a unique identifier for each course - i.e. using stringi or similar.
Ideally the output would look something like this:
Date Name Item CourseID
1 2018-06-02 Johan Apple q3J
2 2018-07-05 Johan Apple q3J
3 2018-08-02 Johan Apple q3J
4 2019-04-15 Johan Apple f8j
5 2019-05-15 Johan Apple f8j
6 2019-05-30 Samantha Orange p8U
7 2019-06-12 Samantha Orange p8U
8 2019-06-27 Samantha Orange p8U
9 2018-02-15 Mary Lemon wi9
10 2018-04-10 Mary Lemon wi9
11 2018-06-12 Mary Lemon wi9
12 2018-08-13 Mary Lime q8U
13 2018-08-27 Mary Lime q8U
14 2017-03-09 George Kiwi jJ0
I've tried going about this using max/min on the date varaible, however I'm stumped when it comes to identifying the break based on the previous purchasing pattern.
There may be a package I don't know which has something for this, however I've been trying with Tidyverse so far.
Here's a dplyr approach that calculates the gap and rolling avg gap within each Name/Item group, then flags large gaps, and assigns a new group for each large gap or change in Name or Item.
df1 %>%
group_by(Name,Item) %>%
mutate(purch_num = row_number(),
time_since_first = Date - first(Date),
gap = Date - lag(Date, default = as.Date(-Inf)),
avg_gap = time_since_first / (purch_num-1),
new_grp_flag = gap > 180 | gap > 3*avg_gap) %>%
ungroup() %>%
mutate(group = cumsum(new_grp_flag))

Only filter values in a column based on a condition

Let's say I have the following dataframe:
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"),
ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"),
Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,NA),
Tax = c(2,4,5,6,2,3,5,1,3,4,5,6,4,NA))
This then yields:
> my_basket
ITEM_GROUP ITEM_NAME Price Tax
1 Fruit Apple 100 2
2 Fruit Banana 80 4
3 Fruit Orange 80 5
4 Fruit Mango 90 6
5 Fruit Papaya 65 2
6 Vegetable Carrot 70 3
7 Vegetable Potato 60 5
8 Vegetable Brinjal 70 1
9 Vegetable Raddish 25 3
10 Dairy Milk 60 4
11 Dairy Curd 40 5
12 Dairy Cheese 35 6
13 Dairy Milk 50 4
14 Dairy Paneer NA NA
What I now would like to do, is make a list of fruits I want to keep and then filter those, so:
fruitlist = c("Apple", "Banana")
How would I go about using tidyverse to filter the data in my data.frame to only keep the fruits in my fruitlist, but also all my Vegetables and Dairy? Normally I'd do:
my_basket %<>% filter(ITEM_NAME %in% fruitlist)
But then I'd also lose all the vegetables and dairy, which is not what I want. I've been trying to make something work with case_when but can't seem to make it work. There must be something obvious I'm missing here.
EDIT: Seconds after posting my question I finally realised:
my_basket %<>% filter(ITEM_NAME %in% fruitlist | ITEM_GROUP != "Fruit")
That solves it. I think if I'd have to filter multiple groups like this, piping the filter command repeatedly would work too.
You could use grepl with a regex alternation:
fruitlist <- c("Apple", "Banana")
regex <- paste0("^(?:", paste0(fruitlist, collapse="|"), ")$")
my_basket %<>% filter(grepl(regex, ITEM_NAME))

R_exclude rows with a column containing a value if multiple rows exist

I have a dataframe "test" as below. I want to exclude all the rows of that person, if this person has "apple" in the "fruit" column, using R language.
I wrote:
filter(test, name != test$name[test$fruit=="apple"])
original "test" data frame
actual result
expected result
Any help is appreciated! Thanks!
> test
name fruit
1 kevin apple
2 kevin pear
3 kevin peach
4 jack apple
5 jack pear
6 jack peach
7 jack kiwi
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi
First, we find the all the 'name' which have 'apple' as a fruit.
df=unique(test$name[test$fruit=="apple"])
> df
[1] kevin jack
Levels: caleb jack justin kevin
Now we need to remove rows from rows from test where name is same as those in df, i.e 'kevin' or 'jack'.
test1= test[ (!(test$name %in% df)),]
> test1
name fruit
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi
Ofcourse we can write this in a single line :
test2=test[(!(test$name %in% (unique(test$name[test$fruit=="apple"])))),]
> test2
name fruit
8 caleb grapefruit
9 caleb kiwi
10 caleb pear
11 justin pineapple
12 justin grape
13 justin watermelon
14 justin kiwi
You can do this in multiple ways.
In base R :
subset(test, !ave(fruit == 'apple', name, FUN = any))
# name fruit
#4 Justin pineapple
#5 Justin grape
Using dplyr
test %>% group_by(name) %>% filter(!any(fruit == 'apple'))
Or data.table
setDT(test)[, .SD[!any(fruit == 'apple')], name]
Another option in base R without grouping could be
subset(test, !name %in% unique(name[fruit == "apple"]))
data
test <- data.frame(name = c('Jack', 'Jack', 'Jack', 'Justin', 'Justin'),
fruit =c('pineapple', 'apple', 'grape', 'pineapple', 'grape'))

Filtering for multiple strings within the same column in r

My large data set (Groceries) has a column in it containing character data (Fruits) all of which is lower case and all of which contains no punctuation.
It looks a bit like this:
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
I'm trying to select all the rows (of which there are 3,320) from the Fruits column that contain 5 specific fruits (orange, lime, lemon, grapefruit & passion fruit). Initially I'm only interested in the rows that contain all 5 of these fruits and no additional Fruits. Thus, the only row out of these 5 that should be filtered/subsetted would be row 4. The fruits do not have to be in any particular order.
The data is actually answers to a test, so eventually I'm interested in determining who got 0/5 fruits, who got 1/5, 2/5 and so on...
I've tried 2 methods so far, both to no avail.
Firstly I tried using grep(), but no rows were stored in the resulting data frame.
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
And then I tried using filter(), but the selected rows don't contain just the 5 Fruits I'm seeking out, it selects all rows that contain any of the 5 fruits.
# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit",
"lime")
filter <- Groceries %>%
select(Id, Fruits) %>%
filter(str_detect(tolower(Fruits), pattern = CorrectFruits))
The result I'm after initially is a new DF containing all the columns in the Groceries table, but only the rows of those people who got all 5 of the chosen fruits correct.
Next, it would be cool to select the opposite; everyone who didn't get all 5 correct.
Finally, I'd love to be able to subset those who got a specific proportion correct. I.e. row 1 got 3 correct, row 2 only got 1 correct and row 3 only got 1 correct.
Any help would be greatly appreciated!
Here's an example of what some of the columns look like:
# Groceries
Id Age Nationality Colour question Fruits question
1 26-35 Canadian Red apple orange banana lemon grapefruit
2 26-35 US Blue grapes tomato passion fruit
3 46-55 Canadian Red strawberry orange kiwi
4 55+ US Red lemon orange passion fruit grapefruit lime
5 36-45 British Green lemon orange passion fruit grapefruit lime peach
Might need more clarification on what you intend on doing with answers that have all 5 fruits with some extra, but this should help you out. I substituted all instances of "passion fruit" with "passionfruit" to make it easier:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
which gives
ID Fruits Count
1 apple orange banana lemon grapefruit 3
2 grapes tomato passionfruit 1
3 strawberry orange kiwi 1
4 lemon orange passionfruit grapefruit lime 5
5 lemon orange passionfruit grapefruit lime peach 0
First line does the passionfruit substitution, and then str_count counts all occurrences of correct fruits in df$Fruit. Finally, if all 5 fruits are correct but there are extras, Count resets to 0.
Here is my answer after seeing others' genius solutions.
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
heds1's reponse looks great. However, you want to be careful using string exacts such as grepl because it could return compound words. For example, consider the word pineapple; it contains pine and apple. Notice here that searching for apple returns pineapples.
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
The answer provided by sumshyftw is awesome. And I love that I am learning something from sumshyftw. But to demonstrate my point that unrestrained string search could mess your count:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
Notice that it counted the pineapple as a correct answer despite that the only correct fruit is an apple. To overcome this, you want to wrap your words with \\b.
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
R no longer counts pineapple as an apple.
But for the record, sumshyftw deserves the credit for working out the hard part in my example:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
To show only those with all five fruits:
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5
Here's one way using grepl with a target list of keywords.
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
You can then change the criteria in the match_rows vector by changing the 5 to, for example 4 for four fruit matches, etc.

R - Merging two data files based on partial matching of inconsistent full name formats

I'm looking for a way to merge two data files based on partial matching of participants' full names that are sometimes entered in different formats and sometimes misspelled. I know there are some different function options for partial matches (eg agrep and pmatch) and for merging data files but I need help with a) combining the two; b) doing partial matching that can ignore middle names; c) in the merged data file store both original name formats and d) retain unique values even if they don't have a match.
For example, I have the following two data files:
File name: Employee Data
Full Name Date Started Orders
ANGELA MUIR 6/15/14 25
EILEEN COWIE 6/15/14 44
LAURA CUMMING 10/6/14 43
ELENA POPA 1/21/15 37
KAREN MACEWAN 3/15/99 39
File name: Assessment data
Candidate Leading Factor SI-D SI-I
Angie muir I -3 12
Caroline Burn S -5 -3
Eileen Mary Cowie S -5 5
Elena Pope C -4 7
Henry LeFeuvre C -5 -1
Jennifer Ford S -3 -2
Karen McEwan I -4 10
Laura Cumming S 0 6
Mandip Johal C -2 2
Mubarak Hussain D 6 -1
I want to merge them based on names (Full Name in df1 and Candidate in df2) ignoring middle name (eg Eilen Cowie = Eileen Mary Cowie), extra spaces (Laura Cumming = Laura Cumming); misspells (e.g. Elena Popa = Elena Pope) etc.
The ideal output would look like this:
Name Full Name Candidate Date Started Orders Leading Factor SI-D SI-I
ANGELA MUIR ANGELA MUIR Angie muir 6/15/14 25 I -3 12
Caroline Burn N/A Caroline Burn N/A N/A S -5 -3
EILEEN COWIE EILEEN COWIE Eileen Mary Cowie 6/15/14 44 S -5 5
ELENA POPA ELENA POPA Elena Pope 1/21/15 37 C -4 7
Henry LeFeuvre N/A Henry LeFeuvre N/A N/A C -5 -1
Jennifer Ford N/A Jennifer Ford N/A N/A S -3 -2
KAREN MACEWAN KAREN MACEWAN Karen McEwan 3/15/99 39 I -4 10
LAURA CUMMING LAURA CUMMING Laura Cumming 10/6/14 43 S 0 6
Mandip Johal N/A Mandip Johal N/A N/A C -2 2
Mubarak Hussain N/A Mubarak Hussain N/A N/A D 6 -1
Any suggestions would be greatly appreciated!
Here's a process that may help. You will have to inspect the results and make adjustments as needed.
df1
# v1 v2
#1 ANGELA MUIR 6/15/14
#2 EILEEN COWIE 6/15/14
#3 AnGela Smith 5/3/14
df2
# u1 u2
#1 Eileen Mary Cowie I-3
#2 Angie muir -5 5
index <- sapply(df1$v1, function(x) {
agrep(x, df2$u1, ignore.case=TRUE, max.distance = .5)
}
)
index <- unlist(index)
df2$u1[index] <- names(index)
merge(df1, df2, by.x='v1', by.y='u1')
# v1 v2 u2
#1 ANGELA MUIR 6/15/14 -5 5
#2 EILEEN COWIE 6/15/14 I-3
I had to adjust the argument max.distance in the index function. It may not work for your data, but adjust and test if it works. If this doesn't help, there is a package called stringdist that may have a more robust matching function in amatch.
Data
v1 <- c('ANGELA MUIR', 'EILEEN COWIE', 'AnGela Smith')
v2 <- c('6/15/14', '6/15/14', '5/3/14')
u1 <- c('Eileen Mary Cowie', 'Angie muir')
u2 <- c('I-3', '-5 5')
df1 <- data.frame(v1, v2, stringsAsFactors=F)
df2 <- data.frame(u1, u2, stringsAsFactors = F)

Resources