How can I count a variable in R conditional on the value of another variable? - r

I want to count occurrences of a variable in a dataframe by another variable, conditional on the value of a third variable. Here is my data:
Name Store Purchase Date
John CVS Shampoo 1/1/2001
John CVS Toothpaste 1/1/2001
John Whole Foods Kombucha 1/1/2005
John Kroger Ice Cream 1/1/2002
Jane CVS Soap 1/1/2001
Jane Whole Foods Crackers 1/1/2004
For each purchase, I want a count of how many previous purchases were made by the specified person, and how many previous shopping trips, like this:
Name Store Purchase Date Prev_Purchase Prev_trip
John CVS Shampoo 1/1/2001 0 0
John CVS Toothpaste 1/1/2001 0 0
John Whole Foods Kombucha 1/1/2005 3 2
John Kroger Ice Cream 1/1/2002 2 1
Jane CVS Soap 1/1/2001 0 0
Jane Whole Foods Crackers 1/1/2004 1 1
If I wanted the total number of purchases/trips for each person, I would use count or tapply--is there a way to adapt these functions so that the outputs are conditional on a third variable (date)?

Maybe you can try the base R code using ave
transform(df,
Prev_Purchase = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) sum(p > x))),
Prev_trip = ave(as.numeric(as.Date(Date, "%d/%m/%Y")), Name, FUN = function(x) sapply(x, function(p) length(unique(x[p > x]))))
)
which gives
Name Store Purchase Date Prev_Purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1
Data
df <- structure(list(Name = c("John", "John", "John", "John", "Jane",
"Jane"), Store = c("CVS", "CVS", "Whole Foods", "Kroger", "CVS",
"Whole Foods"), Purchase = c("Shampoo", "Toothpaste", "Kombucha",
"Ice Cream", "Soap", "Crackers"), Date = c("1/1/2001", "1/1/2001",
"1/1/2005", "1/1/2002", "1/1/2001", "1/1/2004")), class = "data.frame", row.names = c(NA,
-6L))

I think it should solve your problem. If your data is huge it's better if you optimize this code chunk.
# load environment
library(lubridate)
# base function
AddInfo = function(name, date, df) {
prev_purchase = sum(df$Name == name & df$Date < date)
prev_trip = length(unique(filter(df, Name == name & Date < date)$Date))
data = data.frame(
Prev_purchase = prev_purchase,
Prev_trip = prev_trip
)
return(data)
}
# define data frame
df = data.frame(
Name = c(rep('John', 4), rep('Jane', 2)),
Store = c('CVS', 'CVS', 'Whole Foods', 'Kroger', 'CVS', 'Whole Foods'),
Purchase = c('Shampoo', 'Toothpaste', 'Kombucha', 'Ice Cream', 'Soap', 'Crackers'),
Date = c('1/1/2001', '1/1/2001', '1/1/2005', '1/1/2002', '1/1/2001', '1/1/2004')
)
# transform date to POSIXct
df$Date = dmy(df$Date)
# apply function and bind the results
cols = mapply(AddInfo, df$Name, df$Date, MoreArgs = list(df), SIMPLIFY = FALSE)
cols = bind_rows(cols)
df = cbind(df, cols)
Here is the output:
Name Store Purchase Date Prev_purchase Prev_trip
1 John CVS Shampoo 1/1/2001 0 0
2 John CVS Toothpaste 1/1/2001 0 0
3 John Whole Foods Kombucha 1/1/2005 3 2
4 John Kroger Ice Cream 1/1/2002 2 1
5 Jane CVS Soap 1/1/2001 0 0
6 Jane Whole Foods Crackers 1/1/2004 1 1

We could also use outer
library(dplyr)
library(lubridate)
df %>%
mutate(Date = dmy(Date)) %>%
group_by(Name) %>%
mutate(Prev_Purchase = colSums(outer(Date, Date, FUN = "<")),
Prev_trip = colSums(outer(unique(Date), Date, FUN = "<")))
# A tibble: 6 x 6
# Groups: Name [2]
# Name Store Purchase Date Prev_Purchase Prev_trip
# <chr> <chr> <chr> <date> <dbl> <dbl>
#1 John CVS Shampoo 2001-01-01 0 0
#2 John CVS Toothpaste 2001-01-01 0 0
#3 John Whole Foods Kombucha 2005-01-01 3 2
#4 John Kroger Ice Cream 2002-01-01 2 1
#5 Jane CVS Soap 2001-01-01 0 0
#6 Jane Whole Foods Crackers 2004-01-01 1 1

Related

Joining Dataframes in R, Matching Patterns in Strings

Two big real life tables to join up, but here's a little reprex:
I've got a table of small strings and I want to left join on a second table, with the join being based on whether or not these small strings can be found inside the bigger strings on the second table.
df_1 <- data.frame(index = 1:5,
keyword = c("john", "ella", "mil", "nin", "billi"))
df_2 <- data.frame(index_2 = 1001:1008,
name = c("John Coltrane", "Ella Fitzgerald", "Miles Davis", "Billie Holliday",
"Nina Simone", "Bob Smith", "John Brown", "Tony Montana"))
df_results_i_want <- data.frame(index = c(1, 1:5),
keyword = c("john", "john", "ella", "mil", "nin", "billi"),
index_2 = c(1001, 1007, 1002, 1003, 1005, 1004),
name = c("John Coltrane", "John Brown", "Ella Fitzgerald",
"Miles Davis", "Nina Simone", "Billie Holliday"))
Seems like a str_detect() call and a left_join() call might be part of the solution - ie I'm hoping for something like:
library(tidyverse)
df_results <- df_1 |> left_join(df_2, join_by(blah blah str_detect() blah blah))
I'm using dplyr 1.1 so I can use join_by(), but I'm not sure of the correct way to get what I need - can anyone help please?
I suppose I could do a simple cross join using tidyr::crossing() and then do the str_detect() stuff afterwards (and filter out things that don't match)
df_results <- df_1 |>
crossing(df_2) |>
mutate(match = str_detect(name, fixed(keyword, ignore_case = TRUE))) |>
filter(match) |>
select(-match)
but in my real life example, the cross join would produce an absolutely enormous table that would overwhelm my PC.
Thank you.
You can try fuzzy_join::regex_join():
library(fuzzyjoin)
regex_join(df_2, df_1, by=c("name"="keyword"), ignore_case=T)
Output:
index.x name index.y keyword
1 1001 John Coltrane 1 john
2 1002 Ella Fitzgerald 2 ella
3 1003 Miles Davis 3 mil
4 1004 Billie Holliday 5 billi
5 1005 Nina Simone 4 nin
6 1007 John Brown 1 john
join_by does not support inexact join (but unequal), but you can use fuzzyjoin:
library(dplyr)
library(fuzzyjoin)
df_2 %>%
mutate(name = tolower(name)) %>%
fuzzy_left_join(df_1, ., by = c(keyword = "name"),
match_fun = \(x, y) str_detect(y, x))
index keyword index_2 name
1 1 john 1001 john coltrane
2 1 john 1007 john brown
3 2 ella 1002 ella fitzgerald
4 3 mil 1003 miles davis
5 4 nin 1005 nina simone
6 5 billi 1004 billie holliday
We can use SQL to do that.
library(sqldf)
sqldf("select * from [df_1] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
giving:
index keyword index_2 name
1 1 john 1001 John Coltrane
2 1 john 1007 John Brown
3 2 ella 1002 Ella Fitzgerald
4 3 mil 1003 Miles Davis
5 4 nin 1005 Nina Simone
6 5 billi 1004 Billie Holliday
It can be placed in a pipeline like this:
library(magrittr)
library(sqldf)
df_1 %>%
{ sqldf("select * from [.] A
left join [df_2] B on B.name like '%' || A.keyword || '%'")
}

How to compare two strings word by word in R

I have a dataset, let's call it "ORIGINALE", composed by several different rows for only two columns, the first called "DESCRIPTION" and the second "CODICE". The description column has the right information while the column codice, which is the key, is almost always empty, therefore I'm tryng to search for the corresponding codice in another dataset, let's call it "REFERENCE". I am using the column desciption, which is in natural language, and trying to match it with the description in the second dataset. I have to match word by word since there may be a different order of words, synonims or abbreviations. Then I calcolate the similarity score to keep only the best match and accept those above a certain score. Is there a way to improve it? I'm working with around 300000 rows and, even though I know is always going to take time, perhaps there could be a way to make it even just slightly faster.
ORIGINALE <- data.frame(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = (NA, NA, NA))
REFERENE <- dataframe (DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
algoritmo <- function(ORIGINALE, REFERENCE) {
split1 <- strsplit(x$DESCRIPTION, " ")
split2 <- strsplit(y$DESCRIPTION, " ")
risultato <- vector()
distanza <- vector()
for(i in 1:NROW(split1)) {
best_dist <- -5
closest_match <- -5
for(j in 1:NROW(split2)) {
dist <- stringsim(as.character(split1[i]), as.character(split2[j]))
if (dist > best_dist) {
closest_match <- y$DESCRIPTION[j]
best_dist <- dist
}
}
distanza <- append(distanza, best_dist)
risultato <- append(risultato, closest_match)
}
confronto <<- tibble(x$DESCRIPTION, risultato, distanza)
}
match <- subset.data.frame(confronto, confronto$distanza >= "0.6")
missing <- subset.data.frame(confronto, confronto$distanza <"0.6")
The R tm (text mining) library can help here:
library(tm)
library(proxy) # for computing cosine similarity
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# combine ORIGINALE and REFERENCE into one data.table
both = rbind(ORIGINALE,REFERENCE)
# create "doc_id" and "text" columns (required by tm)
both[,doc_id:=1:.N]
names(both)[1] = 'text'
# convert to tm corpus
corpus = SimpleCorpus(DataframeSource(both))
# convert to a tm document term matrix
dtm = DocumentTermMatrix(corpus)
# convert to a regular matrix
dtm = as.matrix(dtm)
# look at it (t() transpose for readability)
t(dtm)
Docs
Terms 1 2 3 4 5 6
123 1 0 0 0 1 0
peter 1 0 0 0 1 0
rose 1 0 0 0 1 0
street 1 0 0 1 1 1
chicago 0 1 0 0 0 1
flower 0 1 0 0 0 1
jane 0 1 0 0 0 1
jenny 0 1 0 0 0 1
str 0 1 0 0 0 0
430f 0 0 1 1 0 0
miss 0 0 1 0 0 0
name 0 0 1 1 0 0
sarah 0 0 1 1 0 0
strt 0 0 1 0 0 0
washington 0 0 1 1 0 0
brown 0 0 0 1 0 0
green 0 0 0 0 1 0
# compute similarity between each combination of documents 1:3 and documents 4:6
similarity = proxy::dist(dtm[1:3,], dtm[4:6,], method="cosine")
# result:
ORIGINALE REFERENCE document
document 4 5 6
1 0.7958759 0.1055728 0.7763932 <-- difference (smaller = more similar)
2 1.0000000 1.0000000 0.2000000
3 0.3333333 1.0000000 1.0000000
# make a table of which REFERENCE document is most similar
most_similar = rbindlist(
apply(
similarity,1,function(x){
data.table(i=which.min(x),distance=min(x))
}
)
)
# result:
i distance
1: 2 0.1055728
2: 3 0.2000000
3: 1 0.3333333
# rows 1, 2, 3 or rows of ORIGINALE; i: 2 3 1 are rows of REFERENCE
# add the results back to ORIGINALE
ORIGINALE1 = cbind(ORIGINALE,most_similar)
REFERENCE[,i:=1:.N]
ORIGINALE2 = merge(ORIGINALE1,REFERENCE,by='i',all.x=T,all.y=F)
# result:
i DESCRIPTION.x CODICE.x distance DESCRIPTION.y CODICE.y
1: 1 washington miss sarah 430f name strt NA 0.3333333 sarah brown name street 430f washington 135tg67
2: 2 mr peter 123 rose street 3b LA NA 0.1055728 peter green 123 rose street 3b LA aw56
3: 3 4c flower str jenny jane Chicago NA 0.2000000 jenny jane flower street 4c Chicago 83776250
# now the documents are in a different order than in ORIGINALE2.
# this is caused by merging by i (=REFERENCE document row).
# if order is important, then add these two lines around the merge line:
ORIGINALE1[,ORIGINALE_i:=1:.N]
ORIGINALE2 = merge(...
ORIGINALE2 = ORIGINALE2[order(ORIGINALE_i)]
Good question. for loops are slow in R:
for(i in 1:NROW(split1)) {
for(j in 1:NROW(split2)) {
For fast R, you need to vectorize your algorithm. I'm not that handy with data.frame anymore, so I'll use its successor, data.table.
library(data.table)
ORIGINALE = data.table(DESCRIPTION = c("mr peter 123 rose street 3b LA"," 4c flower str jenny jane Chicago", "washington miss sarah 430f name strt"), CODICE = c(NA, NA, NA))
REFERENCE = data.table(DESCRIPTION = c("sarah brown name street 430f washington", "peter green 123 rose street 3b LA", "jenny jane flower street 4c Chicago"), CODICE = c("135tg67","aw56", "83776250"))
# split DESCRIPTION to make tables that have one word per row
ORIGINALE_WORDS = ORIGINALE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
REFERENCE_WORDS = REFERENCE[,.(word=unlist(strsplit(DESCRIPTION,' ',fixed=T))),.(DESCRIPTION,CODICE)]
# remove empty words introduced by extra spaces in your DESCRIPTIONS
ORIGINALE_WORDS = ORIGINALE_WORDS[word!='']
REFERENCE_WORDS = REFERENCE_WORDS[word!='']
# merge the tables by word
merged = merge(ORIGINALE_WORDS,REFERENCE_WORDS,by='word',all=F,allow.cartesian=T)
# count matching words for each combination of ORIGINALE DESCRIPTION and REFERENCE DESCRIPTION and CODICE
counts = merged[,.N,.(DESCRIPTION.x,DESCRIPTION.y,CODICE.y)]
# keep only the highest N CODICE.y for each DESCRIPTION.x
topcounts = merged[order(-N)][!duplicated(DESCRIPTION.x)]
# merge the counts back to ORIGINALE
result = merge(ORIGINALE,topcounts,by.x='DESCRIPTION',by.y='DESCRIPTION.x',all.x=T,all.y=F)
Here is result:
DESCRIPTION CODICE DESCRIPTION.y CODICE.y N
1: 4c flower str jenny jane Chicago NA jenny jane flower street 4c Chicago 83776250 5
2: mr peter 123 rose street 3b LA NA peter green 123 rose street 3b LA aw56 6
3: washington miss sarah 430f name strt NA sarah brown name street 430f washington 135tg67 4
PS: There are more memory-efficient ways to do this, and this code could cause your machine to crash due to an out-of-memory error or go slowly due to needing virtual memory, but if not, it should be faster than the for loops.
What about :
library(stringdist)
library(dplyr)
library(tidyr)
data_o <- ORIGINALE %>% mutate(desc_o = DESCRIPTION) %>% select(desc_o)
data_r <- REFERENE %>% mutate(desc_r = DESCRIPTION) %>% select(desc_r)
data <- crossing(data_o,data_r)
data %>% mutate(dist= stringsim(as.character(desc_o),as.character(desc_r))) %>%
group_by(desc_o) %>%
filter(dist==max(dist))
desc_o desc_r dist
<chr> <chr> <dbl>
1 " 4c flower str jenny jane Chicago" jenny jane flower street 4c Chicago 0.486
2 "mr peter 123 rose street 3b LA" peter green 123 rose street 3b LA 0.758
3 "washington miss sarah 430f name strt" sarah brown name street 430f washington 0.385

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

Rearanging Data

I am trying to rearrange a data set and then sort it on multiple variables. For example, right now I have something that looks like this:
ID Name Class 1 Class2 Monday 7-8 Monday 8-9
1 Brad Chem Bio Monday 7-8 NA
2 Charlene Acct NA NA Monday 8-9
3 Carly Philosophy Physics NA NA
4 Jess Chem Acct Monday 7-8 Monday 8-9
And sort the data like this:
Class Monday 7-8 Monday 8-9
Acct Jess Charlene, Jess
Bio Brad NA
Chem Brad, Jess Jess
Philosophy NA NA
Physics NA NA
I have tried separating all of the variables into different spreadsheets and then merging them, but I cant figure out how to sort the name based on both class and time and it is proving incredibly difficult to figure out. The actual database is composed of about 70 different time options with 80 different people and 150 different class names (chem, bio, etc), so I cant go in and create this individually
a tidyr solution:
df1 %>%
gather(class_col,Class,'Class.1','Class2') %>%
filter(!is.na(Class)) %>%
gather(date_col,date,'Monday.7.8','Monday.8.9') %>%
group_by(Class,date) %>%
summarize(Name = paste(Name,collapse=", ")) %>%
spread(date,Name) %>%
select(-`<NA>`)
# # A tibble: 5 x 3
# # Groups: Class [5]
# Class `Monday 7-8` `Monday 8-9`
# * <chr> <chr> <chr>
# 1 Acct Jess Charlene, Jess
# 2 Bio Brad <NA>
# 3 Chem Brad, Jess Jess
# 4 Philosophy <NA> <NA>
# 5 Physics <NA> <NA>
Here is some base R code for this task:
dat <- data.frame(
name=c("Brad", "Charlene", "Carly", "Jess"),
class1=c("Chem", "Acct", "Philosophy", "Chem"),
class2=c("Bio", NA, "Physics", "Acct"),
monday7.8=c("monday7.8", NA, NA, "monday7.8"),
monday8.9=c(NA, "monday8.9", NA, "monday8.9"),
stringsAsFactors=FALSE
)
classes <- c("Chem", "Acct", "Philosophy", "Physics")
times <- c("monday7.8", "monday8.9")
ret <- expand.grid(class=classes, time=times, stringsAsFactors=FALSE)
one_alloc <- function(cl, tm, dat) {
idx <- which(!is.na(dat[,tm]) & (dat[,"class1"]==cl | dat[,"class2"]==cl))
if(length(idx)>0) return(paste(dat[idx,"name"], collapse=", ")) else return(NA)
}
one_alloc <- Vectorize(one_alloc, vectorize.args=c("cl", "tm"))
ret[,"names"] <- one_alloc(cl=ret[,"class"], tm=ret[,"time"], dat=dat)
ret <- reshape(ret, timevar="time", idvar="class", direction="wide")
ret

Match text across multiple rows in R

My data.frame(Networks) contains the following:
Location <- c("Farm", "Supermarket", "Farm", "Conference",
"Supermarket", "Supermarket")
Instructor <- c("Bob", "Bob", "Louise", "Sally", "Lee", "Jeff")
Operator <- c("Lee", "Lee", "Julie", "Louise", "Bob", "Louise")
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors=FALSE)
MY QUESTION
I wish to include a new column Transactions$Count in a new data.frame Transactions that sums the exchanges between each Instructor and Operator for every Location
EXPECTED OUTPUT
Location <- c("Farm", "Supermarket", "Farm", "Conference", "Supermarket")
Person1 <- c("Bob", "Louise", "Sally", "Jeff")
Person2 < - c("Lee", "Julie", "Louise", "Louise")
Count < - c(1, 2, 1, 1, 1)
Transactions <- data.frame(Location, Person1, Person2, Count,
stringsAsFactors=FALSE)
For example, there would be a total of 2 exchanges between Bob and Lee at the Supermarket. It does not matter if one person is a instructor or operator, I am interested in their exchange. In the expected output, the two exchanges between Bob and Lee at the Supermarket are noted. There is one exchange for every other combination at the other locations.
WHAT I HAVE TRIED
I thought grepl may be of use, but I wish to iterate across 1300 rows of this data, so it may be computationally expensive.
Thank you.
You can consider using "data.table" and use pmin and pmax in your "by" argument.
Example:
Networks <- data.frame(Location, Instructor, Operator, stringsAsFactors = FALSE)
library(data.table)
as.data.table(Networks)[
, TransCount := .N,
by = list(Location,
pmin(Instructor, Operator),
pmax(Instructor, Operator))][]
# Location Instructor Operator TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Louise Julie 1
# 4: Conference Sally Louise 1
# 5: Supermarket Lee Bob 2
# 6: Supermarket Jeff Louise 1
Based on your update, it sounds like this might be more appropriate for you:
as.data.table(Networks)[
, c("Person1", "Person2") := list(
pmin(Instructor, Operator),
pmax(Instructor, Operator)),
by = 1:nrow(Networks)
][
, list(TransCount = .N),
by = .(Location, Person1, Person2)
]
# Location Person1 Person2 TransCount
# 1: Farm Bob Lee 1
# 2: Supermarket Bob Lee 2
# 3: Farm Julie Louise 1
# 4: Conference Louise Sally 1
# 5: Supermarket Jeff Louise 1
You may try
library(dplyr)
Networks %>%
group_by(Location, Person1=pmin(Instructor,Operator),
Person2= pmax(Instructor,Operator)) %>%
summarise(Count=n())
# Location Person1 Person2 Count
#1 Conference Louise Sally 1
#2 Farm Bob Lee 1
#3 Farm Julie Louise 1
#4 Supermarket Bob Lee 2
#5 Supermarket Jeff Louise 1
Or using base R
d1 <-cbind(Location=Networks[,1],
data.frame(setNames(Map(do.call, c('pmin', 'pmax'),
list(Networks[-1])), c('Person1', 'Person2'))))
aggregate(cbind(Count=1:nrow(d1))~., d1, FUN=length)
# Location Person1 Person2 Count
#1 Farm Bob Lee 1
#2 Supermarket Bob Lee 2
#3 Supermarket Jeff Louise 1
#4 Farm Julie Louise 1
#5 Conference Louise Sally 1
data
Networks <- data.frame(Location, Instructor, Operator,
stringsAsFactors=FALSE)

Resources