Find co-occurrence of values in large data set - r

I have a large data set with month, customer ID and store ID. There is one record per customer, per location, per month summarizing their activity at that location.
Month Customer ID Store
Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C
I'm interested in creating a matrix that shows the number of customers that each location shares with another. Like this:
A B C
A 4 2 2
B 2 4 2
C 2 2 4
For example, since customer visited Store A and then Store B in the next month, they would be added to the tally. I'm interested in number of shared customers, not number of visits.
I tried the sparse matrix approach in this thread(Creating co-occurrence matrix), but the numbers returned don't match up for some reason I cannot understand.
Any ideas would be greatly appreciated!

Update:
The original solution that I posted worked for your data. But your data has
the unusual property that no customer ever visited the same store in two different
months. Presuming that would happen, a modification is needed.
What we need is a matrix of stores by customers that has 1 if the customer ever
visited the store and zero otherwise. The original solution used
M = as.matrix(table(Dat$ID_Store, Dat$Customer))
which gives how many different months the store was visited by each customer. With
different data, these numbers might be more than one. We can fix that by using
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
If you look at this matrix, it will say TRUE and FALSE, but since TRUE=1 and FALSE=0
that will work just fine. So the full corrected solution is:
M = as.matrix(table(Dat$ID_Store, Dat$Customer) > 0)
M %*% t(M)
A B C
A 4 2 2
B 2 4 2
C 2 2 4

We can try this too:
library(reshape2)
df <- dcast(df,CustomerID~Store, length, value.var='Store')
# CustomerID A B C
#1 1 1 1 1
#2 2 1 1 0 # Customer 2 went to stores A,B but not to C
#3 3 1 0 1
#4 4 1 0 0
#5 7 0 1 0
#6 11 0 0 1
#7 12 0 1 1
crossprod(as.matrix(df[-1]))
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4
with library arules:
library(arules)
write(' Jan 1 A
Jan 4 A
Jan 2 A
Jan 3 A
Feb 7 B
Feb 2 B
Feb 1 B
Feb 12 B
Mar 1 C
Mar 11 C
Mar 3 C
Mar 12 C', 'basket_single')
tr <- read.transactions("basket_single", format = "single", cols = c(2,3))
inspect(tr)
# items transactionID
#[1] {A,B,C} 1
#[2] {C} 11
#[3] {B,C} 12
#[4] {A,B} 2
#[5] {A,C} 3
#[6] {A} 4
#[7] {B} 7
image(tr)
crossTable(tr, sort=TRUE)
# A B C
#A 4 2 2
#B 2 4 2
#C 2 2 4

Related

The r function I have previously used successful no longer works

I am trying to create a summary for this data set
Morph ID black white orange green
1 O 1 2 1 0 3
2 O 2 2 1 3 0
3 O 3 2 1 1 2
4 O 4 3 0 2 1
5 O 5 3 0 2 1
6 O 6 3 0 1 2
7 O 7 3 0 1 2
8 O 8 3 0 3 0
9 O 9 0 3 2 1
10 O 10 3 0 3 0
11 O 11 3 0 1 2
12 O 12 0 3 2 1
13 O 13 3 0 2 1
14 O 14 3 0 2 1
15 O 15 2 1 1 2
I created the summary below before with a data set that has the exact same format.
n mean sd min Q1 median Q3 max percZero Choice se
sum.greenO 15 0.8666667 1.187234 0 0 0 2 3 60.00000 Orange 0.3065424
sum.greenG 15 2.1333333 1.187234 0 1 3 3 3 13.33333 Green 0.3065424
I used the function Summarize() but this function is no longer working.
I need to create the same bar graph I made for this previous data set, which I can't do without "n", "sd", or "se". (I created "se" using "n" and "sd" - it didn't come with the initial function output).
I am confused about how a function can stop working? Is there an alternative function I am not aware of?
Please let me know if this doesn't make any sense.
The following R packages on CRAN all provide a function called "Summarize" with a capital S:
> collidr::CRAN_packages_and_functions() %>% filter(function_names == "Summarize")
package_names function_names
1 alakazam Summarize
2 basket Summarize
3 bayesm Summarize
4 ChemoSpec Summarize
5 ChemoSpecUtils Summarize
6 cold Summarize
7 dataMaid Summarize
8 fastJT Summarize
9 FSA Summarize
10 GLMpack Summarize
11 LAGOSNE Summarize
12 lslx Summarize
13 MapGAM Summarize
14 MetaIntegrator Summarize
15 NetMix Summarize
16 PKNCA Summarize
17 ppclust Summarize
18 qad Summarize
19 radiant.model Summarize
20 ssmrob Summarize
Of course it is not guaranteed you made the previous summary with one of them, but hopefully this helps you find the right one.

Assign index values on dates in Excel

I have a seemingly easy to solve question, but I am really stuck! I have a set of dates and I am aiming to assign an index value to those dates. Specifically, I want each month of each year to have a unique index value. For visualization of what I want to do, I have the following dates where as you can see January 96 has the value of 1, February 96 is 2, March 96 is 3 and so on. Note: there are many years as well from 1996 to 2018, which means that Jan 1997 should have the value 13 in the index.
Date Index
01/01/1996 1
02/01/1996 1
03/01/1996 1
04/01/1996 1
05/01/1996 1
08/01/1996 1
01/02/1996 2
02/02/1996 2
05/02/1996 2
06/02/1996 2
07/02/1996 2
08/02/1996 2
09/02/1996 2
01/03/1996 3
04/03/1996 3
05/03/1996 3
06/03/1996 3
07/03/1996 3
08/03/1996 3
11/03/1996 3
I am trying to achieve this either in R, or Excel.
Assuming that you read in your dates in R and they are of type character and not Date, this would work:
mydf$index2 <- as.numeric(as.factor(substring(mydf$Date, 4)))
mydf
# Date Index Index2
#1 01/01/1996 1 1
#2 02/01/1996 1 1
#3 03/01/1996 1 1
#4 04/01/1996 1 1
#5 05/01/1996 1 1
#6 08/01/1996 1 1
#7 01/02/1996 2 2
#8 02/02/1996 2 2
#9 05/02/1996 2 2
#10 06/02/1996 2 2
#11 07/02/1996 2 2
#12 08/02/1996 2 2
#13 09/02/1996 2 2
#14 01/03/1996 3 3
#15 04/03/1996 3 3
#16 05/03/1996 3 3
#17 06/03/1996 3 3
#18 07/03/1996 3 3
#19 08/03/1996 3 3
#20 11/03/1996 3 3
mydf is your data.frame name. In the code above, I subset the date to extract the month and the year, then convert into factor and then into numeric which creates the indices.

summation for multiple columns dynamically

Hi I have dataframe with multiple columns ,I.e first 5 columns are my metadata and remaing
columns (columns count will be even) are actual columns which need to be calculated
formula : (col6*col9) + (col7*col10) + (col8*col11)
country<-c("US","US","US","US")
name <-c("A","B","c","d")
dob<-c(2017,2018,2018,2010)
day<-c(1,4,7,9)
hour<-c(10,11,2,4)
a <-c(1,3,4,5)
d<-c(1,9,4,0)
e<-c(8,1,0,7)
f<-c(10,2,5,6)
j<-c(1,4,2,7)
m<-c(1,5,7,1)
df=data.frame(country,name,dob,day,hour,a,d,e,f,j,m)
how to get final summation if i have more columns
I have tried with below code
df$final <-(df$a*df$f)+(df$d*df$j)+(df$e*df$m)
Here is one way to do generalize the computation:
x <- ncol(df) - 5
df$final <- rowSums(df[6:(5 + x/2)] * df[(ncol(df) - x/2 + 1):ncol(df)])
# country name dob day hour a d e f j m final
# 1 US A 2017 1 10 1 1 8 10 1 1 19
# 2 US B 2018 4 11 3 9 1 2 4 5 47
# 3 US c 2018 7 2 4 4 0 5 2 7 28
# 4 US d 2010 9 4 5 0 7 6 7 1 37

Add jitter to column value using dplyr

I have a data frame of the following format.
author year stages
1 A 1150 1
2 B 1200 1
3 C 1200 1
4 D 1300 1
5 D 1300 1
6 E 1390 3
7 F 1392 3
8 G 1400 3
9 G 1400 3
...
I want to jitter each year and author combination by a small amount. I want documents by different authors in the same year to be jittered by unique values. For example, tokens from author B and C appear in the same year, but should be jittered by different amounts. All tokens from the same author, for example the two tokens from author G at 1400 should be jittered by the same amount.
I've tried the following, but get a unique jitter amount for each and every row.
data %>% group_by(author) %>% mutate(year = jitter(year, amount=.5))
The output of this code is the following.
author year stages
1 A 1150.400 1
2 B 1200.189 1
3 C 1200.222 1
4 D 1300.263 1
5 D 1299.788 1
6 E 1390.045 3
7 F 1391.964 3
8 G 1399.982 3
9 G 1399.783 3
However, I would like the following, where both tokens from author G should be shifted by the same amount. The crucial difference is that for author G all tokens are shifted by the same amount.
author year stages
1 A 1150.400 1
2 B 1200.189 1
3 C 1200.222 1
4 D 1300.263 1
5 D 1299.788 1
6 E 1390.045 3
7 F 1391.964 3
8 G 1399.982 3
9 G 1399.982 3
Calculate the jitter for one case and add the difference to all cases:
dat %>%
group_by(author) %>%
mutate(year = year + (year[1] - jitter(year[1], amount=.5)))
# author year stages
#1 A 1149.720 1
#2 B 1200.385 1
#3 C 1199.888 1
#4 D 1299.589 1
#5 D 1299.589 1
#6 E 1389.866 3
#7 F 1392.225 3
#8 G 1400.147 3
#9 G 1400.147 3

Append sequence number to data frame based on grouping field and date field

I am attempting to append a sequence number to a data frame grouped by individuals and date. For example, to turn this:
x y
1 A 2012-01-02
2 A 2012-02-03
3 A 2012-02-25
4 A 2012-03-04
5 B 2012-01-02
6 B 2012-02-03
7 C 2013-01-02
8 C 2012-02-03
9 C 2012-03-04
10 C 2012-04-05
in to this:
x y v
1 A 2012-01-02 1
2 A 2012-02-03 2
3 A 2012-02-25 3
4 A 2012-03-04 4
5 B 2012-01-02 1
6 B 2012-02-03 2
7 C 2013-01-02 1
8 C 2012-02-03 2
9 C 2012-03-04 3
10 C 2012-04-05 4
where "x" is the individual, "y" is the date, and "v" is the appended sequence number
I have had success on a small data frame using a for loop in this code:
x=c("A","A","A","A","B","B","C","C","C","C")
y=as.Date(c("1/2/2012","2/3/2012","2/25/2012","3/4/2012","1/2/2012","2/3/2012",
"1/2/2013","2/3/2012","3/4/2012","4/5/2012"),"%m/%d/%Y")
x
y
z=data.frame(x,y)
z$v=rep(1,nrow(z))
for(i in 2:nrow(z)){
if(z$x[i]==z$x[i-1]){
z$v[i]=(z$v[i-1]+1)
} else {
z$v[i]=1
}
}
but when I expand this to a much larger data frame (250K+ rows) the process takes forever.
Any thoughts on how I can make this more efficient?
This seems to work. May be overkill though.
## code needed revision - this is old code
## > d$v <- unlist(sapply(sapply(split(d, d$x), nrow), seq))
EDIT
I can't believe I got away with that ugly mess for so long. Here's a revision. Much simpler.
## revised 04/24/2014
> d$v <- unlist(sapply(table(d$x), seq))
> d
## x y v
## 1 A 2012-01-02 1
## 2 A 2012-02-03 2
## 3 A 2012-02-25 3
## 4 A 2012-03-04 4
## 5 B 2012-01-02 1
## 6 B 2012-02-03 2
## 7 C 2013-01-02 1
## 8 C 2012-02-03 2
## 9 C 2012-03-04 3
## 10 C 2012-04-05 4
Also, an interesting one is stack. Take a look.
> stack(sapply(table(d$x), seq))
## values ind
## 1 1 A
## 2 2 A
## 3 3 A
## 4 4 A
## 5 1 B
## 6 2 B
## 7 1 C
## 8 2 C
## 9 3 C
## 10 4 C
I'm removing my previous post and replacing it with this solution. Extremely efficient for my purposes.
# order data
z=z[order(z$x,z$y),]
#convert to data table
dt.z=data.table(z)
# obtain vector of sequence numbers
z$seq=dt.z[,1:.N,"x"]$V1
The above can be accomplished in fewer steps but I wanted to illustrate what I did. This is appending sequence numbers to my data sets of over 250k records in under a second. Thanks again to Henrik and Richard.

Resources