I have an order data set in the following format:
Ordernumber; Category; # Sold Items
123; A; 3
123; B; 4
234; B; 2
234; C; 1
234; D; 5
...
So, every order has as many lines as there were different categories in the order.
Now, I want to count for every category pair how often they were ordered together in one order.
In the end I would like to have a "correlation" matrix like this
A B C D
A 1
B 1 1 1
C 1 1
D 1 1
Has anyone a good (simple) idea?
Thank you so much!
Perhaps using matrix multiplication gets you there:
dat <- read.table(header=T, text="Ordernumber; Category; Sold Items
123; A; 3
123; B; 4
234; B; 2
234; C; 1
234; D; 5", sep=";")
tt <- table(dat[1:2])
crossprod(tt) # t(tt) %*% tt
# Category
#Category A B C D
# A 1 1 0 0
# B 1 2 1 1
# C 0 1 1 1
# D 0 1 1 1
This has the diagonal but can easily be removed with diag
Related
This question already has an answer here:
Binary encoding of comma separated string column [duplicate]
(1 answer)
Closed 1 year ago.
I have a table in R which has a column containing a string value which i need to tokenenize into 4 separate columns which would have a 0 or 1 in them depending on whether a token was present. For example
a,b,c,d are all the tokens
a column could have any permutation of the tokens in a string e.g.
a,c
b,d
c,a
what i want is to turn
combind
1 A,B,C
2 A
3 A,B
4 B,C
into
combind A B C
1 A,B,C 1 1 1
2 A 1 0 0
3 A,B 1 1 0
4 B,C 0 1 1
Be gentle with me as i am very new to R and its making me very angry at the moment !
I have tried all sorts of approaches to iterate through the first table to apply a function to the first column to get the values for the second column and add them in using the data.table library
e.g
df[,AValPre := isAPresent(combind)]
where isAPresent is a function with an grepl
printf <- function(...) invisible(print(sprintf(...)))
## Setup functions for extraction of the location
containsA <- function(str) {
cond <- grepl("A", str)
if (cond[1] == TRUE)
rv <- 1
else
rv <- 0
printf("Checking A in [%s] cond %d rv %d",str, cond, rv)
return (rv)
}
Help I am at my complete wits end with this....
We may do
library(qdapTools)
cbind(df, mtabulate(strsplit(df$combind, ",")))
-output
combind A B C
1 A,B,C 1 1 1
2 A 1 0 0
3 A,B 1 1 0
4 B,C 0 1 1
I have a dataframe like this:
df<- data.frame(a = 0,b=0,c=1,d=1,e=0,f=1,g=1,h=1)
print(df) would give this result
a b c d e f g h
0 0 1 1 0 1 1 1
Now, I need to find out the span of 1s together , which is maximum. In the above scenario, we have 1s together twice (column C and column D) before zero comes in the next column and thrice next (column f,g,h). I want result to be something like this, as 3 is max of 2 and 3.
a b c d e f g h ***Max_Span***
0 0 1 1 0 1 1 1 ***3***
Is there a easy way to do it rather than jump each byte at once and check its value with previous one? Please advice.
You probably want the function rle.
Here an example to see what it does (counts the number of sequences):
vect <- c(1, 0, 0, 1, 1, 1, 0)
rle(vect)
Run Length Encoding
lengths: int [1:4] 1 2 3 1
values : num [1:4] 1 0 1 0
Edit:
if you want only a particular values just use which:
rle_vect <- rle(vect) #first we assign the output from rle
rle_vect$lengths[which(rle_vect$values==1)] # then we can access where values==1
#[1] 1 3
In your case you want the max number of lengths for only 1s:
rle_1 <- rle(df[1,])
max(rle_1$lengths[which(rle_1$values==1)])
#[1] 3
Data:
df[1, ]
# a b c d e f g h
#1 0 0 1 1 0 1 1 1
in R, if I have a data-structure my_data like:
participant var score
`
1 a ...
b ...
c ...
a ...
2 b ...
a ...
c ...
c ...
3 b ...
c ...
a ...
b ...
and I write the function to count the frequencies of var, through table(my_data$participant, my_data$var), the result is:
a b c
1 1 0 0
2 0 1 0
3 0 1 0
while it should be
a b c
1 2 1 1
2 1 1 2
3 1 2 1
This happens for the reason that the function selects only those lines in which 'participant' is not empty.
Is there a default way to tell the software to associate to the same participant those empty lines?
You can use na.locf from the zoo package:
# sample data
my_data = data.frame(participant=c("1","","","2","",""),var = c("a","a","b","a","a","c"),stringsAsFactors = F)
library(zoo)
# first, replace empty elements with NA, then use na.locf
my_data$participant[nchar(my_data$participant)==0]=NA
my_data$participant = na.locf(my_data$participant)
table(my_data$participant, my_data$var)
Output:
a b c
1 2 1 0
2 2 0 1
Hope this helps!
I have a data.frame that looks like this:
> DF1
A B C D E
a x c h p
c d q t w
s e r p a
w l t s i
p i y a f
I would like to compare each column of my data.frame with the remaining columns in order to count the number of common elements. For example, I would like to compare column A with all the remaining columns (B, C, D, E) and count the common entities in this way:
A versus the remaining:
A vs B: 0 (because they have 0 common elements)
A vs C: 1 (c in common)
A vs D: 2 (p and s in common)
A vs E: 3 (p,w,a, in common)
Then the same: B versus C,D,E columns and so on.
How can I implement this?
We can loop through the column names and compare with the other columns, by taking the intersect and get the length
sapply(names(DF1), function(x) {
x1 <- lengths(Map(intersect, DF1[setdiff(names(DF1), x)], DF1[x]))
c(x1, setNames(0, setdiff(names(DF1), names(x1))))[names(DF1)]})
# A B C D E
#A 0 0 1 3 3
#B 0 0 0 0 1
#C 1 0 0 1 0
#D 3 0 1 0 2
#E 3 1 0 2 0
Or this can be done more compactly by taking the cross product after getting the frequency of the long formatted (melt) dataset
library(reshape2)
tcrossprod(table(melt(as.matrix(DF1))[-1])) * !diag(5)
# Var2
#Var2 A B C D E
# A 0 0 1 3 3
# B 0 0 0 0 1
# C 1 0 0 1 0
# D 3 0 1 0 2
# E 3 1 0 2 0
NOTE: The crossprod part is also implemented with RcppEigen here which would make this faster
An alternative is to use combn twice, once to get the column combinations and next to find the lengths of the element intersections.
cbind.data.frame returns a data.frame and setNames is used to add column names.
setNames(cbind.data.frame(t(combn(names(df), 2)),
combn(names(df), 2, function(x) length(intersect(df[, x[1]], df[, x[2]])))),
c("col1", "col2", "count"))
col1 col2 count
1 A B 0
2 A C 1
3 A D 3
4 A E 3
5 B C 0
6 B D 0
7 B E 1
8 C D 1
9 C E 0
10 D E 2
I have a data.table and I need to add additional column that is a ratio between labels == 1 and labels == 2 for same cID. I have the code that can do that but the results is the reduced form according to the number of unique "l". But I need a full list with duplicates. Any suggestions? Thank's in advance!
x y l cID
0.03588851 0.081635056 1 1
0.952514891 0.82677373 1 1
0.722920691 0.687278396 1 1
0.772207687 0.743329599 2 1
0.682710551 0.946685728 1 2
0.795816439 0.024320077 2 2
0.50788885 0.106910923 2 2
0.145871035 0.802771467 2 2
0.092942384 0.335054397 1 3
0.439765866 0.199329139 1 4
to reproduce
x = c(0.03588851,0.952514891,0.722920691,0.772207687,0.682710551,0.795816439,0.50788885,0.145871035,0.092942384,0.439765866)
y = c(0.081635056,0.82677373,0.687278396,0.743329599,0.946685728,0.024320077,0.106910923,0.802771467,0.335054397,0.199329139)
l = c(1,1,1,2,1,2,2,2,1,1)
cID = c(1,1,1,1,2,2,2,2,3,4)
dt <- data.table(x,y,l,cID)
dt[,sum(l == 1)/sum(l == 2), by = cID]
I need to obtain the ratio column that looks like this
x y l cID ratio
0.03588851 0.081635056 1 1 3
0.952514891 0.82677373 1 1 3
0.722920691 0.687278396 1 1 3
0.772207687 0.743329599 2 1 3
0.682710551 0.946685728 1 2 0.333333333
0.795816439 0.024320077 2 2 0.333333333
0.50788885 0.106910923 2 2 0.333333333
0.145871035 0.802771467 2 2 0.333333333
0.092942384 0.335054397 1 3 Inf
0.439765866 0.199329139 1 4 Inf
You were pretty close. Try this:
dt[, ratio := sum(l == 1) / sum(l == 2), by = cID]