I have data which is in this format:
User Item
1 A
1 B
1 C
1 D
2 A
2 C
2 E
What I want to get is a frequency count for each pair. Order is not important so I don't want to count the inverse. I want to end up with a result similar to this, where the frequency counts are partitioned by user.
Pair Frequency
AB 1
AC 2
AD 1
AE 1
BC 1
BD 1
BE 0
CD 1
CE 1
What tool can I use to formulate this kind of table? I'd prefer some open source solution if possible.
Edit- Added example for my comment below
I'm reading in data from a CSV file using the following two lines and removing the factors with these two steps in code.
xa<-read.csv("C:/Direcotry/MyData.csv")
xa<-data.frame(lapply(xa, as.character), stringsAsFactors=FALSE)
User Item
1 394324 Item A
2 124209 Item B
3 212457 Item C
4 427052 Item A
5 118281 Item D
6 156831 Item A
7 212442 Item E
8 156831 Item B
9 212442 Item A
10 177734 Item C
When I try running suggested answer, I get an error with this result:
Error in combn(x, 2) : n < m
Well R is open source.
Here's an example based on your tiny sample of data:
Here I just read your data in by copypasting it straight from your post:
> xa=read.table(stdin(),header=TRUE,as.is=TRUE)
0: User Item
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 2 A
6: 2 C
7: 2 E
8:
So that's the data in. Then with a couple of lines of code:
> f=function(x) apply(combn(x,2),2,paste0,collapse="")
> table(unlist(tapply(xa$Item,xa$User,f)))
AB AC AD AE BC BD CD CE
1 2 1 1 1 1 1 1
If you need all the empty combinations explicitly as zeroes it takes another line or two (you need to generate all the possible combinations as a factor, rather than just the observed ones and tell table to include the empty ones).
After some research and suggestions by Glen, I came up with the following code which gets me a 3 column CSV file with the pair combination plus frequency count. If anyone sees a better way, let me know! But this seems to work.
The errors I was referring to in my follow up comments were caused by users having purchased only at one location.
library(reshape2)
xa<-read.csv("C:/Input.csv",as.is=TRUE)
xa=xa[!duplicated(xa),]
xa<-data.table(xa)
setkey(xa,ContactId,PurchaseLocation)
tab=table(xa$ContactId)
xa=xa[xa$ContactId %in% names(tab[tab>1]),]
f=function(x) apply(combn(x,2),2,paste0,collapse="--")
xb<-as.data.frame(table(unlist(tapply(xa$PurchaseLocation,xa$ContactId,f))))
xc=with(xb, cbind(Freq, colsplit(xb$Var1, pattern = "--", names = c('a', 'b'))))
xc=subset(xc,a!=b & a!="" & b!="" & Freq>1)
write.csv(xc,file="C:/Output.csv")
Edit- I made a small change to make it order independent by sorting the data table on a key.
Related
I am writing my Thesis in R and I would like, if possible, some help in a problem that I have.
I have a table, which is called tkalp, with 2 columns and 3001 rows and after a 'subset' command that I wrote this table contains now 1084 rows and called kp. Some values of kp are:
As you can see some values from the column V1 are continuously with step = 2 and some are not.
So my difficulty is:
1. I would like to 'break' this big list/table into smaller lists/tables that contain only continuous numbers. For this difficulty, I tried to implement it with these commands but it didn't go as planned:
for (n in 1:nrow(kp)) {
kp1 <- subset(kp, kp[n+1,1] - kp[n,1])==2)
}
2. After completing this task I would like to keep only the sublists that contain more than 10 rows.
Any idea or help is more than welcome! Thank you very much
EDIT
I have uploaded a picture of my table and I have separated the numbers that I want to be contained in different tables. And I would like to do that for all the original table.
blue is one smaller table than the original
black another
yellow another
red another
And after I create all those smaller tables I would like to keep only the tables that contain more than 10 numbers. For example I don't want to keep the yellow table since it contains only 4 numbers.
Thank you again
What about
df <- data.frame(V1=c(1,3,5,10,12,14, 20, 22), V2=runif(8))
df$diff <- c(2,diff(df$V1))
df$numSubset <- cumsum(df$diff != 2) + 1
iter <- seq(max(df$numSubset))
purrr::map(iter, function(i) filter(df, numSubset == i))
listOfSubsets <- purrr::map(iter, function(i) dplyr::filter(df, numSubset == i))
Then you loop through the list and select only those you want. Btw purrr also provides a means to filter the list you get without looping. Check the documentation of purrr.
With base R
kp=data.frame(V1=c(seq(8628,8618,by=-2),seq(8576,8566,by=-2),78,76),V2=runif(14))
kp$diffV1=c(-2,diff(kp$V1))/-2
kp$group=cumsum(ifelse(kp$diffV1/-2==1,0,1))+1
lkp=split(kp,kp$group)
# > kp
# V1 V2 diffV1 group
# 1 8628 0.74304325 -2 1
# 2 8626 0.84658101 -2 1
# 3 8624 0.74540089 -2 1
# 4 8622 0.83551473 -2 1
# 5 8620 0.63605222 -2 1
# 6 8618 0.92702915 -2 1
# 7 8576 0.81978587 -42 2
# 8 8574 0.01661538 -2 2
# 9 8572 0.52313859 -2 2
# 10 8570 0.39997951 -2 2
# 11 8568 0.61444445 -2 2
# 12 8566 0.23570017 -2 2
# 13 78 0.58397923 -8488 3
# 14 76 0.03634809 -2 3
I got stuck at this for a long time and couldn't find answer elsewhere.
Below is my data:
Market Start Type(0 or 1)
A 1
A 2
A 4
A 6
A 10
A 2
B 2
B 4
B 6
B 8
B 4
B 9
C 1
C 4
C 7
C 3
C 9
C 11
C 12
And I want to complete the Type column based on following conditions:
If Market is A and Start is 1,2,3, then Type is 1, otherwise 0
If Market is B and Start is 2,4,5, then Type is 1, otherwise 0
If Market is C and Start is 4,6,9, then Type is 1, otherwise 0
In Alteryx, I tried using the formula tool three times:
IIF ( [Market]="A" && ([Start] in (1,2,3),"1","0")
IIF ( [Market]="B" && ([Start] in (2,4,5),"1","0")
IIF ( [Market]="C" && ([Start] in (4,6,9),"1","0")
But the third IIF function overwrites the previous two. Is there any other tools in Alteryx that can do what I want to do? Or is there something wrong with my code?
Thanks in advance. Really appreciate it.
It evaluates to False and places a zero for any market <> "C"... try a single Formula tool with:
IF [Market]="A" THEN
IIF([Start] in (1,2,3),"1","0")
ELSEIF [Market]="B" THEN
IIF([Start] in (2,4,5),"1","0")
ELSEIF [Market]="C" THEN
IIF([Start] in (4,6,9),"1","0")
ENDIF
This should eliminate overlap.
I have a set of data in the following format:
Items Shipped | Month
A 1
B 1
C 1
D 2
E 2
F 3
G 3
H 3
I would like to show the count of items shipped each month using a calculated field in Tableau.
Item_Count | Month
3 1
2 2
3 3
Any Suggestions?
You should probably have a look on the Tableau page for their basic tutorials:
https://www.tableau.com/learn/training
Drag the [month] pill to row (if it's an actual date, change it to discrete month, otherwise leave it like it is)
Drag the [item_count] to columns, click on it and change it to COUNT or COUNTD depending whether you want the total count or only the distinct elements.
This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between
I got a table like this
a b c
-- -- --
1 1 10
2 1 0
3 1 0
4 4 20
5 4 0
6 4 0
The b column 'points' to 'a', a bit like if a is the parent.
c was computed. Now I need to propagate the parent c value to their children.
The result would be
a b c
-- -- --
1 1 10
2 1 10
3 1 10
4 4 20
5 4 20
6 4 20
I can't make an UPDATE/SELECT combo that works
So far I got a SELECT that procuce the c column I'd like to get
select t1.c from t t1 join t t2 on t1.a=t2.b;
c
----------
10
10
10
20
20
20
But I dunno how to stuff that into c
Thanx in advance
Cheers, phi
You have to look up the value with a correlated subquery:
UPDATE t
SET c = (SELECT c
FROM t AS parent
WHERE parent.a = t.b)
WHERE c = 0;
I finnally found a way to copy back my initial 'temp' SELECT JOIN to table 't'. Something like this
create temp table u as select t1.c from t t1 join t t2 on t1.a=t2.b;
update t set c=(select * from u where rowid=t.rowid);
I'd like to know how the 2 solutions, yours with 1 query UPDATE correlated SELECT, and mine that is 2 queries and 1 correlated query each, compare perf wise. Mine seems more heavier, and less aesthetic, yet regarding perf I wonder.
On the Algo side, yours take care not to copy the parent data, only copy child data, mine copy parent on itself, but that's a nop, yet consuming some cycles :)
Cheers, Phi