R; two data sets merge by column a or column b - r

I have two data frames looking like this
view
id object date maxdate
1 a 8 9
1 b 8 9
2 a 8 9
3 b 7 8
purchase
id date object purchased
1 9 a 1
2 8 a 1
3 8 b 1
one is table when a product was viewed, and the other was if and when the product was purchased - after it was viewed it can be purchased within 24 hours. I want to merge them on column id, date and object OR id, maxdate=date and object, what is the best way to implement that or condition within full_join (dplyr)? below is the code for the data frame and what output I am looking for
id object date maxdate purchased
1 a 8 9 1
1 b 8 9 NA
2 a 8 9 1
3 b 7 8 1
id=c(1,1,2,3)
object=c("a","b","a","b")
date=c(8,8,8,7)
maxdate=c(9,9,9,8)
view=data.frame(id,object,date,maxdate)`
id=c(1,2,3)
date=c(9,8,8)
object=c("a","a","b")
purchased=(1,1,1)
purchase=data.frame(id,date,object,purchased)
so far I have tried something like this but it is very inefficient and confusing to clean up when it is large dataset
a=merge(view,purchase, by="id")
a$ind=ifelse(a$object.x==a$object.y & (a$date.x==a$date.y | a$maxdate==a$date.y),1,"NA")

Are you trying to do something like this?
a=merge(view[,-4],purchase, by=c("id", "object"))
names(a) = c("id", "object", "date.viewed", "date.purchased", "purchased")
> a
id object date.viewed date.purchased purchased
1 1 a 8 9 1
2 2 a 8 8 1
3 3 b 7 8 1

Related

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

Long to short with data manipulation in r with 2 id pieces

In r a data set like below I want to create a variable that is prior minus post. I'll need to do some calculations by ID and later by group so I want to keep both.
Original
ID group time value
1 A prior 8
1 A post 5
2 A prior 4
2 A post 7
3 B prior 3
3 B post 10
4 B prior 5
4 B post 6
Desired data
ID group new_value
1 A -3
2 A 3
3 B 7
4 B 1
I think to get there I need to make my data like this
ID group value_prior value_post
1 A 8 5
2 A 4 7
3 B 3 10
4 B 5 6
But I'm not sure how to get there while preserving ID and group.
Assuming your data is already sorted, you could use:
aggregate(value ~ ID + group, df, diff)
ID group value
1 1 A -3
2 2 A 3
3 3 B 7
4 4 B 1
Or:
library(dplyr)
df %>%
group_by(ID, group) %>%
summarise(new_value = diff(value))

Pulling Specific Row Values based on Another Column

Simple question here -
if I have a dataframe such as:
> dat
typeID ID modelOption
1 2 1 good
2 2 2 avg
3 2 3 bad
4 2 4 marginCost
5 1 5 year1Premium
6 1 6 good
7 1 7 avg
8 1 8 bad
and I wanted to pull only the modelOption values based on the typeID. I know I can subset out all rows corresponding with the typeID, but I just want to pull the modelOption values in this case.

generate sequence of numbers in R according to other variables

I have problem to generate a sequence of number according on two other variables.
Specifically, I have the following DB (my real DB is not so balanced!):
ID1=rep((1:1),20)
ID2=rep((2:2),20)
ID3=rep((3:3),20)
ID<-c(ID1,ID2,ID3)
DATE1=rep("2013-1-1",10)
DATE2=rep("2013-1-2",10)
DATE=c(DATE1,DATE2)
IN<-data.frame(ID,DATE=rep(DATE,3))
and I would like to generate a sequence of number according to the number of observation per each ID for each DATE, like this:
OUTPUT<-data.frame(ID,DATE=rep(DATE,3),N=rep(rep(seq(1:10),2),3))
Curiously, I try the following solution that works for the DB provided above, but not for the real DB!
IN$UNIQUE<-with(IN,as.numeric(interaction(IN$ID,IN$DATE,drop=TRUE,lex.order=TRUE)))#generate unique value for the combination of id and date
PROG<-tapply(IN$DATE,IN$UNIQUE,seq)#generate the sequence
OUTPUT$SEQ<-c(sapply(PROG,"["))#concatenate the sequence in just one vector
Right now, I can not understand why the solution doesn't work for the real DB, as always any tips is greatly appreciated!
Here there is an example (just one ID included) of the data-set:
id date
1 F2_G 2005-03-09
2 F2_G 2005-06-18
3 F2_G 2005-06-18
4 F2_G 2005-06-18
5 F2_G 2005-06-19
6 F2_G 2005-06-19
7 F2_G 2005-06-19
8 F2_G 2005-06-19
9 F2_G 2005-06-20
Here's one using ave:
OUT <- within(IN, {N <- ave(ID, list(ID, DATE), FUN=seq_along)})
This should do what you want...
require(reshape2)
as.vector( apply( dcast( IN , ID ~ DATE , length )[,-1] , 1:2 , function(x)seq.int(x) ) )
[1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6
[27] 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2
[53] 3 4 5 6 7 8 9 10
Bascially we use dcast to get the number of observations by ID and date like so
dcast( IN , ID ~ DATE , length )
ID 2013-1-1 2013-1-2
1 1 10 10
2 2 10 10
3 3 10 10
Then we use apply across each cell to make a sequence of integers as long as the count of ID for each date. Finally we coerce back to a vector using as.vector.

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources