I currently have 600 000 + offerlines, where I want to efficiently match them based on the product bought & the timeframe.
With timeframe I mean that from the base line, I look at all offerlines that are maximally either 10 days before the base line or 10 days after. Everything in between with the same product should be matched.
However, it is very time expensive & after running it for a complete night, I only got to line 45000.
I know parallelism is one option, but I want to know if there are better ways (packages, functions, logic).
Input data
Offerline n°,Customer n°,Offerdate,Product
(we clean to 1 offerline n° per day per custno, for a certain product)
Logic => match lines with same product, different Customer n°
Desired output
Base customer, Related Customer, Offerline n°, Matched Offerline n°, Product, Offerdate base, Offerdate matched line.
Current code
for(i in 1:nrow(X)){
sku <- X[i,]$product
date <- X[i,]$order.offer_date
cust <- X[i,]$customer_code
oon <- X[i,]$order.offer_number
F <- data.frame()
F <- X %>%
filter(product == (X[i,]$product) & (order.offer_date <= date + 10 & order.offer_date >= date - 10)& customer_code != cust)
if(nrow(F)== 0){next}
else{
for(j in 1:nrow(F)){
skuc <- F[j,]$product
datec <- F[j,]$order.offer_date
custc <- F[j,]$customer_code
oonc <- F[j,]$order.offer_number
if(custc == cust | oon == oonc){next}
else if(skuc != sku){next}
else if(skuc == sku){
if(datec <= date + 10 & datec >= date - 10){
z <- z + 1
Y[z,]$count <- j
Y[z,]$base <- oon
Y[z,]$related <- oonc
Y[z,]$baseSku <- sku
Y[z,]$relSku <- skuc
Y[z,]$basedate <- as.Date(date)
Y[z,]$reldate <- as.Date(datec)
Y[z,]$basecust <- cust
Y[z,]$relcust <- custc
}
else{next}
}
}
next
}
}
Related
I am working on market transaction data where each observation contains the value of the variable of the buyer's id, and the value of the variable of the seller's id. For each observation (i.e each transaction), I would like to create a variable equal to the number of other transactions the associated seller has done with a different buyer than the one involved in this transaction. As a consequence, in the following
data <- data.frame(Buyer_id = c("001","001","002","001"), Seller_id = c("021","022","022","021"))
I would like to obtain:
Result <- list(0,1,1,0)
I searched for already existing answers for similar problems than mine, usually involving the function mapply(), and tried to implement them, but it proved unsuccessful.
Thank you very much for helping me.
Are you looking for something like this? If yes, then you might want to change your reproducible example to have a c instead of list when you construct your data.frame.
data <- data.frame(Buyer_id = c("001","001","002","001"),
Seller_id = c("021","022","022","021"))
data$n <- NA
for (i in seq_len(nrow(data))) {
seller <- as.character(data[i, "Seller_id"])
buyer <- as.character(data[i, "Buyer_id"])
with.buyers <- as.character(data[data$Seller_id == seller, "Buyer_id"])
with.buyers <- unique(with.buyers)
diff.buyers <- with.buyers[!(with.buyers %in% buyer)]
data[i, "n"] <- length(diff.buyers)
}
Buyer_id Seller_id n
1 001 021 0
2 001 022 1
3 002 022 1
4 001 021 0
Apart from Roman Lustrik's solution, there is also an approach that uses graphs.
library(igraph)
data <- data.frame(Seller_id = c("021","022","022","021"),
Buyer_id = c("001","001","002","001"),
stringsAsFactors = FALSE)
my.graph <- graph_from_data_frame(data)
plot(my.graph)
degree(my.graph, mode = c("out"))
# Transform the graph into a simple graph. A simple graph does not allow
# duplicate edges.
my.graph <- simplify(my.graph)
plot(my.graph)
degree(my.graph, mode = c("out"))
V(my.graph)$out.degree <- degree(my.graph, mode = c("out"))
data$n <- apply(data,
MARGIN = 1,
FUN = function(transaction)
{
node.out.degree <- V(my.graph)$out.degree[ V(my.graph)$name == transaction["Seller_id"] ]
if (node.out.degree <= 1) {
# Since the vertex has at most 1 out degree we know that the current transaction
# is the only appearance of the current seller.
return(0)
} else {
# In this case, we know that the seller participates in at least one more
# tansaction. We therefore simply subtract minus one (the current transaction)
# from the out degree.
return(node.out.degree - 1)
}
})
data
I want to find the best way to plot a chart showing the cumulative number of individuals in a group based on the date they came into the group as well as the date they may have left the group. This would be within the minimum and maximum date ranges of the date values. Each row is a person.
group_id Date_started Date_exit
1 2005-06-23 NA
1 2013-03-17 2013-09-20
2 2019-10-24 NA
3 2019-11-27 2019-11-27
4 2019-08-14 NA
3 2018-10-17 NA
4 2018-04-13 2019-10-12
1 2019-07-10 NA
I've considered creating a new data frame with a row per day within the min/max range and then applying some kind of function to tally the groups totals for each row (adding and subtracting from a running total based on whether or not there is a new value in either of the columns) but I'm not sure if one, that's the best way to approach the problem and two, how to practically run the cumulative count function either.
Ultimately though I want to be able to plot this as a line chart so I can see the trends over time for each group as I suspect one or more of them are much more volatile in terms of overall numbers. So again I'm not sure if ggplot2 has something already in place to handle this.
As you mentioned, you will need to create a dataframe with the desired dates and count, for each group, how many individuals are in the group.
I quickly put this together, so I'm sure there's a more optimal solution, but it should be what you're looking for.
library(ggplot2)
library(reshape2) # for melt
# your data
test <- read.table(
text =
"group_id,Date_started,Date_exit
1,2005-06-23,NA
1,2013-03-17,2013-09-20
2,2019-10-24,NA
3,2019-11-27,2019-11-27
4,2019-08-14,NA
3,2018-10-17,NA
4,2018-04-13,2019-10-12
1,2019-07-10,NA",
h = T, sep = ",", stringsAsFactors = F
)
# make date series
from <- min(as.POSIXct(test$Date_started))
to <- max(as.POSIXct(test$Date_started))
datebins <- seq(from, to, by = "1 month")
d_between <- function(d, ds, de){
if(ds <= d & (de > d | is.na(de)))
return(TRUE)
return(FALSE)
}
# make df to plot
df <- data.frame(dates = datebins)
df[,paste0("g", unique(test$group_id))] <- 0
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
# plot
ggplot(melt(df, id.vars = "dates"), aes(dates, value, color = variable)) +
geom_line(size = 1) + theme_bw()
This gives:
Feel free to play with the date bins (in seq()) as necessary.
EDIT : for loop explanation
for(i in seq_len(nrow(df))){
for(j in seq_len(nrow(test))){
gid <- paste0("g", test$group_id[j])
df[i, gid] <- df[i, gid] + d_between(df$dates[i], test$Date_started[j], test$Date_exit[j])
}
}
The first loop iterates over the chosen dates.
For each date, go through the dataframe of interest (test) with the second for loop and use the custom d_between() function to determine whether or not an individual is part of the group. That function returns a boolean (which can translate to 0/1). The value 0 or 1 is then added to the df dataframe's column corresponding to the appropriate group (with gid) at the date we checked (row i).
Note that I'm considering the individuals as part of the group as soon as they join (ds <= d), but are not a part of the group the day they quit (de > d).
I'm trying to create a helper function that will calculate how many rows are there in a data.frame according parameters.
getTotalParkeds <- function(place, weekday, entry_hour){
data <- PARKEDS[
PARKEDS$place == place,
PARKEDS$weekday == weekday,
PARKEDS$entry_hour == entry_hour
]
return(nrow(data))
}
Then I'm running this like:
getTotalParkeds('MyPlace', 'mon', 1)
So it is returning this error:
Warning: Error in : Length of logical index vector must be 1 or 11 (the number of columns), not 10000
I'm totally new to R, so I have no idea on what is happening.
Here's the correction you need for your approach -
getTotalParkeds <- function(place, weekday, entry_hour){
data <- PARKEDS[
PARKEDS$place == place &
PARKEDS$weekday == weekday &
PARKEDS$entry_hour == entry_hour,
]
return(nrow(data))
}
Allowing different PARKEDS data, say next month's data:
getTotalParkeds <- function(input, place, weekday, entry_hour){
row.count <- nrow(subset(input, place == place &
weekday == weekday &
entry_hour == entry_hour))
return(row.count)
}
I'm trying to find out how to change the values in a data.frame based on specific proportions. This is an example of the values in the data.frame where class values (with different counts) are grouped by the field "id":
> head(pts)
id class
1 245 10
2 522 10
3 522 10
4 522 10
And this is an example of the proportions:
id class perc%
245 10 100
522 10 50
522 20 50
My objective is to be able to select the values for each "id" and change them according to the "perc%" field, e.g. if I have 100 values for id=522 then change 50 values to class=10 and then 50 values to class=20 (perc%=50).
I've tried subsetting the data.frame or making conditional selections but can't find a way to basically join the "perc%" with the counts of values per "id".
Thanks in advance.
I think I see better what you are trying to do now, this code may not be very elegant, but should get the job done. The "percentages" dataframe represents the second table you described, note that I am renaming "perc%" to be "perc"
colnames(df.percentages) <- c("id","class","perc")
p.check <- df.percentages %>% group_by(id) %>% summarize(sum(perc))
colnames(p.check) <- c("id","sumperc")
not100 <- which(p.check$sumperc != 100)
if(length(not100) != 0)
{
print(paste("ID",p.check$id[not100],"does not add up to 100%"))
}
rm(p.check)
ids <- unique(df.percentages$id)
for(i in 1:length(ids))
{
print("")
print(paste("Processing ID:",ids[i]))
classes.to.reassign <- pts %>% filter(id == ids[i])
if(nrow(classes.to.reassign) == 0)
{
print(paste("Could not find ID",ids[i],"pts dataframe!"))
next
}
class.rows <- df.percentages %>%
filter(id == ids[i]) %>%
mutate(rows = as.integer(round(nrow(classes.to.reassign) * (perc / 100))))
if(nrow(classes.to.reassign) < nrow(class.rows))
{
print(paste("Cannot split", nrow(classes.to.reassign), "classes into", nrow(class.rows), "segments for ID:", ids[i]))
next
}
if(sum(class.rows$rows) > nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (nrow(classes.to.reassign) - sum(class.rows$rows))}
else if(sum(class.rows$rows) < nrow(classes.to.reassign))
{class.rows$rows[nrow(class.rows)] <- class.rows$rows[nrow(class.rows)] + (sum(class.rows$rows) - nrow(classes.to.reassign))}
class.rows <- class.rows %>%
mutate(cumrows = cumsum(as.integer(rows)))
print(paste("Total rows for ID",ids[i],"=",nrow(classes.to.reassign)))
cur.row <- 1
for(c in 1:nrow(class.rows))
{
last.row <- class.rows$cumrow[c]
print(paste("Assigning class",class.rows$class[c],"to rows",cur.row,"-",last.row))
classes.to.reassign$class[cur.row:last.row] <- class.rows$class[c]
cur.row <- last.row + 1
}
if(i == 1)
{pts.new <- classes.to.reassign}
else
{pts.new <- rbind(pts.new, classes.to.reassign)}
rm(classes.to.reassign, class.rows)
}
pts <- pts.new
View(pts)
I am sorry for the cryptic title but I didn't know how to adequately summarise my problem. So here's my question. I have a data frame with dates and a name for several entities:
df <- data.frame(
time=rep(as.Date(seq(as.Date("2004/1/1"), as.Date("2005/12/1"), by = "1 month ")),2),
name=c(rep("a",24),rep("b",24))
)
str(df)
'data.frame': 48 obs. of 2 variables:
$ time: Date, format: "2004-01-01" "2004-02-01" ...
$ name: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
And I have another dataframe with several unevenly spaced events:
events <- data.frame(
time = c("2004-12-1", "2005-8-1", "2005-6-1", "2004-4-1"),
event = c("normal", "extraordinary", "normal", "extraordinary"),
name = c("a", "a", "b", "b")
)
I want to merge these two data frames in a way that the event is assigned from the either the beginning of the data set up to the event or starting with the last event up to the next event or the end of the data set. This would look something like:
date name event
2004-01-01 a normal
2004-01-02 a normal
...
2004-12-01 a extraordinary
2005-01-01 a extraordinary
Is there an easy way doing this in R that I don't see or do I merge these by hand? Thank you very much for your help!
I don't know any function to do this, but here is some R code to do it yourself :
# Needed type coercions (Date for comparisons, characters to avoid 'factor' problems)
events$time <- as.Date(events$time)
events$event <- as.character(events$event)
events$name <- as.character(events$name)
df$name <- as.character(df$name)
# Events ordering (needed to detect previous events as non NA)
events <- events[ order(events$time) ,]
# Updates
df$event = NA
for(i in 1:nrow(events)) {
# Update where time is lesser than the limit, if names correspond and if an event was not already assigned to the row
df[ df$time <= events[i,"time"] & df$name == events[i,"name"] & is.na(df$event) , "event" ] = events[i,"event"]
}
Here is a function to do what you want:
event.aligning <- function(time.dataframe, events){
if(!class(events[["time"]]) == 'Date'){
events[["time"]] <- as.Date(events[["time"]])
}
## lets sort on time
events <- events[order(events[["time"]]),]
## setup event column
time.dataframe$event <- NA
time.dataframe$event <- as.factor(time.dataframe$event)
levels(time.dataframe$event) <- event.types
rownames.tdf <- rownames(time.dataframe)
res.time.dataframe <- NULL
for( i in 1:length(levels(events$name))){
i.name <- levels(events$name)[i]
i.name.events <- subset(events, name == i.name)
first.time <- time.dataframe$time[time.dataframe$name == i.name][1]
first.event <- i.name.events$time[1]
## assume 2 events
first.event.type <- i.name.events$event[1]
second.event.type <- unique(i.name.events$event[i.name.events$event != first.event.type])
event.types <- levels(i.name.events$event)
sub.time.df <- time.dataframe[time.dataframe$name == i.name,]
rownames(sub.time.df) <- 1:length(sub.time.df[,1])
sub.time.df[1:(as.numeric(rownames(sub.time.df[sub.time.df$time == first.event,])) - 1),]$event <- second.event.type
cur.event <- first.event
for( j in 2:length(i.name.events[,1])){
next.event <- i.name.events$time[j]
sub.time.df[rownames( sub.time.df[ sub.time.df[["time"]] == cur.event,]) :
(as.numeric(rownames( sub.time.df[sub.time.df[["time"]] == next.event,])) - 1),]$event <- i.name.events$event[j-1]
cur.event <- next.event
next.event.type = i.name.events$event[j]
}
last.time <- sub.time.df$time[length(sub.time.df$time)]
last.event <- i.name.events$time[length(i.name.events$time)]
sub.time.df[rownames( sub.time.df[sub.time.df$time == last.event,]):length(sub.time.df$time),]$event <- next.event.type
res.time.dataframe <- rbind(res.time.dataframe, sub.time.df)
}
rownames(res.time.dataframe) <- rownames.tdf
return(res.time.dataframe)
}
df2 <- event.aligning(df, events)