creating new column after joining two data.tables

creating new column after joining two data.tables - r

I have two data.tables, main and metrics, both keyed by cid
I want to add to table main the average of each of several values located in metrics.
However, I would like to filter by code, only averaging those rows in metrics with a given code.
> metrics
cid code DZ value1 value2
1: 1001 A 101 8 21
2: 1001 B 102 11 26
3: 1001 A 103 17 25
4: 1002 A 104 25 39
5: 1002 B 105 6 30
6: 1002 A 106 23 40
7: 1003 A 107 27 32
8: 1003 B 108 16 37
9: 1003 A 109 14 42
# DESIRED OUTPUT
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37
# SAMPLE DATA
set.seed(1)
main <- data.table(cid=1e3+1:3, key="cid")
metrics <- data.table(cid=rep(1e3+1:3, each=3), code=rep(c("A", "B", "A"), 3), DZ=101:109, value1=sample(30, 9), value2=sample(20:50, 9), key="cid")
code.filters <- c("A", "B")
These lines get the desired output, but I am having difficulty assigning the new col back into main. (also, doing it programatically would be preferred).
main[metrics[code==code.filters[[1]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
main[metrics[code==code.filters[[2]]]][, list(mean(c(value1))), by=cid]
main[metrics[code==code.filters[[1]]]][, list(mean(c(value2))), by=cid]
Additionally, can someone explain why the following line only takes the last value in each group?
main[metrics[ code=="A"], A.avg.val1 := mean(c(value1))]

You don't need main. You can get it directly from metrics as follows:
> tmp.dt <- metrics[, list(A.avg.val1 = mean(value1[code=="A"]),
A.avg.val2 = mean(value2[code=="A"]),
B.avg.val1 = mean(value1[code == "B"]),
B.avg.val2 = mean(value2[code == "B"])), by=cid]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30
# 3: 1003 20.5 37.0 16 37
If you still want to subset with main just do:
main <- data.table(cid = c(1001:1002))
> tmp.dt[main]
# cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
# 1: 1001 12.5 23.0 11 26
# 2: 1002 24.0 39.5 6 30

I would do this in two steps. First, get your means, then reshape the data
foo <- main[metrics]
bar <- foo[, list(val1 = mean(value1),
val2 = mean(value2)),
by=c('cid', 'code')]
library(reshape2)
bar.melt <- melt(bar, id.var=c('cid', 'code'))
dcast(data=bar.melt,
cid ~ code + variable)
But really, I'd probably leave the data in the "long" format because I find it much easier to work with!

working off of #Arun's answer, the following gets the desired results:
invisible(
sapply(code.filters, function(cf)
main[metrics[code==cf, list(avgv1 = mean(value1), avgv2 = mean(value2)), by=cid],
paste0(cf, c(".avg.val1", ".avg.val2")) :=list(avgv1, avgv2)]
))
> main
cid A.avg.val1 A.avg.val2 B.avg.val1 B.avg.val2
1: 1001 12.5 23.0 11 26
2: 1002 24.0 39.5 6 30
3: 1003 20.5 37.0 16 37

Related

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …

arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

How to group by in base R

I would like to express the following SQL query using base R (without any particular package):
select month, day, count(*) as count, avg(dep_delay) as avg_delay
from flights
group by month, day
having count > 1000
It selects the mean departure delay and the number of flights per day on busy days (days with more than 1000 flights). The dataset is nycflights13 containing information of flights departed from NYC in 2013.
Notice I can easily write this in dplyr as:
flights %>%
group_by(month, day) %>%
summarise(count = n(), avg_delay = mean(dep_delay, na.rm = TRUE)) %>%
filter(count > 1000)

Since I was reminded earlier about the elegance of by (tip of the hat to #Parfait), here is a solution using by:
res <- by(flights, list(flights$month, flights$day), function(x)
if (nrow(x) > 1000) {
c(
month = unique(x$month),
day = unique(x$day),
count = nrow(x),
avg_delay = mean(x$dep_delay, na.rm = TRUE))
})
# Store in data.frame and order by month, day
df <- do.call(rbind, res);
df <- df[order(df[, 1], df[, 2]) ,];
# month day count avg_delay
#[1,] 7 8 1004 37.296646
#[2,] 7 9 1001 30.711499
#[3,] 7 10 1004 52.860702
#[4,] 7 11 1006 23.609392
#[5,] 7 12 1002 25.096154
#[6,] 7 17 1001 13.670707
#[7,] 7 18 1003 20.626789
#[8,] 7 25 1003 19.674134
#[9,] 7 31 1001 6.280843
#[10,] 8 7 1001 8.680402
#[11,] 8 8 1001 43.349947
#[12,] 8 12 1001 8.308157
#[13,] 11 27 1014 16.697651
#[14,] 12 2 1004 9.021978

as commented you can use a combi of subset and aggregate. Changed the order of day & month to recieve the same order as your dplyr approach. Using na.action = NULL to count rows inclunding NAs.
library(nycflights13)
#> Warning: Paket 'nycflights13' wurde unter R Version 3.4.4 erstellt
subset(aggregate(dep_delay ~ day + month, flights,
function(x) cbind(count=length(x), avg_delay=mean(x, na.rm = TRUE)),
na.action = NULL),
dep_delay[,1] > 1000)
#> day month dep_delay.1 dep_delay.2
#> 189 8 7 1004.000000 37.296646
#> 190 9 7 1001.000000 30.711499
#> 191 10 7 1004.000000 52.860702
#> 192 11 7 1006.000000 23.609392
#> 193 12 7 1002.000000 25.096154
#> 198 17 7 1001.000000 13.670707
#> 199 18 7 1003.000000 20.626789
#> 206 25 7 1003.000000 19.674134
#> 212 31 7 1001.000000 6.280843
#> 219 7 8 1001.000000 8.680402
#> 220 8 8 1001.000000 43.349947
#> 224 12 8 1001.000000 8.308157
#> 331 27 11 1014.000000 16.697651
#> 336 2 12 1004.000000 9.021978
Created on 2018-04-05 by the reprex package (v0.2.0).

Not a particularly elegant solution, but this will do what you want using Base R
flights_split <- split(flights, f = list(flights$month, flights$day))
result <- lapply(flights_split, function(x) {
if(nrow(x) > 1000) {
data.frame(month = unique(x$month), day = unique(x$day), avg_delay = mean(x$dep_delay, na.rm = T), count = nrow(x))
} else {
NULL
}
}
)
do.call(rbind, result)
# month day mean_delay n
# 12.2 12 2 9.021978 1004
# 8.7 8 7 8.680402 1001
# 7.8 7 8 37.296646 1004
# 8.8 8 8 43.349947 1001
# 7.9 7 9 30.711499 1001
# 7.10 7 10 52.860702 1004
# 7.11 7 11 23.609392 1006
# 7.12 7 12 25.096154 1002
# 8.12 8 12 8.308157 1001
# 7.17 7 17 13.670707 1001
# 7.18 7 18 20.626789 1003
# 7.25 7 25 19.674134 1003
# 11.27 11 27 16.697651 1014
# 7.31 7 31 6.280843 1001

Here is my solution:
grp <- expand.grid(mth = unique(flights$month), d = unique(flights$day))
out <- mapply(function(mth, d){
sub_data <- subset(flights, month == mth & day == d)
df <- data.frame(
month = mth,
day = d,
count = nrow(sub_data),
avg_delay = mean(sub_data$dep_delay, na.rm = TRUE)
)
df[df$count > 1000]
}, grp$mth, grp$d)
res <- do.call(rbind, out)
This is a lot slower than the dplyr solution.

Data.table: operation with group-shifted data

Consider the folowing data.table:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56,11,25,60,8,27,51))
I would like have the following ouput:
DT <- data.table(year = c(2011,2012,2013,2011,2012,2013,2011,2012,2013),
level = c(137,137,137,136,136,136,135,135,135),
valueIn = c(13,30,56, 11,25,60, 8,27,51),
valueOut = c(12,27.5,58, 9.5,26,55.5, NA,NA,NA))
In other words, I want to calculate the operation (valueIn[level] - valueIn[level-1]) / 2, according to the year. For example, the first value is calculated like this: (13+11)/2=12.
For the moment, I do that with for loops, in which I create data.table's subsets for each level:
levelDtList <- list()
levels <- sort(DT$level, decreasing = FALSE)
for (this.level in levels) {
levelDt <- DT[level == this.level]
if (this.level == min(levels)) {
valueOut <- NA
} else {
levelM1Data <- levelDtList[[this.level - 1]]
valueOut <- (levelDt$valueIn + levelM1Data$valueIn) / 2
}
levelDt$valueOut <- valueOut
levelDtList[[this.level]] <- levelDt
}
datatable <- rbindlist(levelDtList)
This is ugly and quite slow, so I am looking for a better, faster, data.table-based solution.

Using the shift-function with type = 'lead' to get the next value, sum and divide by two:
DT[, valueOut := (valueIn + shift(valueIn, type = 'lead'))/2, by = year]
you get:
year level valueIn valueOut
1: 2011 137 13 12.0
2: 2012 137 30 27.5
3: 2013 137 56 58.0
4: 2011 136 11 9.5
5: 2012 136 25 26.0
6: 2013 136 60 55.5
7: 2011 135 8 NA
8: 2012 135 27 NA
9: 2013 135 51 NA
With all the parameters of the shift-function specified:
DT[, valueOut := (valueIn + shift(valueIn, n = 1L, fill = NA, type = 'lead'))/2, by = year]

We can also use shift with Reduce
DT[, valueOut := Reduce(`+`, shift(valueIn, type = "lead", 0:1))/2, by = year]
DT
# year level valueIn valueOut
#1: 2011 137 13 12.0
#2: 2012 137 30 27.5
#3: 2013 137 56 58.0
#4: 2011 136 11 9.5
#5: 2012 136 25 26.0
#6: 2013 136 60 55.5
#7: 2011 135 8 NA
#8: 2012 135 27 NA
#9: 2013 135 51 NA
It is more easier to generalize as shift can take a vector of 'n' values.

If you:
don't mind using dplyr
the year is the thing that relates your items
the structure shown is representative of reality
then this could work for you:
DT %>% group_by(year) %>% mutate(valueOut = (valueIn + lead(valueIn)) / 2)

Find matching intervals in data frame by range of two column values

I have a data frame of time related events.
Here is an example:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.
I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.
Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.
I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.
My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.
Thank you!
UPDATE:
Here is the example of the desired result
Name Event Order Sequence start_event end_event duration Group
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.

As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == "JOHN")
njpos <- which(x$Name != "JOHN")
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
# Name EventOrder Sequence start_event end_event duration Group
# A.2 JOHN 2 A 60 112 52 ID1
# A.3 JOHN 3 A 392 429 37 ID1
# A.7 ADAM 1 A 19 75 56 ID2
# A.8 ADAM 2 A 384 407 23 ID2
# C.5 JOHN 5 C 147 226 79 ID1
# C.6 JOHN 6 C 566 611 45 ID1
# C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2
Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].
This could be pretty easily mapped into dplyr language:
library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
jpos <- which(name == "JOHN")
njpos <- which(name != "JOHN")
over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
group_by(Sequence) %>%
slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
#
# Name EventOrder Sequence start_event end_event duration Group
# (fctr) (int) (fctr) (int) (int) (int) (fctr)
# 1 JOHN 2 A 60 112 52 ID1
# 2 JOHN 3 A 392 429 37 ID1
# 3 ADAM 1 A 19 75 56 ID2
# 4 ADAM 2 A 384 407 23 ID2
# 5 JOHN 5 C 147 226 79 ID1
# 6 JOHN 6 C 566 611 45 ID1
# 7 ADAM 5 C 140 205 65 ID2
# 8 ADAM 6 C 522 599 77 ID2
If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
dat <- dat[dat$Name %in% c(user1, user2),]
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == user1)
njpos <- which(x$Name == user2)
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
}
You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:
apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

Working with dataframe that uses ':' and 'x' as separators/identifiers within only one columns

I have a dataframe that provides all kinds of sales info- date, session, time, day of week, product type, total sales, etc. It also includes a single column that provides the order in which all products were purchased in that session. Some of the products are text names, some are numbers.
The products with text names never change, but the products with numerical names rotate as new ones are developed. (This is why they are listed in a single column- the "numerical" products change so much that the dataframe would get maddeningly wide in just a few months, plus some other issues)
Here's a small subset:
Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4
Okay, so in the above example, in session 20764 (the first one), sales were $15 and it was all spent on ProductA. In the next session, ProductB was purchased three times. In the third session, product 1069 was purchased twice, and so on.
I am going to be doing a lot with this, but I don't know how to tell R that, in this column, a ':' acts as a separator between products, and an 'x' signifies the number of "numerical' products that were purchased. Any ideas?
Some examples of what I am trying to know:
1. Which Product was purchased first in a session;
2. Which products were purchased most often with each other; and,
3. I'd like to be able to, say, aggregate sessions that contain certain combinations of products (e.g, 1067 and 1046 and Quinto)
I know this is a broad request for on here, but any info on how to get R to recognize these unique-to-this-column identifiers would be tremendously helpful. Thanks in advance.
Also, here's the dput()
structure(list(Session = c(20764L, 31976L, 27966L, 324L, 6943L,
14899L, 25756L, 27279L, 31981L, 2782L), TotSales = c(5, 5, 20,
1, 25, 2, 9, 5, 1, 520), GameList = structure(c(6L, 9L, 4L, 3L,
5L, 2L, 8L, 7L, 3L, 1L), .Label = c("1046x2:1046x2:1046x1:1046x1:1046x1:1046x4",
"1062x2", "1067x1", "1069x2", "1071x1:1064x1:1038x2:1034x1:ProductE",
"ProductA", "ProductD:ProductD:ProductD:ProductC", "ProductB:ProductB:ProductC",
"ProductB:ProductB:ProductB"), class = "factor")), .Names = c("Session",
"TotSales", "GameList"), row.names = c(320780L, 296529L, 98969L,
47065L, 19065L, 92026L, 327431L, 291843L, 296534L, 15055L), class = "data.frame")

Here is an alternate with data.table. I won't answer all your questions, but this should get you going. First, convert to long format:
library(data.table)
dt <- data.table(df) # assumes your data is in `df`
split_fun <- function(x) {
y <- unlist(strsplit(as.character(x), ":"))
z <- strsplit(y, "(?<=[0-9])+x(?=[0-9]+$)", perl=T)
unlist(lapply(z, function(x) if(length(x) == 2) rep(x[[1]], x[[2]]) else x[[1]]))
}
dt.long <- dt[, list(TotSales, split_fun(GameList)), by=Session]
Now, to answer Q1 (first product in session):
dt.long[, head(V2, 1L), by=Session]
Produces:
Session V1
1: 20764 ProductA
2: 31976 ProductB
3: 27966 1069
4: 324 1067
... 6 rows omitted
And Q3 (aggregate sessions that contain multiple products):
dt.long[,
if(length(items <- .SD[all(c("ProductB") %in% V2), V2])) paste0(items, collapse=", "),
by=Session
]
Produces (note you don't have any sessions with more than one product shared, but you can easily modify the above for multiple products for your real data):
Session V1
1: 31976 ProductB, ProductB, ProductB
2: 25756 ProductC, ProductC, ProductB
Q2 is a bit trickier, but I'll leave that one to you. I'm also not 100% sure what you mean by that question. One thing worth highlighting, dt.long here has the products repeated however many times they were "xed". For example, with session 27966, product 1069 shows up twice, so you can count rows for each product if you want:
> dt.long[Session==27966]
Session TotSales V2
1: 27966 25 1069
2: 27966 25 1069
Note that the regular expression we use to split products will work so long as you don't have products with names (not codes) like "BLHABLBHA98877x998".

You need to parse the GameList column. This is probably kind of slow for bigger datasets, but should show the general idea:
options(stringsAsFactors=FALSE)
DF <- read.table(text="Session TotSales GameList
20764 15 ProductA
31976 7 ProductB:ProductB:ProductB
27966 25 1069x2
324 3 1067x1
6943 28 1071x1:1064x1:1038x2:1034x1:ProductE
14899 12 1062x2
25756 8 ProductC:ProductC:ProductB
27279 6 ProductD:ProductD:ProductD:PcoductC
31981 4 1067x1
2782 529 1046x2:1046x2:1046x1:1046x1:1046x1:1046x4", header=TRUE)
DF <- do.call(rbind,
lapply(seq_len(nrow(DF)),
function(i) cbind.data.frame(DF[i,-3],
Game=strsplit(DF$GameList, ":", fixed=TRUE)[[i]])))
DF <- cbind(DF,
t(sapply(strsplit(DF$Game, "x", fixed=TRUE),
function(x) {if (length(x)<2L) x <- c(x, 1); x})))
DF <- DF[,-3]
names(DF)[3:4] <- c("Game", "Amount")
DF$Amount <- as.integer(DF$Amount)
DF$index <- seq_len(nrow(DF))
# Session TotSales Game Amount index
# 1 20764 15 ProductA 1 1
# 2 31976 7 ProductB 1 2
# 3 31976 7 ProductB 1 3
# 4 31976 7 ProductB 1 4
# 31 27966 25 1069 2 5
# 41 324 3 1067 1 6
# 7 6943 28 1071 1 7
# 8 6943 28 1064 1 8
# 9 6943 28 1038 2 9
# 10 6943 28 1034 1 10
# 11 6943 28 ProductE 1 11
# 6 14899 12 1062 2 12
# 13 25756 8 ProductC 1 13
# 14 25756 8 ProductC 1 14
# 15 25756 8 ProductB 1 15
# 16 27279 6 ProductD 1 16
# 17 27279 6 ProductD 1 17
# 18 27279 6 ProductD 1 18
# 19 27279 6 PcoductC 1 19
# 91 31981 4 1067 1 20
# 21 2782 529 1046 2 21
# 22 2782 529 1046 2 22
# 23 2782 529 1046 1 23
# 24 2782 529 1046 1 24
# 25 2782 529 1046 1 25
# 26 2782 529 1046 4 26
Note that I assume that there is no x in the product names. If there is, you need a regex as shown by #BrodieG for splitting.
Now you can do things like this:
aggregate(Game~Session, DF, head, 1)
# Session Game
# 1 324 1067
# 2 2782 1046
# 3 6943 1071
# 4 14899 1062
# 5 20764 ProductA
# 6 25756 ProductC
# 7 27279 ProductD
# 8 27966 1069
# 9 31976 ProductB
# 10 31981 1067

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

creating new column after joining two data.tables - r

Related

R Panel data: Create new variable based on ifelse() statement and previous row

How to group by in base R

Data.table: operation with group-shifted data

Find matching intervals in data frame by range of two column values

Working with dataframe that uses ':' and 'x' as separators/identifiers within only one columns

Categories

Resources