R script merge 2 rows - r

I have one data frame like:
a b c d e f g
1 Car 10/02 01/02 30/02 14 1 NA
2 Car 10/02 07/02 20/02 0 NA 7
I want to get :
a b c d e f g
1 Car 10/02 01/02 20/02 14 1 7
Like a group by (a,b) , select the min Date for c and d, select the max for e and select the non-null for f and g
How can I solve it in R ?

Using dplyr we can group_by a and b, convert c and d to actual dates using dmy from lubridate and select the minimum date, select maximum value of e and non-NA value of f and g.
library(dplyr)
library(lubridate)
df %>%
group_by(a, b) %>%
summarise(c = c[which.min(dmy(paste0(c, "/19")))],
d = d[which.min(dmy(paste0(d, "/19")))],
e = max(e),
f = f[!is.na(f)],
g = g[!is.na(g)])
# a b c d e f g
# <fct> <fct> <fct> <fct> <dbl> <int> <int>
#1 Car 10/02 01/02 20/02 14 1 7
As shown in the example, I am assuming you would have only one non-NA value for f and g if you have more than one then use which.max to select the first non-NA value from those columns.

Use library dplyr, let's say your dataframe is df. First thing is replace NA by 0
df[is.na(df)] <- 0 # you can now directly take sum of f and g column in group by
library(dplyr)
df_1 <- df %>% group_by(a,b) %>% summarise(c = min(c),d = min(d),e = max(e),f = sum(f),g = sum(g))

using data.table. I had to change one of your dates because there is no feb 30th
x <- data.frame(
a= c( "Car","Car"), b=c("10/20","10/20"), c=c("01/02","07/02"),d=c("28/02","20/02"), e=c(14,0), f=c(1, NA), g=c(NA,7))
library( data.table) # you may need to install
x <- data.table( x )
#convert to dates
x$c <- as.Date(x$c, "%d/%m")
x$d <- as.Date(x$d, "%d/%m")
# group as you specfied
x[ ,
.(
c = min( c ),
d = min( d ),
e = max( e ),
f= max( f , na.rm=T),
g= max( g , na.rm=T)
) ,
by= c( "a","b")
]

Related

Create a numerical df in r using factor df

I have a factor df that I would like it to be need it to be numerical/dummy. I used as.integer to each column and then made a cbind to the original data frame. Is there a way to do all columns at once?
data <- data.frame(
x = c('a','b','c'),
y = c('d','e','f'),
z = c('g','h','i'),
stringsAsFactors = TRUE
)
x_factor <- as.integer(data$x)
y_factor <- as.integer(data$y)
z_factor <- as.integer(data$z)
data_binded <- cbind(a,x_factor,y_factor,z_factor)
Here is dplyr solution:
library(dplyr)
data %>%
mutate(across(ends_with("factor"), as.numeric))
x y z x_factor y_factor z_factor
1 a d g 1 1 1
2 b e h 2 2 2
3 c f i 3 3 3

Count common sets of items between different customers

I have data on customers and the different products they have purchased:
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B
I would like to check which sets of products that occur together across different customers. I want to get the count for product combinations of different lengths. For example, the product combination A and B together occurs in three different customers; the product group A, B and C occurs in one customer. And so on for all different sets of 2 or more products in the data. Something like:
Product Group Number
A, B, C 1
D, E, F 1
A, B, D 1
A, B 3
Thus, I'm counting the A, B combination in customers who only have product A and B (e.g. customer 4), and in customers who have A and B, but also any other product (e.g. customer 1, who has A, B and C).
Does anyone have any ideas how to do that with either a tidyverse or base R approach? I feel like it ought to be pretty trivial - maybe pivot_wider first, then count?
I have found this question and answer that can do what I need for pairs of products, but I need to count combinations also for more products than two.
If you have the possibility to use a non-base package, you can use a tool dedicated for the task of finding item sets: arules::apriori. It is much faster on larger data sets.
library(arules)
# coerce data frame to binary incidence matrix
# use apriori to get "frequent itemsets"
r = apriori(data = as.matrix(table(dat) > 0),
# set: type of association mined, minimal support needed of an item set,
# minimal number of items per item set
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
# coerce itemset to data.frame, select relevant rows and columns
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
# items count
# 4 {B,C} 1
# 5 {A,C} 1
# 6 {E,F} 1
# 7 {D,E} 1
# 10 {D,F} 1
# 13 {B,D} 1
# 14 {A,D} 1
# 15 {A,B} 3
# 25 {A,B,C} 1
# 26 {D,E,F} 1
# 35 {A,B,D} 1
Timing on larger data set: 10000 customers with up to 6 products each. apriori is quite a lot faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 38.95475 39.8621 41.44454 40.67313 41.05565 57.64655 20
# f_allan(dat) 4578.20595 4622.2363 4664.57187 4654.58713 4679.78119 4924.22537 20
# f_jay(dat) 2799.10516 2939.9727 2995.90038 2971.24127 2999.82019 3444.70819 20
# f_uwe_dt(dat) 2943.26219 3007.1212 3028.37550 3027.46511 3060.38380 3076.25664 20
# f_uwe_dplyr(dat) 6339.03141 6375.7727 6478.77979 6448.56399 6521.54196 6816.09911 20
10000 customers with up to 10 products each. apriori is several hundred times faster.
# Unit: milliseconds
# expr min lq mean median uq max neval
# f_henrik(dat) 58.40093 58.95241 59.71129 59.63988 60.43591 61.21082 20
# f_jay(dat) 52824.67760 53369.78899 53760.43652 53555.69881 54049.91600 55605.47980 20
# f_uwe_dt(dat) 22612.87954 22820.12012 22998.85072 22974.32710 23220.00390 23337.22815 20
# f_uwe_dplyr(dat) 26083.20240 26255.88861 26445.49295 26402.67887 26659.81195 27046.83491 20
On the larger data set, Allan's code gave warnings (In rawToBits(as.raw(x)) : out-of-range values treated as 0 in coercion to raw) on the toy data, which seemed to affect the result. Thus, it is not included in the second benchmark.
Data and benchmark code:
set.seed(3)
n_cust = 10000
n_product = sample(2:6, n_cust, replace = TRUE) # 2:10 in second run
dat = data.frame(
Customer = rep(1:n_cust, n_product),
Product = unlist(lapply(n_product, function(n) sample(letters[1:6], n)))) # 1:10 in 2nd run
library(microbenchmark)
res = microbenchmark(f_henrik(dat),
f_allan(dat),
f_jay(dat),
f_uwe_dt(dat),
f_uwe_dplyr(dat),
times = 20L)
Check for equality:
henrik = f_henrik(dat)
allan = f_allan(dat)
jay = f_jay(dat)
uwe_dt = f_uwe_dt(dat)
uwe_dplyr = f_uwe_dplyr(dat)
# change outputs to common format for comparison
# e.g. string format, column names, order
henrik$items = substr(henrik$items, 2, nchar(henrik$items) - 1)
henrik$items = gsub(",", ", ", henrik$items)
l = list(
henrik = henrik, allan = allan, jay = jay, uwe_dt = uwe_dt, uwe_dplyr = uwe_dplyr)
l = lapply(l, function(d){
d = setNames(as.data.frame(d), c("items", "count"))
d = d[order(d$items), ]
row.names(d) = NULL
d
})
all.equal(l[["henrik"]], l[["allan"]])
# TRUE
all.equal(l[["henrik"]], l[["jay"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dt"]])
# TRUE
all.equal(l[["henrik"]], l[["uwe_dplyr"]])
# TRUE
Functions:
f_henrik = function(dat){
r = apriori(data = as.matrix(table(dat) > 0),
par = list(target = "frequent itemsets",
support = 0,
minlen = 2))
d = as(r, "data.frame")
d[d$count > 0, c("items", "count")]
}
f_allan = function(dat){
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
}
f_jay = function(dat){
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
}
f_uwe_dt = function(dat){
setorder(setDT(dat), Customer, Product)
dat[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
}
f_uwe_dplyr = function(dat){
dat %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
}
If you define a little helper function that gets all multiple groupings:
all_multiples <- function(strings)
{
n <- length(strings)
do.call("c", sapply(1:2^n, function(x) {
mystrings <- strings[as.character(rawToBits(as.raw(x))[seq(n)]) == "01"]
if (length(mystrings) > 1) paste(mystrings, collapse = ", ") else NULL
}))
}
then you can do this nicely in a tidyverse pipe:
dat %>%
group_by(Customer) %>%
arrange(Product) %>%
summarize(Product_group = all_multiples(Product)) %>%
group_by(Product_group) %>%
count(Product_group)
#> # A tibble: 11 x 2
#> # Groups: Product_group [11]
#> Product_group n
#> <chr> <int>
#> 1 A, B 3
#> 2 A, B, C 1
#> 3 A, B, D 1
#> 4 A, C 1
#> 5 A, D 1
#> 6 B, C 1
#> 7 B, D 1
#> 8 D, E 1
#> 9 D, E, F 1
#> 10 D, F 1
#> 11 E, F 1
For the sake of completeness, here is a solution in data.table syntax which can be translated to dplyr syntax as well.
For both implementations, the core idea is the same:
sort by Product (which is an important step which has been neglected by the other answers posted so far)
For each Customer, create the product groups by using combn() with varying lengths m. Product.Group is a kind of natural key created by concatenating the included products using the toString() function.
Here, we can see why sorting Product is important : products B, A as well as A, B should appear in the same product group A, B.
Finally, count the number of occurrences by Product.Group
data.table version
library(data.table)
setorder(setDT(df), Customer, Product)
df[, .(Product.Group = unlist(lapply(tail(seq(.N), -1L),
function(m) combn(unique(Product), m, toString, FALSE)))),
by = Customer][
, .N, by = Product.Group]
Product.Group N
1: A, B 3
2: A, C 1
3: B, C 1
4: A, B, C 1
5: D, E 1
6: D, F 1
7: E, F 1
8: D, E, F 1
9: A, D 1
10: B, D 1
11: A, B, D 1
dplyr version
library(dplyr)
df %>%
arrange(Customer, Product) %>%
group_by(Customer) %>%
summarise(Product.Group = n() %>%
seq() %>%
tail(-1L) %>%
lapply(function(m) combn(unique(Product), m, toString, FALSE)) %>%
unlist()) %>%
ungroup() %>%
count(Product.Group)
Product.Group n
<chr> <int>
1 A, B 3
2 A, B, C 1
3 A, B, D 1
4 A, C 1
5 A, D 1
6 B, C 1
7 B, D 1
8 D, E 1
9 D, E, F 1
10 D, F 1
11 E, F 1
Data
library(data.table)
df <- fread("
Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")
You could split the data along customers, then get all combinations of product-pairs and triples using combn. Then find matches using %in% with outer, create data frame by collapsing products using toString and finally discard elements which are zero.
# a <- aggregate(Product ~ Customer, dat, I)$Product ## old solution
# if (is.matrix(a)) a <- as.data.frame(t(a)) ## old solution
a <- split(dat$Product, dat$Customer) ## thx to #Henrik
r <- range(lengths(a))
pr <- unlist(lapply(r[1]:r[2], function(x)
combn(unique(dat$Product), x, list)), recursive=F)
or <- rowSums(outer(pr, a, Vectorize(function(x, y) all(x %in% y))))
res <- data.frame(p.group=sapply(pr, toString), number=or)
res[res$number > 0, ]
# p.group number
# 1 A, B 3
# 2 A, C 1
# 3 A, D 1
# 6 B, C 1
# 7 B, D 1
# 13 D, E 1
# 14 D, F 1
# 15 E, F 1
# 16 A, B, C 1
# 17 A, B, D 1
# 35 D, E, F 1
Data
dat <- read.table(header=TRUE, text="Customer Product
1 A
1 B
1 C
2 D
2 E
2 F
3 A
3 B
3 D
4 A
4 B")

Adding values from one column to another with missing values in the second column using R

I want to add just some specific values from column z in dataframe df2 into dataframe df1, but just for the id = 1 and id = 3.
I have already tried solutions with ifelse, but for the missing values that kind of solutions work for the first value, until find the first missing gap.
df1$z <- ifelse((df1$id == df2$id), df2$z, 0)
Examples of the data:
df1 <- read.table(text = "
id v w
1 20 B
3 30 T
", h = T)
df2 <- read.table(text = "
id z b c d e f g h i j
1 100 z w e r y w u y q
2 800 t q j n m q i x z
3 700 f e q b a i e p w
4 300 a b c d a g s y q"
, h = T)
Expected result:
df1_add <- read.table(text = "
id v w z
1 20 B 100
3 30 T 700
", h = T)
Let's use left_join() and select() from the dplyr package:
library(dplyr)
df1_add <- df1 %>%
left_join(df2 %>% select(id, z))
df1_add
id v w z
1 1 20 B 100
2 3 30 T 700
you can try this
df_add <- df1
df_add$z = df2[df2$id %in% c(1, 3), ]$z
We can use merge from base R
merge(df1, df2[c("id", "z")])

parent child structure in R dataframe

I have a csv that contains an org structure as follows plus some additional columns. I use R to create charts and it works great !.
The challenge is when trying to create the charts for a subset manager and its children/grandchildren.
Is there any filtering that is possible in dplr or any alternative package?
Sample format:
emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3
The subset I need is for manager "C"
Scenario 1:emp_id==C should contain all nodes of 'D','D1','D2','E','E1','F'
expected structure:
manager,all_children
C D
C D1
C D2
C E
C E1
C F
Scenario 2:emp_id==C should contain all above nodes but retain mgr_id structure for 'D','E'
expected structure:
manager,all_children
C D
C E
C F
D D1
D D2
E E1
Consider the base package with by which creates a df list for every level of mgr_id (not just C):
SCENARIO 1
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("mgr_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub("*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 C D1
# 5 C D2
# 6 C E1
SCENARIO 2 (where the selected columns change in grandchild and then first column rename)
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("emp_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub(".*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
names(grandchild)[1] <- "mgr_id"
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 D D1
# 5 D D2
# 6 E E1
Here is one solution using functions from dplyr and data.table. dt3 is the output for scenario 1, while dt4 is the output for scenario 2.
# Load packages
library(dplyr)
library(data.table)
# Create example data frame
dt <- read.table(text = "emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
# Filter levels lower than 1
filter(nest_id > 1) %>%
mutate(group_id = ifelse(nest_id > 2, 0, 1)) %>%
# Create "run_id", which will be used to fill manager label
mutate(run_id = rleid(group_id)) %>%
mutate(run_id = ifelse(run_id %% 2 == 0, run_id - 1, run_id)) %>%
group_by(run_id) %>%
mutate(manager = first(emp_id)) %>%
# Select for manager C
filter(manager %in% "C") %>%
ungroup() %>%
# Remove rows if manager == emp_id
filter(manager != emp_id) %>%
rename(all_children = emp_id)
# Scenario 1
dt3 <- dt2 %>% select(manager, all_children)
# Scenario 2
dt4 <- dt2 %>%
select(manager = mgr_id, all_children) %>%
arrange(manager, all_children)

sum different column value rows into a new column row in R

I want to combine different column value rows into a new column row.
Example df like this:
df <- data.frame(area = c("a","b","c","a"),
d = c(1,3,6,3),
f = c(3,2,8,2),
e = c(4,7,1,8),
g = c(6,9,2,9))
Where a,b,c are area column value, I want to combine/sum two rows (a,c) into one to get:
area d f e g
a+c+a 10 13 13 17
b 3 2 7 9
AND I have tried like this:
df <- aggregate(df, list(area=replace(area == c("a","c"), "a+c+a")), sum)
But it won't work.
Thank you.
Another solution using dplyr
library(dplyr)
aggr <- df[df$area %in% c("a", "c"),-1] %>%
summarize_all(sum)
rbind(df[!(df$area %in% c("a", "c")),],
bind_cols(area = "a+c+a", aggr))
# area d f e g
# 2 b 3 2 7 9
# 1 a+c+a 10 13 13 17

Resources