How to get a new column in a data frame which has only elements which appear in the set more than once in R - r

Data:
DB1 <- data.frame(orderItemID = c(1,2,3,4,5,6,7,8,9,10),
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")),
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))
Expected output:
selection_order = c("yes","no","no","no","no","no","yes","no","no","no")
In the data set I have items with the same size or the same color, the same ItemID. Every registered user has his unique customerID.
I want to identify when a user orders products (more then one) with the same itemID (in different sizes or colors = for example the user with the customerID = 33 orders the same item (ItemID = 2) in two different colors) and mark it in a new column named like "selection order"(for example) with "Yes" or "No". It should NOT show me a "Yes", when he or she orders an item with an other ID. I just want to get a "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products).
I've tried a lot already,but nothing works. There are a few thousand different userID's and ItemId's-so I can´t subset for every Id. I tried it with the duplicated function - but it's not leading to a satisfactory solution:
The problem is, that if the same person orders more then one object (customerID is duplicated then) and another person(customerId) orders an item with the same Id (itemId is duplicated then) it gives me a "yes": and it must be a "No" in this case. (in the example the duplicate function will give me an "yes" at orderItemID 4 instead of an "no")

I think I understand what is your desired output now, try
library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
# orderItemID orderDate itemID size color customerID selection_order
# 1: 1 1.1.12 2 l blue 33 TRUE
# 2: 2 1.1.12 3 s black 15 FALSE
# 3: 3 1.1.12 2 xl blue 1 FALSE
# 4: 4 1.1.12 5 xs orange 33 FALSE
# 5: 5 1.1.12 12 m red 14 FALSE
# 6: 6 1.1.12 4 s navy 55 FALSE
# 7: 7 1.1.12 2 l red 33 TRUE
# 8: 8 1.1.12 3 m purple 78 FALSE
# 9: 9 2.1.12 1 xxs white 94 FALSE
# 10: 10 2.1.12 5 xxl black 23 FALSE
In order to convert back to a data.frame, use DB1 <- as.data.frame(DB1) (for older versions) or setDF(DB1) for the lates data.table version.
You can do it (less efficiently) with base R too
transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))
Or using the dplyr package
library(dplyr)
DB1 %>%
group_by(customerID, itemID) %>%
mutate(selection_order = n() > 1)

The following code will append a new column selection.order to your data frame if the row represents a duplicate (customerID, itemID) tuple.
# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))
# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]
# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)

If you just want the subset, as you say in the title, then do this:
DB1[duplicated(DB1[c("itemID", "customerID")]),]
If you want the column, then:
f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L
Note that is also easy to get the actual count by simplifying the last line above.

Related

How to I add a leading numeric identifier (not necessarily zero) to a character string in r

I apologize if this is a duplicate, I've searched through all of the "add leading zero" content I can find, and I'm struggling to find a solution I can work with. I have the following:
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier)
and I want a modified siteid that is always six (6) characters long with zeroes to fill the gaps. The Site ID can vary in nchar from 1-3, the modifier is always a length of 2, and the number of zeroes can vary depending on the length of the site ID (so that 6 is always the final modified length).
I would like the following final output:
df
# siteid modifier mod.siteid
#1 1 44 440001
#2 11 22 220011
#3 111 11 110111
Thanks for any suggestions or direction. This could also be numeric, but it seems like character manipulation has more options...?
The vocabulary here is left pad and paste here is one way using sprintf()::
df$mod.siteid <- with(df, sprintf("%s%04d", modifier, as.integer(siteid)))
# Note:
# code simplified thanks to suggestion by Maurits.
Output:
siteid modifier mod.siteid
1 1 44 440001
2 11 22 220011
3 111 11 110111
Data:
df <- data.frame(
siteid = c("1", "11", "111"),
modifier = c("44", "22", "11"),
stringsAsFactors = FALSE
)
Extra: If you don't want to left pad with 0, then using the stringi package is one option: with(df, paste0(modifier, stringi::stri_pad_left(siteid, 4, "q")))
siteid<-c("1","11","111")
modifier<-c("44","22","11")
df<-data.frame(siteid,modifier, stringsAsFactors = FALSE)
df$mod.siteid = paste0( df$modifier,
formatC( as.numeric(df$siteid), width = 4, format = "d", flag="0") )
df
# siteid modifier mod.siteid
# 1 1 44 440001
# 2 11 22 220011
# 3 111 11 110111

How to combine multiple variable data to a single variable data?

After making my data frame, and selecting the variables i want to look at, i face a dilemma. The excel sheet which acts as my data source was used by different people recording the same type of data.
Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26
As you can see, because the data is written diffently, Major groups (Redwine, Whitewine and Water) have now been split into undergroups . How do i combine the undergroups into a combined group eg. red+Red+RedWine -> Total wine. I use the phyloseq package for this kind of dataset
names <- c("red","white","water")
df2 <- setNames(data.frame(matrix(ncol = length(names), nrow = nrow(df))),names)
for(col in names){
df2[,col] <- rowSums(df[,grep(col,tolower(names(df)))])
}
here
grep(col,tolower(names(df)))
looks for all the column names that contain the strings like "red" in the names of your vector. You then just sum them in a new data.frame df2 defined with the good lengths
I would just create a new data.frame, easiest to do with dplyr but also doable with base R:
with dplyr
newFrame <- oldFrame %>% mutate(Mock = Mock, Neg = Neg + Neg1PCR + Neg2PCR + NegPBS, Red = red + Red + RedWine, Water = water + Water, White = white = White)
with base R (not complete but you get the point)
newFrame <- data.frame(Red = oldFrame$Red + oldFrame$red + oldFrame$RedWine...)
One can use dplyr:starts_with and dplyr::select to combine columns. The ignore.case is by default TRUE in dplyr:starts_with with help in the data.frame OP has posted.
library(dplyr)
names <- c("red", "white", "water")
cbind(df[1], t(mapply(function(x)rowSums(select(df, starts_with(x))), names)))
# Mock red white water
# 1 1 24 28 8
Data:
df <- read.table(text =
"Mock Neg Neg1PCR Neg2PCR NegPBS red Red RedWine water Water white White
1 9 1 1 1 2 18 4 4 4 2 26",
header = TRUE, stringsAsFactors = FALSE)

Convert text within string to numeric

I'm struggling to create a new variable off a text string. Here is a sample of my data frame:
Brand Pack_Content
1 Dove 4X25 G
2 Snickers 250 G
3 Twix 2X20.7 G
4 Korkunov BULK
I would like to create a numeric variable called Grams. I've tried solutions using gsub or separate, but the need to for different solutions by row (i.e., some need to multiply the Brand Packs with multiple packs (i.e., 4X25 G)) has me stumped. A solution with dplyr is preferred.
Brand Pack_Content Grams
1 Dove 4X25 G 100
2 Snickers 250 G 250
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000
A solution using dplyr and tidyr. The key is before using separate to separate the Pack_Content_new column, replace all the strings, such as "G" or "BULK" with "" or meaningful numbers. If you have more than one meaningful strings like "BULK", you may want to use case_when in addition to recode. Arfter the separate function, we can replace NA with 1 in the Number column. Finnaly, we can calculate the Grams based on numbers in Number and Unit_Weight.
library(dplyr)
library(tidyr)
dat2 <- dat %>%
mutate(Pack_Content_new = sub("G$", "", Pack_Content)) %>% # Remove the last G
mutate(Pack_Content_new = recode(Pack_Content_new, # Replace BULK with 1000
`BULK` = "1000")) %>%
separate(Pack_Content_new, into = c("Number", "Unit_Weight"), # Separate the Pack_Content_new column
sep = "X", convert = TRUE,
fill = "left") %>%
replace_na(list(Number = 1)) %>% # Replace NA in Number with 1
mutate(Grams = Number * Unit_Weight) # Calculate the Grams
dat2
# Brand Pack_Content Number Unit_Weight Grams
# 1 Dove 4X25 G 4 25.0 100.0
# 2 Snickers 250 G 1 250.0 250.0
# 3 Twix 2X20.7 G 2 20.7 41.4
# 4 Korkunov BULK 1 1000.0 1000.0
DATA
dat <- read.table(text = " Brand Pack_Content
1 Dove '4X25 G'
2 Snickers '250 G'
3 Twix '2X20.7 G'
4 Korkunov 'BULK'",
header = TRUE, stringsAsFactors = FALSE)
Update: added in some unit extraction and conversions just for the heck of it
Update 2: Threw in some validation steps (for my own reference if no-one else) that should probably have been part of the original answer. In general, if you're using regular expressions to extract values (and you don't have time to review every single row of output in detail), it's easy to get burned when some corner case input format that wasn't considered comes along
Using data.table,stringi, and the sweet, sweet, magic of regular expressions:
A note on tool selection here:
Since regular expressions are difficult to follow enough on their own, I think it's a safer bet to focus on making the transformation steps readable and clearly defined instead of trying to cram it all into a series of pipes and as few lines of code possible.
Since dplyr doesn't allow for step by step manipulation (no pipes) without re-assigning the tibble after each expression, I feel data.table is far more elegant and efficient tool for this kind of data munging work.
Create Data
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
Pre Cleaning
First off we'll strip out spaces and make everything uppercase
## Strip out Spaces
DT[,Pack := gsub("[[:space:]]+","",Pack)]
## Make everything Uppercase
DT[,Pack := toupper(Pack)]
Assumption Validation
Before we use regular expressions to extract values and do some math on them, it's probably prudent to do some validation steps to make sure we don't get burned down the road by an unexpected corner case.
## Start off by trusting nothing
DT[,Valid := FALSE]
## Mark Packs that fit formats like "BULK" as valid
DT[Pack %in% c("BULK"),Valid := TRUE]
## Mark Packs that fit formats like "4X20G" or "3.0X3KG" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
## Mark Packs that fit formats like "250G" as valid
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"),
Valid := TRUE]
print(DT)
At this point:
Brand Pack Valid
1: Dove 4X25G TRUE
2: Snickers 0.250KG TRUE
3: Twix 2X20.7G TRUE
4: Korkunov BULK TRUE
5: Reeses 2.5.5X4G FALSE
6: M&M's 2X3X3G FALSE
Extracting Values
Note that we are only populating values for rows that met pre-defined expectations for what a valid format is.
## Extract the first number at the beginning of the "Pack" column followed by an X
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
## Extract last number out of the "Pack" column
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
## Extract the Units
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
print(DT)
Now we've got the following:
Brand Pack Valid Quantity Unit_Weight Units
1: Dove 4X25G TRUE 4 25.00 G
2: Snickers 0.250KG TRUE NA 0.25 KG
3: Twix 2X20.7G TRUE 2 20.70 G
4: Korkunov BULK TRUE NA NA BULK
5: Reeses 2.5.5X4G FALSE NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA
Convert units, fill in NA's, calculate weights
Now we just have to go back and fill in rows where there wasn't a weight or a quantity, optionally convert units, etc. so we can calculate weight.
## Start with a standard conversion factor of 1
DT[Valid == TRUE, Unit_Factor := 1]
## Make some Unit Conversions
DT[Units == "KG", Unit_Factor := 1000]
## Fill in Rows without a quantity with a value of 1
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
## Fill in a weight for Bulk units
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
## And finally, calculate Weight in grams
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
print(DT)
Which yields a final result:
Brand Pack Valid Quantity Unit_Weight Units Unit_Factor Grams
1: Dove 4X25G TRUE 4 25.00 G 1 100.0
2: Snickers 0.250KG TRUE 1 0.25 KG 1000 250.0
3: Twix 2X20.7G TRUE 2 20.70 G 1 41.4
4: Korkunov BULK TRUE 1 1000.00 G 1 1000.0
5: Reeses 2.5.5X4G FALSE NA NA NA NA NA
6: M&M's 2X3X3G FALSE NA NA NA NA NA
(All the steps, in condensed form)
library(data.table)
library(stringi)
DT <- data.table(Brand = c("Dove","Snickers","Twix","Korkunov","Reeses","M&M's"),
Pack = c("4X25 G","0.250 KG","2X20.7 G","BULK","2.5.5X4G","2 X 3 X 3G"))
DT[,Pack := gsub("[[:space:]]+","",Pack)]
DT[,Pack := toupper(Pack)]
DT[,Valid := FALSE]
DT[Pack %in% c("BULK"),Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+X([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[stri_detect_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(G|KG)$"), Valid := TRUE]
DT[Valid == TRUE, Quantity := as.numeric(stri_extract_first_regex(Pack,"^([[:digit:]]+\\.){0,1}[[:digit:]]+(?=X)"))]
DT[Valid == TRUE, Unit_Weight := as.numeric(stri_extract_last_regex(Pack,"([[:digit:]]+\\.){0,1}[[:digit:]]+"))]
DT[Valid == TRUE, Units := stri_extract_last_regex(Pack,"[[:alpha:]]+$")]
DT[Valid == TRUE, Unit_Factor := 1]
DT[Units == "KG", Unit_Factor := 1000]
DT[Valid == TRUE & is.na(Quantity), Quantity := 1]
DT[Pack == "BULK", `:=` (Unit_Weight = 1000, Units = "G")]
DT[Valid == TRUE, Grams := Unit_Weight*Quantity*Unit_Factor]
A final note:
I'm assuming you didn't include all the messy, dirty details of how all over the place your raw data is, so you might need to add some more steps to capture cases where you have pounds instead of grams (and all those other corner cases).
Still, with 5-7 regular expressions I think you'd probably be able to cover at least a decent amount of your potential cases.
I keep this Regex cheatsheet on RStudio's website within arms reach most of the time.
A relevant XKCD:
I know you need a plyr solution. Have you tried all the methods of Base R? Well here is just a small one. Hope this helps even though its not a plyr method.
First you need to remain with the numbers and also substitute X with *. This is done by the use of sub function. We also replace the one that does not contain a number with 1000. Then we just evaluate the content obtained:
A=sub("X","*",sub("\\s.*","",dat$Pack_Content))
transform(dat,Grams=sapply(parse(text=replace(A,-grep("\\d",A),1000)),eval))
Brand Pack_Content Grams
1 Dove 4X25 G 100.0
2 Snickers 250 G 250.0
3 Twix 2X20.7 G 41.4
4 Korkunov BULK 1000.0
Data Used:
dat=structure(list(Brand = c("Dove", "Snickers", "Twix", "Korkunov"
), Pack_Content = c("4X25 G", "250 G", "2X20.7 G", "BULK")), .Names = c("Brand",
"Pack_Content"), class = "data.frame", row.names = c("1", "2",
"3", "4"))

Count and Reset based on Variable change [duplicate]

Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is Emp and Color. The ID needs to be structured as:
Emp + unique number of each Color + sequential number for duplicated Colors.
These values are separated by periods.
Example data:
dat <- data.frame(Emp = c("A","A","A","B","B","C"),
Color = c("Red","Green","Green","Orange","Yellow","Brown"),
stringsAsFactors = FALSE)
The ID is supposed to appear as this:
ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")
ID
[1] "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"
The three character suffix to the ID to record the duplicates can be done as:
group_by(dat, Emp, Color) %>%
mutate(suffix = str_pad(row_number(), width=3, side="left", pad="0"))
But I am unable to assign sequential numbers to the unique occurrence of Color with each Emp group.
I prefer a dplyr solution, but any method would be appreciated.
Using data.table and sprintf:
library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d',
Emp, rleid(Color), rowid(rleid(Color))),
by = Emp]
you get:
> dat
Emp Color ID
1: A Red A.01.001
2: A Green A.02.001
3: A Green A.02.002
4: B Orange B.01.001
5: B Yellow B.02.001
6: C Brown C.01.001
How this works:
You convert dat to a data.table with setDT()
Group by Emp.
And create the ID-variable with the sprintf-function. With sprintf you paste several vector easily together according to a specified format.
The use of := means that the data.table is updated by reference.
%s indicates that a string is to be used in the first part (which is Emp). %02d & %03d indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.
Adressing the comment of #jsta, if the values in the Color-column are not sequential you can use:
setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp][, r:= NULL]
This will also maintain the order in which the Color column is presented. Instead of as.integer(factor(Color, levels = unique(Color))) you can also use match(Color, unique(Color)) as shown by akrun.
Implementing the above on a bit larger dataset to illustrate:
dat2 <- rbindlist(list(dat,dat))
dat2[, r := match(Color, unique(Color)), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp]
gets you:
> dat2
Emp Color r ID
1: A Red 1 A.01.001
2: A Green 2 A.02.001
3: A Green 2 A.02.002
4: B Orange 1 B.01.001
5: B Yellow 2 B.02.001
6: C Brown 1 C.01.001
7: A Red 1 A.01.002
8: A Green 2 A.02.003
9: A Green 2 A.02.004
10: B Orange 1 B.01.002
11: B Yellow 2 B.02.002
12: C Brown 1 C.01.002
We can try
dat %>%
group_by(Emp) %>%
mutate(temp = match(Color, unique(Color)),
temp2 = duplicated(Color)+1,
ID = sprintf("%s.%02d.%03d", Emp, temp, temp2))%>%
select(-temp, -temp2)
# Emp Color ID
# <chr> <chr> <chr>
#1 A Red A.01.001
#2 A Green A.02.001
#3 A Green A.02.002
#4 B Orange B.01.001
#5 B Yellow B.02.001
#6 C Brown C.01.001

Get a subset of a dataframe which has only elements which appears in the set more than once in R [duplicate]

Data:
DB1 <- data.frame(orderItemID = c(1,2,3,4,5,6,7,8,9,10),
orderDate = c("1.1.12","1.1.12","1.1.12","1.1.12","1.1.12", "1.1.12","1.1.12","1.1.12","2.1.12","2.1.12"),
itemID = c(2,3,2,5,12,4,2,3,1,5),
size = factor(c("l", "s", "xl", "xs","m", "s", "l", "m", "xxs", "xxl")),
color = factor(c("blue", "black", "blue", "orange", "red", "navy", "red", "purple", "white", "black")),
customerID = c(33, 15, 1, 33, 14, 55, 33, 78, 94, 23))
Expected output:
selection_order = c("yes","no","no","no","no","no","yes","no","no","no")
In the data set I have items with the same size or the same color, the same ItemID. Every registered user has his unique customerID.
I want to identify when a user orders products (more then one) with the same itemID (in different sizes or colors = for example the user with the customerID = 33 orders the same item (ItemID = 2) in two different colors) and mark it in a new column named like "selection order"(for example) with "Yes" or "No". It should NOT show me a "Yes", when he or she orders an item with an other ID. I just want to get a "yes", when there is an order (at the same day or in the past) with the same ID more then once - regardless from other ID´s (other products).
I've tried a lot already,but nothing works. There are a few thousand different userID's and ItemId's-so I can´t subset for every Id. I tried it with the duplicated function - but it's not leading to a satisfactory solution:
The problem is, that if the same person orders more then one object (customerID is duplicated then) and another person(customerId) orders an item with the same Id (itemId is duplicated then) it gives me a "yes": and it must be a "No" in this case. (in the example the duplicate function will give me an "yes" at orderItemID 4 instead of an "no")
I think I understand what is your desired output now, try
library(data.table)
setDT(DB1)[, selection_order := .N > 1, by = list(customerID, itemID)]
DB1
# orderItemID orderDate itemID size color customerID selection_order
# 1: 1 1.1.12 2 l blue 33 TRUE
# 2: 2 1.1.12 3 s black 15 FALSE
# 3: 3 1.1.12 2 xl blue 1 FALSE
# 4: 4 1.1.12 5 xs orange 33 FALSE
# 5: 5 1.1.12 12 m red 14 FALSE
# 6: 6 1.1.12 4 s navy 55 FALSE
# 7: 7 1.1.12 2 l red 33 TRUE
# 8: 8 1.1.12 3 m purple 78 FALSE
# 9: 9 2.1.12 1 xxs white 94 FALSE
# 10: 10 2.1.12 5 xxl black 23 FALSE
In order to convert back to a data.frame, use DB1 <- as.data.frame(DB1) (for older versions) or setDF(DB1) for the lates data.table version.
You can do it (less efficiently) with base R too
transform(DB1, selection_order = ave(itemID, list(customerID, itemID), FUN = function(x) length(x) > 1))
Or using the dplyr package
library(dplyr)
DB1 %>%
group_by(customerID, itemID) %>%
mutate(selection_order = n() > 1)
The following code will append a new column selection.order to your data frame if the row represents a duplicate (customerID, itemID) tuple.
# First merge together the table to itself
m<- merge(x=DB1,y=DB1,by=c("customerID","itemID"))
# Now find duplicate instances of orderItemID, note this is assumed to be UNIQUE
m$selection.order<-sapply(m$orderItemID.x,function(X) sum(m$orderItemID.x==X)) > 1
m <- m[,c("orderItemID.x","selection.order")]
# Merge the two together
DB1<- merge(DB1, unique(m), by.x="orderItemID",by.y="orderItemID.x",all.x=TRUE,all.y=FALSE)
If you just want the subset, as you say in the title, then do this:
DB1[duplicated(DB1[c("itemID", "customerID")]),]
If you want the column, then:
f <- interaction(DB1$itemID, DB1$customerID)
DB1$multiple <- table(f)[f] > 1L
Note that is also easy to get the actual count by simplifying the last line above.

Resources