I have sample processing data in multiple sources and I'd like to combine them into a master list. How can I merge the "Status" column between 2 data frames such that a overrules b in order to collate "Y" and "partial" for each sample? Thank you in advance.
c <- merge(a, b, by = "Sample", all=TRUE)
c[,Status := ifelse(!is.na(Status.x), Status.x, Status.y)]
c[,`:=` (Status.x=NULL, Status.y = NULL)]
I assume you want to keep the values from a and b with an order of priority, Y covers partial that covers NA that covers nothing.
d <- merge(a,b,by="Sample",all=TRUE)
d$Status <- ""
d$Status[apply(c,1,function(x){any(is.na(x))})] <- "" # cleaning the NAs I introduced with the merge
d$Status[apply(c,1,`%in%`, x = "NA")] <- NA # or "NA" if you want to keep it this way, or "" if you want to get rid of them
d$Status[apply(c,1,`%in%`, x = "partial")] <- "partial"
d$Status[apply(c,1,`%in%`, x = "Y")] <- "Y"
d <- d[,c(1,4)]
# Sample Status
# 1 100 Y
# 2 101 Y
# 3 102
# 4 103 partial
# 5 106 Y
I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
x y
1 1 a
2 3 <NA>
3 5 b
After scraping some review data from a website, I am having difficulty organizing the data into a useful structure for analysis. The problem is that the data is dynamic, in that each reviewer gave ratings on anywhere between 0 and 3 subcategories (denoted as subcategories "a", "b" and "c"). I would like to organize the reviews so that each row is a different reviewer, and each column is a subcategory that was rated. Where reviewers chose not to rate a subcategory, I would like that missing data to be 'NA'. Here is a simplified sample of the data:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
The vec contains the information of the subcategories that were scored, and the "stop" is the end of each reviewers rating. As such, I would like to organize the result into a data frame with this structure. Expected Output
I would greatly appreciate any help on this, because I've been working on this issue for far longer than it should take me..
#alexis_laz provided what I believe is the best answer:
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
stops <- vec == "stop"
i = cumsum(stops)[!stops] + 1L
j = vec[!stops]
tapply(ratings, list(factor(i, 1:max(i)), factor(j)), identity) # although mean/sum work
# a b c
#[1,] 2 5 1
#[2,] 1 3 NA
#[3,] NA NA NA
#[4,] NA NA 2
base R, but I'm using a for loop...
vec <- c("a","b","c","stop", "a","b","stop", "stop", "c","stop")
ratings <- c(2,5,1, 1,3, 2)
categories <- unique(vec)[unique(vec)!="stop"]
row = 1
df = data.frame(lapply(categories, function(x){NA_integer_}))
colnames(df) <- categories
rating = 1
for(i in vec) {
if(i=='stop') {row <- row+1
} else { df[row,i] <- ratings[[rating]]; rating <- rating+1}
Here is one option
d1 <- as.data.table(melt(split(vec, c(1, head(cumsum(vec == "stop")+1,
-1)))))[value != 'stop', ratings := ratings
][value != 'stop'][, value := as.character(value)][, L1 := as.integer(L1)]
dcast( d1[CJ(value = value, L1 = seq_len(max(L1)), unique = TRUE), on = .(value, L1)],
L1 ~value, value.var = 'ratings')[, L1 := NULL][]
# a b c
#1: 2 5 1
#2: 1 3 NA
#3: NA NA NA
#4: NA NA 2
Using base R functions and rbind.fill from plyr or rbindlist from data.table to produce the final object, we can do
# convert vec into a list, split by "stop", dropping final element
temp <- head(strsplit(readLines(textConnection(paste(gsub("stop", "\n", vec, fixed=TRUE),
collapse=" "))), split=" "), -1)
# remove empty strings, but maintain empty list elements
temp <- lapply(temp, function(x) x[nchar(x) > 0])
# match up appropriate names to the individual elements in the list with setNames
# convert vectors to single row data.frames
temp <- Map(function(x, y) setNames(as.data.frame.list(x), y),
relist(ratings, skeleton = temp), temp)
# add silly data.frame (single row, single column) for any empty data.frames in list
temp <- lapply(temp, function(x) if(nrow(x) > 0) x else setNames(data.frame(NA), vec[1]))
Now, you can produce the single data.frame (data.table) with either plyr or data.table
# with plyr, returns data.frame
do.call(rbind.fill, temp)
a b c
1 2 5 1
2 1 3 NA
4 NA NA 2
# with data.table, returns data.table
rbindlist(temp, fill=TRUE)
a b c
1: 2 5 1
2: 1 3 NA
4: NA NA 2
Note that the line prior to the rbinding can be replaced with
temp[lengths(temp) == 0] <- replicate(sum(lengths(temp) == 0),
setNames(data.frame(NA), vec[1]), simplify=FALSE)
where the list items that are empty data frames are replaced using subsetting instead of an lapply over the entire list.
dfOrig <- data.frame(rbind("1",
setnames(dfOrig, "Value")
I have a single column vector, which actually comprises two observations of three variables. How do I convert it to a data.frame with the following structure:
ID Code Tag
"1" "C" "531404"
"3" "B" "477644"
Obviously, this is just a toy example to illustrate a real-world problem with many more observations and variables.
Here's another approach - it does rely on the dfOrig column being ordered 1,2,3,1,2,3 etc.
x <- c("ID", "Code", "Tag") # new column names
n <- length(x) # number of columns
res <- data.frame(lapply(split(as.character(dfOrig$Value), rep(x, nrow(dfOrig)/n)),
The resulting data is:
> str(res)
#'data.frame': 2 obs. of 3 variables:
# $ Code: Factor w/ 2 levels "B","C": 2 1
# $ ID : int 1 3
# $ Tag : int 531404 477644
As you can see, the column classes have been converted. In case you want the Code column to be character instead of factor you can specify stringsAsFactors = FALSE in the data.frame call.
And it looks like this:
> res
# Code ID Tag
#1 C 1 531404
#2 B 3 477644
Note: You have to get the column name order in x in line with the order of the entries in dfOrig$Value.
If you want to get the column order of res as specified in x, you can use the following:
res <- res[, match(x, names(res))]
Maybe convert to matrix with nrow:
# set number of columns
myNcol <- 3
# convert to matrix, then dataframe
res <- data.frame(matrix(dfOrig$Value, ncol = myNcol, byrow = TRUE),
stringsAsFactors = FALSE)
# convert the type and add column names
res <- as.data.frame(lapply(res, type.convert),
col.names = c("resID", "Code", "Tag"))
# resID Code Tag
# 1 1 C 531404
# 2 3 B 477644
You can create a sequence of numbers
x <- seq(1:nrow(dfOrig)) %% 3 #you can change this 3 to number of columns you need
data.frame(ID = dfOrig$Value[x == 1],
Code = dfOrig$Value[x == 2],
Tag = dfOrig$Value[x == 0])
#ID Code Tag
#1 1 C 531404
#2 3 B 477644
Another approach would be splitting the dataframe according to the sequence generated above and then binding the columns using do.call
x <- seq(1:nrow(dfOrig))%%3
res <- do.call("cbind", split(dfOrig,x))
You can definitely change the column names
colnames(res) <- c("Tag", "Id", "Code")
# Tag Id Code
#3 531404 1 C
#6 477644 3 B
I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
dt1 = rbind(dt1,dt2[is.na(w)])
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names
I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
y <- lapply(x, doddply);y
y comes very close to what I want to get
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
by = key(dt)]
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
res <- lapply(exlist,eval,envir=dfs)
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)