Related
I have a dataframe with 2 numeric columns. For each row, I want to create an array of integers that fall between the values in the columns, and that includes the values in the column. Then, I want to compile all of the values into a single column to generate a histogram.
Input:
df
C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76
Output:
sdf5$Values
-92
-91
-90
...
2
3
-162
-161
...
7
8
...
My actual dataframe has 62 rows. My current code gives me frequencies > 100 (should have a maximum of 62 for any integer). The code worked on a dummy dataframe, so I'm not sure where things are going wrong.
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- as.data.frame(sdf3$values)
sdf5 <- rename(sdf4, Values = 1)
a <- ggplot(sdf5, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
I'm not sure what exactly goes wrong, but I think the rbind.data.frame is causing some troubles with the list input. As an alternative:
library(ggplot2)
df <- read.table(text = " C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76")
list <- mapply(":", df$C2, df$C1)
df2 <- data.frame(Values = do.call(c, list))
ggplot(df2, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
Created on 2021-02-08 by the reprex package (v1.0.0)
There must be something off going on with the stack function, you can check it using table. To put all list numbers into a single vector I'd use unlist.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- data.frame("Values"=sdf3$values)
table(sdf4)
# This returns the count of each unique value and some go up to 200,
# notably the limits of my unif distribution
If you use unlist, it gives the desired result.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
vec <- data.frame("Values"=unlist(list))
a <- ggplot(vec, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
I don't know the stack function, but the problem must be there somehow.
I would like to join the two data frames :
a <- data.frame(x=c(1,3,5))
b <- data.frame(start=c(0,4),end=c(2,6),y=c("a","b"))
with a condition like (x>start)&(x<end) in order to get such a result:
# x y
#1 1 a
#2 2 <NA>
#3 3 b
I don't want to make a potentially large cartesian product and then select only the few rows matching the condition and I'd like a solution using the tidyverse (I am not interested in a solution using SQL which would be a confession of failure). I thought of the 'fuzzyjoin' package but I cannot find examples fitting my need : the function to apply for the condition has only two arguments. I also tried to put 'start' and 'end' into a single argument with data.frame(z=I(purrr::map2(b$start,b$end,list)),y=b$y)
# z y
#1 0, 2 a
#2 4, 6 b
but although the data looks fine fuzzy_left_join doesn't accept it.
I search for solutions working in more general cases (n variables on the LHS, m on the RHS, not necessarily numeric with arbitrary conditions).
UPDATE
I also want to be able to express conditions like (x=start+1)|(x=end+1) giving here:
# x y
#1 1 a
#2 3 a
#3 5 b
For this case you don't need multi_by or multy_match_fun, this works :
library(fuzzyjoin)
fuzzy_left_join(a, b, by = c(x = "start", x = "end"), match_fun = list(`>`, `<`))
# x start end y
# 1 1 0 2 a
# 2 3 NA NA <NA>
# 3 5 4 6 b
I eventually went to the code of fuzzy_join and found a way to make what I want even without proper documentation. fuzzy_let_join doesn't work but there is the following way (not really pretty and it actually does a cartesian product):
g <- function(x,y) (x>y[,"start"])&(x<y[,"end"])
fuzzy_join(a,b, multi_by = list(x="x",y=c("start","end"))
, multi_match_fun = g, mode = "left") %>% select(x,y)
data.table approach could be
library(data.table)
name1 <- setdiff(names(setDT(b)), names(setDT(a)))
#perform left outer join and then select required columns
a[b, (name1) := mget(name1), on = .(x > start, x < end)][, .(x, y)]
which gives
x y
1: 1 a
2: 3 <NA>
3: 5 b
Sample data:
a <- data.frame(x = c(1, 3, 5))
b <- data.frame(start = c(0, 4), end = c(2, 6), y = c("a", "b"))
Update: In case you want to join both dataframes on (x=start+1)|(x=end+1) condition then you can try
library(data.table)
DT1 <- as.data.table(a)
DT2 <- as.data.table(b)
#Perform 1st join on "x = start+1" and then another on "x = end+1". Finally row-bind both results.
DT <- rbindlist(list(DT1[DT2[, start_temp := start+1], on = c(x = "start_temp"), .(x, y), nomatch = 0],
DT1[DT2[, end_temp := end+1], on = c(x = "end_temp"), .(x, y), nomatch = 0]))
DT
# x y
#1: 1 a
#2: 5 b
#3: 3 a
A possible answer to explain what I am trying to do : extending dplyr in some way. And I will be happy to know if there are ways to improve this solution or some problems I didn't see.
The solution avoids the cartesian product, but duplicates into lists of data frames both one of the input data frame and the result. I didn't include the final column selection of x and y that is easy to code.
my_left_join <- function(.DATA1,.DATA2,.WHERE)
{
call = as.list(match.call())
df1 <- .DATA1
df1$._row_ <- 1:nrow(df1)
dfl1 <- replyr::replyr_split(df1,"._row_")
eval(substitute(
dfl2 <- mapply(function(.x)
{filter(.DATA2,with(.x,WHERE)) %>%
mutate(._row_=.x$._row_)}
, dfl1, SIMPLIFY=FALSE)
,list(WHERE=call$.WHERE)))
df2 <- replyr::replyr_bind_rows(dfl2)
left_join(df1,df2,by="._row_") %>% select(-._row_)
}
my_left_join(a,b,(x>start)&(x<end))
# x start end y
#1 1 0 2 a
#2 3 NA NA <NA>
#3 5 4 6 b
my_left_join(a,b,(x==(start+1))|(x==(end+1)))
# x start end y
#1 1 0 2 a
#2 3 0 2 a
#3 5 4 6 b
You can try a GenomicRanges solution
library(GenomicRanges)
# setup GRanges objects
a_gr <- GRanges(1, IRanges(a$x,a$x))
b_gr <- GRanges(1, IRanges(b$start, b$end))
# find overlaps between the two data sets
res <- as.data.frame(findOverlaps(a_gr,b_gr))
# create the expected output
a$y <- NA
a$y[res$queryHits] <- as.character(b$y)[res$subjectHits]
a
x y
1 1 a
2 3 <NA>
3 5 b
I'm looking at a problem where we are trying to create time series differences for a large numbers of pairs (+40 000) where the time series has (+150 daily points)
Each row is representing a pair of two individuals we want to compare
pairs = data.frame("number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A")
)
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
Each row is representing time series data for a particular individual
ts = data.frame("name" = c("A","B","C","D"),
"day1" = c(10,12,54,13),
"day2" = c(2,8,47,29),
"day3" = c(1,5,14,36)
)
ts$name <- as.character(ts$name)
I have the following R Code where the goal is to create for each pair of individuals (so 4 in my example) a new data frame that has the difference of their daily time series from the ts data frame. This works but is extremely slow when I try to run it on my real data set where pairs has 40 000 rows and ts about 150 columns. Anyone has an idea of how I could speed this up? I tried to use lapply but cant figure how to create the difference and store it in new time series. Thanks!!
diffs<-data.frame(matrix(ncol=ncol(ts)))
colnames(diffs)<-colnames(ts)
for (row in 1:nrow(pairs)){
row1<-ts[(ts$name==pairs[row,"name1"]),]
row2<-ts[(ts$name==pairs[row,"name2"]),]
difference<-rbind(row1,row2)
difference[3,1]<-pairs[row,"number"]
difference[3,2:ncol(difference)]<-difference[1,2:ncol(difference)]-difference[2,2:ncol(difference)]
diffs<-rbind(diffs,difference[3,])
}
A few remarks up front:
i) data.frame() has an argument stringsAsFactors, which you can set to FALSE, i.e.:
pairs = data.frame(
"number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A"),
stringsAsFactors = FALSE
)
ii) speeding up your code is not really a matter of replacing the for-loop with an apply, but rather a matter of data structures and processing efficiency. Relying on a package / function that internally loops in C++ rather than R or writing C++ code yourself will give you the biggest boost.
iii) I'll also provide a larger dummy example here, such that you and others can test and compare times a little easier:
# all combination of LETTERS, including identity pairs like A~A
pairs = cbind.data.frame(
"number" = seq(1, 676),
setNames(expand.grid(LETTERS, LETTERS), nm = c("name1", "name2"))
)
# expand.grid produces factor columns
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
ts = cbind.data.frame(
"name" = LETTERS,
matrix(sample.int(100, 150*26, replace = TRUE), ncol = 150),
stringsAsFactors = FALSE
)
names(ts)[-1] <- paste0("day", names(ts)[-1])
iv) an improved version of your loop could then look like:
# initialize full matrix (since the ID is a number too), allocating necessary memory
diffs2 <- matrix(0, ncol = ncol(ts), nrow = nrow(pairs))
colnames(diffs2) <- colnames(ts)
# first column is given
diffs2[, 1] <- pairs$number
for (row in 1:nrow(pairs)) {
row1 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name1"], -1]))
row2 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name2"], -1]))
diffs2[row, -1] <- row1 - row2
}
this is already several times faster than what you had, but illustrates the awkwardness of having a data.frame object for a time-series, which should instead be an object of a class that allows to work with the numeric data more directly / efficiently (there are several packages that offer time-series classes).
Now for an answer that is still fairly simple yet rather fast, using dplyr and tidyr:
# simple way of measuring time
start <- Sys.time()
xx <- tidyr::gather(ts, key = "day", value = "value", 2:151)
yy <- dplyr::left_join(pairs, xx, by = c("name1" = "name"))
zz <- dplyr::left_join(yy, xx, by = c("name2" = "name", "day" = "day"))
res <- dplyr::mutate(zz, diff = value.x - value.y)
end <- Sys.time()
duration <- end - start
duration
Time difference of 0.06700397 secs
You can also try the approaches from the previous two answers, it's clear that the mapply solution will be slow and the data.table one isn't working fully yet and already looks slower and more complicated.
I have a data.table solution to help.
The idea is to switch to long format to be able to use grouping operation (equivalent to apply) and create permutated column to make the pairs:
name1idx <- unlist(lapply(pairs$name1,function(x){grep(x,ts$name)}))
name2idx <- unlist(lapply(pairs$name2,function(x){grep(x,ts$name)}))
plouf <-melt(setDT(ts),measure.vars = patterns("^day"),variable.name = "day")
plouf[,name1 := name[name1idx],by = day]
plouf[,value1 := value[name1idx],by = day]
plouf[,name2 := name[name2idx],by = day]
plouf[,value2 := value[name2idx],by = day]
plouf[,diff := value1 - value2]
plouf[,.(day,diff),by = .(name1,name2)]
name1 name2 day diff
1: A B day1 -2
2: A B day2 -6
3: A B day3 -4
4: B D day1 -1
5: B D day2 -21
6: B D day3 -31
7: C D day1 41
8: C D day2 18
9: C D day3 -22
10: D A day1 3
11: D A day2 27
12: D A day3 35
name1idx and name1idx are the index of ts$name corresponding to pairs$name1 and pairs$name2. You can have this way all the pairs.
I was looking for a solution to in which use of column names is dynamic and no column name to be used other than name. mapply, dplyr and reshape2 has been used for this solution.
# library(reshape2)
# A function which will filter value based on pairs
matchPair <- function(x, y){
matchedRow <- ts %>%
filter(name == x | name == y) %>%
select(-name)
data.frame(diff(as.matrix(matchedRow))) %>%
mutate(name = paste0(x, '~',y))
}
df.r <-do.call(rbind,mapply(matchPair, pairs$name1, pairs$name2,
SIMPLIFY = FALSE))
# Row names are not meaningful. Hence remove those.
row.names(df.r) <- NULL
#Result
#> df.r
# day1 day2 day3 name
#1 2 6 4 A~B
#2 1 21 31 B~D
#3 -41 -18 22 C~D
#4 3 27 35 D~A
for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.
Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))
Using data.table could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
A data.table solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]
You can use the base by() function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)
I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)