split apply recombine, plyr, data.table in R

split apply recombine, plyr, data.table in R - r

I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table?
Sample Data:
dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table?

DWin's answer is correct for v1.8.0 (as currently on CRAN). But in v1.8.1 (on R-Forge repository), := now works by group. It works for non-contiguous groups too so there is no need to setkey first for it to line up.
dtb <- as.data.table(dat)
dtb
dte id val1 val2
1: 2001-10-02 1 10 25
2: 2001-10-03 1 11 24
3: 2001-10-04 1 12 23
4: 2001-10-02 2 13 22
5: 2001-10-03 2 14 21
dtb[, resid:=residuals(lm(val1 ~ val2)), by=id]
dte id val1 val2 resid
1: 2001-10-02 1 10 25 1.631688e-15
2: 2001-10-03 1 11 24 -3.263376e-15
3: 2001-10-04 1 12 23 1.631688e-15
4: 2001-10-02 2 13 22 0.000000e+00
5: 2001-10-03 2 14 21 0.000000e+00
To upgrade to v1.8.1 just install from the R-Forge repo. (R 2.15.0+ is needed when installing any binary package from R-Forge) :
install.packages("data.table", repos="http://R-Forge.R-project.org")
or install from source if you can't upgrade to latest R. data.table itself only needs R 2.12.0+.
Extending to the 1MM case :
DT = data.table(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
setkey(DT, id)
system.time(ans1 <- cbind(DT, DT[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
12.272 0.872 13.182
ans1
dte id val1 val2 id V1
1: 2012-07-02 1 0.8369147 0.57553383 1 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 1 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 1 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 2 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 2 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 2 0.497948275
system.time(DT[, resid:=residuals(lm(val1 ~ val2)), by=id])
user system elapsed
7.436 0.648 8.107
DT
dte id val1 val2 resid
1: 2012-07-02 1 0.8369147 0.57553383 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 0.497948275
The example above only has 2 groups, is quite small at under 40MB, and Rprof shows 96% of the time is spent in lm. So in these cases := by group is not for a speed advantage really, but more for the convenience; i.e., less code needed to write and no superfluous columns added to the output. As size grows, the avoidance of copies comes into it and speed advantages start to show. Especially, transform in j will slow down terribly as the number of groups increases.

I'm guessing this needs to be sorted by "id" to line up properly. Luckily that happens automatically when you set the key:
dat <-read.table(text="dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
", header=TRUE, sep=",")
dtb <- data.table(dat)
setkey(dtb, "id")
dtb[, residuals(lm(val1 ~ val2)), by="id"]
#---------------
cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"])
#---------------
dte id val1 val2 id.1 V1
[1,] 2001-10-02 1 10 25 1 1.631688e-15
[2,] 2001-10-03 1 11 24 1 -3.263376e-15
[3,] 2001-10-04 1 12 23 1 1.631688e-15
[4,] 2001-10-02 2 13 22 2 0.000000e+00
[5,] 2001-10-03 2 14 21 2 0.000000e+00
> dat <- data.frame(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
> dtb <- data.table(dat)
> setkey(dtb, "id")
> system.time( cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
1.696 0.798 2.466
> system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] )
user system elapsed
1.757 0.908 2.690
EDIT from Matthew :
This is all correct for v1.8.0 on CRAN. With the small addition that transform in j is the subject of data.table wiki point 2: "For speed don't transform() by group, cbind() afterwards". But, := now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).
Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):
install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source",
lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")
(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)

Related

Sequential filtering using data.table in R

I have data as follows:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
10 1986 NA
10 1986 NA
10 1987 240
10 1987 NA
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
12 1984 240
I need to filter rows based on following criteria:
For each PERMNO, sort data by date
Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA)
If the first row corresponds to company getting delisted, do not include any rows for that company
Based on these criteria, following is my expected output:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
I am using data.table in R to work with this data. The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs.
I implemented three different methods for doing this, as can be seen here:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3
Below I compare my implementations using a small dataset of 50k rows. Here are my results:
Time Comparison
system.time(dt <- filterbydelistingcode(dt)) # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt)) # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt)) # 114.3 seconds
As you can see all my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank you.
Edit: Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u
Also, here is a customized read function for this data:
readdata = function(filename){
data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
PRCABS = abs(data$PRC)
mcap = PRCABS * data$SHROUT
hpr = data$RET
HPR = as.numeric(levels(hpr))[hpr]
HPR[HPR==""] = NA
data = cbind(data,PRCABS,mcap, HPR)
return(data)
}
data <- readdata('fewdata.csv')
dt <- as.data.table(data)

Here's an attempt in data.table:
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
# PERMNO date DLSTCD
#1: 10 1983 NA
#2: 10 1985 250
#3: 11 1984 NA
#4: 11 1984 NA
#5: 11 1985 NA
#6: 11 1987 NA
Testing it on 2.5million rows, 400000 with a delisting date:
set.seed(1)
dat <- data.frame(PERMNO=sample(1:22000,2.5e6,replace=TRUE), date=1:2.5e6)
dat$DLSTCD <- NA
dat$DLSTCD[sample(1:2.5e6, 400000)] <- 1
setDT(dat)
system.time({
dat[
dat[order(date),
{
pos <- match(TRUE, !is.na(DLSTCD));
(.I <= .I[pos] & pos != 1) | (is.na(pos))
},
by=PERMNO]
$V1]
})
# user system elapsed
# 0.74 0.00 0.76
Less than a second - not bad.

Building on #thelatemail's answer, here are two more variations on the same theme.
In both cases, setkey() first makes things easier to reason with :
setkey(dat,PERMNO,date) # sort by PERMNO, then by date within PERMNO
Option 1 : stack the data you want (if any) from each group
system.time(
ans1 <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
)
user system elapsed
2.604 0.000 2.605
That's quite slow because allocating and populating all the little bits of memory for the result for each group, only then to be stacked into one single result in the end again, takes time and memory.
Option 2 : (closer to the way you phrased the question) find the row numbers to delete, then delete them.
system.time({
todelete <- dat[, {
w = first(which(!is.na(DLSTCD)))
if (length(w)) .I[seq.int(from=if (w==1) 1 else w+1, to=.N)]
}, keyby=PERMNO]
ans2 <- dat[ -todelete$V1 ]
})
user system elapsed
0.160 0.000 0.159
That's faster because it's only stacking row numbers to delete followed by a single operation to delete the required rows in one bulk operation. Since it's grouping by the first column of the key, it uses the key to make the grouping faster (groups are contiguous in RAM).
More info can be found about ?.SD and ?.I on this manual page.
You can inspect and debug what is happening inside each group just by adding a call to browser() and having a look as follows.
> ans1 <- dat[, {
browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD # type .SD to look at it
date DLSTCD
1: 21679 NA
2: 46408 1
3: 68378 NA
4: 75362 NA
5: 77690 NA
---
111: 2396559 1
112: 2451629 NA
113: 2461958 NA
114: 2484403 NA
115: 2485217 NA
Browse[1]> w # doesn't exist yet because browser() before that line
Error: object 'w' not found
Browse[1]> w = first(which(!is.na(DLSTCD))) # copy and paste line
Browse[1]> w
[1] 2
Browse[1]> if (!length(w)) .SD else if (w>1) .SD[seq_len(w)]
date DLSTCD
1: 21679 NA
2: 46408 1
Browse[1]> # that is what is returned for this group
Browse[1]> n # or type n to step to next line
debug at #3: w = first(which(!is.na(DLSTCD)))
Browse[2]> help # for browser commands
Let's say you find a problem or bug with one particular PERMNO. You can make the call to browser conditional as follows.
> ans1 <- dat[, {
if (PERMNO==42) browser()
w = first(which(!is.na(DLSTCD)))
if (!length(w)) .SD
else if (w>1) .SD[seq_len(w)]
}, keyby=PERMNO]
Browse[1]> .SD
date DLSTCD
1: 31018 NA
2: 35803 1
3: 37494 NA
4: 50012 NA
5: 52459 NA
---
128: 2405818 NA
129: 2429995 NA
130: 2455519 NA
131: 2478605 1
132: 2497925 NA
Browse[1]>

R - Create a new variable where each observation depends on another table and other variables in the data frame

I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !

Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.

We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17

What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17

Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))

my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}

Binning data across interval boundaries

Say I have these data:
start end duration
1 2.67026 2.903822 0.233562
2 4.40529 5.606470 1.201180
3 9.24340 10.010818 0.767418
4 11.87930 13.414140 1.534840
5 14.78210 15.182492 0.400392
6 16.51720 16.817494 0.300294
7 22.08930 25.125610 3.036310
8 32.13240 33.667240 1.534840
9 45.47880 45.912558 0.433758
10 52.85270 54.454270 1.601570
11 55.62210 56.389518 0.767418
They represent 11 events that occurred within a minute. Each has a start and end time (in seconds) and the duration of that event (in seconds).
What I want to calculate is how many seconds were spent doing these events in each 10 second bin/epoch.
A standard way of binning data in data.table would be to do something like:
as.data.table(df)[, .(total = sum(duration)), by = .(INTERVAL = cut(end, seq(0,60,10)))]
INTERVAL total
1: (0,10] 1.434742
2: (10,20] 3.002944
3: (20,30] 3.036310
4: (30,40] 1.534840
5: (40,50] 0.433758
6: (50,60] 2.368988
However, note that event 3 starts at 9.24340 seconds and ends at 10.010818 seconds. This method has only summed the durations of the first two events in the interval (0,10). I want that first interval to include 10-9.24340 = 0.7566 seconds, i.e. it should be 2.19132 seconds. This number should be subtracted from the second interval, it should be 2.246344 seconds.
In this example, the 0-10 / 10-20 seconds are the only ones where an event spans the cut point, however, obviously I need to find a solution that generalizes to any number of potential cut points.
I think a solution might be to convert the times to datetime format (including milliseconds?) and use that to cut the data, however, I wasn't able to make that work.
EDIT following #Arun 's answer:
#Arun 's answer works for the above problem well. But what if we want to include all intervals - even those where the summed duration = 0.
Example:
set.seed(1)
df<-
data.frame(
start=c(2.3, 3.5,6.7,9.4,10.4,13.5,16.3,18.1),
duration=runif(8,0,1)
)
df$end<-df$start+df$duration
dt<-data.table(df)
dt
start duration end
1: 2.3 0.2655087 2.565509
2: 3.5 0.3721239 3.872124
3: 6.7 0.5728534 7.272853
4: 9.4 0.9082078 10.308208
5: 10.4 0.2016819 10.601682
6: 13.5 0.8983897 14.398390
7: 16.3 0.9446753 17.244675
8: 18.1 0.6607978 18.760798
Following Arun's solution:
lookup = data.table(start = seq(0, 18, by = 2), end = seq(2, 20, by = 2))
ans = foverlaps(dt, setkey(lookup, start, end))
ans[, sum(pmin(i.end, end) - pmax(i.start, start)), by=.(start,end)]
Result:
1: 2 4 0.6376326
2: 6 8 0.5728534
3: 8 10 0.6000000
4: 10 12 0.5098897
5: 12 14 0.5000000
6: 14 16 0.3983897
7: 16 18 0.9446753
8: 18 20 0.6607978
Notice the intervals 0-2 and 4-6 are not included in the result. Obviously, we could bind these back in - but I wonder if this can be done simply by adjusting the data.table code?

Here's a way I could think of with foverlaps().
require(data.table) # v1.9.5+ (due to bug fixes in foverlaps for double)
lookup = data.table(start = seq(0, 50, by = 10), end = seq(10, 60, by = 10))
# start end
# 1: 0 10
# 2: 10 20
# 3: 20 30
# 4: 30 40
# 5: 40 50
# 6: 50 60
ans = foverlaps(dt, setkey(lookup, start, end))
ans[, sum(pmin(i.end, end) - pmax(i.start, start)), by=.(start,end)]
# start end V1
# 1: 0 10 2.191342
# 2: 10 20 2.246344
# 3: 20 30 3.036310
# 4: 30 40 1.534840
# 5: 40 50 0.433758
# 6: 50 60 2.368988
I feel like there may be better options out there though..

R aggregate by multiple rows and apply summarizing function [duplicate]

Edit -- This question was originally titled << Long to wide data reshaping in R >>
I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:
ID Obs 1 Obs 2 Obs 3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36
And what I want to end up with will look like this:
ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev
1 x x x x
2 x x x x
3 x x x x
And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.
Thanks very much for any help.

This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate like this (assuming DF is the input data frame):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
Note 1: A commenter pointed out that ag is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag has the same number of columns as the input DF. Its first column ag[[1]] is ID and the ith column of the remainder ag[[i+1]] (or equivalanetly ag[-1][[i]]) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j] which can also be written as ag[-1][[i]][, j] .
On the other hand, suppose there are k statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]] or equivalently ag[-1][[k*(i-1)+j]] .
For example, compare the simplicity of the first expression vs. the second:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
Note 2: The input in reproducible form is:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)

There are a few different ways to go about it. reshape2 is a helpful package.
Personally, I like using data.table
Below is a step-by-step
If myDF is your data.frame:
library(data.table)
DT <- data.table(myDF)
DT
# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]
# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]
# If you would like to round the values:
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]
# If we want to add names to the columns
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))
wide
ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1: 1 35.333 8.021 36.333 10.214 33.0 9.644
2: 2 29.750 3.594 32.250 4.193 30.5 5.916
3: 3 41.500 4.950 43.500 4.950 39.0 4.243
Also, this may or may not be helpful
> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
Obs.1 Obs.2 Obs.3
Min. 25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median 33.00 32.00 36.00
Mean 34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max. 45.00 48.00 42.00

Here is probably the simplest way to go about it (with a reproducible example):
library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))
ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev
1 1 -0.13994642 0.8258445 -0.15186380 0.4251405
2 2 1.49982393 0.2282299 0.50816036 0.5812907
3 3 -0.09269806 0.6115075 -0.01943867 1.3348792
EDIT: The following approach saves you a lot of typing when dealing with many columns.
ddply(df, .(ID), colwise(mean))
ID Obs_1 Obs_2 Obs_3
1 1 -0.3748831 0.1787371 1.0749142
2 2 -1.0363973 0.0157575 -0.8826969
3 3 1.0721708 -1.1339571 -0.5983944
ddply(df, .(ID), colwise(sd))
ID Obs_1 Obs_2 Obs_3
1 1 0.8732498 0.4853133 0.5945867
2 2 0.2978193 1.0451626 0.5235572
3 3 0.4796820 0.7563216 1.4404602

I add the dplyr solution.
set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))
# ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
# 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
# 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

Here's another take on the data.table answers, using #Carson's data, that's a bit more readable (and also a little faster, because of using lapply instead of sapply):
library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
# ID mean.Obs_1 mean.Obs_2 mean.Obs_3 sd.Obs_1 sd.Obs_2 sd.Obs_3
#1: 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
#2: 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
#3: 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

The updated dplyr solution, as for 2020
1: summarise_each_() is deprecated as of dplyr 0.7.0.
and
2: funs() is deprecated as of dplyr 0.8.0.
ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))

There is a helpful function in the psych package.
You should try the following implementation:
psych::describeBy(data$dependentvariable, group = data$groupingvariable)

Correct way of vectorizing "lookup" function

I am looking for a fast and efficient way to compute the problem described below. Any help would be appreciated, thanks in advance!
I have a couple of very large csv files that have different information about the same object, but in my final calculation I need all of the attributes in the different table. I am trying to calculate the load of a large number of electrical substations, first I have a list of unique electrical substations;
Unique_Substations <- data.frame(Name = c("SubA", "SubB", "SubC", "SubD"))
In another list I have information about the customers behind these substations;
Customer_Information <- data.frame(
Customer = 1001:1010,
SubSt_Nm = sample(unique(Unique_Substations$Name), 10, replace = TRUE),
HouseHoldType = sample(1:2, 10, replace = TRUE)
)
And in another list I have information about the, let's say, solar panels on these customers roofs (for different years);
Solar_Panels <- data.frame(
Customer = sample(1001:1010, 10, replace = TRUE),
SolarPanelYear1 = sample(10:20, 10, replace = TRUE),
SolarPanelYear2 = sample(15:20, 10, replace = TRUE)
)
Now I want see what the load is for each substation for each year. I have a household load and a solar panel load normalised for each type of household or the solarpanel;
SolarLoad <- data.frame(Load = c(0, -10, -10, 5))
HouseHoldLoad <- data.frame(Type1 = c(1, 3, 5, 2), Type2 = c(3, 5, 6, 1))
So now I have to match up these lists;
ML_SubSt_Cust <- sapply(Unique_Substations$Name,
function(x) which(Customer_Information$SubSt_Nm %in% x == TRUE))
ML_Cust_SolarP <- sapply(Customer_Information$Customer,
function(x) which(Solar_Panels$Customer %in% x == TRUE))
(Here I use the which(xxx %in% x == TRUE) method because I need multiple matches and match() only returns one match
And now we come to my big question (but probably not my only problem with this method) at last. I want to calculate the maximum load on each substation for each year. To this end I had first written a for loop that looped through the Unique_Substations list, which is of course highly inefficient. After that I tried to speed it up using outer() but I don't think I have properly vectorized my function. My maximum function looks as follows (I only wrote it out for the solar panel part to keep it simple);
GetMax <- function(i, Yr) {
max(sum(Solar_Panels[unlist(ML_Cust_SolarP[ML_SubSt_Cust[[i]]], use.names= FALSE),Yr])*SolarLoad)
}
I'm sure this is not efficient at all but I have no clue how to do it in any other way.
To get my final results I use a outer function;
Results <- outer(1:nrow(Unique_Substations), 1:2, Vectorize(GetMax))
In my example all of these data frames are much much larger (40000 rows each or so), so I really need some good optimization of the functions involved. I tried to think of ways to vectorize the function but I couldn't work it out. Any help would be appreciated.
EDIT:
Now that I fully understand the accepted awnser I have another problem. My actual Customer_Information is 188k rows long and my actual HouseHoldLoad is 53k rows long. Needless to say this does not merge() very well. Is there another solution to this problem that does not require merge() or for loops that are too slow?

First: set.seed() when generating random data! I did set.seed(1000) before your code for these results.
I think a bit of merge-ing and dplyr can help here. First, we get the data into a better shape:
library(dplyr)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load") %>%
select(HouseHoldType=variable, Load) %>%
mutate(HouseHoldType=gsub("Type", "", HouseHoldType))
Solar_Panels <- melt(Solar_Panels, id.vars="Customer",
value.name="SPYearVal") %>%
select(Customer, SolarPanelYear=variable, SPYearVal) %>%
mutate(SolarPanelYear=gsub("SolarPanelYear", "", SolarPanelYear))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
That gives us:
## Customer SubSt_Nm HouseHoldType SolarPanelYear SPYearVal
## 1 1001 SubB 1 1 16
## 2 1001 SubB 1 2 18
## 3 1001 SubB 1 2 16
## 4 1001 SubB 1 1 20
## 5 1002 SubD 2 1 16
## 6 1002 SubD 2 1 13
## 7 1002 SubD 2 2 20
## 8 1002 SubD 2 2 18
## 9 1003 SubA 1 2 15
## 10 1003 SubA 1 1 16
## 11 1005 SubC 2 2 19
## 12 1005 SubC 2 1 10
## 13 1006 SubA 1 1 15
## 14 1006 SubA 1 2 19
## 15 1007 SubC 1 1 17
## 16 1007 SubC 1 2 19
## 17 1009 SubA 1 1 10
## 18 1009 SubA 1 1 18
## 19 1009 SubA 1 2 18
## 20 1009 SubA 1 2 18
Now we just group and summarize:
dat %>% group_by(SubSt_Nm, SolarPanelYear) %>%
summarise(mx=max(sum(SPYearVal)*SolarLoad))
## SubSt_Nm SolarPanelYear mx
## 1 SubA 1 295
## 2 SubA 2 350
## 3 SubB 1 180
## 4 SubB 2 170
## 5 SubC 1 135
## 6 SubC 2 190
## 7 SubD 1 145
## 8 SubD 2 190
If you use data.table vs data frames, it should be pretty speedy even with 40K entries.
UPDATE For those who cannot install dplyr, this just uses reshape2 (hopefully that is installable)
library(reshape2)
HouseHoldLoad <- melt(HouseHoldLoad, value.name="Load")
colnames(HouseHoldLoad) <- c("HouseHoldType", "Load")
HouseHoldLoad$HouseHoldType <- gsub("Type", "", HouseHoldLoad$HouseHoldType)
Solar_Panels <- melt(Solar_Panels, id.vars="Customer", value.name="SPYearVal")
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("SolarPanelYear", "", Solar_Panels$SolarPanelYear)
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190
If you really can't install even reshape2, then this works with just the base stats package:
colnames(HouseHoldLoad) <- c("Load.1", "Load.2")
HouseHoldLoad <- reshape(HouseHoldLoad, varying=c("Load.1", "Load.2"), direction="long", timevar="HouseHoldType")[1:2]
colnames(Solar_Panels) <- c("Customer", "SolarPanelYear.1", "SolarPanelYear.2")
Solar_Panels <- reshape(Solar_Panels, varying=c("SolarPanelYear.1", "SolarPanelYear.2"), direction="long", timevar="SolarPanelYear")[1:2]
colnames(Solar_Panels) <- c("Customer", "SPYearVal")
Solar_Panels$SolarPanelYear <- gsub("^[0-9]+\\.", "", rownames(Solar_Panels))
dat <- merge(Customer_Information, Solar_Panels, by="Customer")
rbind(by(dat, list(dat$SubSt_Nm, dat$SolarPanelYear), function(x) {
mx <- max(sum(x$SPYearVal) * SolarLoad)
}))
## 1 2
## SubA 295 350
## SubB 180 170
## SubC 135 190
## SubD 145 190

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

split apply recombine, plyr, data.table in R - r

Related

Sequential filtering using data.table in R

R - Create a new variable where each observation depends on another table and other variables in the data frame

Binning data across interval boundaries

R aggregate by multiple rows and apply summarizing function [duplicate]

Correct way of vectorizing "lookup" function

Categories

Resources