I need to compute a monthly weighted average. The data frame looks like this:
Month Variable Weighting
460773 1998-06-01 11 153.00
337134 1998-06-01 9 0.96
473777 1998-06-01 10 264.00
358226 1998-06-01 6 0.52
414626 1998-06-01 10 34.00
341020 1998-05-01 9 1.64
453066 1998-05-01 5 26.00
183276 1998-05-01 8 0.51
403729 1998-05-01 6 123.00
203005 1998-05-01 11 0.89
When I use aggregate e.g.,
Output <- aggregate(Variable ~ Month, df , mean )
Output
Month Variable
1 1998-05-01 7.8
2 1998-06-01 9.2
I get correct results, however, when I try to add weight to the aggregation e.g.,
Output <- aggregate(Variable ~ Month, df , FUN = weighted.mean, w = df$Weighting)
I get a different-vector-lenghts error:
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
Is there a way to remedy this situation?
With aggregate() it is not possible, because your weight vector is not partitionated during aggregate(). You can use by() or split() plus sapply() or additional package data.table or function ddply() from package plyr or functions from the package dplyr
example with split() plus sapply():
sapply(split(df, df$Month), function(d) weighted.mean(d$Variable, w = d$Weighting))
result:
1998-05-01 1998-06-01
5.89733 10.33142
a variant with by()
by(df, df$Month, FUN=function(d) weighted.mean(d$Variable, w = d$Weighting)) # or
unclass(by(df, df$Month, FUN=function(d) weighted.mean(d$Variable, w = d$Weighting)))
with package plyr
library(plyr)
ddply(df, ~Month, summarize, weighted.mean(Variable, w=Weighting))
with data.table
library(data.table)
setDT(df)[, weighted.mean(Variable, w = Weighting), Month]
In case you don't have plyr, dplyr or data.table installed and cannot install them due to some reasons, it is still possible to use aggregate to compute monthly weighted average, all you need is to do the following trick,
df$row <- 1:nrow(df) #the trick
aggregate(row~Month, df, function(i) mean(df$Variable[i])) #mean
aggregate(row~Month, df, function(i) weighted.mean(df$Variable[i], df$Weighting[i])) #weighted mean
Here are the outputs:
Mean:
> aggregate(row~Month, df, function(i) mean(df$Variable[i]))
Month row
1 1998-05-01 7.8
2 1998-06-01 9.2
Weighted mean:
> aggregate(row~Month, df, function(i) weighted.mean(df$Variable[i], df$Weighting[i]))
Month row
1 1998-05-01 5.89733
2 1998-06-01 10.33142
Related
I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
I have a dataframe (sample of the following form):
DateTime Ind1 Ind2 V1 V2 Ac1 Ac2 w1 w2 w3 shift
2016-05-01 00:01:00 U A 5 7 20 100 50 70 200 1
2016-05-01 00:01:20 U A 5 7 20 109 35 77 140 1
2016-05-01 00:01:40 U A 5 7 40 120 55 97 160 1
...
2016-05-01 00:08:20 U A 5 7 15 157 70 70 204 2
...
2016-05-02 00:08:20 U A 5 7 28 147 65 90 240 2
...
2016-05-02 00:20:00 U A 5 7 35 210 45 100 167 3
I need a new dataframe where some statistics (e.g. mean, standard deviation) for the columns v1 to w3 are listed for each date-and-shift combination, something similar to the following:
Date shift Ind1 Ind2 avgV1 sdV1 avgV2 sdV2 avgAC1 ....
2016-05-01 1 U A 5.3 2.9 7.8 4.5 108 .....
2016-05-01 2 U A 6.7 3.5 8.9 5.0 99 .....
SOLUTION TRIED:
I can do the following steps.
1) extract date from DateTime
df$Date <- format(as.POSIXct(df$DateTime, format="%Y-%m-%d %H:%M:%S"), format="%Y-%m-%d")
2) label the data by date and shift.
df$DateShift <- paste(df$Date, df$shift)
3) for each subset, calculate some statistics on a col:
tmp_df <- data.frame(levels(as.factor(df$DateShift)))
avgV1 <- tapply(df$V1, df$DateShift, FUN=mean)
sdV1 <- tapply(df$V1, df$DateShift, FUN=sd)
avgV2<- tapply(df$V2, df$DateShift, FUN=mean)
....
However, I have more than 50 columns in the original dataframe, with different types of names (not as simple as in the example above).
Moreover, the statistics that I want to compute may vary (say, calculation of max and min, or some other user-defined function).
So I don't want to code by hand for the different combinations of columns and type of statistic (mean, standard dev, etc.)
What is the way to automate this?
I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.
library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)
Will give a dataframe with V1.mean, V1.median, and so on.
The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.
library(dplyr)
df %>%
mutate(Date = as.Date(DateTime)) %>%
group_by(Date, shift) %>%
summarise_each(funs(mean))
Edit -- This question was originally titled << Long to wide data reshaping in R >>
I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:
ID Obs 1 Obs 2 Obs 3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36
And what I want to end up with will look like this:
ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev
1 x x x x
2 x x x x
3 x x x x
And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.
Thanks very much for any help.
This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate like this (assuming DF is the input data frame):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
Note 1: A commenter pointed out that ag is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag has the same number of columns as the input DF. Its first column ag[[1]] is ID and the ith column of the remainder ag[[i+1]] (or equivalanetly ag[-1][[i]]) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j] which can also be written as ag[-1][[i]][, j] .
On the other hand, suppose there are k statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]] or equivalently ag[-1][[k*(i-1)+j]] .
For example, compare the simplicity of the first expression vs. the second:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
Note 2: The input in reproducible form is:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)
There are a few different ways to go about it. reshape2 is a helpful package.
Personally, I like using data.table
Below is a step-by-step
If myDF is your data.frame:
library(data.table)
DT <- data.table(myDF)
DT
# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]
# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]
# If you would like to round the values:
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]
# If we want to add names to the columns
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))
wide
ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1: 1 35.333 8.021 36.333 10.214 33.0 9.644
2: 2 29.750 3.594 32.250 4.193 30.5 5.916
3: 3 41.500 4.950 43.500 4.950 39.0 4.243
Also, this may or may not be helpful
> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
Obs.1 Obs.2 Obs.3
Min. 25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median 33.00 32.00 36.00
Mean 34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max. 45.00 48.00 42.00
Here is probably the simplest way to go about it (with a reproducible example):
library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))
ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev
1 1 -0.13994642 0.8258445 -0.15186380 0.4251405
2 2 1.49982393 0.2282299 0.50816036 0.5812907
3 3 -0.09269806 0.6115075 -0.01943867 1.3348792
EDIT: The following approach saves you a lot of typing when dealing with many columns.
ddply(df, .(ID), colwise(mean))
ID Obs_1 Obs_2 Obs_3
1 1 -0.3748831 0.1787371 1.0749142
2 2 -1.0363973 0.0157575 -0.8826969
3 3 1.0721708 -1.1339571 -0.5983944
ddply(df, .(ID), colwise(sd))
ID Obs_1 Obs_2 Obs_3
1 1 0.8732498 0.4853133 0.5945867
2 2 0.2978193 1.0451626 0.5235572
3 3 0.4796820 0.7563216 1.4404602
I add the dplyr solution.
set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))
# ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
# 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
# 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
Here's another take on the data.table answers, using #Carson's data, that's a bit more readable (and also a little faster, because of using lapply instead of sapply):
library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
# ID mean.Obs_1 mean.Obs_2 mean.Obs_3 sd.Obs_1 sd.Obs_2 sd.Obs_3
#1: 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
#2: 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
#3: 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692
The updated dplyr solution, as for 2020
1: summarise_each_() is deprecated as of dplyr 0.7.0.
and
2: funs() is deprecated as of dplyr 0.8.0.
ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))
There is a helpful function in the psych package.
You should try the following implementation:
psych::describeBy(data$dependentvariable, group = data$groupingvariable)
I have a dataframe where I have values, and for each value I have the counts associated with that value. So, plotting counts against values gives me the histogram. I have three types, a, b, and c.
value counts type
0 139648267 a
1 34945930 a
2 5396163 a
3 1400683 a
4 485924 a
5 204631 a
6 98599 a
7 53056 a
8 30929 a
9 19556 a
10 12873 a
11 8780 a
12 6200 a
13 4525 a
14 3267 a
15 2489 a
16 1943 a
17 1588 a
... ... ...
How do I get from this to a CDF?
So far, my approach is super inefficient: I first write a function that sums up the counts up to that value:
get_cumulative <- function(x) {
result <- numeric(nrow(x))
for (i in seq_along(result)) {
result[i] = sum(x[x$num_groups <= x$num_groups[i], ]$count)
}
x$cumulative <- result
x
}
Then I wrap this in a ddply that splits by the type. This is obviously not the best way, and I'd love any suggestions on how to proceed.
You can use ave and cumsum (assuming your data is in df and sorted by value):
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
Here is a toy example:
df <- data.frame(counts=sample(1:100, 10), type=rep(letters[1:2], each=5))
transform(df, cdf=ave(counts, type, FUN=function(x) cumsum(x) / sum(x)))
that produces:
counts type cdf
1 55 a 0.2750000
2 61 a 0.5800000
3 27 a 0.7150000
4 20 a 0.8150000
5 37 a 1.0000000
6 45 b 0.1836735
7 79 b 0.5061224
8 12 b 0.5551020
9 63 b 0.8122449
10 46 b 1.0000000
If your data is in data.frame DF then following should do
do.call(rbind, lapply(split(DF, DF$type), FUN=cumsum))
The HistogramTools package on CRAN has several functions for converting between Histograms and CDFs, calculating information loss or error margins, and plotting functions to help with this.
If you have a histogram h then calculating the Empirical CDF of the underlying dataset is as simple as:
library(HistogramTools)
h <- hist(runif(100), plot=FALSE)
plot(HistToEcdf(h))
If you first need to convert your input data of breaks and counts into an R Histogram object, then see the PreBinnedHistogram function first.
I am new-ish to R and have what should be a simple enough question to answer; any help would be greatly appreciated.
The situation is I have a tab delimited data matrix (data matrix.txt) like below with group information included on the last column.
sampleA sampleB sampleC Group
obs11 23.2 52.5 -86.3 1
obs12 -86.3 32.5 -84.7 1
obs41 -76.2 35.8 -16.3 2
obs74 23.2 32.5 -86.8 2
obs82 -86.2 52.8 -83.2 3
obs38 -36.2 59.5 -74.3 3
I would like to replace the values of each of the groups with the average value for that group
How can a group average rather than a row or column average be calculated in R?
And how can I use this value to replace original values? Is the replace() function useable in this situation or is that only for replacing two known values?
Thanks in advance
The package ddply should do the trick.
dat <- as.data.frame(matrix(runif(80),ncol=4))
dat$group <- sample(letters[1:4],size=20,replace=T)
head(dat)
library(plyr)
ddply(.data = dat, .variables =.(group), colwise(mean))
Result
group V1 V2 V3 V4
1 a 0.4741673 0.7669612 0.5043857 0.5039938
2 b 0.3648794 0.5776748 0.4033758 0.5748613
3 c 0.1450466 0.5399372 0.2440170 0.5124578
4 d 0.4249183 0.3252093 0.5467726 0.4416924