Aggregate an entire data frame with Weighted Mean - r

I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.

You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])

Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)

Related

How to identify and remove outliers in a data.frame using R?

I have a dataframe that has multiple outliers. I suspect that these ouliers have produced different results than expected.
I tried to use this tip but it didn't work as I still have very different values: https://www.r-bloggers.com/2020/01/how-to-remove-outliers-in-r/
I tried the solution with the rstatix package, but I can't remove the outliers from my data.frame
library(rstatix)
library(dplyr)
df <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50))
View(df)
out_df<-identify_outliers(df$score)#identify outliers
df2<-df#copy df
df2<- df2[-which(df2$score %in% out_df),]#remove outliers from df2
View(df2)
The identify_outliers expect a data.frame as input i.e. usage is
identify_outliers(data, ..., variable = NULL)
where
... - One unquoted expressions (or variable name). Used to select a variable of interest. Alternative to the argument variable.
df2 <- subset(df, !score %in% identify_outliers(df, "score")$score)
A rule of thumb is that data points above Q3 + 1.5xIQR or below Q1 - 1.5xIQR are considered outliers.
Therefore you just have to identify them and remove them. I don't know how to do it with the dependency rstatix, but with base R can be achived following the example below:
# Generate a demo data
set.seed(123)
demo.data <- data.frame(
sample = 1:20,
score = c(rnorm(19, mean = 5, sd = 2), 50),
gender = rep(c("Male", "Female"), each = 10)
)
#identify outliers
outliers <- which(demo.data$score > quantile(demo.data$score)[4] + 1.5*IQR(demo.data$score) | demo.data$score < quantile(demo.data$score)[2] - 1.5*IQR(demo.data$score))
# remove them from your dataframe
df2 = demo.data[-outliers,]
Do a cooler function that returns to you the index of the outliers:
get_outliers = function(x){
which(x > quantile(x)[4] + 1.5*IQR(x) | x < quantile(x)[2] - 1.5*IQR(x))
}
outliers <- get_outliers(demo.data$score)
df2 = demo.data[-outliers,]

r aggregate dynamic columns

I'd like to create an aggregation without knowing neither the column names nor their positions ie. I retrieve the names dynamically.
Further I'm able to use data.frame or data.table as I'm forced to use R version 3.1.1
Is there an option like do.call... as explained in this answer for 'order'
trying a similar do.call with 'aggregate' leads to an error
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
group_by <- c('group1','group2')
test <- do.call( aggregate.data.frame , c(by=group_by, x=smalldat, FUN=mean))
#output
#Error in is.data.frame(x) : Argument "x" missing (no default)
or is there an option with data.table?
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
# convert to data.frame to data.table
library(data.table)
smalldat <- data.table(smalldat)
# convert aggregated variable into raw data file
smalldat[, aggGroup1 := mean(x), by = group1]
Thanks for advice!
aggregate can take a formula, and you can build a formula from a string.
form = as.formula(paste(". ~", paste(group_by, collapse = " + ")))
aggregate(form, data = smalldat, FUN = mean)
# group1 group2 x y
# 1 1 a 0.1021667 -0.09798418
# 2 2 a -0.5695960 -0.67409059
# 3 1 b -1.0341342 -0.46696381
# 4 2 b -0.3102046 0.46478476

R / data.table() merge on named subset of another data.table

I'm trying to put together several files and need to do a bunch of merges on column names that are created inside a loop. I can do this fine using data.frame() but am having issues using similar code with a data.table():
library(data.table)
df1 <- data.frame(id = 1:20, col1 = runif(20))
df2 <- data.frame(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
df1[,newColName] <- runif(20)
df2 <- merge(df2, df1[,c('id',newColName)], by = 'id', all.x = T) # Works fine
######################
dt1 <- data.table(id = 1:20, col1 = runif(20))
dt2 <- data.table(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
dt1[,newColName] <- runif(20)
dt2 <- merge(dt2, dt1[,c('id',newColName)], by = 'id', all.x = T) # Doesn't work
Any suggestions?
This really has nothing to do with merge(), and everything to do with how the j (i.e. column) index is, by default, interpreted by [.data.table().
You can make the whole statement work by setting with=FALSE, which causes the j index to be interpreted as it would be in a data.frame:
dt2 <- merge(dt2, dt1[,c('id',newColName), with=FALSE], by = 'id', all.x = T)
head(dt2, 3)
# id col1 col5
# 1: 1 0.4954940 0.07779748
# 2: 2 0.1498613 0.12707070
# 3: 3 0.8969374 0.66894157
More precisely, from ?data.table:
with: By default 'with=TRUE' and 'j' is evaluated within the frame
of 'x'. The column names can be used as variables. When
'with=FALSE', 'j' is a vector of names or positions to
select.
Note that this could be avoided by storing the columns in a variable like so:
cols = c('id', newColName)
dt1[ , ..cols]
.. signals to "look up one level"
Try dt1[,list(id,get(newColName))] in your merge.

How to use ddply to get weighted-mean of class in dataframe?

I'm new to plyr and want to take the weighted mean of values within a class to reshape a dataframe for multiple variables. Using the following code, I know how to do this for one variable, such as x2:
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE),
x=rnorm(20), x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class),function(x) data.frame(weighted.mean(x$x2, x$weights)))
However, I would like the code to create a new data frame for x and x2 (and any amount of variables in the frame). Does anybody know how to do this? Thanks
You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x as well, just add that line to be passed into the summarise function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.
A data.table answer for fun, which also doesn't require specifying all the variables individually.
library(data.table)
frame <- as.data.table(frame)
keynames <- setdiff(names(frame),c("class","weights"))
frame[, lapply(.SD,weighted.mean,w=weights), by=class, .SDcols=keynames]
Result:
class x x2
1: B 0.1390808 -1.7605032
2: D 1.3585759 -0.1493795
3: C -0.6502627 0.2530720
4: E 2.6657227 -3.7607866

Efficiently transform multiple columns of a data frame

I have a data frame, and I want to transform all columns (say, take the logs or whatever) with columns that match a certain name. So in the example below, I want to take the log of X.1 and X.2, but not Y or Z.1.
df <- data.frame(
Y = sample(0:1, 10, replace = TRUE),
X.1 = sample(1:10),
X.2 = sample(1:10),
Z.1 = sample(151:160)
)
# option 1, won't work for dozens of fields
df$X.1 <- log(df$X.1)
df$X.2 <- log(df$X.2)
Is there a good, efficient way to do this when the dataframe is several gigabtyes?
In the case of functions that will return a data.frame:
cols <- c("X.1","X.2")
df[cols] <- log(df[cols])
Otherwise you will need to use lapply or a loop over the columns. These solutions will be slower than the solution above, so only use them if you must.
df[cols] <- lapply(df[cols], function(x) c(NA,diff(x)))
for(col in cols) {
df[col] <- c(NA,diff(df[col]))
}
vars <- c("X.1", "X.2")
df[vars] <- lapply(df[vars], log)
df <- data.frame(
Y = sample(0:1, 10, replace = TRUE),
X.1 = sample(1:10),
X.2 = sample(1:10),
Z.1 = sample(151:160)
)
df
assuming that you know those variables which requires conversions in the real dataframe (2 and 3 refers to the 2nd and 3rd variables in df which are X.1 and X.2)
df2=log10(df[c(2:3)])
df2
if the variables are far a part in the dataframe you can select them like c(1,3,6,8:10,13) for 1st, 3rd, 6th 8 through 10 and 13th.this works only for numerical variables.

Resources