r aggregate dynamic columns - r

I'd like to create an aggregation without knowing neither the column names nor their positions ie. I retrieve the names dynamically.
Further I'm able to use data.frame or data.table as I'm forced to use R version 3.1.1
Is there an option like do.call... as explained in this answer for 'order'
trying a similar do.call with 'aggregate' leads to an error
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
group_by <- c('group1','group2')
test <- do.call( aggregate.data.frame , c(by=group_by, x=smalldat, FUN=mean))
#output
#Error in is.data.frame(x) : Argument "x" missing (no default)
or is there an option with data.table?
# generate a small dataset
set.seed(1234)
smalldat <- data.frame(group1 = rep(1:2, each = 5),
group2 = rep(c('a','b'), times = 5),
x = rnorm(10),
y = rnorm(10))
# convert to data.frame to data.table
library(data.table)
smalldat <- data.table(smalldat)
# convert aggregated variable into raw data file
smalldat[, aggGroup1 := mean(x), by = group1]
Thanks for advice!

aggregate can take a formula, and you can build a formula from a string.
form = as.formula(paste(". ~", paste(group_by, collapse = " + ")))
aggregate(form, data = smalldat, FUN = mean)
# group1 group2 x y
# 1 1 a 0.1021667 -0.09798418
# 2 2 a -0.5695960 -0.67409059
# 3 1 b -1.0341342 -0.46696381
# 4 2 b -0.3102046 0.46478476

Related

using t.test within data.table on multiple columns

I am trying to run a t.test() on multiple columns of data within 'j' in a data.table. I've found a way that works, but isn't very elegant, and I feel like there's probably a more concise way using .SDcols, but haven't had any luck looking through here, or the data.table vignette. If this has been asked previously I apologize and please point me in the right direction.
My data.table has essentially the following format
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
When I want to run a t.test comparing the values of y in group One and group Two, it's very straightforward:
DT[,t.test(y~group)]
If I want to get an output for both y and z the following works, but is clunky and inelegant. And with my actual data, I'm trying to do this over many columns so it would be more time consuming to type out each iteration I would like to run.
DT[,.(t.test(y~group), t.test(z~group))]
In the data.table vignette, using a function over a specific subset of columns is achieved by
DT[,lapply(.SD, mean), .SDcols = c('y', 'z')]
However replacing mean with t.test yields a one sample t.test, while I'm trying to get a two sample t.test. I've tried:
DT[,lapply(.SD, t.test, formula = .SDcols ~ group, data = DT), .SDcols = c('y', 'z')]
But this gives me a comparison between y and z, not both the comparisons of y~group and z~group.
I've tried several versions of lapply with a custom function to get the output I want, but I won't make anyone read through my walls of unsuccessful code. Needless to say I have been unable to get that to work.
Question:
Is there a way via lapply() or function() or a way currently unknown to me, to get t.test to run over multiple columns of data within 'j' in a data.table?
Thanks in advance for your help,
Chris
To pull together the parts of the answer and to rearrange to put the name in the first column (if desired for nicer printing):
library(data.table)
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
result <-
DT[,lapply(.SD, function(x) t.test(x ~ group)), .SDcols = y:z][
,ttname:=names(t.test(1:5))][ # add names
,.(ttname,y,z)] # put names in first column
result
# ttname y z
# 1: statistic 0.1391646 0.1295093
# 2: parameter 3.468876 3.559917
# 3: p.value 0.8970165 0.9039359
# 4: conf.int -99.61786,109.47358 -8.209637, 8.972439
# 5: estimate 110.7286,105.8008 11.15414,10.77274
# 6: null.value 0 0
# 7: stderr 35.41031 2.94497
# 8: alternative two.sided two.sided
# 9: method Welch Two Sample t-test Welch Two Sample t-test
# 10: data.name x by group x by group
Here is my solution, wrapped as a function. In the accepted answer, , I didn't like that the test output was rows and the variables columns, i like it the other way around, makes it easier to read I think.
I also a added an argument for rounding, and one that default only prints the most important info, the pvalue and test statistic. the function requires purrr. the input for the group variable and the variables to test are character, so dt_test(dtx, 'varname', c('z','y'))
dt_ttest <- function(dtx, grp, thecols, decimals=3, small=TRUE, ...) {
x1 <- a2[, map(.SD, ~ t.test(.x ~ get(grp))), .SDcols = thecols]
x2 <- t(x1) %>% data.table()
setnames(x2, names(t.test(1:2)))
x2 <- x2[, var := thecols] [, !'data.name']
tcols <- c('p.value', 'statistic', 'stderr', 'null.value', 'parameter', 'method', 'alternative')
x2[, (tcols) := map(.SD, unlist), .SDcols=tcols ]
x2
thecols2 <- keep(x2, is.numeric) %>% names()
x2[, (thecols2) := map(.SD, ~ round(.x, decimals)), .SDcols=thecols2 ]
# go one level deeper to round the two list cols
thecols3 <- c('conf.int', 'estimate')
x2[, (thecols3) := modify_depth(.SD, 2, ~ round(.x, decimals)), .SDcols=thecols3 ]
# set order
setcolorder(x2, c('var', 'p.value', 'statistic', 'stderr', 'conf.int', 'estimate', 'parameter', 'method', 'alternative') )
if( small) x2[, .(var, p.value, statistic)] else x2[]
}

Iterating over columns in a data frame in order to replace values from matching data in list of data frames

I'm interested in building a function making use of apply/sapply or Map that would iterate over available columns in dta and replace values in each column with matched values from data frame available in a nameless list of data frames with list item index corresponding to the column number of the dta data frame.
Example
Given objects:
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE)
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D")
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q")
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January")
)
)
Desired results
When applied on dta, the function should return a data.frame corresponding to the extract below:
unitA unitB unitC someValue
Letter B small t Apr 912876
Letter B small q March 293604
C s Apr 459066
Letter D p March 332395
Letter A small q March 650871
Letter D small q Apr 258017
Letter D p January 478546
C small q Feb 766311
C small t March 84247
Letter A small q March 875322
Letter A r Feb 339073
Letter A r Ap 839441
C r Feb 346684
Letter B p January 333775
Letter D small t January 476352
(...)
Existing approach
replaceLbls <- function(dataSet, lstDict) {
sapply(seq_along(dataSet), function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
dataSet[,i][match(dataSet[,i], dtaDict[,1])] <- dtaDict[,2][match(dtaDict[,1], dataSet[,i])]
})
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
Of course the approach proposed above does not work as it will try to use NA in assignments; but it summarises what I want to achieve:
Error in x[...] <- m : NAs are not allowed in subscripted assignments
In addition: Warning message: In [<-.factor(*tmp*, match(dataSet[,
i], dtaDict[, 1]), value = c(NA, : invalid factor level, NA
generated
Additional remarks
Source data set
The key characteristics of the data are:
The list is nameless so subsetting has to be done by item numbers not by names
Item number correspond to column numbers
There is no full match between metadata data frames available in the list of data frames and unit columns available in the data
The someValue column also should be iterated over as it may contain labels that should be replaced
Solution
I'm not interested in dplyr/data.table/sqldf-based solutions.
I'm not interested in nested for-loops
I have a hacky solution that doesn't use for loops or other packages. I needed to convert the factors to characters for it to work but you might be able to improve my solution.
The solution works by only matching values that are found in your lstMeta by creating a vector of indices where matches are found. I also used the <<- operator. If you're better at R than me, you can probably improve this.
set.seed(1)
size <- 20
# Data set
dta <-
data.frame(
unitA = sample(LETTERS[1:4], size = size, replace = TRUE),
unitB = sample(letters[16:20], size = size, replace = TRUE),
unitC = sample(month.abb[1:4], size = size, replace = TRUE),
someValue = sample(1:1e6, size = size, replace = TRUE),
stringsAsFactors = F
)
# Meta data
lstMeta <- list(
# Unit A definitions
data.frame(
V1 = c("A", "B", "D"),
V2 = c("Letter A", "Letter B", "Letter D"),
stringsAsFactors = F
),
# Unit B definitions
data.frame(
V1 = c("t", "q"),
V2 = c("small t", "small q"),
stringsAsFactors = F
),
# Unit C definitions
data.frame(
V1 = c("Mar", "Jan"),
V2 = c("March", "January"),
stringsAsFactors = F
)
)
replaceLbls <- function(dataSet, lstDict) {
sapply(1:3, function(i) {
# Take corresponding metadata data frame
dtaDict <- lstDict[[i]]
# Replace values in selected column
# Where matches on V1 push corrsponding values from V2
myUniques <- which(dataSet[,i] %in% dtaDict[,1])
dataSet[myUniques,i]<<- dtaDict[,2][match(dataSet[myUniques,i],dtaDict[,1])]
})
return(dataSet)
}
# Testing -----------------------------------------------------------------
replaceLbls(dataSet = dta, lstDict = lstMeta)
The following approach works well for the example data:
replaceLbls <- function(dataSet, lstDict) {
dataSet[seq_along(lstDict)] <- Map(function(x, lst) {
x <- as.character(x)
idx <- match(x, as.character(lst$V1))
replace(x, !is.na(idx), as.character(lst$V2)[na.omit(idx)])
}, dataSet[seq_along(lstDict)], lstDict)
dataSet
}
head(replaceLbls(dta, lstMeta))
# unitA unitB unitC someValue
# 1 Letter B small t Apr 912876
# 2 Letter B small q March 293604
# 3 C s Apr 459066
# 4 Letter D p March 332395
# 5 Letter A small q March 650871
# 6 Letter D small q Apr 258017
This assumes that you want to apply the changes to the first X column of the data that are as long as the meta-list. You might want to include an extra step to convert back to factor since this approach converts the adjusted columns to character class.
Another remark on factors: you could potentially speed up the performance by working only on the levels of any factor variables instead the whole column. The general process would be similar but requires a few more steps to check classes etc.
You can also try this:
mapr<-function(t,meta){
ind<-match(t,meta$V1)
if(!is.na(ind)){return(meta$V2[ind])}
else{return(t)}}
Then using sapply:
dta<-as.data.frame(cbind(sapply(1:3,function(t,df,meta){sapply(df[,t],mapr,lstMeta[[t]])},dta,lstMeta,simplify = T),dta[,4]))
A couple of mapplys can do the job
f1 <- function(df, lst){
d1 <- setNames(data.frame(mapply(function(x, y) x$V2[match(y, x$V1)], lst, df[1:3]),
df$someValue, stringsAsFactors = FALSE),
names(df))
as.data.frame(mapply(function(x, y) replace(x, is.na(x), y[is.na(x)]), d1, df))
}

Calculate Percentage and other functions using data.table

I want to apply aggregate functions and percentage function to column. I found threads that discuss aggregation (Calculating multiple aggregations with lapply(.SD, ...) in data.table R package) and threads that discuss percentage (How to obtain percentages per value for the keys in R using data.table? and Use data.table to calculate the percentage of occurrence depending on the category in another column), but not both.
Please note that I am looking for data.table based methods. dplyr wouldn't work on actual data set.
Here's the code to generate sample data:
set.seed(10)
IData <- data.frame(let = sample( x = LETTERS, size = 10000, replace=TRUE), numbers1 = sample(x = c(1:20000),size = 10000), numbers2 = sample(x = c(1:20000),size = 10000))
IData$let<-as.character(IData$let)
data.table::setDT(IData)
Here's the code to generate output using dplyr
Output <- IData %>%
dplyr::group_by(let) %>%
dplyr::summarise(numbers1.mean = as.double(mean(numbers1)),numbers1.median = as.double(median(numbers1)),numbers2.mean=as.double(mean(numbers2)),sum.numbers1.n = sum(numbers1)) %>%
dplyr::ungroup() %>%
dplyr::mutate(perc.numbers1 = sum.numbers1.n/sum(sum.numbers1.n)) %>%
dplyr::select(numbers1.mean,numbers1.median,numbers2.mean,perc.numbers1)
Sample Output (header)
If I run head(output), I would get:
let numbers1.mean numbers1.median numbers2.mean perc.numbers1
<chr> <dbl> <dbl> <dbl> <dbl>
N 10320.951 10473.0 9374.435 0.03567927
H 9683.590 9256.5 9328.035 0.03648391
L 10223.322 10226.0 9806.210 0.04005400
S 9922.486 9618.0 10233.849 0.03678742
C 9592.620 9226.0 9791.221 0.03517997
F 10323.867 10382.0 10036.561 0.03962035
Here's what I tried using data.table (unsuccessfully)
IData[, as.list(unlist(lapply(.SD, function(x) list(mean=mean(x),median=median(x),sum=sum(x))))), by=let, .SDcols=c("numbers1","numbers2")] [,.(Perc = numbers1.sum/sum(numbers1.sum)),by=let]
I have 2 Questions:
a) How can I solve this using data.table?
b) I have seen above threads have used prop.table. Can someone please guide me how to use this function?
I would sincerely appreciate any guidance.
We can use the similar approach with data.table
res <- IData[, .(numbers1.mean = mean(numbers1),
numbers1.median = median(numbers1),
numbers2.mean=mean(numbers2),
sum.numbers1.n = sum(numbers1)), let
][, perc.numbers1 := sum.numbers1.n/sum(sum.numbers1.n)
][, c("let", "numbers1.mean", "numbers1.median",
"numbers2.mean", "perc.numbers1"), with = FALSE]
head(res)
# let numbers1.mean numbers1.median numbers2.mean perc.numbers1
#1: N 10320.951 10473.0 9374.435 0.03567927
#2: H 9683.590 9256.5 9328.035 0.03648391
#3: L 10223.322 10226.0 9806.210 0.04005400
#4: S 9922.486 9618.0 10233.849 0.03678742
#5: C 9592.620 9226.0 9791.221 0.03517997
#6: F 10323.867 10382.0 10036.561 0.03962035

Drop columns when splitting data frame in R

I am trying to split data table by column, however once I get list of data tables, they still contains the column which data table was split by. How would I drop this column once the split is complete. Or more preferably, is there a way how do I drop multiple columns.
This is my code:
x <- rnorm(10, mean = 5, sd = 2)
y <- rnorm(10, mean = 5, sd = 2)
z <- sample(5, 10, replace = TRUE)
dt <- data.table(x, y, z)
split(dt, dt$z)
The resulting data table subsets looks like that
$`1`
x y z
1: 6.179790 5.776683 1
2: 5.725441 4.896294 1
3: 8.690388 5.394973 1
$`2`
x y z
1: 5.768285 3.951733 2
2: 4.572454 5.487236 2
$`3`
x y z
1: 5.183101 8.328322 3
2: 2.830511 3.526044 3
$`4`
x y z
1: 5.043010 5.566391 4
2: 5.744546 2.780889 4
$`5`
x y z
1: 6.771102 0.09301977 5
Thanks
Splitting a data.table is really not worthwhile unless you have some fancy parallelization step to follow. And even then, you might be better off sticking with a single table.
That said, I think you want
split( dt[, !"z"], dt$z )
# or more generally
mysplitDT <- function(x, bycols)
split( x[, !..bycols], x[, ..bycols] )
mysplitDT(dt, "z")
You would run into the same problem if you had a data.frame:
df = data.frame(dt)
split( df[-which(names(df)=="z")], df$z )
First thing that came to mind was to iterate through the list and drop the z column.
lapply(split(dt, dt$z), function(d) { d$z <- NULL; d })
And I just noticed that you use the data.table package, so there is probably a better, data.table way of achieving your desired result.

Aggregate an entire data frame with Weighted Mean

I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)

Resources