How to get concise code by using loop function in r - r

The whole data include 5 columns, which are named A, B, C, D, and Portfolio. I will run the linear regression model for each portfolio. Therefore, the whole data is divided into subset data.Then, run the regression model and check their summaries.
Data frame looks like the table below,
A B C D Portfolio
1 ... 11
2 ... 22
3 ... 13
4 ... 11
5 ... 21
6 ... 21
7 ... 23
8 ... 12
9 ... 11
10 ... 12
11 ... 22
...
The code I did presents as below,
Portfolio_11<-subset(df, Portfolio==11)
Portfolio_12<-subset(df, Portfolio==12)
Portfolio_13<-subset(df, Portfolio==13)
Portfolio_21<-subset(df, Portfolio==21)
Portfolio_22<-subset(df, Portfolio==22)
Portfolio_23<-subset(df, Portfolio==23)
Reg_11<-lm(A ~ B + C + D, data=Portfolio_11)
Reg_12<-lm(A ~ B + C + D, data=Portfolio_12)
Reg_13<-lm(A ~ B + C + D, data=Portfolio_13)
Reg_21<-lm(A ~ B + C + D, data=Portfolio_21)
Reg_22<-lm(A ~ B + C + D, data=Portfolio_22)
Reg_23<-lm(A ~ B + C + D, data=Portfolio_23)
summary(Reg_11)
summary(Reg_12)
summary(Reg_13)
summary(Reg_21)
summary(Reg_22)
summary(Reg_23)
I try to simplify R code by using loop function. Like,
for (i=1:3, j=1:3){
Portfolio_ij<-subset(df, Portfolio==ij)
Reg_ij<-lm(A ~ B + C + D, data=Portfolio_ij)
summary(Reg_ij)
}
However, I am a starter in r and don't really understand the rule of loop function. Therefore, I want to learn it. Thank you so much.

We can use one of the group by functions
library(data.table)
dtSummary <- setDT(df)[, list(list(summary(lm(A ~ B + C + D)))), by = Portfolio]
dtSummary$V1

To make life easier for yourself, use one of the R packages for data munging. Akrun has already mentioned data.table; this is also a classic use case for dplyr's do:
library(dplyr)
df %>%
group_by(Portfolio) %>%
do(smry=summary(lm(A ~ B + C + D, data=.)))

This is a classic case for the split-apply-combine approach, or at least the split-apply part, since it's not clear what you want to do with the output. Here's one way to do that in base R, returning the results in a list called Summaries:
Summaries <- lapply(split(df, df$Portfolio), function(i) summary(lm(A ~ B + C + D, data = i)))
Working out from the inside, you:
Use split to break the original data into a list composed of the desired subsets, defined here by unique values of DF$Portfolio.
use lapply to iterate the modeling and model summarizing functions over the elements of the list created in step 1.
The result is a list (Summaries), the ith element of which corresponds to the ith subset of df$Portfolio. Conveniently, the list elements will have names that correspond to the unique values of df$Portfolio, so you can inspect them with Summaries[["21"]], for example. Or, if you just want to see the results in your terminal or markdown or whatever, drop the Summaries <- part.

Using base R, you could try:
#creates your combinations
subs <- apply(expand.grid(1:3, 1:2), 1, function(x) as.numeric(paste0(x, collapse="")))
# loop along these combinations. Note the print.
for (i in subs)
print(summary(lm(A ~ B + C + D, data=subset(df, Portfolio==i))))
But as asked in the comments, a reproducible example would help.
Here is one built dataset:
# same as above
subs <- apply(expand.grid(1:3, 1:2), 1, function(x) as.numeric(paste0(x, collapse="")))
# here we create the dataset
n=50 # we want 50 rows
set.seed(1) # for the sake of reproducibility
df <- data.frame(A=rnorm(n), B=rnorm(n), C=rnorm(n), D=rnorm(n), Portfolio=sample(subs, n, replace=TRUE))
# now we can apply the loop:
for (i in subs){
cat(rep("*", 20), "\nlm for Portfolio =", i, '\n') # a cheap console displayer
print(summary(lm(A ~ B + C + D, data=subset(df, Portfolio==i))))
}
But as others answered both data.table and dplyr packages result in a more straightforward/generic syntax compared to base R.

Related

Calculate all possible product combinations between variables

I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)

R function parameters in quotes / not in quotes

In R, some functions expect the function parameters to be in quotes, like this:
summarySE(xx, measurevar= "X1F1", groupvars="genotype",na.rm=TRUE)
others seem happy with the same parameter without ", like this:
aov(data=xx,X1F1~genotype)
How can I convert from a string like "X1F1" to the X1F1 required by the formula
Here's my data
genotype X1F1 X2F1
1 R 43.33877 7.881666
2 R 130.34433 65.056984
3 R 53.39783 11.985018
4 R 23.45456 5.683387
5 R 138.50044 61.194956
6 R 108.63964 39.581222
7 R 153.60738 55.854238
8 T 264.96127 108.751380
9 T 222.94124 119.695112
10 T 119.55373 36.793537
11 T 34.97877 12.285921
You can use data.frame column names in formula. Two lm below will give the same result:
df <- data.frame(x = 1:100, y = rnorm(100))
# Option 1. As formula
lm(y ~ x, df)
# Option 2. As data.frame index
lm(df[, "y"] ~ df[, "x"])

Aggregating multiple dependent measures

I need to aggregate over a number of dependent measures (DMs) in R. I found the following discussion here quite useful:
Aggregate / summarize multiple variables per group (i.e. sum, mean, etc)
Based on this, the code below basically does what I need. It gets quite lengthy, however, as the number of DMs increases (I have many DMs):
aggregate(cbind(DM1, DM2, DV3, DM4, DM5 ... DMn) ~ F1 + F2 +
F3, data = sst2, mean, na.rm=TRUE)
I was therefore wondering if there was a more efficient way of writing the DMs, without having to individually type every one of them. Most DMs of interest are next to one another (i.e. DM3, DM4, DM5 etc.), so I was thinking of using something along the lines of cbind(DM1, DM3:DM10, DM14), but this doesn't seem to work. I also tried generating a list of the relevant column names. Unfortunately this didn't work either:
pr<-colnames(sst2)
pr2<-pr[pr!="DM2" & pr!="DM11" & pr!="DM12" & pr!="DM13"]
pr3<-noquote(paste(pr2,collapse=","))
pp<-aggregate(cbind(pr3) ~ F1 + F2 +
F3, data = sst2, mean, na.rm=TRUE)
Any suggestions on how to efficiently include a large number of DMs in the aggregate function (or other related functions such as ddply) would be much appreciated.
I believe that this should work
sst2 <- data.frame(F1=c("A","A","B","B","C","C"),
F2=c("A","A","A","B","B","B"),
F3=c("D","D","D","D","D","D"),
DM1=c(5,6,21,61,2,3),
DM2=c(1,5,3,6,1,6),
DM3=c(1,7,9,1,4,44))
n = 3 # number of DM columns
m = 2 # number of F columns
DM <- paste0("DM", 1:n)
attach(sst2)
# use sapply(DM,get) but this produces separate columns
tmp <- aggregate(sapply(DM, get) ~ F1 + F2,
data = sst2, mean, na.rm=TRUE)
detach(sst2)
# combine these separate columns. The apply is to each row of tmp
data.frame(F1 = tmp$F1, F2 = tmp$F2,
DM = apply(tmp[(m+1):(n+length(DM)-1)], 1, mean))
# F1 F2 DM
# 1 A A 4.166667
# 2 B A 11.000000
# 3 B B 22.666667
# 4 C B 10.000000
Edit
If your variable names are different than the only line that would need to change is
DM <- c("mean.go.RT", "mean.SRT", "mean.SSD", "SSRT")
If these variables are in your data frame, you could easily get them with
DM <- names(sst2)[4:6]
or whatever other columns (i.e. instead of 4-6) that you want
An alternative solution using select, ddply and numcolwise:
library(dplyr)
library(plyr)
sst21 <- data.frame(F1=c("A","A","B","B","C","C"),
F2=c("A","A","A","B","B","B"),
F3=c("D","D","D","D","D","D"),
DM1=c(5,6,21,61,2,3),
DM2=c(1,5,3,6,1,6),
DM3=c(1,7,9,1,4,44),
DM4=c(2,3,6,7,2,33),
DM5=c(44,55,66,77,55,88))
sel1 <- dplyr::select(sst21, starts_with("F"), .data$DM1 : .data$DM3, .data$DM5) # select columns of interest
sel1 <- dplyr::select(sst21, -c(.data$DM4)) # Alternative: specifying columns to be excluded
sst22 <- plyr::ddply(sel1, .(F1, F2, F3), plyr::numcolwise(mean, na.rm = TRUE)) # Aggregate selected data

I want to split data in R by varying block sizes but each observation is unique

I have managed to read in a data file, and subset out the 2 columns of info that I want to work with. I am now stuck because I need to split the data into chunks of varying sizes and apply a function (mean, sd) to them, save the chunks and plot the sd from each. Otherwise known generally as block averaging. Right now I have a data frame with 2 columns and 10005 rows. The head of it looks like this:
Frame CA
1 0.773
Is there an efficient way that I could subset pieces of the data from a:b so that I can dictate how the data is broken up by the "Frame" column? I have found really good answers on here but I am not sure what they mean fully or if they would work.
chunk <- function(x, n)
(mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n),
pmin(seq.int(from=1, to=length(x), by=n)+(n-1),
length(x)), SIMPLIFY=FALSE))
I'm not sure if it is what you're looking for but with closure, a data frame can be subsetted by arbitrary indices.
(If Frame can be subsetted by a:b, it is likely to be a sequence and thus a subset may be made by row index?)
df <- data.frame(group = sample(c("a", "b"), 20, replace = T),
val = rnorm(20))
# closure - returns a function that accepts from and to
subsetter <- function(from, to) {
function(x) {
x[from:to, ]
}
}
# from and to are specified
sub1 <- subsetter(2, 4)
sub2 <- subsetter(1, 5)
# data is split from to to
sub1(df)
#group val
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
sub2(df)
# group val
#1 b 0.4780080
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
#5 b 0.4449554

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources