If I have a data frame, such as:
group=rep(1:4,each=10)
data=c(seq(1,10,1),seq(5,50,5),seq(20,11,-1),seq(0.3,3,0.3))
DF=data.frame(group,data)
Now, I would like to divide each data element by the mean of its group. For example:
group=rep(1:4,each=10)
data=c(seq(1,10,1),seq(5,50,5),seq(20,11,-1),seq(0.3,3,0.3))
DF=data.frame(group,data)
aggregate(DF,by=list(DF$group),FUN=mean)
#Group.1 group data
#1 1 1 5.50
#2 2 2 27.50
#3 3 3 15.50
#4 4 4 1.65
data1=c(seq(1,10,1)/5.5,seq(5,50,5)/27.5,seq(20,11,-1)/15.5,seq(0.3,3,0.3)/1.65)
DF1=data.frame(group, data1)
However, this is a bit convoluted, and work not work easily in a large dataset. I feel like there is an apply application which could be used here, but I cannot find a nice way to do it.
Here's the usual set of options (thanks to #G.Grothendieck for simplification of ave):
# base R
DF$newdata = ave(DF$data, DF$group, FUN = function(x) x/mean(x))
# or...
DF$newdata = DF$data / ave(DF$data, DF$group)
# dplyr
library(dplyr)
DF %>% group_by(group) %>% mutate(newdata = data/mean(data))
# data.table
library(data.table)
setDT(DF)[, newdata := data/mean(data), by=group]
Related
Asking how to go from dplyr to base may be a weird ask, especially since I love the tidyverse, but I think because I learned the tidyverse first, my grasp of base is far from masterful, and I need a base solution because the package I'm helping to develop doesn't want any tidyverse dependencies
Data (there are many more columns, but abbreviated for reprex sake):
sample.df <- tibble(batch = rep(c(1,2,3), c(4,5,6)))
Desire base equivalent of:
sample.df %>%
mutate(rowid = row_number()) %>%
group_by(batch) %>%
summarize(idx_b = min(rowid),
idx_e = max(rowid))
# A tibble: 3 x 3
# Groups: batch [3]
batch idx_b idx_e
<dbl> <int> <int>
1 1 1 4
2 2 5 9
3 3 10 15
We create a sequence column in the data, use aggregate to get the range or min/max and convert the matrix column to regular data.frame column with do.call
out <- do.call(data.frame, aggregate(rowid ~ batch,
transform(sample.df, rowid = seq_len(nrow(sample.df))),
FUN = function(x) c(b = min(x), e = max(x))))
Another base R option using unique + ave
unique(
transform(
sample.df,
idx_b = ave(1:nrow(sample.df), batch, FUN = min),
idx_c = ave(1:nrow(sample.df), batch, FUN = max)
)
)
gives
batch idx_b idx_c
1 1 1 4
5 2 5 9
10 3 10 15
Lets say I have a data set that has multiple rows and columns and I want to record the min, max and mean for each column and store this data in its own table. How do I loop through the data frame in such a way that I can find this data for each column?
Edit: My initial data is stored in a tbl that looks like this Initial Data and I want the output to look like this Output Data
Take a look at package dplyr, which will make this task more straightforward!
Here's an approach that just uses dplyr. The format isn't exactly what's in Output Data...
> df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1)) # Your Initial Data
> library(dplyr)
> df %>% summarise_all(.funs=funs(mean, min, max)) ## Approach 1: just dplyr
A_mean B_mean C_mean A_min B_min C_min A_max B_max C_max
1 4.333333 5 5.666667 2 4 1 7 6 9
Alternatively, if you also use package tidyr, you can get exactly the format you wanted for your output data:
> library(tidyr)
> df %>%
+ gather(Column, Value) %>% ## Converts dataframe from wide to long format
+ group_by(Column) %>% ## Groups by the new column containing old column names
+ summarise(Max=max(Value), Min=min(Value), Mean=mean(Value)) ## The summary functions
# A tibble: 3 x 4
Column Max Min Mean
<chr> <dbl> <dbl> <dbl>
1 A 7.00 2.00 4.33
2 B 6.00 4.00 5.00
3 C 9.00 1.00 5.67
One advantage of using these packages is that it may be more efficient, especially if df is large, than using an explicit loop.
I suggest you work with long tables instead of wide ones. While the last will make it simpler to the human eye, the former are easier to manipulate for data analysis. That said, I think you could use the data.table package to achieve this:
# create a data frame
df <- data.frame(A=c(7,2,4), B=c(5,4,6), C=c(7,9,1))
# load data.table package
require(data.table)
# convert df to a data.table
setDT(df)
#Explanation of the following code:
# melt: turns your wide table into a long one
# .(val_mean ...) calculate and give names to calculated variables
# by = ... : group by variable. See data.table vignette
melt(df)[, .(val_mean = mean(value),
val_min = min(value),
val_max = max(value)),
by = variable]
which produces:
variable val_mean val_min val_max
1: A 4.333333 2 7
2: B 5.000000 4 6
3: C 5.666667 1 9
So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.
I am trying to calculate changes in weight between visits to chicks at different nests. This requires R to look up the nest code in the current row, find the previous time that nest was visited, and subtract the weight at the previous visit from the current visit. For the first visit to each nest, I would like to output the current weight (i.e. as though the weight at the previous, non-existent visit was zero).
My data is of the form:
Nest <- c(a,b,c,d,e,c,b,c)
Weight <- c(2,4,3,3,2,6,8,10)
df <- data.frame(Nest, Weight)
So the desired output here would be:
Change <- c(2,4,3,3,2,3,4,4)
I have achieved the desired output once, by subsetting to a single nest and using a for loop:
tmp <- subset(df, Nest == "a")
tmp$change <- tmp$Weight
for(x in 2:(length(tmp$Nest))){
tmp$change[x] <- tmp$Weight[(x)] - tmp$Weight[(x-1)]
}
but when I try to fit this into ddply
df2 <- ddply(df, "Nest", function(f) {
f$change <- f$Weight
for(x in 2:(length(f$Nest))){
f$change <- f$Weight[(x)] - f$Weight[(x-1)]
}
})
the output gives a blank data.frame (0 obs. of 0 variables).
Am I approaching this the right way but getting the code wrong? Or is there a better way to do it?
Thanks in advance!
Try this:
library(dplyr)
df %>% group_by(Nest) %>% mutate(Change = c(Weight[1], diff(Weight)))
or with just the base of R
transform(df, Change = ave(Weight, Nest, FUN = function(x) c(x[1], diff(x))))
Here is a data.table solution. With large data sets, this is likely to be faster.
library(data.table)
setDT(df)[,Change:=c(Weight[1],diff(Weight)),by=Nest]
df
# Nest Weight Change
# 1: a 2 2
# 2: b 4 4
# 3: c 3 3
# 4: d 3 3
# 5: e 2 2
# 6: c 6 3
# 7: b 8 4
# 8: c 10 4
When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner:
# bogus functions
foo1 <- function(x){mean(x)*var(x)}
foo2 <- function(x){mean(x)/var(x)}
# for illustration purposes only
npk$block <- as.numeric(npk$block)
subdf <- aggregate(npk[,c("yield", "block")],
by = list(N = npk$N, P = npk$P),
FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))})
Having the results in a nicely ordered data frame is achieved by using:
df <- do.call(data.frame, subdf)
Can I avoid the call to do.call() by somehow using aggregate() smarter in this scenario or shorten the whole process by using another base R solution from the start?
As #akrun suggested, dplyr's summarise_each is well-suited to the task.
library(dplyr)
npk %>%
group_by(N, P) %>%
summarise_each(funs(foo1, foo2), yield, block)
# Source: local data frame [4 x 6]
# Groups: N
#
# N P yield_foo2 block_foo2 yield_foo1 block_foo1
# 1 0 0 2.432390 1 1099.583 12.25
# 2 0 1 1.245831 1 2205.361 12.25
# 3 1 0 1.399998 1 2504.727 12.25
# 4 1 1 2.172399 1 1451.309 12.25
You can use
df=data.frame(as.list(aggregate(...