Create aggregate variable in long data format - r

I'm sure there's a question similar to this already, but I couldn't make them work
I am trying to calculate aggregates (or subtotals) in a dataframe of long format. In the group column I want an aggregate variable "AGG" that is a sum of "value" for a specific "Year" and "var". I have tried using the aggregate() function, but didn't succeed. I used the code:
aggregate(value ~ cbind(Year,var), data = Energi5, FUN = sum)
My data looks like this
> head(df)
Year group var value
1 1966 A x 25465462
2 1966 B x 9512621
3 1966 E x 2832865
4 1966 H x 291769
5 1966 NE x 141524912
6 1966 NF x 23580353
> tail(df)
Year group var value
5403 2017 NZ y 167158
5404 2017 O y 23480
5405 2017 QF y 0
5406 2017 QS y 0
5407 2017 QZ y 16447
5408 2017 TC3000 y 488556
and I would like to obtain something like this at the end of (or in the middle of) my existing dataframe
Year group var value
5409 1966 AGG x ?
5410 1967 AGG x ?
...
5450 2017 AGG x ?
5451 1966 AGG y ?
...
I hope you can help. Thank you!

The error lies in how are you declaring the formula. See ?formula in the manual.
# Example
year <- rep(seq(1966, 2020), each = 8)
group <- rep(letters[1:4], times = 2*(2021-1966))
var <- rep(c("x", "y"), times = length(year)/2)
value <- rnorm(length(year))
data <- cbind.data.frame(year, group, var, value)
# Solution
aggregate(value ~ year * var, data, FUN=sum)

There is probably a more efficient way to do this, but does this help?
library(dplyr)
df <- Energi5 %>% group_by(Year, var) %>% mutate(value = sum(value)) %>% summarise_all(funs(mean))
df$group <- "AGG"
Energi5 <- merge(Energi5, df, all = T)

Related

How can i add more columns in dataframe by for loop

I am beginner of R. I need to transfer some Eviews code to R. There are some loop code to add 10 or more columns\variables with some function in data in Eviews.
Here are eviews example code to estimate deflator:
for %x exp con gov inv cap ex im
frml def_{%x} = gdp_{%x}/gdp_{%x}_r*100
next
I used dplyr package and use mutate function. But it is very hard to add many variables.
library(dplyr)
nominal_gdp<-rnorm(4)
nominal_inv<-rnorm(4)
nominal_gov<-rnorm(4)
nominal_exp<-rnorm(4)
real_gdp<-rnorm(4)
real_inv<-rnorm(4)
real_gov<-rnorm(4)
real_exp<-rnorm(4)
df<-data.frame(nominal_gdp,nominal_inv,
nominal_gov,nominal_exp,real_gdp,real_inv,real_gov,real_exp)
df<-df %>% mutate(deflator_gdp=nominal_gdp/real_gdp*100,
deflator_inv=nominal_inv/real_inv,
deflator_gov=nominal_gov/real_gov,
deflator_exp=nominal_exp/real_exp)
print(df)
Please help me to this in R by loop.
The answer is that your data is not as "tidy" as it could be.
This is what you have (with an added observation ID for clarity):
library(dplyr)
df <- data.frame(nominal_gdp = rnorm(4),
nominal_inv = rnorm(4),
nominal_gov = rnorm(4),
real_gdp = rnorm(4),
real_inv = rnorm(4),
real_gov = rnorm(4))
df <- df %>%
mutate(obs_id = 1:n()) %>%
select(obs_id, everything())
which gives:
obs_id nominal_gdp nominal_inv nominal_gov real_gdp real_inv real_gov
1 1 -0.9692060 -1.5223055 -0.26966202 0.49057546 2.3253066 0.8761837
2 2 1.2696927 1.2591910 0.04238958 -1.51398652 -0.7209661 0.3021453
3 3 0.8415725 -0.1728212 0.98846942 -0.58743294 -0.7256786 0.5649908
4 4 -0.8235101 1.0500614 -0.49308092 0.04820723 -2.0697008 1.2478635
Consider if you had instead, in df2:
obs_id variable real nominal
1 1 gdp 0.49057546 -0.96920602
2 2 gdp -1.51398652 1.26969267
3 3 gdp -0.58743294 0.84157254
4 4 gdp 0.04820723 -0.82351006
5 1 inv 2.32530662 -1.52230550
6 2 inv -0.72096614 1.25919100
7 3 inv -0.72567857 -0.17282123
8 4 inv -2.06970078 1.05006136
9 1 gov 0.87618366 -0.26966202
10 2 gov 0.30214534 0.04238958
11 3 gov 0.56499079 0.98846942
12 4 gov 1.24786355 -0.49308092
Then what you want to do is trivial:
df2 %>% mutate(deflator = real / nominal)
obs_id variable real nominal deflator
1 1 gdp 0.49057546 -0.96920602 -0.50616221
2 2 gdp -1.51398652 1.26969267 -1.19240392
3 3 gdp -0.58743294 0.84157254 -0.69801819
4 4 gdp 0.04820723 -0.82351006 -0.05853872
5 1 inv 2.32530662 -1.52230550 -1.52749012
6 2 inv -0.72096614 1.25919100 -0.57256297
7 3 inv -0.72567857 -0.17282123 4.19901294
8 4 inv -2.06970078 1.05006136 -1.97102841
9 1 gov 0.87618366 -0.26966202 -3.24919196
10 2 gov 0.30214534 0.04238958 7.12782060
11 3 gov 0.56499079 0.98846942 0.57158146
12 4 gov 1.24786355 -0.49308092 -2.53074800
So the question becomes: how do we get to the nice dplyr-compatible data.frame.
You need to gather your data using tidyr::gather. However, because you have 2 sets of variables to gather (the real and nominal values), it is not straightforward. I have done it in two steps, there may be a better way though.
real_vals <- df %>%
select(obs_id, starts_with("real")) %>%
# the line below is where the magic happens
tidyr::gather(variable, real, starts_with("real")) %>%
# extracting the variable name (by erasing up to the underscore)
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Same thing for nominal values
nominal_vals <- df %>%
select(obs_id, starts_with("nominal")) %>%
tidyr::gather(variable, nominal, starts_with("nominal")) %>%
mutate(variable = gsub(variable, pattern = ".*_", replacement = ""))
# Merging them... Now we have something we can work with!
df2 <-
full_join(real_vals, nominal_vals, by = c("obs_id", "variable"))
Note the importance of the observation id when merging.
We can grep the matching names, and sort:
x <- colnames(df)
df[ sort(x[ (grepl("^nominal", x)) ]) ] /
df[ sort(x[ (grepl("^real", x)) ]) ] * 100
Similarly, if the columns were sorted, then we could just:
df[ 1:4 ] / df[ 5:8 ] * 100
We can loop over column names using purrr::map_dfc then apply a custom function over the selected columns (i.e. the columns that matched the current name from nms)
library(dplyr)
library(purrr)
#Replace anything before _ with empty string
nms <- unique(sub('.*_','',names(df)))
#Use map if you need the ouptut as a list not a dataframe
map_dfc(nms, ~deflator_fun(df, .x))
Custom function
deflator_fun <- function(df, x){
#browser()
nx <- paste0('nominal_',x)
rx <- paste0('real_',x)
select(df, matches(x)) %>%
mutate(!!paste0('deflator_',quo_name(x)) := !!ensym(nx) / !!ensym(rx)*100)
}
#Test
deflator_fun(df, 'gdp')
nominal_gdp real_gdp deflator_gdp
1 -0.3332074 0.181303480 -183.78433
2 -1.0185754 -0.138891362 733.36121
3 -1.0717912 0.005764186 -18593.97398
4 0.3035286 0.385280401 78.78123
Note: Learn more about quo_name, !!, and ensym which they are tools for programming with dplyr here

find max value in a data.frame by group and show its date as year-month-day

Here my dataframe:
df = read.csv(text = '"Date","Value","ID","WY"
1975-02-01,-1.16543693088,"Tweed",1975
1975-03-01,-1.05372283483,"Tweed",1975
1975-04-01,-1.06632370439,"Tweed",1975
1975-05-01,-1.18903485356,"Tweed",1975
1992-05-01,-1.04737467143,"Ouse",1992
1992-06-01,-1.4058281451,"Ouse",1992
1992-07-01,-1.13608647243,"Ouse",1992
1992-08-01,-0.802566581309,"Ouse",1992
1992-09-01,-0.551433852821,"Ouse",1992
1992-10-01,-0.625997598552,"Ouse",1993
1992-11-01,-0.483559758609,"Ouse",1993
1992-12-01,-0.792013395632,"Ouse",1993
1993-01-01,-0.754618121962,"Ouse",1993
1993-02-01,-1.2504282139,"Ouse",1993
1996-01-01,-0.945410385985,"Trent",1996
1996-02-01,-0.84249575782,"Trent",1996
1996-03-01,-1.10332425045,"Trent",1996
1996-04-01,-1.22634133042,"Trent",1996
1996-05-01,-1.2335181635,"Trent",1996
1996-06-01,-1.23451130358,"Trent",1996
1996-07-01,-1.25902677738,"Trent",1996
1996-08-01,-1.13068733413,"Trent",1996', header = TRUE)
I need to find the annual maximum value for each ID and WY group.
The following code do the trick very easily but its output only shows the year of each annual maximum whereas I am interested also in the relative month and day:
df_AMAX = aggregate(df$Value, by = list(df$WY, df$ID), max)
colnames(df_AMAX) = c('Date', 'ID', 'Value')
print(df_AMAX)
Date ID Value
1 1992 Ouse -0.5514339
2 1993 Ouse -0.4835598
3 1996 Trent -0.8424958
4 1975 Tweed -1.0537228
My output should be:
Date ID Value
1 1992-09-01 Ouse -0.5514339
2 1993-11-01 Ouse -0.4835598
3 1996-02-01 Trent -0.8424958
4 1975-03-01 Tweed -1.0537228
It should be a silly thing but please let me know if you have any suggestion.
Thanks
Use subset with ave. Note that the function passed to ave returns a logical but ave will coerce it to the class of Value so we use !! to make it logical again. No packages are used.
mx_all <- function(x) if (length(x)) x == max(x)
subset(df, !!ave(Value, ID, WY, FUN = mx_all))
or
mx_first <- function(x) if (length(x)) seq_along(x) == which.max(x)
subset(df, !!ave(Value, ID, WY, FUN = mx_first))
These give the same answer for the sample input and will always give the same answer if there is a unique maximum in each group but if there are multiple maxima in a group then the first one gives all of them and the second gives the first.
There is of course a dplyr solution, too:
df %>%
group_by(WY, ID) %>%
summarise(
Value = max(Value),
Date = Date[which.max(Value)]) %>%
ungroup() %>%
select(ID:Date)

Refer particular value in `dplyr::mutate()`

I have the following code:
library(dplyr)
library(quantmod)
# inflation data
getSymbols("CPIAUCSL", src='FRED')
avg.cpi <- apply.yearly(CPIAUCSL, mean)
cf <- avg.cpi/as.numeric(avg.cpi['1991']) # using 1991 as the base year
cf <- as.data.frame(cf)
cf$year <- rownames(cf)
cf <- tail(cf, 25)
rownames(cf) <- NULL
cf$year <- lapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
rm(CPIAUCSL)
# end of inflation data get
tmp <- data.frame(year=c(rep(1991,2), rep(1992,2)), price=c(12.03, 12.98, 14.05, 14.58))
tmp %>% mutate(infl.price = price / cf[cf$year == year, ]$CPIAUCSL)
I want to get the following result:
year price
1991 12.03
1991 12.98
1992 13.64
1992 14.16
But I'm getting an error:
Warning message:
In cf$year == tmp$year :
longer object length is not a multiple of shorter object length
And with %in% it produces and incorrect result.
I think it might be easier to join the CPIAUCSL column in cf into tmp before you try to mutate:
cf$year = as.numeric(cf$year)
tmp = tmp %>% inner_join(cf, by = "year") %>% mutate(infl.price = price / CPIAUCSL)
Your cf structure is a list of lists which is unfriendly. It woud have been nicer to have
cf$year <- sapply(cf$year, function(x) as.numeric(head(unlist(strsplit(x, "-")), 1)))
which at least returns a simple vector.
Additional, the subsetting operator [] is not properly vectorized for this type of operation. The mutate() function does not iterate over rows, it operates on entire columns at a time. When you do
cf[cf$year == year, ]$CPIAUCSL
There is not just one year value, mutate is trying to do them all at once.
You'd be better off doing a proper merge with your data and then do the mutate. This will basically do the same thing as your pseudo-merge that you were trying to do in your version.
You can do
tmp %>% left_join(cf) %>%
mutate(infl.price = price / CPIAUCSL) %>%
select(-CPIAUCSL)
to get
year price infl.price
1 1991 12.03 12.03000
2 1991 12.98 12.98000
3 1992 14.05 13.63527
4 1992 14.58 14.14962

dcast with custom fun.aggregate

I have data that looks like this:
sample start end gene coverage
X 1 10 A 5
X 11 20 A 10
Y 1 10 A 5
Y 11 20 A 10
X 1 10 B 5
X 11 20 B 10
Y 1 10 B 5
Y 11 20 B 10
I added additional columns:
data$length <- (data$end - data$start + 1)
data$ct_lt <- (data$length * data$coverage)
I reformated my data using dcast:
casted <- dcast(data, gene ~ sample, value.var = "coverage", fun.aggregate = mean)
So my new data looks like this:
gene X Y
A 10.00000 10.00000
B 38.33333 38.33333
This is the correct data format I desire, but I would like to fun.aggregate differently. Instead, I would like to take a weighted average, with coverage weighted by length:
( sum (ct_lt) ) / ( sum ( length ) )
How do I go about doing this?
Disclosure: no R in front of me, but I think your friend here may be the dplyr and tidyr packages.
Certainly lots of ways to accomplish this, but I think the following might get you started
library(dplyr)
library(tidyr)
data %>%
select(gene, sample, ct_lt, length) %>%
group_by(gene, sample) %>%
summarise(weight_avg = sum(ct_lt) / sum(length)) %>%
spread(sample, weight_avg)
Hope this helps...

R-sq values, linear regression of several trends within one dataset

I am running into a sticky spot trying to solve for variance accounted for by trend several times within a single data set.....
My data is structured like this
x <- read.table(text = "
STA YEAR VALUE
a 1968 457
a 1970 565
a 1972 489
a 1974 500
a 1976 700
a 1978 650
a 1980 659
b 1968 457
b 1970 565
b 1972 350
b 1974 544
b 1976 678
b 1978 650
b 1980 690
c 1968 457
c 1970 565
c 1972 500
c 1974 600
c 1976 678
c 1978 670
c 1980 750 " , header = T)
and I am trying to return something like this
STA R-sq
a n1
b n2
c n3
where n# is the corresponding r-squared value of the locations data in the original set....
I have tried
fit <- lm(VALUE ~ YEAR + STA, data = x)
to give the model of yearly trend of VALUE for each individual station over the years data is available for VALUE, within the master data set....
Any help would be greatly appreciated.... I am really stumped on this one and I know it is just a familiarity with R problem.
To get r-squared for VALUE ~ YEAR for each group of STA, you can take this previous answer, modify it slightly and plug-in your values:
# assuming x is your data frame (make sure you don't have Hmisc loaded, it will interfere)
models_x <- dlply(x, "STA", function(df)
summary(lm(VALUE ~ YEAR, data = df)))
# extract the r.squared values
rsqds <- ldply(1:length(models_x), function(x) models_x[[x]]$r.squared)
# give names to rows and col
rownames(rsqds) <- unique(x$STA)
colnames(rsqds) <- "rsq"
# have a look
rsqds
rsq
a 0.6286064
b 0.5450413
c 0.8806604
EDIT: following mnel's suggestion here are more efficient ways to get the r-squared values into a nice table (no need to add row and col names):
# starting with models_x from above
rsqds <- data.frame(rsq =sapply(models_x, '[[', 'r.squared'))
# starting with just the original data in x, this is great:
rsqds <- ddply(x, "STA", summarize, rsq = summary(lm(VALUE ~ YEAR))$r.squared)
STA rsq
1 a 0.6286064
2 b 0.5450413
3 c 0.8806604
#first load the data.table package
library(data.table)
#transform your dataframe to a datatable (I'm using your example)
x<- as.data.table(x)
#calculate all the metrics needed (r^2, F-distribution and so on)
x[,list(r2=summary(lm(VALUE~YEAR))$r.squared ,
f=summary(lm(VALUE~YEAR))$fstatistic[1] ),by=STA]
STA r2 f
1: a 0.6286064 8.462807
2: b 0.5450413 5.990009
3: c 0.8806604 36.897258
there's only one r-squared value, not three.. please edit your question
# store the output
y <- summary( lm( VALUE ~ YEAR + STA , data = x ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
# or are you looking to run three separate
# lm() functions on 'a' 'b' and 'c' ..where this would be the first?
y <- summary( lm( VALUE ~ YEAR , data = x[ x$STA %in% 'a' , ] ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]

Resources