Aggregate in multiple columns [duplicate] - r

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I've multiple stations (+1000) with more than 50 years to work on it, so I configured my df with the stations on the columns and dates on rows as the example.
Now I need to make summations of my parameter for each year in each column of data, but also i must know how many fields with no NA it counted to give me a specific value for each year in each station.
I hope you guys can help me, and sorry about the syntax and language.
year<-c(rep(2000,12),rep(2001,12),rep(2002,12), rep(2003,12))
data <- data.frame( year, month=rep(1:12,4),est1=rnorm(12*4,2,1),est2=rnorm(12*4,2,1),est3=rnorm(12*4,2,1))
data[3,3]<-NA

Sums:
> apply(data[,-(1:2)], 2, tapply, data$year, sum, na.rm=T)
est1 est2 est3
2000 23.46997 21.36984 28.24381
2001 27.32517 28.84098 24.11784
2002 23.41737 25.47548 23.82606
2003 24.63551 24.51148 28.17723
Non NA's:
> apply(!is.na(data[,-(1:2)]), 2, tapply, data$year, sum)
est1 est2 est3
2000 11 12 12
2001 12 12 12
2002 12 12 12
2003 12 12 12
And a version without apply (see #r2evans comment below):
sapply(data[,-(1:2)], tapply, data$year, sum, na.rm=T)
sapply(data.frame(!is.na(data[,3:5])), tapply, data$year, sum)

Related

Age calculation for observation data in R [duplicate]

This question already has answers here:
Return date range by group
(3 answers)
Closed 3 years ago.
I have very simple big observation data hypothetically structured as below:
> df = data.frame(ID = c("oak", "birch", rep("oak",2), "pine", "birch", "oak", rep("pine",2), "birch", "oak"),
+ yearobs = c(rep(1998,3), rep(1999,2), rep(2000,3),rep(2001,2), 2002))
> df
ID yearobs
1 oak 1998
2 birch 1998
3 oak 1998
4 oak 1999
5 pine 1999
6 birch 2000
7 oak 2000
8 pine 2000
9 pine 2001
10 birch 2001
11 oak 2002
What I want to do is to calculate the age by taking the difference between the years ( max(yearobs)-min(yearobs) ) for each unique ID (tree species in this example). I have tried to work with lubridate + dplyr packages, however, number of observations for each unique ID varies in my data and I want to create an age column in a fastest way without storing minimum and maximum values separately (avoiding for loops here since my data is huge).
Desired output:
ID age
1 oak 4
2 birch 3
3 pine 3
Any suggestion would be appreciated.
In base R you can do:
aggregate(yearobs ~ ID, data = df, FUN = function(x) max(x) - min(x))
# ID yearobs
# 1 birch 3
# 2 oak 4
# 3 pine 2
An option is to group by 'ID' and get the difference between the min and max of 'yearobs' column
library(dplyr)
df %>%
group_by(ID) %>%
summarise(age = max(yearobs) - min(yearobs))
Also, if we need to do this fast, then data.table would be another option
library(data.table)
setDT(df)[, .(age = max(yearobs) - min(yearobs)), by = ID]
Or using base R
by(df['yearobs'], df$ID, FUN = function(x) max(x)- min(x))

I need to classify by categories without mixing the data of different columns [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I have the following dataset:
Year Company Product Sales
2017 X A 10
2017 Y A 20
2017 Z B 20
2017 X B 10
2018 X B 20
2018 Y B 30
2018 X A 10
2018 Z A 10
I want to obtain the following summary:
Year Product Sales
2017 A 30
B 30
2018 A 50
B 20
and also the following summary:
Year Company Sales
2017 X 20
Y 20
Z 20
2018 X 50
Y 10
Z 10
Is there any way to do it without using loops?
I know I could do something with the function aggregate, but I don't know how to proceed with it without mixing the data of company, product and year. For example, I get the total sales of product A and B, but it's mixing the sales of both years instead of giving A and B in 2017, and separated in 2018.
Do you have any suggestions?
Let's say your dataframe is called df:
df1 = df.groupby('Year', 'Product')['Sales'].sum()
df2 = df.groupby('Year', 'Company')['Sales'].sum()
I believe this would help you create your two summary dataframes without mixing anything :) !

Ntile and decile function depended on two columns in R [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 4 years ago.
I would like to have a new column with Ntile but it should depend on column 1 - "year" and show the ntile number for column 2 - "mileage".
year mileage
<dbl> <dbl>
1 2011 7413
2 2011 10926
3 2011 7351
4 2011 11613
5 2012 8367
6 2010 25125
mydata$Ntile <- ntile(mydata$mileage, 10)
I know the easy to use function ntile, but I do not know how to make it depend on 2 columns. I would like to have ntiles for mileage but for each year, 2010, 2011 and 2012 to be calculated in new column "Ntile".
PS: I know there is not enough data to calculate Ntiles for 2011 and 2012, it is just an example.
I like the data.table approach:
library(data.table)
mydata <- as.data.table(mydata)
mydata[, Ntile:=ntile(mileage,10), by=year]
Best!

How to sum a variable by group but do not aggregate the data frame in R? [duplicate]

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 4 years ago.
although I have found a lot of ways to calculate the sum of a variable by group, all the approaches end up creating a new data set which aggregates the double cases.
To be more precise, if I have a data frame:
id year
1 2010
1 2015
1 2017
2 2011
2 2017
3 2015
and I want to count the number of times I have the same ID by the different years, there are a lot of ways (using aggregate, tapply, dplyr, sqldf etc) which use a "group by" kind of functionality that in the end will give something like:
id count
1 3
2 2
3 1
I haven't managed to find a way to calculate the same thing but keep my original data frame, in order to obtain:
id year count
1 2010 3
1 2015 3
1 2017 3
2 2011 2
2 2017 2
3 2015 1
and therefore do not aggregate my double cases.
Has somebody already figured out?
Thank you in advance

Average for column value across multiple datasets in R [duplicate]

This question already has answers here:
calculate average over multiple data frames
(5 answers)
Closed 6 years ago.
I am new to R and I need help in this. I have 3 data sets from 3 different years. they have the same columns with different values for each year. I want to find the average for the column values across the three years based on the name field. To be specific:
assume : first data set
Name Age Height Weight
A 4 20 20
B 5 22 22
C 8 25 21
D 10 25 23
second data set
Name Age Height Weight
A 5 22 25
B 6 23 26
Third data set
Name Age Height Weight
A 6 24 24
B 7 24 27
C 10 27 28
I want to find the average height for "A" across the three data sets
We can place them in a list and rbind them, group by 'Name' and get the mean of each column
library(data.table)
rbindlist(list(df1, df2, df3))[, lapply(.SD, mean), by = Name]
Or with dplyr
bind_rows(df1, df2, df3) %>%
group_by(Name) %>%
summarise_each(funs(mean))

Resources