Best method for averaging across rows [duplicate] - r

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
I have data with multiple observations per day, and I want to construct a table of daily averages. My instinctive approach (from other programming languages) is to sort the data by date and write a for loop to go through and average it out. But every time I see an R question involving for loops, there tends to be a strong response that R handles vector-type approaches much better. What would a smarter approach be to this problem?
For reference, my data looks something like
date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8
And I would like the output to be a new data frame that looks like
date average
2017-4-4 146
2017-4-3 55
2017-4-2 8

require("dplyr")
df <- data.frame(date = c('2017-4-4', '2017-4-4', '2017-4-4', '2017-4-3', '2017-4-3', '2017-4-2'),
observation = c(17, 412, 8, 96, 14, 8))
df %>%
group_by(date) %>%
summarise(average = mean(observation)) %>%
data.frame

tapply() can do that:
df <- read.table(header=TRUE, text=
'date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8')
df$date <- as.Date(df$date, format="%Y-%m-%d")
m <- tapply(df$observation, df$date, FUN=mean)
d.result <- data.frame(date=as.Date(names(m), format="%Y-%m-%d"), m)
# > d.result
# date m
# 2017-04-02 2017-04-02 8
# 2017-04-03 2017-04-03 55
# 2017-04-04 2017-04-04 146
or
aggregate(observation ~ date, data=df, FUN=mean)
or with data.table
library("data.table")
dt <- fread(
'date observation
2017-4-4 17
2017-4-4 412
2017-4-4 9
2017-4-3 96
2017-4-3 14
2017-4-2 8')
dt[ , .(observation = mean(observation)), by=date]

Related

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

Convert data frame with year column to timeseries data, sorted by observation [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 8 months ago.
I have the following data.frame, which I want to convert into 2 separate timeseries data frames for revenue and cost.
df1 = data.frame(year = c('2018','2019', '2020','2019','2020','2021'),
company=c('x','x','x','y','y','z'),
revenue=c(45,78,13,89,48,70),
cost=c(100,120,130,140,160,164),
stringsAsFactors=FALSE)
df1
year company revenue cost
1 2018 x 45 100
2 2019 x 78 120
3 2020 x 13 130
4 2019 y 89 140
5 2020 y 48 160
6 2021 z 70 164
If I want to create a new data frame for the revenue data with the data arranged as so, and n.a. to replace all years in which the data is not available, what codes can I use to do this?
2018 2019 2020 2021
1 x 45 78 13 n.a.
2 y n.a. 89 48 n.a.
3 z n.a. n.a. n.a. 70
With the tidyverse...
df1 %>% filter(company == 'x') %>% pivot_wider(values_from = revenue, names_from = year)
If you are trying to get both revenue and costs as you imply
library(tidyr)
df2 <- pivot_wider(df1, names_from = year, values_from = c(revenue,cost))
gets what you need, I think. Cols 2-5 are the revenues and Cols 6-9 are the costs.

Replace values based on months in a dataframe with values in another column in r, using apply functions

I am working with a time series of precipitation data and attempting to use the median imputation method to replace all 0 value data points with the median of all data points for the corresponding month that that 0 value was recorded.
I have two data frames, one with the original precipitation data:
> head(df.m)
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
And one with the median monthly values:
> medians
Group.1 x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
Below is the current solution that I have come up with utilizing the 1st answer here:
df.m[,"prcp"] <- sapply(df.m[,"prcp"], function(y) ifelse(y==0, medians$x,y))
This has not worked as it only applies the first value of the df medians$Group.1, which is the month of January (01). How can I get the values so that correct median will be applied from the corresponding month?
Another way I have attempted a solution is via the below:
df.m[,"prcp"] <- sapply(medians$Group.1, function(y)
ifelse(df.m[format.Date(df.m$date, "%m") == y &
df.m$prcp == 0, "prcp"], medians[medians$Group.1 == y,"x"],
df.m[,"prcp"]))
Description of the above function - this function tests and returns the amount of zeros for every month that there is a zero value in df.m[,"prcp"]
Same issue here as the 1st solution, but it does return all of the 0 values by month (if just executing the sapply() portion).
How can I replace all 0 in df.m$prcp with their corresponding medians from the medians df based on the month of the data?
Apologies if this is a basic question, I'm somewhat of a newbie here. Any and all help would be greatly appreciated.
Consider merging the two dataframes by month/group and then calculating with ifelse:
# MERGE TWO FRAMES
df.m$month <- format(df.m$date, "%m")
df.merge <- merge(df.m, medians, by.x="month", by.y="Group.1")
# CONDITIONAL CALCULATION
df.merge$prcp <- ifelse(df.merge$prcp == 0, df.merge$x, df.merge$prcp)
# RETURN BACK TO ORIGINAL STRUCTURE
df.m <- df.merge[names(df.m)]
A dplyr version, which does not rely on original order. This uses slightly modified test data to show replacement of zeroes and multiple years
require(dplyr)
## test data with zeroes - extended for addtional years
df.m <- read.delim(text="
i prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 0 1976-06-30
8 0 1976-07-31
9 70 1976-08-31
", sep="", stringsAsFactors = FALSE)
medians <- read.delim(text="
i month x
1 01 135.90680
2 02 123.52613
3 03 113.09841
4 04 98.10044
5 05 75.21976
6 06 57.47287
7 07 54.16667
8 08 45.57653
9 09 77.87740
10 10 103.25179
11 11 124.36795
12 12 131.30695
", sep = "", stringsAsFactors = FALSE, strip.white = TRUE)
# extract the month as integer
df.m$month = as.integer(substr(df.m$date,6,7))
# match to medians by joining
result <- df.m %>%
inner_join(medians, by='month') %>%
mutate(prcp = ifelse(prcp == 0, x, prcp)) %>%
select(prcp, date)
result
yields
prcp date
1 121.00485 1975-01-31
2 122.41667 1975-02-28
3 82.74026 1975-03-31
4 104.63514 1975-04-30
5 57.46667 1975-05-31
6 38.97297 1975-06-30
7 57.47287 1976-06-30
8 54.16667 1976-07-31
9 70.00000 1976-08-31
I created small datasets with some zero values and added one line of code:
#create sample data
prcp <- c(1.5,0.0,0.0,2.1)
date <- c(01,02,03,04)
x <- c(1.11,2.22,3.33,4.44)
df <- data.frame(prcp,date)
grp <- data.frame(x,date)
#Make the assignment
df[df$prcp == 0,]$prcp <- grp[df$prcp == 0,]$x

Reshaping issues in R: my reshaped dataframe changes 3 variables into 1

I'm a relative newbie to R and trying to reshape my data into long format from wide format and having problems. I'm thinking that my problem may be due to having made the data.frame from a data.frame that I have created in R, getting mean values of the large data.frame into another data.frame.
What I have done is this created an empty data.frame (ndf):
ndf <- data.frame(matrix(ncol = 0, nrow = 3))
Then used lapply to get the means from the large data.frame (ldf) into separate columns in the new data.frame, with the year being used from the large data.frame:
ndf$Year <- names(ldf)
ndf$col1 <- lapply(ldf, function(i) {mean(i$col1)})
ndf$col2 <- lapply(ldf, function(i) {mean(i$col2)})
etc.
The melted function in reshape2 does not work apparently because there are non-atomic 'measure' columns.
For using the reshape base function I have used the code:
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:7]),
v.names = "cover",
timevar = "species",
times = names(ndf[2:7]),
new.row.names = 1:1000,
direction = "long")
My output is then essentially just using the first row for the variables. So my wide data.frame looks like this (sorry for the strange names):
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
And the long data.frame looks like this:
Year species cover id
1 2014 Cladonia.portentosa 11.75 1
2 2015 Cladonia.portentosa 11.75 2
3 2016 Cladonia.portentosa 11.75 3
4 2014 Erica.tetralix 35.00 1
5 2015 Erica.tetralix 35.00 2
6 2016 Erica.tetralix 35.00 3
Where the "cover" column should have the value from each year put into the cell with the corresponding year.
Please could someone tell me where I've gone wrong!?
Here is an example of 'melting' in tidyr.
You'll need tidyr but I also like dplyr and am including it here to encourage its use along with the rest of the tidyverse. You'll find endless great tutorials on the web...
library(dplyr)
library(tidyr)
Let's use iris as an example, I want a long form where species, variable and value are the columns.
data(iris)
Here it is with gather(). we specify that variable and value are the column names for the new 'melted' columns. we also specify that we do not want to melt the column Species which we want to remain its own column.
iris_long <- iris %>%
gather(variable, value, -Species)
inspect the iris_long object to make sure it worked.
In addition to roman's answer, I thought I would share exactly what I did with my data set.
My initial "wide" data.frame ndf looked like this:
Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5
I used downloaded tidyr
install.packages("tidyr")
Then selected the package
library(tidyr)
I then used the gather() function in the tidyr package to gather the species columns Cladonia.portentosa Erica.tetralix and Eriophorum.vaginatum together into one column, with a cover column in the new "long" data.frame.
long.ndf <- ndf %>% gather(species, cover, Cladonia.portentosa:Eriophorum.vaginatum)
Easy peasy!
Thanks again to roman for the suggestion!
I'm answering your question in case it may help someone using reshape function.
Please could someone tell me where I've gone wrong!?
You have not specified parameter idvar and reshape has created one for you named id. In order to avoid it, just add to your code the line idvar = "Year" :
ndf <- read.table(text =
"Year Cladonia.portentosa Erica.tetralix Eriophorum.vaginatum
1 2014 11.75 35 55
2 2015 15.75 25.75 70
3 2016 22.75 5 37.5",
header=TRUE, stringsAsFactors = F)
reshape.ndf <- reshape(ndf,
varying = list(names(ndf)[2:4]),
v.names = "cover",
idvar = "Year",
timevar = "species",
times = names(ndf[2:4]),
new.row.names = 1:9,
direction = "long")
The result looks as you were expecting
reshape.ndf
Year species cover
1 2014 Cladonia.portentosa 11.75
2 2015 Cladonia.portentosa 15.75
3 2016 Cladonia.portentosa 22.75
4 2014 Erica.tetralix 35.00
5 2015 Erica.tetralix 25.75
6 2016 Erica.tetralix 5.00
7 2014 Eriophorum.vaginatum 55.00
8 2015 Eriophorum.vaginatum 70.00
9 2016 Eriophorum.vaginatum 37.50

How to group in R with partial match and assign a column with the aggregated value?

Below is the data frame I have:
Quarter Revenue
1 2014-Q1 10
2 2014-Q2 20
3 2014-Q3 30
4 2014-Q4 40
5 2015-Q1 50
6 2015-Q2 60
7 2015-Q3 70
8 2015-Q4 80
I want to find the mean of the quarters containing Q1,Q2,Q3,Q4 separately (for e.g. for text containing Q1, I have two values for revenue i.e. 10 and 50, the mean of which is 30) and insert a column depicting the mean. The o/p should look like the one described below:
Quarter Revenue Aggregate
1 2014-Q1 10 30
2 2014-Q2 20 40
3 2014-Q3 30 50
4 2014-Q4 40 60
5 2015-Q1 50 30
6 2015-Q2 60 40
7 2015-Q3 70 50
8 2015-Q4 80 60
Could you all please let me know if there are any processes without using the popular packages and with using too.
Thanks!
We can separate the "Quarter" into "Year", "Quart", group by "Quart", and get the mean of "Revenue"
library(dplyr)
library(tidyr)
separate(df1, Quarter, into = c("Year", "Quart"), remove = FALSE) %>%
group_by(Quart) %>%
mutate(Aggregate = mean(Revenue)) %>%
ungroup() %>%
select(-Quart, -Year)
# Quarter Revenue Aggregate
# <chr> <int> <dbl>
#1 2014-Q1 10 30
#2 2014-Q2 20 40
#3 2014-Q3 30 50
#4 2014-Q4 40 60
#5 2015-Q1 50 30
#6 2015-Q2 60 40
#7 2015-Q3 70 50
#8 2015-Q4 80 60
Or we can do this compactly with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by the substring of 'Quarter (removed the Year and -), we assign (:=) the mean of 'Revenue' to create the 'Aggregate'.
library(data.table)
setDT(df1)[, Aggregate := mean(Revenue) ,.(sub(".*-", "", Quarter))]
One possible solution using functions from the base package.
qtr <- c("Q1", "Q2", "Q3", "Q4")
avg <- numeric()
for (n in 1:length(qtr)) {
ind <- grep(qtr[n], df1$Quarter)
avg[length(avg) + 1] <- mean(df1$Revenue[ind])
}
df1 <- transform(df1, Aggregate = avg)
Apparently using functions from other packages (e.g., dplyr) make code less verbose.

Resources