Use dplyr/tidyr to turn rows into columns in R data frame - r

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()

To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

Related

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

Plotting average monthly counts per decade on a plot

I have a data set that has monthly "flows" over 68 years. I am trying to make a comparison of flow distributions by decade by making a plot that has a seasonal distribution on the x-axis and displays a mean value for each decade on the plot.
Using your sample data, and the tidyverse packages, the following code will calculate the average per decade and month:
library(tidyverse)
x <- "Year Jan Feb Mar Apr May Jun Jul Aug Sep
1948 29550 47330 64940 61140 20320 17540 37850 29250 17100
1949 45700 53200 37870 36310 39200 23040 31170 23640 19720
1950 16050 17950 27040 21610 15510 16090 12010 11360 14390
1951 14280 13210 16260 24280 13570 9547 9921 8129 7304
1952 19030 29250 58860 31780 19940 16930 9268 9862 9708
1953 24340 28020 31830 29700 44980 15630 22660 14190 13430
1954 34660 23260 24390 21500 13250 10860 10700 8188 6092
1955 14050 19430 12780 19330 12210 7892 12450 10920 6850
1956 7262 20800 27680 24110 13560 8594 10150 7721 10540
1957 14470 13350 22720 39860 23980 12630 10230 7008 8567"
d <- read_table(x) %>%
mutate(
decade = (Year %/% 10)*10 # add column for decade
) %>%
select(-Year) %>% # remove the year
pivot_longer( # convert to a 'tidy' (long) format
cols = Jan:Sep,
names_to = "month",
values_to = "count"
) %>%
mutate(
month = factor(month, levels = month.abb, ordered = TRUE) # make sure months are ordered
) %>%
group_by(decade, month) %>%
summarise(
mean = mean(count)
)
If you print that dataframe, you get:
> d
# A tibble: 18 x 3
# Groups: decade [2]
decade month mean
<dbl> <ord> <dbl>
1 1940 Jan 37625
2 1940 Feb 50265
3 1940 Mar 51405
4 1940 Apr 48725
5 1940 May 29760
6 1940 Jun 20290
7 1940 Jul 34510
8 1940 Aug 26445
9 1940 Sep 18410
10 1950 Jan 18018.
11 1950 Feb 20659.
12 1950 Mar 27695
13 1950 Apr 26521.
14 1950 May 19625
15 1950 Jun 12272.
16 1950 Jul 12174.
17 1950 Aug 9672.
18 1950 Sep 9610.
If you need it back in wide format:
d2 <- d %>%
pivot_wider(
id_cols = decade,
names_from = month,
values_from = mean
)
> d2
# A tibble: 2 x 10
# Groups: decade [2]
decade Jan Feb Mar Apr May Jun Jul Aug Sep
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1940 37625 50265 51405 48725 29760 20290 34510 26445 18410
2 1950 18018. 20659. 27695 26521. 19625 12272. 12174. 9672. 9610.
(Edit: changed from line graph to dodged bar plot, to better align with OP code.)
Here's an approach using dplyr, tidyr, and ggplot2 from tidyverse.
library(tidyverse)
M %>%
group_by(Decade = floor(Year/10)*10) %>%
summarize_at(vars(Jan:Sep), mean) %>%
# This uses tidyr::pivot_longer to reshape the data longer, which gives us the
# ability to map decade to color.
pivot_longer(-Decade, names_to = "Month", values_to = "Avg") %>%
# This step to get the months to be an ordered factor in order of appearance,
# which is necessary to avoid the months showing up in alphabetical order.
mutate(Month = fct_inorder(Month)) %>%
# Alternatively, we could have aligned these thusly
# mutate(Month_order = match(Month, month.abb)) %>%
# mutate(Month = fct_reorder(Month, Month_order)) %>%
ggplot(aes(Month, Avg, fill = as.factor(Decade))) +
geom_col(position = position_dodge()) +
scale_fill_discrete(name = "Decade")

How to rearrange daily stream discharge data into monthly format and rank the discharge values for each month using R

I have a data set of daily stream discharge values from a gauging station for approximately 50 years. The data is arranged into three columns, namely, "date", "month", "discharge".(Sample data shown here)
`
Date<- as.Date(c('1938-10-01','1954-10-27', '1967-06-16','1943-01-01','1945-01-14','1945-03-14','1954-05-04','1960-04-23','1960-05-09','1962-01-18','1968-12-19','1972-01-15','1977-08-15','1981-04-11','1986-06-20','1989-01-20','1992-03-29'))
> Months<- c('Oct','Oct','Jun','Jan','Jan','Mar','May','Apr','May','Jan','Dec','Jan','Aug','Apr','Jun','Jan','Mar')
> Dis<-c('1000','1200','400','255','450','215','360','120','145','1204','752','635','1456','154','154','1204','450')
> Sampledata<-data.frame("Date"=Date,"Months"=Months,"Disch"=Dis)
> print(Sampledata)
Date Months Disch
1 1938-10-01 Oct 1000
2 1954-10-27 Oct 1200
3 1967-06-16 Jun 400
4 1943-01-01 Jan 255
5 1945-01-14 Jan 450
6 1945-03-14 Mar 215
7 1954-05-04 May 360
8 1960-04-23 Apr 120
9 1960-05-09 May 145
10 1962-01-18 Jan 1204
11 1968-12-19 Dec 752
12 1972-01-15 Jan 635
13 1977-08-15 Aug 1456
14 1981-04-11 Apr 154
15 1986-06-20 Jun 154
16 1989-01-20 Jan 1204
17 1992-03-29 Mar 450
I want to calculate ranks for each month separately for all the years. For example: Calculate rank in ascending order for the month of January for 50 years. With the same rank value assigned to a duplicate discharge value. Desired output shown here:
> Date Month Disch Rank
1 1943-01-01 Jan 255 1
2 1945-01-14 Jan 450 2
3 1962-01-18 Jan 1204 4
4 1972-01-15 Jan 635 3
5 1989-01-20 Jan 1204 4
> Date Month Disch Rank
1 1945-03-14 Mar 215 1
2 1992-03-29 Mar 450 2
3 2001-03-19 Mar 450 2
Without using any packages first convert columns 2 and 3 to numeric and then use ave and rank with the indicated ties method. Finally order the result.
Note that the output shown in the question does not correspond to the input, e.g. there are three Mar rows in the output but only two such rows in the input so this will correspond to the input but will not be identical to the output shown.
Sampledata2 <- transform(Sampledata,
Disch = as.numeric(as.character(Disch)),
Months = as.numeric(format(Date, "%m")))
Rank <- function(x) rank(x, ties = "min")
Sampledata3 <- transform(Sampledata2,
Rank = ave(Disch, Months, FUN = Rank))
o <- with(Sampledata3, order(Months, Date))
Sampledata3[o, ]
An option would be to group by 'Month' and use one of the ranking functions (dense_rank, row_number(), min_rank - based on the needs) to rank the 'Discharge' column
library(dplyr)
df1 %>%
group_by(Month) %>%
mutate(Rank = dense_rank(Discharge))

R, dplyr: How to divide date frame elements by specific elements

edit: Solution at the end.
I have a dataframe that contains different variables and the sum of these different variables as a variable called "total".
I want to add a new column that calculates each variables' share of the "total"-variable.
Example:
library(dplyr)
name <- c('A','A',
'B','B')
month = c("oct 2018", "nov 2018",
"oct 2018", "nov 2018")
value <- seq(1:length(month))
df = data.frame(name, month, value)
# Create total variable
dfTotal =
df%>%
group_by_("month")%>%
summarize(value = sum(value, na.rm = TRUE))
dfTotal[["name"]] <- "Total"
dfTotal = as.data.frame(dfTotal)
# Add total column to dataframe
df2 = rbind(df, dfTotal)
df2
which gives the dataframe
name month value
1 A oct 2018 1
2 A nov 2018 2
3 B oct 2018 3
4 B nov 2018 4
5 Total nov 2018 6
6 Total oct 2018 4
What I want is to produce a new column with the shares of the total for each month in the above dataframe, so that I get something like
name month value share
1 A oct 2018 1 0.25 (=1/4)
2 A nov 2018 2 0.33 (=2/6)
3 B oct 2018 3 0.75 (=3/4)
4 B nov 2018 4 0.67 (=4/6)
5 Total nov 2018 6 1.00 (=6/6)
6 Total oct 2018 4 1.00 (=4/4)
Does anybody know how I from the first dataframe can produce the last column in the second dataframe?
Solution:
Based on tmfmnk's comment, the following solves the problem:
df2 =
df2 %>%
group_by(month) %>%
mutate(share = value/max(value))
df2
which gives
name month value share
<fct> <fct> <int> <dbl>
1 A oct 2018 1 0.25
2 A nov 2018 2 0.333
3 B oct 2018 3 0.75
4 B nov 2018 4 0.667
5 Total nov 2018 6 1
6 Total oct 2018 4 1

Weekends in a Month in R

I am trying to prepare an xreg serie for my Arima model and I will use number of weekends in a month for it. I can find results for a year but when it is longer than a year, it usually is, I couldn't find a way. Here is what I do so far.
dates <- seq(from=as.Date("2001-01-01"), to=as.Date("2010-12-31"), by = "day")
wd <- weekdays(dates)
aylar <- months(dates[which(wd == "Sunday" | wd == "Satuday")])
table(aylar)
What I want is gathering all months' weekends not based on only months but also years. So that I can have the same length of serie with my original forecast serie.
Here is my solution:
library(chron)
library(dplyr)
library(lubridate)
month <- months(dates[chron::is.weekend(dates)])
day <- dates[chron::is.weekend(dates)]
# create data.frame
df <- data.frame(date = day, month = month, year = chron::years(day))
df %>% group_by(year, month) %>% summarize(weekends = floor(n()/2))
# year month weekends
# <dbl> <fctr> <dbl>
#1 2001 April 4
#2 2001 August 4
#3 2001 Dezember 5
#4 2001 Februar 4
#5 2001 Januar 4
#6 2001 Juli 4
#7 2001 Juni 4
#8 2001 Mai 4
#9 2001 März 4
#10 2001 November 4
## ... with 110 more rows
I hope this is a starting point for your work.

Resources