Use function sets to group by row values [duplicate] - r

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 3 years ago.
Hi How do i apply lappy() to the values in rows?
Lets say my dataset is
Name value month
SP 50 January
Mary 67 January
Justin 33 January
SP 45 February
Mary 37 February
Justin 53 February
SP 55 March
Mary 67 March
Justin 23 March
So if i need to find the sum of values for each month,
to get my out put as:
Month Value
January 150
February 135
March 145
Currently i only know how to calculate this on the whole without the monthly separation as in:
library(tidyverse)
x <- tibble(Name = c("SP","Mary","Justin","SP","Mary","Justin","SP","Mary","Justin"),
value = c(50,67,33,45,37,53,55,67,23),
Month = c("January", "January", "February", "February","March","March"))
y <- data.frame(x,sum(x$value))
I do not know how to get my desired outcome.
any idea how?

We can use aggregate from base R
aggregate(value ~ Month, x, sum)
Or with dplyr
library(dplyr)
x %>%
group_by(Month) %>%
summarise(value = sum(value))

Related

How to add a new column based on a conditional using R? [duplicate]

This question already has an answer here:
Add rep vector to dataframe with uneven total rows
(1 answer)
Closed 2 years ago.
My data frame looks like this:
Year sales
1976 January 250
1976 February 350
1976 March 230
1976 April 255
.
.
This goes up-to 2003 December
I want to add a new column "Month" with a number from 1 to 12 for every year and repeating thereafter.
So that it would look like this:
Year Month sales
1976 January 1 250
1976 February 2 350
1976 March 3 230
1976 April 4 255
.
.
1976 December 12 320
1977 January 1 233
1977 February 2 333
.
.
Can you help me with the codes and if possible without use of any packages.
Thank you
Probably a safer way than Konrad's answer:
library(tidyr)
library(dplyr)
mydat %>%
# Split the year from the month into a separate variable
separate(Year, c("Year", "month"), sep = " ") %>%
# Add the month number based on the name of the month
mutate(Month_num = match(month, month.name))
This will return the correct month number even if your rows are not properly ordered.
If the first row of the table is always January, and if no months are missing, you can do
table$Month = rep(1 : 12, length.out = nrow(table))

extract specific digits from column of numbers in R

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

Increasing Year count every January

I have a data frame that looks similar to this:
I know the starting year of the first obs (1963). The obs are in the exact chronological order. So the next instance of "Jan" (obs 13) indicates that the year is 1964. Is there a way to create a column "Year" that has increases the current year every time that the next occurrence of "Jan " happens?
In the pic, it would be "1964" and then when "Jan" happens again, 1965 and so on....
There is an answer to a similar problem that was suggested but it doesn't quite do it and here it is:
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10, 15),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("1963", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 1963 - cumsum(diff(c(julian[2], julian))>0)
But this won't do it: Because the last two observations have the same month ("Dec" and then another "Dec") the count for year increases:
The last observation should still read "1960" NOT "1959".
The OP has requested to complete the years in ascending order starting in 1963.
The approach below works without date conversion and dummy dates and can be amended to work with fiscal years (see here).
df$year <- 1963 + cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) < 0))
df
day month year
1 24 Jun 1963
2 21 Mar 1964
3 20 Jan 1965
4 10 Dec 1965
5 20 Jun 1966
6 20 Jan 1967
7 10 Dec 1967
8 15 Dec 1967
Note that there is a question which seems to be similar but was asking to complete years in descending order. The solution there needs to be changed in two places to work here.

Aggregate count of timeseries values which exceed threshold, by year-month

I am now learning R and using the SEAS package to help me with some calculation in R and data is the same format as SEAS package likes. It is a time series
require(seas)
data(mscdata)
dat.int <- (mksub(mscdata, id=1108447))
the heading of the data and it is 20 years of data
year yday date t_max t_min t_mean rain snow precip
However, I now need to calculate the number of days in each month rainfall is >= 1.0mm . So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
I'm not certain how to write this code and any help would be appreciated
Thank you
Lam
I now need to calculate the number of days in each month rainfall is >= 1.0mm. So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
1) So dat.int$date is a Date object. First step is you need to create a new column dat.int$yearmon extracting the year-month, e.g. using zoo::yearmon
Extract month and year from a zoo::yearmon object
require(zoo)
dat.int$yearmon <- as.yearmon(dat.int$date, "%b %y")
2) Second, you need to do a summarize operation (recommend you use plyr or the newer dplyr) on rain>=1.0 aggregated by yearmon. Let's name our resulting column rainy_days.
If you want to store rainy_days column back into the dat.int dataframe, you use a transform instead of a summarize:
ddply(dat.int, .(yearmon), transform, rainy_days=sum(rain >= 1.0) )
or else if you really just want a new summary dataframe:
require(plyr)
rainydays_by_yearmon <- ddply(dat.int, .(yearmon), summarize, rainy_days=sum(rain >= 1.0) )
print.data.frame(rainydays_by_yearmon)
yearmon rainy_days
1 Jan 1975 14
2 Feb 1975 12
3 Mar 1975 13
4 Apr 1975 6
5 May 1975 6
6 Jun 1975 5
...
355 Jul 2004 3
356 Aug 2004 7
357 Oct 2004 14
358 Nov 2004 16
359 Dec 2004 19
Note: you can do the above with plain old R, without using zoo or plyr/dplyr packages. But might as well teach you nicer, more scalable, maintainable code idioms.

Resources