extract specific digits from column of numbers in R - r

Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!

You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.

Related

How to organise a date in 3 different columns in r?

I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2

ID variable changes over time, how to calculate pct change by ID

The problem starts from the difficulty of explaining it.
I have a data set that has a time dimension, my ID variables change name over time making it difficult to calculate e.g. percentage changes over time by ID variable.
ID YR Value
01 2004 100
02 2005 50
03 2005 50
04 2005 10
I need to calculate pct. Change in Value over time by ID. The problem is in Yr 2005 the ID variable 01 is split into three IDs (02,03,04), such that one has to aggregate the values for the three IDs in 2005 to get the corresponding value for ID 01 in 2005. The percent change of ID 01 is NOT 50/100, rather sum(50,50,10)/100.
I have data.frame of IDs only matching the changes over time, it looks like this:
x2004 x2005
01 01
01 02
01 03
I used group_by from dplyr to create matching between IDs in the two years
group_by(x2004) %>%
summarize(onetomany = paste(sort(unique(x2005)),collapse=", "))
Which gave me a data.frame of the form
cv2004 onetomany
1 1 1, 2, 3
Where I can see which IDs belong to the same group, and that is where I stopped the percentage calculation.
I totally understand that the problem it self is not easy to understand. This is a common problem in trade statistics, commodity codes change name over time but not content, and one has to keep track of the changes to get the picture of developments in trade over time by commodity. Any suggestion is appreciated.
df <- data.frame("ID" = c("01", "02", "03", "04"),
"YR" = c(2004, 2005, 2005, 2005),
"Value" = c(100, 50, 50, 10))
df %>% group_by(YR) %>% summarise(sum = sum(Value))
# A tibble: 2 x 2
YR sum
<dbl> <dbl>
1 2004 100
2 2005 110

Splitting Columns by Number of Characters [duplicate]

This question already has answers here:
Split character string multiple times every two characters
(2 answers)
Closed 6 years ago.
I have a column of dates in a data table entered in 6-digit numbers as such: 201401, 201402, 201403, 201412, etc. where the first 4 digits are the year and second two digits are month.
I'm trying to split that column into two columns, one called "year" and one called "month". Been messing around with strsplit() but can't figure out how to get it to do number of characters instead of a string pattern, i.e. split in the middle of the 4th and 5th digit.
Without using any external package, we can do this with substr
transform(df1, Year = substr(dates, 1, 4), Month = substr(dates, 5, 6))
# dates Year Month
#1 201401 2014 01
#2 201402 2014 02
#3 201403 2014 03
#4 201412 2014 12
We have the option to remove or keep the column.
Or with sub
cbind(df1, read.csv(text=sub('(.{4})(.{2})', "\\1,\\2", df1$dates), header=FALSE))
Or using some package solutions
library(tidyr)
extract(df1, dates, into = c("Year", "Month"), "(.{4})(.{2})", remove=FALSE)
Or with data.table
library(data.table)
setDT(df1)[, tstrsplit(dates, "(?<=.{4})", perl = TRUE)]
tidyr::separate can take an integer for its sep parameter, which will split at a particular location:
library(tidyr)
df <- data.frame(date = c(201401, 201402, 201403, 201412))
df %>% separate(date, into = c('year', 'month'), sep = 4)
#> year month
#> 1 2014 01
#> 2 2014 02
#> 3 2014 03
#> 4 2014 12
Note the new columns are character; add convert = TRUE to coerce back to numbers.

Aggregate count of timeseries values which exceed threshold, by year-month

I am now learning R and using the SEAS package to help me with some calculation in R and data is the same format as SEAS package likes. It is a time series
require(seas)
data(mscdata)
dat.int <- (mksub(mscdata, id=1108447))
the heading of the data and it is 20 years of data
year yday date t_max t_min t_mean rain snow precip
However, I now need to calculate the number of days in each month rainfall is >= 1.0mm . So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
I'm not certain how to write this code and any help would be appreciated
Thank you
Lam
I now need to calculate the number of days in each month rainfall is >= 1.0mm. So at the end of it. I would have two columns ( each month in each year and total # of days in each month rainfall>= 1.0mm )
1) So dat.int$date is a Date object. First step is you need to create a new column dat.int$yearmon extracting the year-month, e.g. using zoo::yearmon
Extract month and year from a zoo::yearmon object
require(zoo)
dat.int$yearmon <- as.yearmon(dat.int$date, "%b %y")
2) Second, you need to do a summarize operation (recommend you use plyr or the newer dplyr) on rain>=1.0 aggregated by yearmon. Let's name our resulting column rainy_days.
If you want to store rainy_days column back into the dat.int dataframe, you use a transform instead of a summarize:
ddply(dat.int, .(yearmon), transform, rainy_days=sum(rain >= 1.0) )
or else if you really just want a new summary dataframe:
require(plyr)
rainydays_by_yearmon <- ddply(dat.int, .(yearmon), summarize, rainy_days=sum(rain >= 1.0) )
print.data.frame(rainydays_by_yearmon)
yearmon rainy_days
1 Jan 1975 14
2 Feb 1975 12
3 Mar 1975 13
4 Apr 1975 6
5 May 1975 6
6 Jun 1975 5
...
355 Jul 2004 3
356 Aug 2004 7
357 Oct 2004 14
358 Nov 2004 16
359 Dec 2004 19
Note: you can do the above with plain old R, without using zoo or plyr/dplyr packages. But might as well teach you nicer, more scalable, maintainable code idioms.

Week Number to Starting Date of Each Week in R

A few questions have come close to what I am looking for, but I can't find one that gets it right on.
I have sales data for several products for each day over a 6-year period. I summed the data by week, starting January 1, 2008. During the period 1/1/08-12/30/13, there were 313 weeks, so I just created dataframes for each product that contained columns for week numbers 1-313 and the weekly sales for each respective week.
I am plotting them with ggplot2, adding trendlines, etc.
The x-axis obviously uses the week number for its values, but I would prefer if it used the actual dates of the start of each week (Jaunary 1, 2008, a Tuesday, January 8, 2008, December 25, 2013, etc).
What is the best way to do this? How can I convert weeks 1-313 into their respective Start of Week dates? Or, is there a way to override the axis values on the plot itself?
To convert your week numbers to dates try something like this
weeks <- 1:313
start.date <- as.Date("2007/12/31")
y <- start.date + (weeks - 1)*7
head(y)
"2007-12-31" "2008-01-07" "2008-01-14" "2008-01-21" "2008-01-28" "2008-02-04"
Use package:lubridate?
Sample data (which you should have provided):
> df = data.frame(wid=1:10,z=runif(10))
> head(df)
wid z
1 1 0.2071595
2 2 0.4313403
3 3 0.7063967
4 4 0.2245014
5 5 0.2004542
6 6 0.1231366
Assuming your data are consecutive, with no gaps:
> require(lubridate)
> df$week=mdy("Jan 1 2008") + weeks(0:(nrow(df)-1))
> head(df)
wid z week
1 1 0.2071595 2008-01-01
2 2 0.4313403 2008-01-08
3 3 0.7063967 2008-01-15
4 4 0.2245014 2008-01-22
5 5 0.2004542 2008-01-29
6 6 0.1231366 2008-02-05
Then plot for nice labels:
> require(ggplot2)
> ggplot(df,aes(x=week,y=z))+geom_line()

Resources