I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2
Related
Hi I have a dataframe in the form shown below:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7), Date = c("20200230",
"20200422", "20100823", "20190801", "20130230", "20160230", "20150627"
)), class = "data.frame", row.names = c(NA, -7L))
ID Date
1 1 20200230
2 2 20200422
3 3 20100823
4 4 20190801
5 5 20130230
6 6 20160230
7 7 20150627
the date in the Date column is not in the standard format and it's shown in yyyymmdd form. How can I separate year, month and day from Date column and save them as separate new column in data frame, so the result look like this?
ID Date Year Month Day
1 1 20200230 2020 02 30
2 2 20200422 2020 04 22
3 3 20100823 ....................
4 4 20190801 ....................
5 5 20130230 ....................
6 6 20160230 ....................
7 7 20150627 ....................
I tried using format(as.Date(x, format="%YYYY%mm/%dd"),"%YYYY") but it didn't work for me. I also tried follwing code:
Data$Year <- year(ymd(Data$Date))
The result is in this form:
ID Date Year
1 1 20200230 NA
2 2 20200422 2020
3 3 20100823 2010
4 4 20190801 2019
5 5 20130230 NA
6 6 20160230 NA
7 7 20150627 2015
As mentioned by #neilfws , the reason I get NA is that the date is not valid; however, I really don't care about the validity and I want to extract the year in anycase.
If you only want the year and are not concerned with date validation, the easiest solution is probably to extract the first 4 characters from Date and convert to numeric.
Data$Year <- as.numeric(substring(Data$Date, 1, 4))
Might be good to have some kind of check for Date, e.g. that they all contain 8 digits.
Base R in one expression:
# If you want to keep the Date vector:
cbind(df,
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))
# If you want to drop the Date vector:
cbind(within(df, rm(Date)),
strcapture(pattern = "^(\\d{4})(\\d{2})(\\d{2})$",
x = df$Date,
proto = list(year = integer(), month = integer(), day = integer())))
Apologies if this is a repeat question, I searched and could not find the specific answer I am looking for.
I have a data frame where one column is a 16-digit code, and there are a number of other columns. Here is a simplified example:
code = c("1109619910224003", "1157919910102001", "1539820070315001", "1563120190907002")
year = c(1991, 1991, 2007, 2019)
month = c(02, 01, 03, 09)
dat = as.data.frame(cbind(code,year,month))
dat
> dat
code year month
1 1109619910224003 1991 2
2 1157919910102001 1991 1
3 1539820070315001 2007 3
4 1563120190907002 2019 9
As you can see, the code contains year, month, and day information. I already have columns for year and month in my dataframe, but I need to also create a day column, which would be 24, 02, 15, and 07 in this example. The date is always in the format yyyymmdd and begins as the 6th digit in the code. So I essentially need to extract the 12th and 13th digits from each code to create my day column.
I then need to create another column for day of year from the date information, so I end up with the following:
day = c(24, 02, 15, 07)
dayofyear = c(55, 2, 74, 250)
dat2 = as.data.frame(cbind(code,year,month,day,dayofyear))
dat2
> dat2
code year month day dayofyear
1 1109619910224003 1991 2 24 55
2 1157919910102001 1991 1 2 2
3 1539820070315001 2007 3 15 74
4 1563120190907002 2019 9 7 250
Any suggestions? Thanks!
You can leverage the Date data type in R to accomplish all of these tasks. First we will parse out the date portion of the code (characters 6 to 13), and convert them to Date format using readr::parse_date(). Once the date is converted, we can simply access all of the values you want rather than calculating them ourselves.
library(tidyverse)
out <- dat %>%
mutate(
date=readr::parse_date(substr(code, 6, 13), format="%Y%m%d"),
day=format(date, "%d"),
month=format(date, "%m"),
year=format(date, "%Y"),
day.of.year=format(date, "%j")
)
(I'm using tidyverse syntax here because I find it quicker for these types of problems)
Once we create these columns, we can look at the updated data.frame out:
code year month date day day.of.year
1 1109619910224003 1991 02 1991-02-24 24 055
2 1157919910102001 1991 01 1991-01-02 02 002
3 1539820070315001 2007 03 2007-03-15 15 074
4 1563120190907002 2019 09 2019-09-07 07 250
Edit: note that the output for all the new columns is character. We can tell this without using str() because of the leading zeros in the new columns. To get rid of this, we can do something like out <- out %>% mutate_all(as.integer), or just append the mutate_all call to the end of our existing pipeline.
This question already has answers here:
Split character string multiple times every two characters
(2 answers)
Closed 6 years ago.
I have a column of dates in a data table entered in 6-digit numbers as such: 201401, 201402, 201403, 201412, etc. where the first 4 digits are the year and second two digits are month.
I'm trying to split that column into two columns, one called "year" and one called "month". Been messing around with strsplit() but can't figure out how to get it to do number of characters instead of a string pattern, i.e. split in the middle of the 4th and 5th digit.
Without using any external package, we can do this with substr
transform(df1, Year = substr(dates, 1, 4), Month = substr(dates, 5, 6))
# dates Year Month
#1 201401 2014 01
#2 201402 2014 02
#3 201403 2014 03
#4 201412 2014 12
We have the option to remove or keep the column.
Or with sub
cbind(df1, read.csv(text=sub('(.{4})(.{2})', "\\1,\\2", df1$dates), header=FALSE))
Or using some package solutions
library(tidyr)
extract(df1, dates, into = c("Year", "Month"), "(.{4})(.{2})", remove=FALSE)
Or with data.table
library(data.table)
setDT(df1)[, tstrsplit(dates, "(?<=.{4})", perl = TRUE)]
tidyr::separate can take an integer for its sep parameter, which will split at a particular location:
library(tidyr)
df <- data.frame(date = c(201401, 201402, 201403, 201412))
df %>% separate(date, into = c('year', 'month'), sep = 4)
#> year month
#> 1 2014 01
#> 2 2014 02
#> 3 2014 03
#> 4 2014 12
Note the new columns are character; add convert = TRUE to coerce back to numbers.
I have a date column where dates look like this:
19940818
19941215
What is the proper command to extract the year and month from them?
If your data is e.g.
(df <- data.frame(date = c("19940818", "19941215")))
# date
#1 19940818
#2 19941215
To add two columns, one for month and one for year, you can do
within(df, {
year <- substr(date, 1, 4)
month <- substr(date, 5, 6)
})
# date month year
# 1 19940818 08 1994
# 2 19941215 12 1994
I don't see a need to convert to Date class here since all you want is a substring of the date column.
Another option is to use extract from tidyr. Using df from #Richard Scriven's post
library(tidyr)
extract(df, date, c('year', 'month'), '(.{4})(.{2}).*', remove=FALSE)
# date year month
#1 19940818 1994 08
#2 19941215 1994 12
I am working with a dataset of about 800 weather stations, with monthly air temperature values for each station from 1986 to 2014. The data are split into three columns: (1) station name, (2) date (year and month), and (3) Temp. In general, the data look something like this:
STATION DATE TEMP
Station 1 198601 -15
Station 1 198602 -16
Station 1 201401 -10
Station 1 201402 -14
Station 2 198601 -11
Station 2 198602 -9
Station 2 201401 -5
Station 2 201402 -4
I need to extract the average temperature at each weather station for a given month in various year ranges. For example, if I needed to know the average July temperature at each weather station from 1986-1990. My ideal output, then would be a new list or dataframe giving the average temperature for each station, based on my specified date range.
I am sure this can be accomplished using a for loop, but I am not very proficient at creating such code. Any suggestions would be greatly appreciated.
Using dplyr instead of data table
weather <- data.frame(station = c("Station 1", "Station 1", "Station 1", "Station 1",
"Station 2", "Station 2", "Station 2", "Station 2"),
date = c(198601, 198602, 201401, 201402, 198601, 198602, 201401, 201402),
temp = c(-15, -16, -10, -14, -11, -9, -5, -4))
library(dplyr)
library(stringr)
# get month and year columns in data
weather <- mutate(weather,
year = str_extract(date, "\\d{4}"),
month = str_extract(date, "\\d{2}$"))
# get the mean for each station for each month
mean_station <- group_by(weather, station, month) %>%
summarise(mean_temp = mean(temp, na.rm = T))
If you need to only do this on a certain range of dates you can add a filter on the year
mean_station <- group_by(weather, station, month) %>%
filter(year >= 1986, year <= 2015) %>%
summarise(mean_temp = mean(temp, na.rm = T))
Something like this...?
> df$month <- substr(df$DATE, 5, 6)
> result <- aggregate(TEMP~STATION+month, mean, data=df)
> data.frame(Year=unique(substr(df$DATE, 1, 4)), result)
Year STATION month TEMP
1 1986 Station1 01 -12.5
2 2014 Station2 01 -8.0
3 1986 Station1 02 -15.0
4 2014 Station2 02 -6.5
Or maybe
library(data.table)
setDT(df)[, list(MeanTemp = mean(TEMP)),
by = list(STATION, Mon = substr(DATE, 5, 6))]
# STATION Mon MeanTemp
# 1: Station 1 01 -12.5
# 2: Station 1 02 -15.0
# 3: Station 2 01 -8.0
# 4: Station 2 02 -6.5
I am also learning R and may not answer your question directly but I wanted to mention that seas Package is helpful for analysis of such type of data
for example
require(seas)
pdf( paste("test",".pdf", sep="") )
for (i in 1: length(STATION)){
d1 <-mksub(mdata,id=STATION[i]) # making a subset for each station based on name/unique id
dat.ss <- seas.sum(d1)
plot(dat.ss)
}
graphics off ()
You have to ensure that str() of your dataset is the in format that seas needs.
with such large dataset I am advise loops and functions helps with data analysis a bit quickly. If got another way of looping that works grateful if you can share