I have a dataset with multiple columns but I'd like to change the order in chronological order by date!
This is a really bad example but would there be a code to r
Station
year
ID
1
2020
D
2
2019
C
3
2017
A
4
2018
B
This is a really bad example but would there be a code to reorder by date oldest to newest?
Station
year
ID
3
2017
A
4
2018
B
2
2019
C
1
2020
D
To look something like this!
Any help would be amazing! :)
Thank you
Well... "2020" is not a date, and you can order the column as regular integer.
But, if you had dates like "2020-01-25"... transforming strings to dates is easy as...
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df <- df %>%
mutate(
dt = as.Date(dt)
) %>%
arrange(
dt
)
Use ymd () function from lubridate package to bring dt to date format and year () to extract the year. With this format you can sort your dates with arrange
library(dplyr)
library(lubridate)
# data borrowed from abreums
df <- tibble(n = c(1,2,3,4),
dt = c("2020-01-01","2019-01-01","2017-01-01", "2018-01-01"),
l = c("D","C","A","B"))
df1 <- df %>%
mutate(dt = ymd(dt), # "2020-01-01"
dt = year(dt)) %>% # "2020"
arrange(dt)
Related
I have a large database with a date column that has date numbers coming from Excel, incomplete dates that are missing the year (but year is in another column), and some cells with missing date. I found out how to change format of the dates, but the problem is how to filter the three types of cells I have in the date variable (that is date numbers from excel, incomplete dates, and empty cell). I managed to do it by filtering a by a created column (value) that I DON'T have in the real database.
This is my original database:
This is what I required end result:
What I managed to do was to filter the dataset with the fictitious value column and convert the date to the required format. This is what I did:
library(dplyr)
data_a <- read.csv(text = "
year,date,value
2018,43238,1
2017,43267,2
2020,7/25,3
2018,,4
2013,,5
2000,8/23,6
2000,9/21,7")
data_b <- data_a %>%
filter(value %in% c(1,2)) %>%
mutate(data_formatted = as.Date(as.numeric(date), origin = "1899-12-30"))
data_c <- data_a %>%
filter(value %in% c(3, 6, 7)) %>%
mutate(data_formatted = as.Date(paste0(year, "/", date)))
data_d <- data_a %>%
filter(value %in% c(4, 5)) %>%
mutate(data_formatted = NA)
data_final <- rbind(data_b, data_c, data_d)
I need to do the same all at once WITHOUT using the value column.
You can use do conditional for the scenarios and apply different functions to convert to date.
Code
library(dplyr)
library(stringr)
library(lubridate)
data_a %>%
mutate(
data_formatted = case_when(
!str_detect(date,"/") ~ as.Date(as.numeric(date), origin = "1899-12-30"),
TRUE ~ ymd(paste0(year, "/", date))
)
)
Output
year date value data_formatted
1 2018 43238 1 2018-05-18
2 2017 43267 2 2018-06-16
3 2020 7/25 3 2020-07-25
4 2018 4 <NA>
5 2013 5 <NA>
6 2000 8/23 6 2000-08-23
7 2000 9/21 7 2000-09-21
Please try
data_a2 <- data_a %>% mutate(date2=as.numeric(ifelse(str_detect(date,'\\/'), '',date)),
date2_=as.numeric(as.Date(ifelse(str_detect(date,'\\/'), paste0(year,'/',date),''), format='%Y/%m/%d')),
date_formatted=as.Date(coalesce(date2,date2_), origin = "1970-01-01")) %>%
dplyr::select(-date2,-date2_)
Having a dataframe like this:
data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"), stringAsFactor = FALSE)
How is it possible to keep only year in the timestamp column having in mind that all years are after 2000? An expected output:
data.frame(id = c(1,3), timestamp = c("2009", "2017"), stringAsFactor = FALSE)
Base R:
format(as.Date(df$timestamp, "%d-%m-%Y %H:%M:%S"), "%Y")
[1] "2009" "2017"
So in the dataframe:
df$year <- format(as.Date(df$timestamp, "%d-%m-%Y %H:%M:%S"), "%Y")
id timestamp year
1 1 20-10-2009 11:35:12 2009
2 3 01-01-2017 12:21:21 2017
Another option, if you're into or familiar with regex, is this:
sub(".*([0-9]{4}).*", "\\1", df$timestamp)
[1] "2009" "2017"
See if this answers your question. The code and the output is as follows :-
library(lubridate)
library(tidyverse)
df <- data.frame(id = c(1,3,4), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21","01-01-1998 12:21:21"), stringAsFactor = FALSE)
df$timestamp <- dmy_hms(df$timestamp)
df1 <- df %>%
filter(year(timestamp) > 2000) %>%
mutate(new_year = year(timestamp))
df1
#id timestamp stringAsFactor new_year
#1 1 2009-10-20 11:35:12 FALSE 2009
#2 3 2017-01-01 12:21:21 FALSE 2017
If you're not afraid of external packages, one option would be to make use of the lubridate package:
df <- data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"))
df <- df %>%
mutate(timestamp = lubridate::dmy_hms(timestamp)) %>%
mutate(year = lubridate::year(timestamp))
Obviously, if you actually want to replace the timestampe column, you have to change the last mutate command. Result:
id timestamp year
1 1 2009-10-20 11:35:12 2009
2 3 2017-01-01 12:21:21 2017
I have a tidyverse solution to your problem:
library(tidyverse)
data.frame(id = c(1,3), timestamp = c("20-10-2009 11:35:12", "01-01-2017 12:21:21"), stringAsFactor = FALSE) %>%
mutate(timestamp = timestamp %>%
str_extract("\\d{4}"))
The function str_extract("\\d{4}") should always extract the first four digits of your target variable.
I am an aspiring data scientist, and this will be my first ever question on StackOF.
I have this line of code to help wrangle me data. My date filter is static. I would prefer not to have to go in an change this hardcoded value every year. What is the best alternative for my date filter to make it more dynamic? The date column is also difficult to work with because it is not a
"date", it is a "dbl"
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
Tried so far:
df %>%
filter(DATE >= 20191231)
# load packages (lubridate for dates)
library(dplyr)
library(lubridate)
# create a sample dataframe
df <- data.frame(
DATE = c(20191230, 20191231, 20200122)
)
This looks like this:
DATE
1 20191230
2 20191231
3 20200122
# and now...
df %>% # take the dataframe
mutate(DATE = ymd(DATE)) %>% # turn the DATE column actually into a date
filter(DATE >= floor_date(Sys.Date(), "year") - days(1))
...and filter rows where DATE is >= to one day before the first day of this year (floor_date(Sys.Date(), "year"))
DATE
1 2019-12-31
2 2020-01-22
I have the raw data frame:
igroup=c("A", "B", "C")
demo_df=data.frame(date=c("2018-11-28", "2018-12-17", "2019-01-23"), group)
Raw data frame:
date group
1 2018-11-28 A
2 2018-12-17 B
3 2019-01-23 C
I want to have a data frame that expand the date to next column but still keep the group information. For example, date from 2018-11-28 to 2018-12-16 is with group A, date from 2018-12-17 to 2019-01-22 is with group B and 2019-01-23 is with group C.
This is the output (result_df) I want:
time=c(seq(as.Date("2018-11-28"), as.Date("2018-12-17")-1, by=1),
seq(as.Date("2018-12-17"), as.Date("2019-01-23")-1, by=1),as.Date("2019-01-23") )
group1=c(rep("A",as.numeric(as.Date("2018-12-17")-as.Date("2018-11-28"))),
rep("B",as.numeric(as.Date("2019-01-23")-as.Date("2018-12-17"))), "C" )
result_df=data.frame(time,group1 )
result_df
I am wondering if there is any more efficient way (using dplyr) to handle this issue.
Thanks in advance.
First, make sure date is stored as a date object:
demo_df$date <- as.Date(demo_df$date, format = "%Y-%m-%d")
Then using tidyverse, we first complete the sequence, then fill the group down:
library(tidyverse)
demo_df %>% complete(date = seq.Date(min(date), max(date), by = "day")) %>%
fill(igroup)
Going through this years later, here is a variation on Mako212's answer:
demo_df %>% complete(date=full_seq(date,1)) %>% fill(group)
I need to aggregate multiple months from original data with dataframe in R, e.g: data frame with datetime include 2017 and 2018.
date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616
What I want is to sum based on 6 month period and category:
period category sum
1 2017_secondPeriod A 25.00972
2 2018_firstPeriod A 25.59850
3 2017_secondPeriod B 24.96924
4 2018_firstPeriod B 24.79649
5 2017_secondPeriod c 20.17096
6 2018_firstPeriod c 27.01794
What I did:
1. select the last 6 months of 2017, like wise 2018
2. add a new column for each subset to indicate the period
3. Combine 2 subset again
4. aggregate
as following:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
category = c("A","B","c"),
amt = runif(300)
)
df2017_secondHalf <- subset(df, month(df$date) %in% c(7,8,9,10,11,12) & year(df$date) == 2017)
f2018_firstHalf <- subset(df, month(df$date) %in% c(1,2,3,4,5,6) & year(df$date) == 2018)
sum1 <- aggregate(df2017_secondHalf$amt, by=list(Category=df2017_secondHalf$Category), FUN=sum)
sum2 <- aggregate(df2018_firstHalf$amt, by=list(Category=df2018_secondHalf$Category), FUN=sum)
df2017_secondHalf$period <- '2017_secondPeriod'
df2018_firstHalf$period <- '2018_firstPeriod'
aggregate(x = df$amt, by = df[c("period", "category")], FUN = sum)
I try to figure out but did not know how to aggregate multple months e.g, 3 months, or 6 months.
Thanks in advance
Any suggesstion?
With lubridate and tidyverse (dplyr & magrittr)
First, let's create groups with Semesters, Quarter, and "Trimonthly".
library(tidyverse)
library(lubridate)
df <- df %>% mutate(Semester = semester(date, with_year = TRUE),
Quarter = quarter(date, with_year = TRUE),
Trimonthly = round_date(date, unit = "3 months" ))
Lubridate's semester() breaks by semsters and gives you a 1 (Jan-Jun) or 2 (Jul-Aug); quarter() does a similar thing with quarters.
I add a third, the more basic round_date function, where you can specify your time frame in the form of size and time units. It yields the first date of such time frame. I deliberately name it "Trimonthly" so you can see how it compares to quarter()
Pivot.Semester <- df %>%
group_by(Semester, category) %>%
summarise(Semester.sum = sum(amt))
Pivot.Quarter <- df %>%
group_by(Quarter, category) %>%
summarise(Quarter.sum = sum(amt))
Pivot.Trimonthly <- df %>%
group_by(Trimonthly, category) %>%
summarise(Trimonthly.sum = sum(amt))
Pivot.Semester
Pivot.Quarter
Pivot.Trimonthly
Optional: If you want to join the summarised data to the original DF.
df <- df %>% left_join(Pivot.Semester, by = c("category", "Semester")) %>%
left_join(Pivot.Quarter, by = c("category", "Quarter")) %>%
left_join(Pivot.Trimonthly, by = c("category", "Trimonthly"))
df
Here is a 3 line solution that uses no package. Let k be the number of months in a period. For half year periods k is 6. For quarter year periods k would be 3, etc. Replace 02 in the sprintf format with 1 if you want one digit suffices (but not for monthly since those must be two digit). Further modify the sprintf format if you want it to exactly match the question.
k <- 6
period <- with(as.POSIXlt(DF$date), sprintf("%d-%02d", year + 1900, (mon %/% k) + 1))
aggregate(amt ~ category + period, DF, sum)
giving:
category period amt
1 A 2017-02 0.7084425
2 B 2017-02 0.5682296
3 c 2017-02 0.8156812
At the expense of using one package we can simplify the quarterly and monthly calculations by replacing the formula for period with one of these:
library(zoo)
# quarterly
period <- as.yearqtr(DF$date)
# monthly
period <- as.yearmon(DF$date)
Note: The input in reproducible form is:
Lines <- "date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616"
DF <- read.table(text = Lines)
DF$date <- as.Date(DF$date)