This question already has an answer here:
Reading mdy_hms AM/PM off excel into r
(1 answer)
Closed 2 years ago.
I have a pretty annoying format for the dates in a data frame. here is a sample:
"Jan 1, 2020, 8:36:55 PM" "Jan 7, 2020, 12:00:00 PM" "Jan 9, 2020, 8:24:55 PM"
The first thing I had to do was to filter it by year. I ended up just using grep(), since there is no other context in which 2020 appears, but this isn't an elegant solution. I hope the answer to my current problem can help with this, too.
Anyway, I now want to identify the weeks. I want to take the sum of each cell of a different column by week. However, I don't even know how to turn that character string into some sort of date...
Just to give you a sample of my data, it would be this (already filtered for 2020):
Activity.Date Moving.Time
1 Jan 1, 2020, 8:36:55 PM 3581
2 Jan 7, 2020, 12:00:00 PM 1200
3 Jan 9, 2020, 8:24:55 PM 970
4 Jan 12, 2020, 7:51:30 PM 5564
5 Feb 4, 2020, 9:20:21 AM 1350
6 Feb 5, 2020, 9:20:00 AM 2400
7 Feb 6, 2020, 9:15:00 AM 2415
8 Feb 16, 2020, 11:55:51 AM 1836
9 Feb 17, 2020, 8:36:47 PM 511
10 Feb 25, 2020, 7:30:00 PM 928
11 Mar 4, 2020, 7:41:02 PM 558
12 Mar 6, 2020, 8:25:27 PM 2637
13 Mar 9, 2020, 8:37:11 PM 577
14 Mar 11, 2020, 7:46:10 PM 523
15 Mar 11, 2020, 10:00:25 PM 1278
16 Mar 12, 2020, 12:34:41 AM 442
17 Mar 13, 2020, 8:26:55 PM 2410
18 Mar 16, 2020, 8:25:22 PM 609
19 Sep 12, 2020, 7:27:26 PM 1884
20 Sep 15, 2020, 7:46:27 PM 1783
21 Sep 17, 2020, 8:41:19 PM 1838
22 Sep 19, 2020, 12:08:56 PM 1995
23 Sep 22, 2020, 7:29:01 PM 1776
24 Sep 24, 2020, 7:08:35 PM 1972
25 Sep 26, 2020, 7:24:52 PM 4032
26 Oct 3, 2020, 7:27:22 PM 4172
27 Oct 7, 2020, 8:00:41 PM 2987
28 Oct 8, 2020, 6:57:21 PM 2319
29 Oct 10, 2020, 7:23:39 PM 2509
30 Oct 12, 2020, 6:54:36 PM 5711
31 Oct 13, 2020, 7:56:59 PM 1764
32 Oct 14, 2020, 7:18:06 PM 4822
33 Oct 15, 2020, 8:09:31 PM 1863
34 Oct 17, 2020, 7:50:45 PM 5086
35 Oct 20, 2020, 7:58:39 PM 1583
36 Oct 21, 2020, 8:16:10 PM 4978
37 Oct 22, 2020, 7:23:26 PM 1940
38 Oct 22, 2020, 8:18:24 PM 1857
EDIT: I also need the number of rows that were summed in a third column, if possible...
You can use as.POSIXct.
x <- c("Jan 1, 2020, 8:36:55 PM", "Jan 7, 2020, 12:00:00 PM", "Jan 9, 2020, 8:24:55 PM")
as.POSIXct(x, format = '%b %d, %Y, %I:%M:%S %p', tz = 'UTC')
#[1] "2020-01-01 20:36:55 UTC" "2020-01-07 12:00:00 UTC" "2020-01-09 20:24:55 UTC"
The formats are mentioned in ?strptime.
If this is difficult to remember you can use mdy_hms from lubridate.
lubridate::mdy_hms(x)
Once you do that you can extract the week information and sum Moving.Time in each week.
library(dplyr)
library(lubridate)
df %>%
mutate(Activity.Date = mdy_hms(Activity.Date)) %>%
group_by(Week = week(Activity.Date )) %>%
summarise(Moving.Time = sum(Moving.Time))
Convert your activity.date column to a date/time object with this:
activitydate <-as.POSIXct("Jan 1, 2020, 8:36:55 PM", format="%b %d, %Y, %r")
Then to identify the week number:
format(activitydate, "%V") #or %U
See help for strptime for more information.
Update
To answer your second question about providing number of the rows this is easily done with the dplyr library.
df$Activity.Date <- as.POSIXct(df$Activity.Date, format="%b %d, %Y, %r")
df$week <- format(df$Activity.Date, "%V")
library(dplyr)
df %>% group_by(week) %>% summarize(count=n(), sum=sum(Moving.Time))
Related
I have daily scores and their corresponding dates as seen below and currently struggling to convert them to quarterly. However, the years are first of all not chronological and am quite confused as to how to deal with the two situations.
see sample.
data
dates score
1 July 1, 2019 Monday 8
2 October 25, 2015 Sunday -3
3 June 17, 2020 Wednesday -5
4 January 17, 2018 Wednesday -1
5 April 15, 2019 Monday 6
6 October 30, 2019 Wednesday 10
7 March 6, 2017 Monday -2
8 November 19, 2018 Monday 3
9 June 11, 2020 Thursday 5
10 October 11, 2017 Wednesday -13
11 December 3, 2017 Sunday -8
12 November 14, 2018 Wednesday -6
13 August 22, 2017 Tuesday 8
14 December 13, 2017 Wednesday 5
15 January 22, 2016 Friday 5`
dates <- sapply(date, function(x)
trimws(grep(paste(month.name, collapse = '|'), x, value = TRUE)));
sort(as.Date(dates,'%B %d, %Y %A'))
This is a job for lubridate. You can parse your date column with lubridate::parse_date_time() and extract the quarter they fall in with lubridate::quarter():
library("tibble")
library("dplyr")
library("lubridate")
tbl <- tribble(~date, ~score,
"July 1, 2019 Monday", 8,
"October 25, 2015 Sunday", -3,
"June 17, 2020 Wednesday", -5,
"January 17, 2018 Wednesday", -1,
"April 15, 2019 Monday", 6,
"October 30, 2019 Wednesday", 10,
"March 6, 2017 Monday", -2,
"November 19, 2018 Monday", 3,
"June 11, 2020 Thursday", 5,
"October 11, 2017 Wednesday", -13,
"December 3, 2017 Sunday", -8,
"November 14, 2018 Wednesday", -6,
"August 22, 2017 Tuesday", 8,
"December 13, 2017 Wednesday", 5,
"January 22, 2016 Friday", 5)
tbl %>%
mutate(date = parse_date_time(date, "B d, Y")) %>%
mutate(quarter = quarter(date, with_year = TRUE))
#> # A tibble: 15 x 3
#> date score quarter
#> <dttm> <dbl> <dbl>
#> 1 2019-07-01 00:00:00 8 2019.3
#> 2 2015-10-25 00:00:00 -3 2015.4
#> 3 2020-06-17 00:00:00 -5 2020.2
#> 4 2018-01-17 00:00:00 -1 2018.1
#> 5 2019-04-15 00:00:00 6 2019.2
#> 6 2019-10-30 00:00:00 10 2019.4
#> 7 2017-03-06 00:00:00 -2 2017.1
#> 8 2018-11-19 00:00:00 3 2018.4
#> 9 2020-06-11 00:00:00 5 2020.2
#> 10 2017-10-11 00:00:00 -13 2017.4
#> 11 2017-12-03 00:00:00 -8 2017.4
#> 12 2018-11-14 00:00:00 -6 2018.4
#> 13 2017-08-22 00:00:00 8 2017.3
#> 14 2017-12-13 00:00:00 5 2017.4
#> 15 2016-01-22 00:00:00 5 2016.1
If you are trying to change dates column to Date class you can use as.Date.
df$new_date <- as.Date(trimws(df$dates), '%B %d, %Y')
Or this should also work with lubridate's mdy :
df$new_date <- lubridate::mdy(df$dates)
Once the data has been converted to date values per Ronak Shah's answer, we can use lubridate::quarter() to generate year and quarter values.
textData <- " dates|score
July 1, 2019 Monday| 8
October 25, 2015 Sunday| -3
June 17, 2020 Wednesday| -5
January 17, 2018 Wednesday| -1
April 15, 2019 Monday| 6
October 30, 2019 Wednesday| 10
March 6, 2017 Monday| -2
November 19, 2018 Monday| 3
June 11, 2020 Thursday| 5
October 11, 2017 Wednesday| -13
December 3, 2017 Sunday| -8
November 14, 2018 Wednesday| -6
August 22, 2017 Tuesday| 8
December 13, 2017 Wednesday| 5
January 22, 2016 Friday| 5
"
df <- read.csv(text=textData,
header=TRUE,
sep="|")
library(lubridate)
df$dt_quarter <- quarter(mdy(df$dates),with_year = TRUE,
fiscal_start = 1)
head(df)
We include with_year = TRUE and fiscal_start = 1 arguments to illustrate that one can change the output to include / exclude the year information, as well as change the start month for the year from the default of 1.
...and the output:
> head(df)
dates score dt_quarter
1 July 1, 2019 Monday 8 2019.3
2 October 25, 2015 Sunday -3 2015.4
3 June 17, 2020 Wednesday -5 2020.2
4 January 17, 2018 Wednesday -1 2018.1
5 April 15, 2019 Monday 6 2019.2
6 October 30, 2019 Wednesday 10 2019.4
The yearqtr class represents a year and quarter as the year plus 0 for Q1, 0.25 for Q2, 0.5 for Q3 and 0.75 for Q4. If date is defined as a yearqtr object as below then as.integer(date) is the year and cycle(date) is the quarter: 1, 2, 3 or 4. Note that junk at the end of the date field is ignored by as.yearqtr so we only need to specify month, day and year percent codes.
If you want a Date object instead of a yearqtr object then uncomment one of the commented out lines.
data is defined reproducibly in the Note at the end. (In the future please use dput to display your input data to prevent ambiguity as discussed in the information at the top of the r tag page.)
library(zoo)
date <- as.yearqtr(data$date, "%B %d, %Y")
# uncomment one of these lines if you want a Date object instead of yearqtr object
# date <- as.Date(date) # first day of quarter
# date <- as.Date(date, frac = 1) # last day of quarter
data.frame(date, score = data$score)[order(date), ]
giving the following sorted data frame assuming that we do not uncomment any of the commented out lines above.
date score
2 2015 Q4 -3
15 2016 Q1 5
7 2017 Q1 -2
...snip...
Time series
If this is supposed to be a time series with a single aggregated score per quarter then we can get a zoo series like this where data is the original data defined in the Note below.
library(zoo)
to_ym <- function(x) as.yearqtr(x, "%B %d, %Y")
z <- read.zoo(data, FUN = to_ym, aggregate = "mean")
z
## 2015 Q4 2016 Q1 2017 Q1 2017 Q3 2017 Q4 2018 Q1 2018 Q4 2019 Q2
## -3.000000 5.000000 -2.000000 8.000000 -5.333333 -1.000000 -1.500000 6.000000
## 2019 Q3 2019 Q4 2020 Q2
## 8.000000 10.000000 0.000000
or as a ts object like this:
as.ts(z)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2015 -3.000000
## 2016 5.000000 NA NA NA
## 2017 -2.000000 NA 8.000000 -5.333333
## 2018 -1.000000 NA NA -1.500000
## 2019 NA 6.000000 8.000000 10.000000
## 2020 NA 0.000000
Note
The input data in reproducible form:
data <- structure(list(dates = c("July 1, 2019 Monday", "October 25, 2015 Sunday",
"June 17, 2020 Wednesday", "January 17, 2018 Wednesday", "April 15, 2019 Monday",
"October 30, 2019 Wednesday", "March 6, 2017 Monday", "November 19, 2018 Monday",
"June 11, 2020 Thursday", "October 11, 2017 Wednesday", "December 3, 2017 Sunday",
"November 14, 2018 Wednesday", "August 22, 2017 Tuesday", "December 13, 2017 Wednesday",
"January 22, 2016 Friday"), score = c(8L, -3L, -5L, -1L, 6L,
10L, -2L, 3L, 5L, -13L, -8L, -6L, 8L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-15L))
Update
Have updated this answer several times so be sure you are looking at the latest version.
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Add missing months for a range of date in R
(2 answers)
Closed 2 years ago.
I have a data of random dates from 2008 to 2020 and their corresponding value
Date Val
September 16, 2012 32
September 19, 2014 33
January 05, 2008 26
June 07, 2017 02
December 15, 2019 03
May 28, 2020 18
I want to fill the missing dates from January 01 2008 to March 31, 2020 and their corresponding value as 1.
I refer some of the post like Post1, Post2 and I am not able to solve the problem based on that. I am a beginner in R.
I am looking for data like this
Date Val
January 01, 2008 1
January 02, 2008 1
January 03, 2008 1
January 04, 2008 1
January 05, 2008 26
........
Use tidyr::complete :
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%B %d, %Y")) %>%
tidyr::complete(Date = seq(as.Date('2008-01-01'), as.Date('2020-03-31'),
by = 'day'), fill = list(Val = 1)) %>%
mutate(Date = format(Date, "%B %d, %Y"))
# A tibble: 4,475 x 2
# Date Val
# <chr> <dbl>
# 1 January 01, 2008 1
# 2 January 02, 2008 1
# 3 January 03, 2008 1
# 4 January 04, 2008 1
# 5 January 05, 2008 26
# 6 January 06, 2008 1
# 7 January 07, 2008 1
# 8 January 08, 2008 1
# 9 January 09, 2008 1
#10 January 10, 2008 1
# … with 4,465 more rows
data
df <- structure(list(Date = c("September 16, 2012", "September 19, 2014",
"January 05, 2008", "June 07, 2017", "December 15, 2019", "May 28, 2020"
), Val = c(32L, 33L, 26L, 2L, 3L, 18L)), class = "data.frame",
row.names = c(NA, -6L))
We can create data frame with the desired date range and then join our data frame on it and replace all NAs with 1:
library(tidyverse)
days_seq %>%
left_join(df) %>%
mutate(Val = if_else(is.na(Val), as.integer(1), Val))
Joining, by = "Date"
# A tibble: 4,474 x 2
Date Val
<date> <int>
1 2008-01-01 1
2 2008-01-02 1
3 2008-01-03 1
4 2008-01-04 1
5 2008-01-05 33
6 2008-01-06 1
7 2008-01-07 1
8 2008-01-08 1
9 2008-01-09 1
10 2008-01-10 1
# ... with 4,464 more rows
Data
days_seq <- tibble(Date = seq(as.Date("2008/01/01"), as.Date("2020/03/31"), "days"))
df <- tibble::tribble(
~Date, ~Val,
"2012/09/16", 32L,
"2012/09/19", 33L,
"2008/01/05", 33L
)
df$Date <- as.Date(df$Date)
I have a data frame which looks like this:
structure(list(V1 = c(1174060957322141696, 1174107739209043968,
1175456617980149760, 1175463444805558272, 1175475052307013632,
1175916108697808896, 1177035962104369152, 1177959867077791744,
1180512511436709888, 1179879113844236288), V2 = structure(c(573L,
595L, 87L, 88L, 91L, 67L, 561L, 100L, 77L, 1L), .Label = c("Fri Oct 04 00:01:16 CEST 2019",
"Sat Oct 05 13:55:30 CEST 2019", "Sat Oct 05 13:55:56 CEST 2019",
"Wed Oct 02 10:25:36 CEST 2019", "Wed Oct 02 11:47:16 CEST 2019",
"Wed Oct 02 23:43:18 CEST 2019", "Wed Oct 02 23:46:07 CEST 2019",
"Wed Oct 02 23:52:27 CEST 2019", "Wed Oct 02 23:54:42 CEST 2019",
"Wed Oct 02 23:55:50 CEST 2019", "Wed Oct 02 23:56:11 CEST 2019",
"Wed Oct 02 23:56:41 CEST 2019", "Wed Oct 02 23:57:12 CEST 2019",
"Wed Oct 02 23:58:02 CEST 2019", "Wed Oct 02 23:58:53 CEST 2019",
"Wed Oct 02 23:59:05 CEST 2019", "Wed Oct 02 23:59:16 CEST 2019",
"Wed Oct 02 23:59:42 CEST 2019", "Wed Sep 18 01:47:53 CEST 2019",
"Wed Sep 25 00:50:36 CEST 2019", "Wed Sep 25 01:06:26 CEST 2019"
), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
I want to change the hours in column V4 by subtracting 07:00:00. If the hours in column V4 is smaller than 07:00:00 then it should as well change the day in column V3 and in case the day goes to the month before then it should change the month in column V2. The final aim of this is to count how many rows are there for each day, for which I can use:
count(entertainment_one, c("V2", "V3"))
but before I need to reorganise my data frame.
I am new to R and do not know where to start. Any help would be really appreciated, thank you very much!
First thing to notice is that your V2 is a factor; they do not behave as you think. Quickly convert it back to a character vector!
df$V2 <- as.character(df$V2)
Now, let's have our date as an actual datetime vector. But first, set the locale to English, as it seems your dates are English; otherwise parsing dates from a different language than your computer might work:
Sys.getlocale('LC_TIME') # take note of this value if you want to reset it.
Sys.setlocale('LC_TIME', 'english') # works on windows
df$dates <- strptime(df$V2, '%a %b %d %T CEST %Y', tz='XXX')
You see that 'XXX' - that's because I have no idea what timezone CEST is. If all your dates are in the same timezone, you probably wouldn't notice...
At this point, df$dates is a POSIXlt-class object. Try adding 10 (or 1 or any small integer)
df$dates + 1
[1] "2019-10-04 00:01:17 EDT" "2019-10-05 13:55:31 EDT" "2019-10-05 13:55:57 EDT" ...
Ahh, it's counting seconds.
So to subtract 7 hours, subtract 7 hours worth of seconds:
df$offset <- df$dates - 7 * 60 * 60
See, both days and months change accordingly. Now use the package lubridate to extract day and month-components:
library(lubridate)
month(df$offset)
day(df$offset)
In R, what would be the best way to separate the following data into a table with 2 columns?
March 09, 2018
0.084752
March 10, 2018
0.084622
March 11, 2018
0.084622
March 12, 2018
0.084437
March 13, 2018
0.084785
March 14, 2018
0.084901
I considered using a for loop but was advised against it. I do not know how to parse things very well, so if the best method involves this process please
be as clear as possible.
The final table should look something like this:
https://i.stack.imgur.com/u5hII.png
Thank you!
Input:
input <- c("March 09, 2018",
"0.084752",
"March 10, 2018",
"0.084622",
"March 11, 2018",
"0.084622",
"March 12, 2018",
"0.084437",
"March 13, 2018",
"0.084785",
"March 14, 2018",
"0.084901")
Method:
library(dplyr)
library(lubridate)
df <- matrix(input, ncol = 2, byrow = TRUE) %>%
as_tibble() %>%
mutate(V1 = mdy(V1), V2 = as.numeric(V2))
Output:
df
# A tibble: 6 x 2
V1 V2
<date> <dbl>
1 2018-03-09 0.0848
2 2018-03-10 0.0846
3 2018-03-11 0.0846
4 2018-03-12 0.0844
5 2018-03-13 0.0848
6 2018-03-14 0.0849
Use names() or rename() to rename each columns.
names(df) <- c("Date", "Value")
data.table::fread can read "...a string (containing at least one \n)...."
'f' in fread stands for 'fast' so the code below should work on fairly large chunks as well.
require(data.table)
x = 'March 09, 2018
0.084752
March 10, 2018
0.084622
March 11, 2018
0.084622
March 12, 2018
0.084437
March 13, 2018
0.084785
March 14, 2018
0.084901'
o = fread(x, sep = '\n', header = FALSE)
o[, V1L := shift(V1, type = "lead")]
o[, keep := (1:.N)%% 2 != 0 ]
z = o[(keep)]
z[, keep := NULL]
z
result = data.frame(matrix(input, ncol = 2, byrow = T), stringsAsFactors = FALSE)
result
# X1 X2
# 1 March 09, 2018 0.084752
# 2 March 10, 2018 0.084622
# 3 March 11, 2018 0.084622
# 4 March 12, 2018 0.084437
# 5 March 13, 2018 0.084785
# 6 March 14, 2018 0.084901
You should next adjust the names and classes, something like this:
names(result) = c("date", "value")
result$value = as.numeric(result$value)
# etc.
Using Nik's nice input:
input = c(
"March 09, 2018",
"0.084752",
"March 10, 2018",
"0.084622",
"March 11, 2018",
"0.084622",
"March 12, 2018",
"0.084437",
"March 13, 2018",
"0.084785",
"March 14, 2018",
"0.084901"
)
My data has this format:
DF <- data.frame(ids = c("uniqueid1", "uniqueid1", "uniqueid1", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid3", "uniqueid3", "uniqueid3", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("April 2014", "March 2012", "April 2014", "January 2017", "January 2016", "January 2015", "January 2014", "November 2011", "November 2011", "December 2009", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid1 April 2014
2 uniqueid1 March 2012
3 uniqueid1 April 2014
4 uniqueid2 January 2017
5 uniqueid2 January 2016
6 uniqueid2 January 2015
7 uniqueid2 January 2014
8 uniqueid3 November 2011
9 uniqueid3 November 2011
10 uniqueid3 December 2009
11 uniqueid4 August 2001
12 uniqueid4 July 2000
13 uniqueid4 May 1999
How is it possible to remove totally rows which have in the same id have a same value in stock_year column?
An example output of expected results is this:
DF <- data.frame(ids = c("uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("January 2017", "January 2016", "January 2015", "January 2014", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid2 January 2017
2 uniqueid2 January 2016
3 uniqueid2 January 2015
4 uniqueid2 January 2014
5 uniqueid4 August 2001
6 uniqueid4 July 2000
7 uniqueid4 May 1999
We can group by 'ids' and check for duplicates to filter those 'ids' having no duplicates
library(dplyr)
DF %>%
group_by(ids) %>%
filter(!anyDuplicated(stock_year))
Or using ave from base R
DF[with(DF, ave(as.character(stock_year), ids, FUN=anyDuplicated)!=0),]