In R, what would be the best way to separate the following data into a table with 2 columns?
March 09, 2018
0.084752
March 10, 2018
0.084622
March 11, 2018
0.084622
March 12, 2018
0.084437
March 13, 2018
0.084785
March 14, 2018
0.084901
I considered using a for loop but was advised against it. I do not know how to parse things very well, so if the best method involves this process please
be as clear as possible.
The final table should look something like this:
https://i.stack.imgur.com/u5hII.png
Thank you!
Input:
input <- c("March 09, 2018",
"0.084752",
"March 10, 2018",
"0.084622",
"March 11, 2018",
"0.084622",
"March 12, 2018",
"0.084437",
"March 13, 2018",
"0.084785",
"March 14, 2018",
"0.084901")
Method:
library(dplyr)
library(lubridate)
df <- matrix(input, ncol = 2, byrow = TRUE) %>%
as_tibble() %>%
mutate(V1 = mdy(V1), V2 = as.numeric(V2))
Output:
df
# A tibble: 6 x 2
V1 V2
<date> <dbl>
1 2018-03-09 0.0848
2 2018-03-10 0.0846
3 2018-03-11 0.0846
4 2018-03-12 0.0844
5 2018-03-13 0.0848
6 2018-03-14 0.0849
Use names() or rename() to rename each columns.
names(df) <- c("Date", "Value")
data.table::fread can read "...a string (containing at least one \n)...."
'f' in fread stands for 'fast' so the code below should work on fairly large chunks as well.
require(data.table)
x = 'March 09, 2018
0.084752
March 10, 2018
0.084622
March 11, 2018
0.084622
March 12, 2018
0.084437
March 13, 2018
0.084785
March 14, 2018
0.084901'
o = fread(x, sep = '\n', header = FALSE)
o[, V1L := shift(V1, type = "lead")]
o[, keep := (1:.N)%% 2 != 0 ]
z = o[(keep)]
z[, keep := NULL]
z
result = data.frame(matrix(input, ncol = 2, byrow = T), stringsAsFactors = FALSE)
result
# X1 X2
# 1 March 09, 2018 0.084752
# 2 March 10, 2018 0.084622
# 3 March 11, 2018 0.084622
# 4 March 12, 2018 0.084437
# 5 March 13, 2018 0.084785
# 6 March 14, 2018 0.084901
You should next adjust the names and classes, something like this:
names(result) = c("date", "value")
result$value = as.numeric(result$value)
# etc.
Using Nik's nice input:
input = c(
"March 09, 2018",
"0.084752",
"March 10, 2018",
"0.084622",
"March 11, 2018",
"0.084622",
"March 12, 2018",
"0.084437",
"March 13, 2018",
"0.084785",
"March 14, 2018",
"0.084901"
)
Related
Having a dataframe like this:
data.frame(id = c(1,2,3,4), time_stamp_1 = c("Nov 2016-Current", "May 2013-Current", "Oct 2015-Current", "May 2014-Current"), time_stamp_2 = c("Mar 2015-Nov 2016", "May 2008-May 2013", "Aug 2005-Current", "Oct 2014-Jan 2015"))
How is it possible to add new columns which have the time difference in months for every row and whete current insert the "Sept 2022".
Example output:
data.frame(id = c(1,2,3,4), time_stamp_1 = c("Nov 2016-Current", "May 2013-Current", "Oct 2015-Current", "May 2014-Current"), time_stamp_2 = c("Mar 2015-Nov 2016", "May 2008-May 2013", "Aug 2005-Current", "Oct 2014-Jan 2015"), time_stamp_1_duration = c(41,43,24,53), time_stamp_2_duration = c(32,12,45,32))
duration is an example only it is not the real, just for example.
This should do the trick. First replace all the "Current" and "Sept" with the R-recognized abbreviation "Sep", then use tidy::separate and zoo::as.yearmon() to convert to year-month format, then calculate the intervals (in months (x12) per OP):
library(tidyr)
library(zoo)
df <- data.frame(id = c(1,2,3,4), time_stamp_1 = c("Nov 2016-Current", "May 2013-Current", "Oct 2015-Current", "May 2014-Current"), time_stamp_2 = c("Mar 2015-Nov 2016", "May 2008-May 2013", "Aug 2005-Current", "Oct 2014-Jan 2015"))
# convert current and Sept to "Sep 2022"
df[2:3] <- lapply(df[2:3], function(x) gsub("-Current|-Sept 2022", "-Sep 2022", x))
df %>%
separate(time_stamp_1, into = c("my1a", "my1b"), sep = "-") %>%
separate(time_stamp_2, into = c("my2a", "my2b"), sep = "-") %>%
mutate(across(my1a:my2b, ~ as.yearmon(.x, format = "%b %Y"))) %>%
mutate(interval_1 = (my1b - my1a) * 12,
interval_2 = (my2b - my2a) * 12) %>%
left_join(df) %>% select(names(df), "interval_1", "interval_2")
Output:
id time_stamp_1 time_stamp_2 interval_1 interval_2
1 1 Nov 2016-Sep 2022 Mar 2015-Nov 2016 70 20
2 2 May 2013-Sep 2022 May 2008-May 2013 112 60
3 3 Oct 2015-Sep 2022 Aug 2005-Sep 2022 83 205
4 4 May 2014-Sep 2022 Oct 2014-Jan 2015 100 3
As G. Grothendieck mentions in the comments, we could wrap this in a function:
# thanks to G. Grothendieck
ts2mos <- function(x) {
x <- gsub("-Current|-Sept 2022", "-Sep 2022", x)
12 * (as.yearmon(sub(".*-", "", x)) - as.yearmon(x, "%b %Y"))
}
df %>% mutate(interval_1 = ts2mos(time_stamp_1),
interval_2 = ts2mos(time_stamp_2))
Using tidyverse
library(dplyr)
library(lubridate)
library(stringr)
df1 %>%
mutate(across(starts_with('time_stamp'), ~ {
tmp <- str_replace(.x, "Current", 'Sep 2022') %>%
str_replace("(\\w+) (\\d+)-(\\w+) (\\d+)", "\\2-\\1-01/\\4-\\3-01") %>%
interval
tmp %/% months(1)}, .names = "{.col}_duration"))
-output
id time_stamp_1 time_stamp_2 time_stamp_1_duration time_stamp_2_duration
1 1 Nov 2016-Current Mar 2015-Nov 2016 70 20
2 2 May 2013-Current May 2008-May 2013 112 60
3 3 Oct 2015-Current Aug 2005-Current 83 205
4 4 May 2014-Current Oct 2014-Jan 2015 100 3
A tidyverse approach
library(dplyr)
library(lubridate)
library(stringr)
df %>%
mutate(across(starts_with("time_stamp"), ~ str_replace(.x, "Current", "Sep 2022")),
time_stamp_1_duration = sapply(str_split(time_stamp_1, "-"), function(x)
interval(my(x[1]), my(x[2])) %/% months(1)),
time_stamp_2_duration = sapply(str_split(time_stamp_2, "-"), function(x)
interval(my(x[1]), my(x[2])) %/% months(1)),
across(starts_with("time_stamp"), ~ str_replace(.x, "Sep 2022", "Current")))
id time_stamp_1 time_stamp_2 time_stamp_1_duration
1 1 Nov 2016-Current Mar 2015-Nov 2016 70
2 2 May 2013-Current May 2008-May 2013 112
3 3 Oct 2015-Current Aug 2005-Current 83
4 4 May 2014-Current Oct 2014-Jan 2015 100
time_stamp_2_duration
1 20
2 60
3 205
4 3
library(stringr)
timespan_to_duration <- function(x) {
x[ x == 'Current' ] <- 'Sep 2022'
x <- str_replace_all(x, '\\s+', ' 01 ')
x <- as.POSIXct(x, format = '%b %d %Y')
((difftime(x[ 2 ], x[ 1 ], units = 'days') |>
as.integer()) / 30) |>
round()
}
df <- data.frame(id = c(1,2,3,4),
time_stamp_1 = c("Nov 2016-Current", "May 2013-Current", "Oct 2015-Current", "May 2014-Current"),
time_stamp_2 = c("Mar 2015-Nov 2016", "May 2008-May 2013", "Aug 2005-Current", "Oct 2014-Jan 2015"))
df$time_stamp_1_duration <- df$time_stamp_1 |>
str_split('-') |>
lapply(timespan_to_duration) |>
unlist()
df$time_stamp_2_duration <- df$time_stamp_2 |>
str_split('-') |>
lapply(timespan_to_duration) |>
unlist()
df
One possible solution using the function tstrsplit from data.table package. Not that I am also using the built-in constant month.abb.
df[c("duration1", "duration2")] = lapply(df[2:3], function(x) {
x = data.table::tstrsplit(sub("Current", "Sep 2022", x),
split="\\s|-",
type.convert=TRUE,
names=c("mo1", "yr1", "mo2", "yr2"))
x[c("mo1", "mo2")] = lapply(x[c("mo1", "mo2")], match, month.abb)
pmax(x$yr2 - x$yr1-1, 0) * 12 + 12-x$mo1 + x$mo2
})
id time_stamp_1 time_stamp_2 duration1 duration2
1 1 Nov 2016-Current Mar 2015-Nov 2016 70 20
2 2 May 2013-Current May 2008-May 2013 112 60
3 3 Oct 2015-Current Aug 2005-Current 83 205
4 4 May 2014-Current Oct 2014-Jan 2015 100 3
I have columns like these:
year period period2 Sales
2015 201504 April 2015 10000
2015 201505 May 2015 11000
2018 201803 March 2018 12000
I want to change the type of period or period2 column as a date, to use later in time series analysis
Data:
tibble::tibble(
year = c(2015,2015,2018),
period = c(201504, 201505,201803 ),
period2 = c("April 2015", "May 2015", "March 2018"),
Sales = c(10000,11000,12000)
)
Using lubridate package you can transform them into date variables:
df <- tibble::tibble(
year = c(2015,2015,2018),
period = c(201504, 201505,201803 ),
period2 = c("April 2015", "May 2015", "March 2018"),
Sales = c(10000,11000,12000)
)
library(dplyr)
df %>%
mutate(period = lubridate::ym(period),
period2 = lubridate::my(period2))
I have daily scores and their corresponding dates as seen below and currently struggling to convert them to quarterly. However, the years are first of all not chronological and am quite confused as to how to deal with the two situations.
see sample.
data
dates score
1 July 1, 2019 Monday 8
2 October 25, 2015 Sunday -3
3 June 17, 2020 Wednesday -5
4 January 17, 2018 Wednesday -1
5 April 15, 2019 Monday 6
6 October 30, 2019 Wednesday 10
7 March 6, 2017 Monday -2
8 November 19, 2018 Monday 3
9 June 11, 2020 Thursday 5
10 October 11, 2017 Wednesday -13
11 December 3, 2017 Sunday -8
12 November 14, 2018 Wednesday -6
13 August 22, 2017 Tuesday 8
14 December 13, 2017 Wednesday 5
15 January 22, 2016 Friday 5`
dates <- sapply(date, function(x)
trimws(grep(paste(month.name, collapse = '|'), x, value = TRUE)));
sort(as.Date(dates,'%B %d, %Y %A'))
This is a job for lubridate. You can parse your date column with lubridate::parse_date_time() and extract the quarter they fall in with lubridate::quarter():
library("tibble")
library("dplyr")
library("lubridate")
tbl <- tribble(~date, ~score,
"July 1, 2019 Monday", 8,
"October 25, 2015 Sunday", -3,
"June 17, 2020 Wednesday", -5,
"January 17, 2018 Wednesday", -1,
"April 15, 2019 Monday", 6,
"October 30, 2019 Wednesday", 10,
"March 6, 2017 Monday", -2,
"November 19, 2018 Monday", 3,
"June 11, 2020 Thursday", 5,
"October 11, 2017 Wednesday", -13,
"December 3, 2017 Sunday", -8,
"November 14, 2018 Wednesday", -6,
"August 22, 2017 Tuesday", 8,
"December 13, 2017 Wednesday", 5,
"January 22, 2016 Friday", 5)
tbl %>%
mutate(date = parse_date_time(date, "B d, Y")) %>%
mutate(quarter = quarter(date, with_year = TRUE))
#> # A tibble: 15 x 3
#> date score quarter
#> <dttm> <dbl> <dbl>
#> 1 2019-07-01 00:00:00 8 2019.3
#> 2 2015-10-25 00:00:00 -3 2015.4
#> 3 2020-06-17 00:00:00 -5 2020.2
#> 4 2018-01-17 00:00:00 -1 2018.1
#> 5 2019-04-15 00:00:00 6 2019.2
#> 6 2019-10-30 00:00:00 10 2019.4
#> 7 2017-03-06 00:00:00 -2 2017.1
#> 8 2018-11-19 00:00:00 3 2018.4
#> 9 2020-06-11 00:00:00 5 2020.2
#> 10 2017-10-11 00:00:00 -13 2017.4
#> 11 2017-12-03 00:00:00 -8 2017.4
#> 12 2018-11-14 00:00:00 -6 2018.4
#> 13 2017-08-22 00:00:00 8 2017.3
#> 14 2017-12-13 00:00:00 5 2017.4
#> 15 2016-01-22 00:00:00 5 2016.1
If you are trying to change dates column to Date class you can use as.Date.
df$new_date <- as.Date(trimws(df$dates), '%B %d, %Y')
Or this should also work with lubridate's mdy :
df$new_date <- lubridate::mdy(df$dates)
Once the data has been converted to date values per Ronak Shah's answer, we can use lubridate::quarter() to generate year and quarter values.
textData <- " dates|score
July 1, 2019 Monday| 8
October 25, 2015 Sunday| -3
June 17, 2020 Wednesday| -5
January 17, 2018 Wednesday| -1
April 15, 2019 Monday| 6
October 30, 2019 Wednesday| 10
March 6, 2017 Monday| -2
November 19, 2018 Monday| 3
June 11, 2020 Thursday| 5
October 11, 2017 Wednesday| -13
December 3, 2017 Sunday| -8
November 14, 2018 Wednesday| -6
August 22, 2017 Tuesday| 8
December 13, 2017 Wednesday| 5
January 22, 2016 Friday| 5
"
df <- read.csv(text=textData,
header=TRUE,
sep="|")
library(lubridate)
df$dt_quarter <- quarter(mdy(df$dates),with_year = TRUE,
fiscal_start = 1)
head(df)
We include with_year = TRUE and fiscal_start = 1 arguments to illustrate that one can change the output to include / exclude the year information, as well as change the start month for the year from the default of 1.
...and the output:
> head(df)
dates score dt_quarter
1 July 1, 2019 Monday 8 2019.3
2 October 25, 2015 Sunday -3 2015.4
3 June 17, 2020 Wednesday -5 2020.2
4 January 17, 2018 Wednesday -1 2018.1
5 April 15, 2019 Monday 6 2019.2
6 October 30, 2019 Wednesday 10 2019.4
The yearqtr class represents a year and quarter as the year plus 0 for Q1, 0.25 for Q2, 0.5 for Q3 and 0.75 for Q4. If date is defined as a yearqtr object as below then as.integer(date) is the year and cycle(date) is the quarter: 1, 2, 3 or 4. Note that junk at the end of the date field is ignored by as.yearqtr so we only need to specify month, day and year percent codes.
If you want a Date object instead of a yearqtr object then uncomment one of the commented out lines.
data is defined reproducibly in the Note at the end. (In the future please use dput to display your input data to prevent ambiguity as discussed in the information at the top of the r tag page.)
library(zoo)
date <- as.yearqtr(data$date, "%B %d, %Y")
# uncomment one of these lines if you want a Date object instead of yearqtr object
# date <- as.Date(date) # first day of quarter
# date <- as.Date(date, frac = 1) # last day of quarter
data.frame(date, score = data$score)[order(date), ]
giving the following sorted data frame assuming that we do not uncomment any of the commented out lines above.
date score
2 2015 Q4 -3
15 2016 Q1 5
7 2017 Q1 -2
...snip...
Time series
If this is supposed to be a time series with a single aggregated score per quarter then we can get a zoo series like this where data is the original data defined in the Note below.
library(zoo)
to_ym <- function(x) as.yearqtr(x, "%B %d, %Y")
z <- read.zoo(data, FUN = to_ym, aggregate = "mean")
z
## 2015 Q4 2016 Q1 2017 Q1 2017 Q3 2017 Q4 2018 Q1 2018 Q4 2019 Q2
## -3.000000 5.000000 -2.000000 8.000000 -5.333333 -1.000000 -1.500000 6.000000
## 2019 Q3 2019 Q4 2020 Q2
## 8.000000 10.000000 0.000000
or as a ts object like this:
as.ts(z)
## Qtr1 Qtr2 Qtr3 Qtr4
## 2015 -3.000000
## 2016 5.000000 NA NA NA
## 2017 -2.000000 NA 8.000000 -5.333333
## 2018 -1.000000 NA NA -1.500000
## 2019 NA 6.000000 8.000000 10.000000
## 2020 NA 0.000000
Note
The input data in reproducible form:
data <- structure(list(dates = c("July 1, 2019 Monday", "October 25, 2015 Sunday",
"June 17, 2020 Wednesday", "January 17, 2018 Wednesday", "April 15, 2019 Monday",
"October 30, 2019 Wednesday", "March 6, 2017 Monday", "November 19, 2018 Monday",
"June 11, 2020 Thursday", "October 11, 2017 Wednesday", "December 3, 2017 Sunday",
"November 14, 2018 Wednesday", "August 22, 2017 Tuesday", "December 13, 2017 Wednesday",
"January 22, 2016 Friday"), score = c(8L, -3L, -5L, -1L, 6L,
10L, -2L, 3L, 5L, -13L, -8L, -6L, 8L, 5L, 5L)), class = "data.frame", row.names = c(NA,
-15L))
Update
Have updated this answer several times so be sure you are looking at the latest version.
This question already has answers here:
Insert rows for missing dates/times
(9 answers)
How to add only missing Dates in Dataframe
(3 answers)
Add missing months for a range of date in R
(2 answers)
Closed 2 years ago.
I have a data of random dates from 2008 to 2020 and their corresponding value
Date Val
September 16, 2012 32
September 19, 2014 33
January 05, 2008 26
June 07, 2017 02
December 15, 2019 03
May 28, 2020 18
I want to fill the missing dates from January 01 2008 to March 31, 2020 and their corresponding value as 1.
I refer some of the post like Post1, Post2 and I am not able to solve the problem based on that. I am a beginner in R.
I am looking for data like this
Date Val
January 01, 2008 1
January 02, 2008 1
January 03, 2008 1
January 04, 2008 1
January 05, 2008 26
........
Use tidyr::complete :
library(dplyr)
df %>%
mutate(Date = as.Date(Date, "%B %d, %Y")) %>%
tidyr::complete(Date = seq(as.Date('2008-01-01'), as.Date('2020-03-31'),
by = 'day'), fill = list(Val = 1)) %>%
mutate(Date = format(Date, "%B %d, %Y"))
# A tibble: 4,475 x 2
# Date Val
# <chr> <dbl>
# 1 January 01, 2008 1
# 2 January 02, 2008 1
# 3 January 03, 2008 1
# 4 January 04, 2008 1
# 5 January 05, 2008 26
# 6 January 06, 2008 1
# 7 January 07, 2008 1
# 8 January 08, 2008 1
# 9 January 09, 2008 1
#10 January 10, 2008 1
# … with 4,465 more rows
data
df <- structure(list(Date = c("September 16, 2012", "September 19, 2014",
"January 05, 2008", "June 07, 2017", "December 15, 2019", "May 28, 2020"
), Val = c(32L, 33L, 26L, 2L, 3L, 18L)), class = "data.frame",
row.names = c(NA, -6L))
We can create data frame with the desired date range and then join our data frame on it and replace all NAs with 1:
library(tidyverse)
days_seq %>%
left_join(df) %>%
mutate(Val = if_else(is.na(Val), as.integer(1), Val))
Joining, by = "Date"
# A tibble: 4,474 x 2
Date Val
<date> <int>
1 2008-01-01 1
2 2008-01-02 1
3 2008-01-03 1
4 2008-01-04 1
5 2008-01-05 33
6 2008-01-06 1
7 2008-01-07 1
8 2008-01-08 1
9 2008-01-09 1
10 2008-01-10 1
# ... with 4,464 more rows
Data
days_seq <- tibble(Date = seq(as.Date("2008/01/01"), as.Date("2020/03/31"), "days"))
df <- tibble::tribble(
~Date, ~Val,
"2012/09/16", 32L,
"2012/09/19", 33L,
"2008/01/05", 33L
)
df$Date <- as.Date(df$Date)
My data has this format:
DF <- data.frame(ids = c("uniqueid1", "uniqueid1", "uniqueid1", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid3", "uniqueid3", "uniqueid3", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("April 2014", "March 2012", "April 2014", "January 2017", "January 2016", "January 2015", "January 2014", "November 2011", "November 2011", "December 2009", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid1 April 2014
2 uniqueid1 March 2012
3 uniqueid1 April 2014
4 uniqueid2 January 2017
5 uniqueid2 January 2016
6 uniqueid2 January 2015
7 uniqueid2 January 2014
8 uniqueid3 November 2011
9 uniqueid3 November 2011
10 uniqueid3 December 2009
11 uniqueid4 August 2001
12 uniqueid4 July 2000
13 uniqueid4 May 1999
How is it possible to remove totally rows which have in the same id have a same value in stock_year column?
An example output of expected results is this:
DF <- data.frame(ids = c("uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("January 2017", "January 2016", "January 2015", "January 2014", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid2 January 2017
2 uniqueid2 January 2016
3 uniqueid2 January 2015
4 uniqueid2 January 2014
5 uniqueid4 August 2001
6 uniqueid4 July 2000
7 uniqueid4 May 1999
We can group by 'ids' and check for duplicates to filter those 'ids' having no duplicates
library(dplyr)
DF %>%
group_by(ids) %>%
filter(!anyDuplicated(stock_year))
Or using ave from base R
DF[with(DF, ave(as.character(stock_year), ids, FUN=anyDuplicated)!=0),]