Removing String from the column in R - r

After executing the R code, the values I got in the column of dataframe are:
25 July 2012 bet
22 June 2015 bet
09 April 2015 be
14 November 2016
I want only the dates, How can I remove "bet", "be" from the values?
I am using the below code to extract the above values from the text document:
coalesce((substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,16)),(substr((stringr::str_match(text, "ISDA Master Agreement dated as of (.) ")[, 2]),1,13)))
If I swipe the coalesce arguements, then the 4th value gets truncated.
I am ok with the code, but while cleaning, how should I remove the "bet","be"?

I am far away from being a regex expert, but here goes a tidyverse way of doing what you want:
library(tidyverse, verbose = F)
df <- tibble::tribble(
~V1, ~V2,
1L, "25 July 2012 bet",
2L, "22 June 2015 bet",
3L, "09 April 2015 be",
4L, "14 November 2016"
)
df %>%
mutate(V2 = str_replace(V2, pattern = "[:space:]be.*", replacement = ""))
#> # A tibble: 4 x 2
#> V1 V2
#> <int> <chr>
#> 1 1 25 July 2012
#> 2 2 22 June 2015
#> 3 3 09 April 2015
#> 4 4 14 November 2016
Created on 2020-02-21 by the reprex package (v0.3.0)

We can use sub to remove whitespace and everything with "be"
sub("\\s+be.*", "", c("25 July 2012 bet", "09 April 2015 be"))
#[1] "25 July 2012" "09 April 2015"

If you use lubridate you can strip away the excess text after the date:
library(lubridate)
test_strings <- c("25 July 2012 bet", "09 April 2015 be")
dmy(test_strings)
[1] "2012-07-25" "2015-04-09"

Related

Removing dates ( in any format) form a text column

Hope everyone is well.
In my dataset there is column including free texts. My goal is to remove all dates in any format form the text.
this is a snapshot of the data
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
df %>% select(text)
text
1 tommorow is 2022 11 03
2 I married on 2020-01-01
3 why not going there on 2023/01/14
4 2023 08 01 will be great
The outcome should look like
text
1 tommorow is
2 I married on
3 why not going there on
4 will be great
Thank you!
Best approach would perhaps be to have a sensitive regex pattern:
df <- data.frame(
text=c('tommorow is 2022 11 03',"I married on 2020-01-01",
'why not going there on 2023/01/14','2023 08 01 will be great'))
library(tidyverse)
df |>
mutate(left_text = str_trim(str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}")))
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
This will match dates by:
\\d{1,4} = starting with either month (1-2 numeric characters), day (1-2 characters) or year (2-4 characters); followed by
\\D = anything that's not a number, i.e. the separator; followed by
\\d{1,2} = day or month (1-2 chars); followed by
\\D again; ending with
\\d{1,4} = day or year (1-2 or 2-4 chars)
The challenge is balancing sensitivity with specificity. This should not take out numbers which are clearly not dates, but might miss out:
dates with no year
dates with no separators
dates with double spaces between parts
But hopefully should catch every sensible date in your text column!
Further date detection examples:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 2010',
'dec 21/22 is another date'
)
)
df |>
mutate(left_text = str_remove(text, "\\d{1,4}\\D\\d{1,2}\\D\\d{1,4}") |>
str_remove(regex(paste0("(", paste(month.name, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_remove(regex(paste0("(", paste(month.abb, collapse = "|"),
")(\\D+\\d{1,2})?\\D+\\d{1,4}"),
ignore_case = TRUE)) |>
str_trim())
#> text left_text
#> 1 tommorow is 2022 11 03 tommorow is
#> 2 I married on 2020-01-01 I married on
#> 3 why not going there on 2023/01/14 why not going there on
#> 4 2023 08 01 will be great will be great
#> 5 A trickier example: January 05,2020 A trickier example:
#> 6 or try Oct 2010 or try
#> 7 dec 21/22 is another date is another date
Final Edit - doing replace with temporary placeholders
The following code should work on a wide range of date formats. It works by replacing in a specific order so as not to accidentally chop out bits of some dates. Gluing together pre-made regex patterns to hopefully give a clearer idea as to what each bit is doing:
library(tidyverse)
df <- data.frame(
text = c(
'tommorow is 2022 11 03',
"I married on 2020-01-01",
'why not going there on 2023/01/14',
'2023 08 01 will be great',
'A trickier example: January 05,2020',
'or try Oct 26th 2010',
'dec 21/22 is another date',
'today is 2023-01-29 & tomorrow is 2022 11 03 & 2022-12-01',
'A trickier example: January 05,2020',
'2020-01-01 I married on 2020-12-01',
'Adding in 1st December 2018',
'And perhaps Jul 4th 2023'
)
)
r_year <- "\\d{2,4}"
r_day <- "\\d{1,2}(\\w{1,2})?" # With or without "st" etc.
r_month_num <- "\\d{1,2}"
r_month_ab <- paste0("(", paste(month.abb, collapse = "|"), ")")
r_month_full <- paste0("(", paste(month.name, collapse = "|"), ")")
r_sep <- "[^\\w]+" # The separators can be anything but letters
library(glue)
df |>
mutate(
text =
# Any numeric day/month/year
str_replace_all(text,
glue("{r_day}{r_sep}{r_month_num}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric month/day/year
str_replace_all(glue("{r_month_num}{r_sep}{r_day}{r_sep}{r_year}"),
"REP_DATE") |>
# Any numeric year/month/day
str_replace_all(glue("{r_year}{r_sep}{r_month_num}{r_sep}{r_day}"),
"REP_DATE") |>
# Any day[th]/monthname/year or monthname/day[th]/year
str_replace_all(regex(paste0(
glue("({r_day}{r_sep})?({r_month_full}|{r_month_ab})",
"{r_sep}({r_day}{r_sep})?{r_year}")
), ignore_case = TRUE),
"REP_DATE") |>
# And transform all placeholders to required date
str_replace_all("REP_DATE", "25th October 2022")
)
#> text
#> 1 tommorow is 25th October 2022
#> 2 I married on 25th October 2022
#> 3 why not going there on 25th October 2022
#> 4 25th October 2022 will be great
#> 5 A trickier example: 25th October 2022
#> 6 or try 25th October 2022
#> 7 25th October 2022 is another date
#> 8 today is 25th October 2022 & tomorrow is 25th October 2022 & 25th October 2022
#> 9 A trickier example: 25th October 2022
#> 10 25th October 2022 I married on 25th October 2022
#> 11 Adding in 25th October 2022
#> 12 And perhaps 25th October 2022
This should catch all the most common ways of writing dates, even with added "st"s "nd"s and "th"s after day number and irrespective of ordering of parts (apart from any format which puts "year" in the middle between "day" and "month", but that seems unlikely).

How to split an existing column into three and then append it to the data set?

Good morning, everyone! I am a beginner with R, and I was given an assignment. It looks as follows:
"Split the “eventdate” column in the ACLED dataset into separate day, month, and year columns that are then appended to the ACLED dataset."
We are working with strsplit() and paste(), but I suspect this is not enough.
The eventdate column in ACLED looks like this:
"01 August 2020"
I was trying to do it like this:
strsplit(brazil_acled$event_date, " ")
and then use the paste() function to append it. But I still do not understand how to create three columns out of splitting text in the existing data set.
I am really new with R and I am with students that are advanced. I am having a difficult time, and I appreciate any help.
Thank you!
Note that I need to do this without using loops.
Splitting the dates gives you a list that you want to rbind first before appending to the data frame.
S <- strsplit(as.character(ACLED$event_date), " ")
S
# [[1]]
# [1] "18" "September" "2020"
#
# [[2]]
# [1] "19" "September" "2020"
#
# [[3]]
# [1] "20" "September" "2020"
#
# [[4]]
# [1] "21" "September" "2020"
#
# [[5]]
# [1] "22" "September" "2020"
#
# [[6]]
# [1] "23" "September" "2020"
For rbinding multiple elements together we need do.call(rbind, ...) and perhaps want to assign appropriate column names.
R <- do.call(rbind, S)
colnames(R) <- c("day", "month", "year")
R
# day month year
# [1,] "18" "September" "2020"
# [2,] "19" "September" "2020"
# [3,] "20" "September" "2020"
# [4,] "21" "September" "2020"
# [5,] "22" "September" "2020"
# [6,] "23" "September" "2020"
Finally just cbind the result to the original data frame.
ACLED <- cbind(ACLED, R)
ACLED
# event_date sth_else day month year
# 1 18 September 2020 -0.78445901 18 September 2020
# 2 19 September 2020 -0.85090759 19 September 2020
# 3 20 September 2020 -2.41420765 20 September 2020
# 4 21 September 2020 0.03612261 21 September 2020
# 5 22 September 2020 0.20599860 22 September 2020
# 6 23 September 2020 -0.36105730 23 September 2020
You may also do this in one single step.
cbind(ACLED, `colnames<-`(do.call(rbind, strsplit(ACLED$event_date, " ")),
c("day", "month", "year")))
Note:
Maybe you need the splitted date as "integer". In this case you may modify R in the following way before rbinding to ACLED.
R[,2] <- match(R[,2], month.name) ## using constant built into R
mode(R) <- "integer"
R
# day month year
# [1,] 18 9 2020
# [2,] 19 9 2020
# [3,] 20 9 2020
# [4,] 21 9 2020
# [5,] 22 9 2020
# [6,] 23 9 2020
Example data:
ACLED <- structure(list(event_date = c("18 September 2020", "19 September 2020",
"20 September 2020", "21 September 2020", "22 September 2020",
"23 September 2020"), sth_else = c(1.37095844714667, -0.564698171396089,
0.363128411337339, 0.63286260496104, 0.404268323140999, -0.106124516091484
)), class = "data.frame", row.names = c(NA, -6L))
Sorry paste is of no use here since it sort of the opposite of strsplit. Another possibility that you may find clearer using strsplit.
You had the basics correct and you recognized what you needed which was some tool to take the 3 results from strsplit and put them in the right columns.
May I suggest sapply. Although not exactly simple you can quickly divine that "[", 1 is r's way of saying grab the nth piece
# fake data
brazil_acled <- structure(list(event_date = c("18 September 2020", "19 September 2020",
"20 September 2020", "21 September 2020", "22 September 2020",
"23 September 2020")), class = "data.frame", row.names = c(NA, -6L))
brazil_acled
#> event_date
#> 1 18 September 2020
#> 2 19 September 2020
#> 3 20 September 2020
#> 4 21 September 2020
#> 5 22 September 2020
#> 6 23 September 2020
brazil_acled$day <- sapply(strsplit(brazil_acled$event_date, " "), "[", 1)
brazil_acled$month <- sapply(strsplit(brazil_acled$event_date, " "), "[", 2)
brazil_acled$year <- sapply(strsplit(brazil_acled$event_date, " "), "[", 3)
brazil_acled
#> event_date day month year
#> 1 18 September 2020 18 September 2020
#> 2 19 September 2020 19 September 2020
#> 3 20 September 2020 20 September 2020
#> 4 21 September 2020 21 September 2020
#> 5 22 September 2020 22 September 2020
#> 6 23 September 2020 23 September 2020
paste would be useful to paste them back together in a different order...
brazil_acled$YMD <- paste(brazil_acled$year, brazil_acled$month, brazil_acled$day, sep = "-")

Convert fromJSON list to a data frame

I am getting data from BLS website using the package blsAPI.
The code is:
library(blsAPI)
employ <- blsAPI(payload= "CES0500000001")
emp <- fromJSON(employ)
The data set emp is a list... this is where I am stumped. I've been trying all types of variations to convert emp to data.frame from list with no success.
Just set the argument return_data_frame = TRUE of blsAPI function. data.frame will be returned instead of list (default behaviour).
library(rjson)
library(blsAPI)
response <- blsAPI("CES0500000001", return_data_frame = TRUE)
head(response)
Output:
year period periodName value seriesID
1 2018 M08 August 126939 CES0500000001
2 2018 M07 July 126735 CES0500000001
3 2018 M06 June 126582 CES0500000001
4 2018 M05 May 126390 CES0500000001
5 2018 M04 April 126130 CES0500000001
6 2018 M03 March 125956 CES0500000001

change of unit of analysis for panel data in R

I looking at foreign powers intervening into civil wars using R studio. My first dataset unit of analysis is conflict year while the second one is conflict month. I would need to have both of them in conflict years so I can merge them.
Is there any command that allows you to do the opposite of expanding rows?
It's hard to give you specifics without a sample of your data so we know what the structure is. I'm assuming your month-level dataset stores the month as a character string that includes a year. You should be able to extract the year with separate from the tidyr package:
library(tidyverse)
month <- c("June 2015", "July 2015", "September 2016", "August 2016", "March 2014")
conflict <- c("A", "B", "C", "D", "E")
my.data <- data.frame(month, conflict)
my.data
month conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E
my.data <- my.data %>%
separate(month, c("month", "year"), sep = " ")
> my.data
month year conflict
1 June 2015 A
2 July 2015 B
3 September 2016 C
4 August 2016 D
5 March 2014 E

Fill in missing year in ordered list of dates

I have collected some time series data from the web and the timestamp that I got looks like below.
24 Jun
21 Mar
20 Jan
10 Dec
20 Jun
20 Jan
10 Dec
...
The interesting part is that the year is missing in the data, however, all the records are ordered, and you can infer the year from the record and fill in the missing data. So the data after imputing should be like this:
24 Jun 2014
21 Mar 2014
20 Jan 2014
10 Dec 2013
20 Jun 2013
20 Jan 2013
10 Dec 2012
...
Before lifting my sleeves and start writing a for loop with nested logic.. is there a easy way that might work out of box in R to impute the missing year.
Thanks a lot for any suggestion!
Here's one idea
## Make data easily reproducible
df <- data.frame(day=c(24, 21, 20, 10, 20, 20, 10),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Dec"))
## Convert each month-day combo to its corresponding "julian date"
datestring <- paste("2012", match(df[[2]], month.abb), df[[1]], sep = "-")
date <- strptime(datestring, format = "%Y-%m-%d")
julian <- as.integer(strftime(date, format = "%j"))
## Transitions between years occur wherever julian date increases between
## two observations
df$year <- 2014 - cumsum(diff(c(julian[1], julian))>0)
## Check that it worked
df
# day month year
# 1 24 Jun 2014
# 2 21 Mar 2014
# 3 20 Jan 2014
# 4 10 Dec 2013
# 5 20 Jun 2013
# 6 20 Jan 2013
# 7 10 Dec 2012
The OP has requested to complete the years in descending order starting in 2014.
Here is an alternative approach which works without date conversion and fake dates. Furthermore, this approach can be modified to work with fiscal years which start on a different month than January.
# create sample dataset
df <- data.frame(
day = c(24L, 21L, 20L, 10L, 20L, 20L, 21L, 10L, 30L, 10L, 10L, 7L),
month = c("Jun", "Mar", "Jan", "Dec", "Jun", "Jan", "Jan", "Dec", "Jan",
"Jan", "Jan", "Jun"))
df$year <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb)) + df$day) > 0))
df
day month year
1 24 Jun 2014
2 21 Mar 2014
3 20 Jan 2014
4 10 Dec 2013
5 20 Jun 2013
6 20 Jan 2013
7 21 Jan 2012
8 10 Dec 2011
9 30 Jan 2011
10 10 Jan 2011
11 10 Jan 2011
12 7 Jun 2010
Completion of fiscal years
Let's assume the business has decided to start its fiscal year on February 1. Thus, January lies in a different fiscal year than February or March of the same calendar year.
To handle fiscal years, we only need to shuffle the factor levels accordingly:
df$fy <- 2014 - cumsum(c(0L, diff(100L*as.integer(
factor(df$month, levels = month.abb[c(2:12, 1)])) + df$day) > 0))
df
day month year fy
1 24 Jun 2014 2014
2 21 Mar 2014 2014
3 20 Jan 2014 2013
4 10 Dec 2013 2013
5 20 Jun 2013 2013
6 20 Jan 2013 2012
7 21 Jan 2012 2011
8 10 Dec 2011 2011
9 30 Jan 2011 2010
10 10 Jan 2011 2010
11 10 Jan 2011 2010
12 7 Jun 2010 2010

Resources