Find and extract year within sentence for each cell in R - r

I have a large dataframe of 22641 obs. and 12 variables.
The first column "year" includes extracted values from satellite images in the format below.
1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc
From this format, I only want to keep the date which in this case is 19870517 and format it as date (so two different things). Usually, I use the regex to extract the words that I want, but here the date is different for each cell and I have no idea how to replace the above text with only the date. Maybe the way to do this is to search by position within the sentence but I do not know how.
Any ideas?
Thanks.

It's not clear what the "date is different in each cell" means but if it means that the value of the date is different and it is always the 7th field then either of (1) or (2) will work. If it either means that it consists of 8 consecutive digits anywhere in the text or 8 consecutive digits surrounded by _ anywhere in the text then see (3).
1) Assuming the input DF shown in reproducible form in the Note at the end use read.table to read year, pick out the 7th field and then convert it to Date class. No packages are used.
transform(read.table(text = DF$year, sep = "_")[7],
year = as.Date(as.character(V7), "%Y%m%d"), V7 = NULL)
## year
## 1 1987-05-17
2) Another alternative is separate in tidyr. 0.8.2 or later is needed.
library(dplyr)
library(tidyr)
DF %>%
separate(year, c(rep(NA, 6), "year"), extra = "drop") %>%
mutate(year = as.Date(as.character(year), "%Y%m%d"))
## year
## 1 1987-05-17
3) This assumes that the date is the only sequence of 8 digits in the year field use this or if we know it is surrounded by _ delimiters then the regular expression "_(\\d{8})_" can be used instead.
library(gsubfn)
transform(DF,
year = do.call("c", strapply(DF$year, "\\d{8}", ~ as.Date(x, "%Y%m%d"))))
## year
## 1 1987-05-17
Note
DF <- data.frame(year = "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc",
stringsAsFactors = FALSE)

Not sure if this will generalize to your whole data but maybe:
gsub(
'(^(?:.*?[^0-9])?)(\\d{8})((?:[^0-9].*)?$)',
'\\2',
'1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc',
perl = TRUE
)
## [1] "19870517"
This uses group capturing and throws away anything but bounded 8 digit strings.

You can use sub to extract the data string and as.Date to convert it into R's date format:
as.Date(sub(".+?([0-9]+)_[^_]+$", "\\1", txt), "%Y%m%d")
# [1] "1987-05-17"
where txt <- "1_1_1_1_LT05_127024_19870517_00005ff8aac6b6bf60bc"

Related

data frame with mixed date format

I would like to change all the mixed date format into one format for example d-m-y
here is the data frame
x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
I hv tried using this code down here, but it gives NAs
newdateformat <- as.Date(x$Birthdate,
format = "%m%d%y", origin = "2020-6-25")
newdateformat
Then I tried using parse, but it also gives NAs which means it failed to parse
require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))
[1] NA NA "2001-09-12 UTC" NA
[5] "2005-02-18 UTC"
and I also could find what is the format for the first date in the data frame which is "36085.0"
i did found this code but still couldn't understand what the number means and what is the "origin" means
dates <- c(30829, 38540)
betterDates <- as.Date(dates,
origin = "1899-12-30")
p/s : I'm quite new to R, so i appreciate if you can use an easier explanation thank youuuuu
You should parse each format separately. For each format, select the relevant rows with a regular expression and transform only those rows, then move on the the next format. I'll give the answer with data.table instead of data.frame because I've forgotten how to use data.frame.
library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
"Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table
# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]
# handle dates like "36085.0": days since 1904 (or 1900)
# see https://learn.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]
# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]
# result
> x
Name Birthdate Birthdate1
1: A 36085.0 2002-10-18
2: B 2001-sep-12 2001-09-12
3: C Feb-18-2005 2005-02-18
4: D 05/27/84 1984-05-27
5: E 2020-6-25 2020-06-25

Extracting String from Column

I am working with the following dataset called results and am trying to add in a column that only contains the date (ideally just the year) of the row.
I am trying to extract just the date (for example: 2012-02-10) from the column_label column.
This is the code that I use:
pattern <- "- (.*?) .RData"
subsetpk <- results %>%
filter(team=="Pakistan") %>%
mutate(year = str_extract(column_label, pattern))
This, however, only gives me NA values.
You can use a regular expression. Here '\\d{4}' just matches the first 4 consecutive digits that are found in the string. This works if your data always looks the same as your example. If not, you may need something more sophisticated. If this doesn't work, post some more example data.
library(tidyverse)
library(stringr)
df <- data.frame(column_label = c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs"))
df %>%
mutate(my_year = str_extract(column_label, '\\d{4}'))
column_label my_year
#1 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2012
#2 Afghanistan-Pakistan-2012-02-10.RDATA.overs 2019
The ymd() function from the lubridate package
Transforms dates stored in character and numeric vectors to Date or POSIXct objects
So, we can pass the complete string conveniently without having to deal with regular expressions:
x <- c("Afghanistan-Pakistan-2012-02-10.RDATA.overs",
"Afghanistan-Pakistan-2019-02-10.RDATA.overs")
lubridate::ymd(x)
[1] "2012-02-10" "2019-02-10"
The year can be derived from the extracted dates by
library(lubridate)
year(ymd(x))
[1] 2012 2019
Use str_extract from the package stringr:
DATA:
results <- data.frame(
column_label = "Afghanistan-Pakistan-2012-02-10.RData.overs")
SOLUTION:
results$date <- str_extract(results$column_label, "\\d{4}-\\d{2}-\\d{2}")
RESULT:
results
column_label date
1 Afghanistan-Pakistan-2012-02-10.RData.overs 2012-02-10

Find year in random data in R

I have 71 columns in a dataframe, 10 of which include data that may include a year between 1990 and 2019 in the format YYYY (e.g. 2019). For example:
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
I am trying to find a way to pull the years from relevant cells and insert them in a new column.
So far, I am only aware of how to filter the data in a very time-consuming way. I have produced the following code, which starts like this:
dated_data <- select(undated_data, 1:71) %>%
filter(grepl("1990", id_1) | filter(grepl("1990", id_2) | filter(grepl("1991", id_1) | filter(grepl("1991", id_2)
However, it take a really long time to write that for all ten columns and all 30 years. I am sure there is a quicker way. I also have no idea how to then pull the dates from each of the matching cells into a new cell.
The output I want looks like this:
dated_data$year <- c("2013", "2014", "2016", "1990")
Does anyone know how I do this? Thank you in advance for your help!
There are many ways. This is one of them:
Step 1: define a pattern you want to match with regex:
pattern <- "(1|2)\\d{3}"
Step 2: define a function to extract raw matches:
extract <- function(x) unlist(regmatches(x, gregexpr(pattern, x, perl = T)))
Step 3: apply the function to your data, e.g., id_1:
extract(id_1)
[1] "2013" "2014" "2016" "1990"
Here's another way, actually simpler ;)
It uses the str_extract function from the stringr package. So you install the package and activate it:
install.packages("stringr")
library(stringr)
and use str_extract to pull your matches:
years <- str_extract(id_1,"(1|2)\\d{3}")
years
[1] "2013" "2014" "2016" "1990"
EDIT:
If not every string contains a match and you want to preserve the length of the vectors/columns, you can use ifelse to test whether the regex finds a match and, where it doesn't, to put NA.
For example, if your data is like this (note the two added strings which do not contain years):
id_3 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759", "gbgbgbgb", "hnhna25")
you can set up the ifelse test like this:
years <- ifelse(grepl("(1|2)\\d{3}", id_3), str_extract(id_3,"(1|2)\\d{3}"), NA)
years
[1] "2013" "2014" "2016" "1990" NA NA
Based on the example in your question, you are trying to filter out any rows without years and then extract the year from the string. It looks like every row only contains 1 year. Here is some code so that you do not have to write long filter statements for 10 columns and 30 years. Keep in mind that I don't have your data so I couldn't test it.
library(tidyverse)
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate(year = str_extract(id_1, pattern = paste0(1990:2019, collapse = "|")))
EDIT: based on your comment it looks like maybe some columns have a year and others do not. What we do instead is pull the year out of any column with id_* and then we coalesce the columns together. Again, without your data its tough to test this.
undated_data %>%
select(1:71) %>%
filter_at(vars(starts_with("id_"), any_vars(grepl(paste0(1990:2019, collapse = "|"), .)))) %>%
mutate_at(vars(starts_with("id_")), list(year = ~str_extract(., pattern = paste0(1990:2019, collapse = "|")))) %>%
mutate(year = coalesce(ends_with("_year"))) %>%
select(-ends_with("_year"))
Using tidyverse methods:
undated_data %>%
mutate_at(vars(1:71),
funs(str_extract(., "(1|2)[0-9]{3}")))
(Note that the regex pattern will match numbers that may not be years, such as 2999; if your data has many "false positives" like that, you may be better off writing a custom function.)
Here is a similar solution to the one provided, but using dplyr and stringr on a data.frame.
library(stringr)
library(dplyr)
df<-data.frame("X1" = id_1,"X2" = id_2)
#Set in cols the column names from which years are going to be extracted
df %>%
pivot_longer(cols = c("X1","X2"), names_to = "id") %>%
arrange(id) %>%
mutate(new = unlist(str_extract_all(value, pattern = "(1|2)\\d{3}")))
Base R solution:
# Sample data: id_1; id_2 => character vectors
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
# Thanks #Chris Ruehlemann: store the date pattern: date_pattern => character scalar
date_pattern <- "(1|2)\\d{3}"
# Convert to data.frame: df => data.frame
df <- data.frame(id_1, id_2, stringsAsFactors = FALSE)
# Subset the data to only contain date information vectors: dates_subset => data.frame
dates_subset <- df[,sapply(df, function(x){any(grepl(date_pattern, x))}), drop = FALSE]
# Initialse the year vector: year => character vector:
df$years <- NA_character_
# Remove punctuation and letters, return valid dates, combine into a, comma-separated string:
# Store the dates found in the string: years => character vector
df$years[which(rowSums(Vectorize(grepl)(date_pattern, dates_subset)) > 0)] <-
apply(sapply(dates_subset, function(x){
grep(date_pattern, unlist(strsplit(x, "[[:punct:]]|[a-zA-Z]")), value = TRUE)}),
1, paste, collapse = ", ")
Here may be another solution.
We just use gsub() function and set pattern as ".(199[0-9]|20[01][0-9]).".
The pattern captures a year text between 1990 to 2019 as a
group result , especially only one group , so we replace original text with first one group string:)
library(magrittr)
id_1 <- c("regkfg_2013", "fsgdf-2014", "f2016sghsg", "gjdg1990_3759")
id_2 <- c("dghdgl2013jg", "2fgdg_2014_hf", "ghdg_2016*89", "gc-hs1990")
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_1)
# [1] "2013" "2014" "2016" "1990"
gsub(".*(199[0-9]|20[01][0-9]).*","\\1",id_2)
#[1] "2013" "2014" "2016" "1990"

Adding a new column with month extracted from a separate already existing "date" (mdy) column

Trying to add a new column in my data table denoting the month (either as a numeric value or character) using an already available column of "SetDate", which is in the format mdy.
I'm new to R and having trouble. Thank you
base solution:
f = "%m/%d/%y" # note the lowercase y; it's because the year is 92, not 1992
dataset$SetDateMonth <- format(as.POSIXct(dataset$SetDate, format = f), "%m")
Basically, what it does is it converts the column from character (presumed class) to POSIXct, which allows for an easy extraction of month information.
Quick test:
format(as.POSIXct('1/1/92', format = "%m/%d/%y"), "%m")
[1] "01"
Try this (created a small example):
library(lubridate)
date_example <- "1/1/92"
lubridate::mdy(date_example)
[1] "1992-01-01"
lubridate::mdy(date_example) %>% lubridate::month()
[1] 1
If you want full month as character string, use:
lubridate::mdy(date_example) %>% lubridate::month(label = TRUE, abbr = FALSE)

as.Date function gives different result in a for loop

Slight problem where my as.Date function gives a different result when I put it in a for loop. I'm looking in a folder with subfolders (per date) that contain images. I build date_list to organize all the dates (for plotting options in a later stage). The Julian Day starts from the first of January of the year, so because I have 4 years of date, the year must be flexible.
# Set up list with 4 columns and counter Q. jan is used to set all dates to the first of january
date_list <- outer(1:52, 1:4)
q = 1
jan <- "-01-01"
for (scene in folders){
year <- as.numeric(substr(scene, start=10, stop=13))
day <- as.numeric(substr(scene, start=14, stop=16))
datum <- paste(year, day, sep='_')
date_list[q, 1] <- datum
date_list[q, 2] <- year
date_list[q, 3] <- day
date_list[q, 4] <- as.Date(day, origin = as.Date(paste(year,jan, sep="")))
q = q+1
}
Output final row:
[52,] "2016_267" "2016" "267" "17068"
What am i missing in date_list[q, 4] that doesn't transfer my integer to a date?
running the following code does work, but due to the large amount of scenes and folders I like to automate this:
as.Date(day, origin = as.Date(paste(year,jan, sep="")))
Thank you for your time!
Well, I assume this would answer your first question:
date_list[q, 4] <- as.character(as.Date(datum,format="%Y_%j"))
as.Date accept a format argument, (the %Y and %j are documented in strptime), the %jis the julian day, this is a little easier to read than using origin and multiple paste calls.
Your problem is actually linked to what a Date object is:
> dput(as.Date("2016-01-10"))
structure(16810, class = "Date")
When entered into a matrix (your date_list) it is coerced to character w
without special treatment before like this:
> d<-as.Date("2016-01-10")
> class(d)<-"character"
> d
[1] "16810"
Hence you get only the number of days since 1970-01-01. When you ask for the date as character representation with as.character, it gives the correct value because the Date class as a as.character method which first compute the date in human format before returning a character value.
Now if I understood well your problem I would go this way:
First create a function to work on one string:
name_to_list <- function(name) {
dpart <- substr(name, start=10, stop=16)
date <- as.POSIXlt(dpart, format="%Y%j")
c("datum"=paste(date$year+1900,date$yday,sep="_"), "year"=date$year+1900, "julian_day"=date$yday, "date"=as.character(date) )
}
this function just get your substring, and then convert it to POSIXlt class, which give us julian day, year and date in one pass. as the year is stored as integer since 1900 (could be negative), we have to add 1900 when storing the year in the fields.
Then if your folders variable is a vector of string:
lapply(folders,name_to_list)
wich for folders=c("LC81730382016267LGN00","LC81730382016287LGN00","LC81730382016167LGN00") gives:
[[1]]
datum year julian_day date
"2016_266" "2016" "266" "2016-09-23"
[[2]]
datum year julian_day date
"2016_286" "2016" "286" "2016-10-13"
[[3]]
datum year julian_day date
"2016_166" "2016" "166" "2016-06-15"
Do you mean to output your day as 3 numbers? Should it not be 2 numbers?
day <- as.numeric(substr(scene, start=15, stop=16))
or
day <- as.numeric(substr(scene, start=14, stop=15))
That could at least be part of the issue. Providing an example of what typical values of "scene" are would be helpful here.

Resources