Extracting from numerical string only some digits - r

I have a folder full of raster files. They come by group of 12 where each one of them is a band (there are 12 bands) of the satellite Sentinel 2. I simply want to create a loop that goes through the folder and first identify the two bands that I am interested in (Band 4 et 5). To process them in pairs from the same set, I am trying to extract from the Band 4 the date of the photo in a string, that I will the use to retrieve the Band 5 from the same date;
There the problem comes. The names are like this : T31UER_20210722T105619_B12.jp2, but I manage to extract only the numbers from it and get rid of the 31 and this gives me : 20190419105621042
The core of my question is then, how can I select only a small part (YYYY/MM/DD) of this string ?
here is the piece of code. As you can see, my method is to select the part I want deleted. But it doesn't work for the second step where the part coming after the date changes all the time, except for the 042.
thank you very much !
for (f in files){
#Load band 4
Bande4 <- list.files(path="C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac",
pattern ="B04.jp2$", full.names=TRUE)
#Copy the date
x <- gsub("[A-z //.//(//)]", "", Bande4)
y <- gsub("31", "", x)
z <- gsub("??? this part changes for every file!", "", y)
#Load the matching Band 5
Bande5 <- list.files(path="C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac",
pattern = z, full.names=TRUE)
#Calculate NDVI
NDVI <- ((Bande5 - Bande4)/(Bande5- Bande4))
#Save the result
r4 <- writeRaster(z, "C:/Users/Perrin/Desktop/INRA/Raster/BDA/Images en vrac", format="GTiff", overwrite=TRUE)
}

You can use substr to extract certain characters from a string, e.g.:
substr(z, 1, 8)
[1] "20210722"
If your names are always in the same format, you can directly use substr without gsub first:
substr(Bande4, 8, 15)
# e.g. with
substr("T31UER_20210722T105619_B12.jp2", 8, 15)
[1] "20210722"

you can select the date because it's a string 8 digit long between and underscore and a capital letter (here I assume it's always "T")
str <- "T31UER_20210722T105619_B12.jp2"
sub("(.*_)([[:digit:]]{8})(T.*)", "\\2", str)
#> [1] "20210722"
I describe the string as a regex and only gather the second part of it (parts being delimited by parenthesis).
I hope it will match all your raster !

Related

R: extract dates and numbers from PDF

I'm really struggling to extract the proper information from several thousands PDF files from NTSB (some Dates and numbers to be specific); these PDFs don't require to be OCRed and each report is almost identical in length and layout information.
I need to extract the date and the time of the accident (first page) and some other information, like Pilot's age or its Flight experience. What I tried does the job for several files but is not working for each file the since code I am using is poorly written.
# an example with a single file
library(pdftools)
library(readr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(),"/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
rows <-scan(textConnection(pdf),
what="character", sep = "\n")
# Extract the date of the accident based on the 'Date & Time' occurrence.
date <-rows[grep(pattern = 'Date & Time', x = rows, ignore.case = T, value = F)]
date <- strsplit(date, " ")
date[[1]][9] #this method is not desirable since the date will not be always in that position
# Pilot age
age <- rows[grep(pattern = 'Age', x = rows, ignore.case = F, value = F)]
age <- strsplit(age, split = ' ')
age <- age[[1]][length(age[[1]])] # again, I'm using the exact position in that list
age <- readr::parse_number(age) #
The main issue I got is when I am trying to extract the date and time of the accident. Is it possible to extract that exact piece of information by avoiding using a list as I did here?
I think the best approach to achieve what you want is to use regex.
In this case I use stringr library. The main idea with regex is to find
the desire string pattern, in this case is the date 'July 29, 2014, 11:15'
Take on count that you'll have to check the date format for each pdf file
library(pdftools)
library(readr)
library(stringr)
# Download the file and read it row by row
file <- 'http://data.ntsb.gov/carol-repgen/api/Aviation/ReportMain/GenerateNewestReport/89789/pdf' # less than 100 kb
destfile <- paste0(getwd(), "/example.pdf")
download.file(file, destfile)
pdf <- pdf_text(destfile)
## New code
# Regex pattern for date 'July 29, 2014, 11:15'
regex_pattern <- "[T|t]ime\\:(.*\\d{2}\\:\\d{2})"
# Getting date from page 1
grouped_matched <- str_match_all(pdf[1], regex_pattern)
# This returns a list with groups. You're interested in group 2
raw_date <- grouped_matched[[1]][2] # First element, second group
# Clean date
date <- trimws(raw_date)
# Using dplyr
library(dplyr)
date <- pdf[1] %>%
str_match_all(regex_pattern) %>%
.[[1]] %>% # First list element
.[2] %>% # Second group
trimws() # Remove extra white spaces
You can make a function to extract the date changing the regex pattern for different files
Regards

R How to Search for String Pattern and Extract Custom character lengths from that location?

I am looking to extract a pattern and then a custom number of characters to the left or right of that pattern. I believe this is possible with Regex but unsure how to proceed. Below is an example of the data and the output I am looking for:
library(data.table)
#my data set
df = data.table(
event = c(1,2,3),
notes = c("watch this movie from 4-7pm",
"watch this musical from 5-9pm",
"eat breakfast at this place from 7-9am")
)
#how do I point R to a string section and then pull characters around it?
#example:
grepl('pm|am',df$notes) # I can see an index that these keywords exist but how can I tell R
#locate that word and then maybe pull N digits to the left, or n digits to right like substr()
#output would be
#'4-7pm', '5-9pm', '7-9am'
#right now I can extract the pattern:
library(stringr)
str_extract(df$notes, "pm")
#but I also want to then pull things to the left or right of it.
May in your case, just the below should work:
sapply(df$notes, function(x) {
grep("am|pm", unlist(strsplit(x, " ")), value = T)
}, USE.NAMES = FALSE)
[1] "4-7pm" "5-9pm" "7-9am"
However, this can still fail because of edge cases.
You can also try regex to extract all works ending with am or pm
Look at stringr to locate the extract characters and build the radius:
stringr::str_locate(df$notes, "am|pm")
start end
[1,] 26 27
[2,] 28 29
[3,] 37 38
Using stringr you could do something like this. With the matrix of locations you could tinker with moving around the radius for whatever you are looking for:
library(stringr)
# Extacting locations
locations <- str_locate(df$notes, "\\d+\\-\\d+pm|\\d+\\-\\d+am")
# Using substring to pull the info you want
str_sub(df$notes, locations)
[1] "12-7pm" "5-9pm" "7-9am"
Data (I swapped out 4 for 12):
df = data.table(
event = c(1,2,3),
notes = c("watch this movie from 12-7pm",
"watch this musical from 5-9pm",
"eat breakfast at this place from 7-9am")
)

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

exclude specific country phone numbers from a column with different type of phone numbers

I've got problem with excluding a specific country phone numbers out of a column. the problem is that they are not in a same format and some countries have 3 digit country code ex:"001" and others have 4 digit country code ex:"0098"
sample:
00989121234567
009809121234567
989121234567
9121234567
09121234567
first I need to convert all of those formats into 1 format and next exclude them out of that column.output phone numbers must be in this format:
"989121234567"
You can use startsWith and substr (or gsub would do as well) for this. First though, you need an array with prefixes:
# variables
country_codes <- c('1', '98')
prefix <- union(country_codes, paste0('00', country_codes))
numbers <- c('00989121234567','009809121234567','989121234567','9121234567','09121234567')
# get rid of prefix
new_numbers <- character(length(numbers))
for (k in seq_along(prefix)) {
ind <- startsWith(numbers, prefix[k])
new_numbers[ind] <- substr(numbers[ind], nchar(prefix[k]) + 1, nchar(numbers[ind]))
}
new_numbers[new_numbers == ""] <- numbers[new_numbers == ""]
# results
new_numbers
# [1] "9121234567" "09121234567" "9121234567" "9121234567" "09121234567"
You can then add new country codes e.g. 44,31 etc. or you could also add paste0('+', country_codes) in prefix to deal with numbers of the form +1xxxx.
If you define the vector that includes the telephone number as numeric the zeros in front are removed and you are then free to remove the numbers that you don't want.
Using the numbers provided:
nr <- c(00989121234567,009809121234567,989121234567,9121234567,09121234567)
nr
[1] 9.891212e+11 9.809121e+12 9.891212e+11 9.121235e+09 9.121235e+09
subset(nr,!grepl("^98",nr))
[1] 9121234567 9121234567
EDIT: I see you added the requirement of returning a character vector. You can just use the as.character() function for that on the final vector.

Most efficient way to read key value pairs where values span multiple lines?

What is the fastest way to parse a text file such as the example below into a two column data.frame which then then be transformed into a wide format?
FN Thomson Reuters Web of Scienceā„¢
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Using readLines is problematic because the multi-line fields don't have the keys. Reading as fixed width table also doesn't work. Suggestions? If not for the multiline issue, this would be easily accomplished with a function that operates on each row/record like so:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Notes: The fields will always be uppercase and two characters. The entire title and list of authors can be concatenated into a single cell.
This should work:
library(zoo)
x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)
x$V1[x$V1==" "] <- NA
x$V1 <- na.locf(x$V1)
res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")
Here's another idea, that might be useful if you want to stay in base R:
parseEntry <- function(entry) {
## Split at beginning of each line that starts with a non-space character
ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
## Clean up empty characters at beginning of continuation lines
ll <- gsub("\\n(\\s){3}", "", ll)
## Split each field into its two components
read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)
Read lines of the file into a character vector using readLines and append a colon to each key. The result is then in DCF format so we can read it using read.dcf - this is the function used to read R package DESCRIPTION files. The result of read.dcf is wide, a matrix with one column per key. Finally we create long, a long data.frame with one row per key:
L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

Resources