read.zoo with date and time as index in R - r

I have the following file
"Index" "time" "open" "high" "low" "close" "numEvents" "volume"
2013-01-09 14:30:00 "2013-01-09T14:30:00.000" "110.8500" "110.8500" "110.8000" "110.8000" " 57" "32059"
2013-01-09 14:31:00 "2013-01-09T14:31:00.000" "110.7950" "110.8140" "110.7950" "110.8140" " 2" " 1088"
2013-01-09 14:32:00 "2013-01-09T14:32:00.000" "110.8290" "110.8300" "110.8290" "110.8299" " 5" " 967"
2013-01-09 14:33:00 "2013-01-09T14:33:00.000" "110.8268" "110.8400" "110.8268" "110.8360" " 8" " 2834"
2013-01-09 14:34:00 "2013-01-09T14:34:00.000" "110.8400" "110.8400" "110.8200" "110.8200" " 33" " 6400"
I want to read this file into a zoo (or xts) object in R. This file was created as an xts object and saved using write.zoo(as.zoo(xts_object), path, sep = "\t") and now I am trying to read it in using zoo_object <- read.zoo(path, sep = "\t", header=TRUE, format="%Y-%m-%d %H:%M:%S"). However, I get the following warning
Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
And when I type zoo_object into the console to show its contents i get:
time open high low close numEvents volume
2013-01-09 2013-01-09T14:30:00.000 110.8500 110.850 110.8000 110.8000 57 32059
2013-01-09 2013-01-09T14:31:00.000 110.7950 110.814 110.7950 110.8140 2 1088
2013-01-09 2013-01-09T14:32:00.000 110.8290 110.830 110.8290 110.8299 5 967
2013-01-09 2013-01-09T14:33:00.000 110.8268 110.840 110.8268 110.8360 8 2834
2013-01-09 2013-01-09T14:34:00.000 110.8400 110.840 110.8200 110.8200 33 6400
where you can see that the time is not included in the row index. I assume I can convert the time field into the index and fix my problems, but I also assume I am doing something wrong in reading this file (or maybe writing), but after search all day I have no idea what. Can anyone offer any insight?
dput(zoo_object) after read
dput(zoo_object)
structure(c("2013-01-09T14:30:00.000", "2013-01-09T14:31:00.000",
"2013-01-09T14:32:00.000", "2013-01-09T14:33:00.000", "2013-01-09T14:34:00.000",
"110.8500", "110.7950", "110.8290", "110.8268", "110.8400", "110.850",
"110.814", "110.830", "110.840", "110.840", "110.8000", "110.7950",
"110.8290", "110.8268", "110.8200", "110.8000", "110.8140", "110.8299",
"110.8360", "110.8200", "57", " 2", " 5", " 8", "33", "32059",
" 1088", " 967", " 2834", " 6400"), .Dim = c(5L, 7L), .Dimnames = list(
NULL, c("time", "open", "high", "low", "close", "numEvents",
"volume")), index = structure(c(15714, 15714, 15714, 15714,
15714), class = "Date"), class = "zoo")

(Please note that the object that was desired for testing was the one passed to write.zoo, not the final object.)
By default (it appears) the date-time function used by read.zoo is as.Date while I would have guessed it would be as.POSIXct. You can force the desired behavior with:
zoo_object <- read.zoo("~/test", index.column=2, sep = "\t",
header=TRUE, format="%Y-%m-%dT%H:%M:%S", FUN=as.POSIXct)
Note that I changed your format slightly because looking at the text output in an editor, it appeared that the was a single column with "T" as a separator between the Date and time text.

Related

Add a space every three characters from the end

I need to add a space between every 3rd character in the string but from the end. Also, ignore the element, which has a percentage %.
string <- c('186527500', '3875055', '23043', '10.8%', '9.8%')
And need to get the view:
186 527 500, 3 875 055, 23 043, 10.8%, 9.8%
You could do:
ifelse(grepl('%', string), string, scales::comma(as.numeric(string), big = ' '))
#> [1] "186 527 500" "3 875 055" "23 043" "10.8%" "9.8%"
Using format and ifelse:
ifelse(!grepl("\\D", string), format(as.numeric(string), big.mark = " ", trim = T), string)
#[1] "186 527 500" "3 875 055" "23 043" "10.8%" "9.8%"
Here is a base R solution with prettyNum. The trick is to set big.mark to one space character.
I use a variant of my answer to another post, but instead of returning an index to the numbers cannot be converted to numeric, the function below returns the index of the numbers that can. This is to avoid trying to put spaces in the % numbers.
check_num <- function(x){
y <- suppressWarnings(as.numeric(x))
if(anyNA(y)){
which(!is.na(y))
} else invisible(NULL)
}
string <- c('186527500', '3875055', '23043', '10.8%', '9.8%')
i <- check_num(string)
prettyNum(string[i], big.mark = " ", preserve.width = "none")
#> [1] "186 527 500" "3 875 055" "23 043"
Created on 2022-05-16 by the reprex package (v2.0.1)
You can then assign the result back to the original string.
string[i] <- prettyNum(string[i], big.mark = " ", preserve.width = "none")
An idea is to reverse the strings, add the space and reverse back, i.e.
new_str <- string[!grepl('%', string)]
stringi::stri_reverse(sub("\\s+$", "", gsub('(.{3})', '\\1 ',stringi::stri_reverse(new_str))))
#[1] "186 527 500" "3 875 055" "23 043"
Another way is via formatC, i.e.
sapply(new_str, function(i)formatC(as.numeric(i), big.mark = " ", big.interval = 3, format = "d", flag = "0", width = nchar(i)))
# 186527500 3875055 23043
# "186 527 500" "3 875 055" "23 043"

How can I create a DataFrame with separate columns from a fixed width character vector input in R? [duplicate]

This question already has answers here:
Read observations in fixed width files spanning multiple lines in R
(3 answers)
Closed 2 years ago.
I have a fixed width character vector input called "text" that looks something like this:
[1] " Report"
[2] "Group ID Name"
[3] "Number"
[4] "AA A134 abcd"
[5] "AB A123 def"
[6] "AC A345 ghikl"
[7] "BA B134 jklmmm"
[8] "AD A987 mn"
I need to create a standard DataFrame. My approach is to first create a text file and then use the read.fwf function to create a clean DataFrame from a fixed width text file input. What I have works, but it forces me to create a text file in my working directory and then read it back in as a fwf:
> cat(text, file = "mytextfile", sep = "\n", append = TRUE)
> read.fwf("mytextfile", skip = 3, widths = c(12, 14, 20))
Is it possible to achieve the same result without saving the intermediate output to my working directory? I tried using paste() and capture.output() without success. While
x = paste(text, collapse = "\n")
seemed to work at first, but when I passed it to
read.fwf(x, skip = 3, widths = c(12, 14, 20))
I got
Error in file(file, "rt") : cannot open the connection
In addition: Warning Message:
In file(file, "rt") : cannot open file '
and capture.output() got me back to square one, a character vector. Any advice is greatly appreciated. Thank you.
You can use textConnection to read file as text in read.fwf and supply the widths.
data <- read.fwf(textConnection(text),
widths = c(12, 14, 20), strip.white = TRUE, skip = 3)
data
# V1 V2 V3
#1 AA A134 abcd
#2 AB A123 def
#3 AC A345 ghikl
#4 BA B134 jklmmm
#5 AD A987 mn
data
text <- c(" Report", "Group ID Name", "Number",
"AA A134 abcd", "AB A123 def",
"AC A345 ghikl", "BA B134 jklmmm",
"AD A987 mn")

Reading a Fixed-Width Multi-Line File in R

I have data from a PDF file that I am reading into R.
library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>%
readr::read_lines()
When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.
A reproducible example is below:
ex_result <- c("03/11/2012 BES 3RD BES inc and corp no- no- sale -",
" group with sale no- sale",
" boxes",
"03/11/2012 KRS six and firefly 45 mg/dL 100 - 200",
" seven",
"03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87")
I am trying to use read_fwf with fwf_widths as I read that it can handle multi-line input if you give the widths for multi-line records.
ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100),
c("date", "name","description", "value", "unit","range","ab_flag")))
I determined the sizes by typing in the console nchar with the longest string that I saw for that column.
Using fwf_widths I can get the date column by defining in the width = argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.
Ultimately this is the desired output:
desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))
I am trying to see:
How can I get fwf_widths to recognize multi-line text and missing columns?
Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)
str_subset(ex_result,pattern = "\/\d{2}\/")
[1] "03/11/2012 BES 3RD BES inc and corp no- no- sale -"
[2] "03/11/2012 KRS six and firefly 45 mg/dL 100 - 200"
[3] "03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87"

R: pasting (or combining) a variable amount of rows together as one

I have a text file I am trying to parse and put the information into a data frame. In each one of the 'events' there may or may not be some notes with it. However the notes can span various amounts of rows. I need to concatenate the notes for each event into one string to store in a column of the data frame.
ID: 20470
Version: 1
notes:
ID: 01040
Version: 2
notes:
The customer was late.
Project took 20 min. longer than anticipated
Work was successfully completed
ID: 00000
Version: 1
notes:
Customer was not at home.
ID: 00000
Version: 7
notes:
Fax at 2:30 pm
Called but no answer
Visit home no answer
Left note on door with call back number
Made a final attempt on 12/5/2013
closed case on 12/10 with nothing resolved
So for example for the third event the notes should be one long string: "The customer was late. Project took 20 min. longer than anticipated Work was successfully completed", which then would be store into the notes columns in the the data frame.
For each event I know how many rows the notes span.
Something like this (actually, you would be happier and learn more figuring it out yourself, I was just procrastinating between two tasks):
x <- readLines("R/xample.txt") # you'll probably read it from a file
ids <- grep("^ID:", x) # detecting lines starting with ID:
versions <- grep("^Version:", x)
notes <- grep("^notes:", x)
nStart <- notes + 1 # lines where the notes start
nEnd <- c(ids[-1]-1, length(x)) # notes end one line before the next ID: line
ids <- sapply(strsplit(x[ids], ": "), "[[", 2)
versions <- sapply(strsplit(x[versions], ": "), "[[", 2)
notes <- mapply(function(i,j) paste(x[i:j], collapse=" "), nStart, nEnd)
df <- data.frame(ID=ids, ver=versions, note=notes, stringsAsFactors=FALSE)
dput of data
> dput(x)
c("ID: 20470", "Version: 1", "notes: ", " ", " ", "ID: 01040",
"Version: 2", "notes: ", " The customer was late.", "Project took 20 min. longer than anticipated",
"Work was successfully completed", "", "ID: 00000", "Version: 1",
"notes: ", " Customer was not at home.", "", "ID: 00000", "Version: 7",
"notes: ", " Fax at 2:30 pm", "Called but no answer", "Visit home no answer",
"Left note on door with call back number", "Made a final attempt on 12/5/2013",
"closed case on 12/10 with nothing resolved ")

How to separate one colmn into multiple based on a character vector of delimiters

I have a dataframe which consists of one column. I would like to separate the text into seperate columns based on a vector of delimiters.
Input:
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<- "PathReportWhole"
The intended output:
structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begood\n DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"),
HospitalNumber = " 233456 ", PatientName = " Jonny Begood",
DOB = " 13/01/77 ", GeneralPractitioner = NA_character_,
Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ",
Macroscopicdescription = " 3 pieces of oesophagus, all good biopsies\n ",
Histology = " These show chronic reflux and other bits n bobs\n ",
Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName",
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails",
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")
I was keen to use the separate function from tidyr but can't quite figure it out so that it will separate according to a list of delimiters
The list would be:
mywords<-c("Hospital Number","Patient Name","DOB:","General Practitioner:","Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:")
I then tried:
Mypath %>% separate(Mypath, mywords)
But I amd clearly mis-understanding the function which I guess can't take a list of delimiters
Error: `var` must evaluate to a single number or a column name, not a list
Is there a simple way of doing this using tidyr (or csplit with a list or any other way for that matter)
Maybe make sure that it's like a dcf file, and you can use read.dcf:
Notice that "mywords" is a little different from yours. I've added colons to "Hospital Number" and "Patient Name".
mywords<-c("Hospital Number:","Patient Name:","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:",
"Histology:","Diagnosis:")
Convert the relevant column to character, add a colon after "Hospital Number".
Mypath$PathReportWhole <- as.character(Mypath$PathReportWhole)
Mypath$PathReportWhole <- gsub("Hospital Number", "Hospital Number:", Mypath$PathReportWhole)
Make it such that each key: value pair is on its own line.
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", Mypath$PathReportWhole)
Use read.dcf to read it in:
out <- read.dcf(textConnection(temp))
Here's some sample data that makes it easier to see the resulting structure:
example <- c("var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here",
"var 1 xyz var 2: more text here var 5: not all values are there")
example <- data.frame(report = example)
example
# report
# 1 var 1 abc var 2: some, text var 3: 112 var 4: value var 5: even more here
# 2 var 1 xyz var 2: more text here var 5: not all values are there
And, going through the same steps:
mywords <- c("var 1:", "var 2:", "var 3:", "var 4:", "var 5:")
temp <- as.character(example$report)
temp <- gsub("var 1", "var 1:", temp)
temp <- gsub(sprintf("(%s)", paste(mywords, collapse = "|")), "\n\\1", temp)
read.dcf(textConnection(temp))
# var 1 var 2 var 3 var 4 var 5
# [1,] "abc" "some, text" "112" "value" "even more here"
# [2,] "xyz" "more text here" NA NA "not all values are there"
read.dcf(textConnection(temp), fields = c("var 1", "var 3", "var 5"))
# var 1 var 3 var 5
# [1,] "abc" "112" "even more here"
# [2,] "xyz" NA "not all values are there"

Resources