I have data from a PDF file that I am reading into R.
library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>%
readr::read_lines()
When I read it in with this method, a character vector is returned. Multi-line information for a given column is spread out on different lines (and not all columns for each observation will have data.
A reproducible example is below:
ex_result <- c("03/11/2012 BES 3RD BES inc and corp no- no- sale -",
" group with sale no- sale",
" boxes",
"03/11/2012 KRS six and firefly 45 mg/dL 100 - 200",
" seven",
"03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87")
I am trying to use read_fwf with fwf_widths as I read that it can handle multi-line input if you give the widths for multi-line records.
ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100),
c("date", "name","description", "value", "unit","range","ab_flag")))
I determined the sizes by typing in the console nchar with the longest string that I saw for that column.
Using fwf_widths I can get the date column by defining in the width = argument with 10 bytes, but for the NAME column if I set it to say 24 bytes it returns back columns concatenated instead of rows split to account for multi-line which then cascades to the other columns now having the wrong data and the rest being dropped when space has run out.
Ultimately this is the desired output:
desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))
I am trying to see:
How can I get fwf_widths to recognize multi-line text and missing columns?
Is there a better way to read in the pdf file to account for multi-line values and missing columns? (I was following this tutorial but it seems to have a more structured pdf file)
str_subset(ex_result,pattern = "\/\d{2}\/")
[1] "03/11/2012 BES 3RD BES inc and corp no- no- sale -"
[2] "03/11/2012 KRS six and firefly 45 mg/dL 100 - 200"
[3] "03/11/2012 KRS core ladybuyg 55 mg/dL 42 - 87"
Related
I have a vector of organization names in a dataframe. Some of them are just fine, others have the name repeated twice in the same element. Also, when that name is repeated, there is no separating space so the name has a camelCase appearance.
For example (id column added for general dataframe referencing):
id org
1 Alpha Company
2 Bravo InstituteBravo Institute
3 Charlie Group
4 Delta IncorporatedDelta Incorporated
but it should look like:
id org
1 Alpha Company
2 Bravo Institute
3 Charlie Group
4 Delta Incorporated
I have a solution that gets the result I need--reproducible example code below. However, it seems a bit lengthy and not very elegant.
Does anyone have a better approach for the same results?
Bonus question: If organizations have 'types' included, such as Alpha Company, LLC, then my gsub() line to fix the camelCase does not work as well. Any suggestions on how to adjust the camelCase fix to account for the ", LLC" and still work with the rest of the solution?
Thanks in advance!
(Thanks to the OP & those who helped on the previous SO post about splitting camelCase strings in R)
# packages
library(stringr)
# toy data
df <- data.frame(id=1:4, org=c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
# split up & clean camelCase words
df$org_fix <- gsub("([A-Z])", " \\1", df$org)
df$org_fix <- str_trim(str_squish(df$org_fix))
# temp vector with half the org names
df$org_half <- word(df$org_fix, start=1, end=(sapply(strsplit(df$org_fix, " "), length)/2)) # stringr::word
# double the temp vector
df$org_dbl <- paste(df$org_half, df$org_half)
# flag TRUE for orgs that contain duplicates in name
df$org_dup <- df$org_fix == df$org_dbl
# corrected the org names
df$org_fix <- ifelse(df$org_dup, df$org_half, df$org_fix)
# drop excess columns
df <- df[,c("id", "org_fix")]
# toy data for the bonus question
df2 <- data.frame(id=1:4, org=c("Alpha Company, LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated"))
Another approach is to compare the first half of the string with the second half of the string. If equal, pick the first half. It also works if there are numbers, underscores or any other characters present in the company name.
org <- c("Alpha Company", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated", "WD40WD40", "3M3M")
ifelse(substring(org, 1, nchar(org) / 2) == substring(org, nchar(org) / 2 + 1, nchar(org)), substring(org, 1, nchar(org) / 2), org)
# [1] "Alpha Company" "Bravo Institute" "Charlie Group" "Delta Incorporated" "WD40" "3M"
You can use regex as this line below :
my_df$org <- str_extract(string = my_df$org, pattern = "([A-Z][a-z]+ [A-Z][a-z]+){1}")
If all individual words start with a capital letter (not followed by an other capital letter), then you can use it to split on. Only keep unique elements, and paste + collapse. Will also work om the bonus LCC-option
org <- c("Alpha CompanyCompany , LLC", "Bravo InstituteBravo Institute", "Charlie Group", "Delta IncorporatedDelta Incorporated")
sapply(
lapply(
strsplit(gsub("[^A-Za-z0-9]", "", org),
"(?<=[^A-Z])(?=[A-Z])",
perl = TRUE),
unique),
paste0, collapse = " ")
[1] "Alpha Company LLC" "Bravo Institute" "Charlie Group" "Delta Incorporated"
This question already has answers here:
Read observations in fixed width files spanning multiple lines in R
(3 answers)
Closed 2 years ago.
I have a fixed width character vector input called "text" that looks something like this:
[1] " Report"
[2] "Group ID Name"
[3] "Number"
[4] "AA A134 abcd"
[5] "AB A123 def"
[6] "AC A345 ghikl"
[7] "BA B134 jklmmm"
[8] "AD A987 mn"
I need to create a standard DataFrame. My approach is to first create a text file and then use the read.fwf function to create a clean DataFrame from a fixed width text file input. What I have works, but it forces me to create a text file in my working directory and then read it back in as a fwf:
> cat(text, file = "mytextfile", sep = "\n", append = TRUE)
> read.fwf("mytextfile", skip = 3, widths = c(12, 14, 20))
Is it possible to achieve the same result without saving the intermediate output to my working directory? I tried using paste() and capture.output() without success. While
x = paste(text, collapse = "\n")
seemed to work at first, but when I passed it to
read.fwf(x, skip = 3, widths = c(12, 14, 20))
I got
Error in file(file, "rt") : cannot open the connection
In addition: Warning Message:
In file(file, "rt") : cannot open file '
and capture.output() got me back to square one, a character vector. Any advice is greatly appreciated. Thank you.
You can use textConnection to read file as text in read.fwf and supply the widths.
data <- read.fwf(textConnection(text),
widths = c(12, 14, 20), strip.white = TRUE, skip = 3)
data
# V1 V2 V3
#1 AA A134 abcd
#2 AB A123 def
#3 AC A345 ghikl
#4 BA B134 jklmmm
#5 AD A987 mn
data
text <- c(" Report", "Group ID Name", "Number",
"AA A134 abcd", "AB A123 def",
"AC A345 ghikl", "BA B134 jklmmm",
"AD A987 mn")
I have a text file I am trying to parse and put the information into a data frame. In each one of the 'events' there may or may not be some notes with it. However the notes can span various amounts of rows. I need to concatenate the notes for each event into one string to store in a column of the data frame.
ID: 20470
Version: 1
notes:
ID: 01040
Version: 2
notes:
The customer was late.
Project took 20 min. longer than anticipated
Work was successfully completed
ID: 00000
Version: 1
notes:
Customer was not at home.
ID: 00000
Version: 7
notes:
Fax at 2:30 pm
Called but no answer
Visit home no answer
Left note on door with call back number
Made a final attempt on 12/5/2013
closed case on 12/10 with nothing resolved
So for example for the third event the notes should be one long string: "The customer was late. Project took 20 min. longer than anticipated Work was successfully completed", which then would be store into the notes columns in the the data frame.
For each event I know how many rows the notes span.
Something like this (actually, you would be happier and learn more figuring it out yourself, I was just procrastinating between two tasks):
x <- readLines("R/xample.txt") # you'll probably read it from a file
ids <- grep("^ID:", x) # detecting lines starting with ID:
versions <- grep("^Version:", x)
notes <- grep("^notes:", x)
nStart <- notes + 1 # lines where the notes start
nEnd <- c(ids[-1]-1, length(x)) # notes end one line before the next ID: line
ids <- sapply(strsplit(x[ids], ": "), "[[", 2)
versions <- sapply(strsplit(x[versions], ": "), "[[", 2)
notes <- mapply(function(i,j) paste(x[i:j], collapse=" "), nStart, nEnd)
df <- data.frame(ID=ids, ver=versions, note=notes, stringsAsFactors=FALSE)
dput of data
> dput(x)
c("ID: 20470", "Version: 1", "notes: ", " ", " ", "ID: 01040",
"Version: 2", "notes: ", " The customer was late.", "Project took 20 min. longer than anticipated",
"Work was successfully completed", "", "ID: 00000", "Version: 1",
"notes: ", " Customer was not at home.", "", "ID: 00000", "Version: 7",
"notes: ", " Fax at 2:30 pm", "Called but no answer", "Visit home no answer",
"Left note on door with call back number", "Made a final attempt on 12/5/2013",
"closed case on 12/10 with nothing resolved ")
I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:
Collected for 2 man-hours total. Cloudy, imminent storms.
Collected for 2 man-hours total. Rainy.
Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.
..And so on
I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.
An option using gsub can be as:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation:
- `^Collected` : Starts with `Collected`
- `[^.]*` : Followed by anything other than `.`
- `\\. ` : Ends with `.` and `space`.
Replace such matches with "".
Data:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)
a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
Or if you know that there will be a space after the period:
sub("Collected.*?\\. ","",a)
I'm facing the following problem. I've got a table with a column called title.
The title column contains rows with values like To kill a mockingbird (1960).
So basically the format of the column is [title] ([year]). What I need are two columns: title and year, year without brackets.
One other problem is that some rows contain a title including brackets. But
basically the last 6 characters of every row are year wrapped in brackets.
How do I create the two columns, title and year?
What I have is:
Books$title <- c("To kill a mockingbird (1960)", "Harry Potter and the order of the phoenix (2003)", "Of mice and men (something something) (1937)")
title
To kill a mockingbird (1960)
Harry Potter and the order of the phoenix (2003)
Of mice and men (something something) (1937)
What I need is:
Books$title <- c("To kill a mockingbird", "Harry Potter and the order of the phoenix", "Of mice and men (something something)")
Book$year <- c("1960", "2003", "1937")
title year
To kill a mockingbird 1960
Harry Potter and the order of the phoenix 2003
Of mice and men (something something) 1937
We can work around substring the last 6 characters.
First we recreate your data.frame:
df <- read.table(h=T, sep="\n", stringsAsFactors = FALSE,
text="
Title
To kill a mockingbird (1960)
Harry Potter and the order of the phoenix (2003)
Of mice and men (something something) (1937)")
Then we create a new one. The first column, Title is everything from df$Title but the last 7 characters (we also remove the trailing space). The second column, Year is the last 6 characters from df$Title and we remove any space, opening or closing bracket. (gsub("[[:punct:]]", ...) would have worked as well.
data.frame(Title=substr(df$Title, 1, nchar(df$Title)-7),
Year=gsub(" |\\(|\\)", "", substr(df$Title, nchar(df$Title)-6, nchar(df$Title))))
Title Year
1 To kill a mockingbird 1960
2 Harry Potter and the order of the phoenix 2003
3 Of mice and men (something something) 1937
Does that solve your problem?
try to use substrRight(df$Title, 6) in a loop to extract last 6 characters so the year with brackets and save it as new column
Extracting the last n characters from a string in R
Similar to #Vincent Bonhomme:
I assumue that the data are in some text file that I have called so.dat from where I read the data into a data.frame that also contains two columns for the title and year to be extracted. Then I use substr() to separate title from the fixed length year at the end, leaving the () alone as the OP apparently wants them:
titles <- data.frame( orig = readLines( "so.dat" ),
text = "", yr = "", stringsAsFactors = FALSE )
titles$text <- substring( titles[ , 1 ],
1, nchar( titles[ , 1 ] ) - 7 )
titles$yr <- substring( titles[ , 1 ],
nchar( titles[ , 1 ] ) - 5, nchar( titles[ , 1 ] ) )
The original data can be removed or not, dpending upon further need.