Extract before and after lines based on keyword in Pdf using R programming - r

I want to extract information related to keyword "cancer" from list of pdf using R.
i want to extract before and after lines or paragraph containing word cancer in text file.
abstracts <- lapply(mytxtfiles, function(i) {
j <- paste0(scan(i, what = character()), collapse = " ")
regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\R+){4}[cancer][^\\r\\n]*\\R+(^[^\\r\\n]*\\R+){4}", j, perl=TRUE))})
above regex is not working

Here's one approach:
library(textreadr)
library(tidyverse)
loc <- function(var, regex, n = 1, ignore.case = TRUE){
locs <- grep(regex, var, ignore.case = ignore.case)
out <- sort(unique(c(locs - 1, locs, locs + 1)))
out <- out[out > 0]
out[out <= length(var)]
}
doc <- 'https://www.in.kpmg.com/pdf/Indian%20Pharma%20Outlook.pdf' %>%
read_pdf() %>%
slice(loc(text, 'cancer'))
doc
## page_id element_id text
## 1 24 28 Ranjit Shahani applauds the National Pharmaceuticals Policy's proposal of public/private
## 2 24 29 partnerships (PPPs) to tackle life-threatening diseases such as cancer and HIV/AIDS, but
## 3 24 30 stresses that, in order for them to work, they should be voluntary, and the government
## 4 25 8 the availability of medicines to treat life-threatening diseases. It notes, for example, that
## 5 25 9 while an average estimate of the value of drugs to treat the country's cancer patients is
## 6 25 10 $1.11 billion, the market is in fact worth only $33.5 million. “The big gap indicates the
## 7 25 12 because of the high cost of these medicines,” says the Policy, which also calls for tax and
## 8 25 13 excise exemptions for anti-cancer drugs.
## 9 25 14 Another area for which PPPs are proposed is for drugs to treat HIV/AIDS, India's biggest health
## 10 32 19 Variegate Trading, a UB subsidiary. The firm's major products are in the anti-infective,
## 11 32 20 anti-inflammatory, cancer, diabetes and allergy market segments and, for the year ended
## 12 32 21 December 31, 2005, it reported net sales (excluding excise duty) up 9.9 percent to $181.1

Related

Read table from PDF with partially filled column using Pdftools

I've written a function in R using pdftools to read a table from a pdf. The function gets the job done, but unfortunately the table contains a column for notes, which is only partially filled. As a result the data in the resulting table is shifted by one column in the row containing a note.
Here's the table.
And here's the code:
# load library
library(pdftools)
# link to report
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
# read data through pdftool
data <- pdf_text(url)
# create a function to read the pdfs
scrape_pdf <- function(list_of_tables,
table_number,
number_columns,
column_names,
first_row,
last_row) {
data <- list_of_tables[table_number]
data <- trimws(data)
data <- strsplit(data, "\n")
data <- data[[1]]
data <- data[min(grep(first_row, data)):
max(grep(last_row, data))]
data <- str_split_fixed(data, " {2,}", number_columns)
data <- data.frame(data)
names(data) <- column_names
return(data)
}
names <- c("","6m 30-9-2022","6m 30-9-2021","12m 30-3-2022")
output <- scrape_pdf(rym22Q3fs,3,5,names,"Care fees","Basic and diluted")
And the output.
6m 30-9-2022 6m 30-9-2021 12m 30-3-2022 NA
1 Care fees 210,187 194,603 398,206
2 Management fees 59,746 50,959 105,552
3 Interest received 364 42 41
4 Other income 3,942 2,260 4,998
5 Total revenue 274,239 247,864 508,797
6
7 Fair-value movement of
8 investment properties 3 261,346 285,143 745,885
9 Total income 535,585 533,007 1,254,682
10
11 Operating expenses (265,148) (225,380) (466,238)
12 Depreciation and
13 amortisation expenses (22,996) (17,854) (35,698)
14 Finance costs (19,355) (15,250) (30,664)
15 Impairment loss 2 (10,784) - -
16 Total expenses (318,283) (258,484) (532,600)
17
18 Profit before income tax 217,302 274,523 722,082
19 Income tax (expense) / credit (23,316) 6,944 (29,209)
20 Profit for the period 193,986 281,467 692,873
21
22 Earnings per share
23 Basic and diluted (cents per share) 38.8 56.3 138.6
How can I best circumvent this issue?
Many thanks in advance!
While readr::read_fwf() is for handling fixed width files, it performs pretty well on text from pdftools too once header / footer rows are removed. Even if it has to guess column widths, though those can be specified too.
library(pdftools)
library(dplyr, warn.conflicts = F)
url <- "https://www.rymanhealthcare.co.nz/hubfs/Investor%20Centre/Financial/Half%20year%20results%202022/Ryman%20Healthcare%20Limited%20-%20Announcement%20Numbers%20and%20financial%20statements%20-%2030%20September%202022.pdf"
data <- pdf_text(url)
scrape_pdf <- function(pdf_text_item, first_row_str, last_row_str){
lines <- unlist(strsplit(pdf_text_item, "\n"))
# remove 0-length lines
lines <- lines[nchar(lines) > 0]
lines <- lines[min(grep(first_row_str, lines)):
max(grep(last_row_str , lines))]
# paste lines back into single string for read_fwf()
paste(lines, collapse = "\n") %>%
readr::read_fwf() %>%
# re-connect strings in first colum if values were split between rows
mutate(X1 = if_else(!is.na(lag(X1)) & is.na(lag(X3)), paste(lag(X1), X1), X1)) %>%
filter(!is.na(X3))
}
output <- scrape_pdf(data[3], "Care fees","Basic and diluted" )
Result:
output %>%
mutate(X1 = stringr::str_trunc(X1, 35))
#> # A tibble: 16 × 5
#> X1 X2 X3 X4 X5
#> <chr> <dbl> <chr> <chr> <chr>
#> 1 Care fees NA 210,187 194,603 398,206
#> 2 Management fees NA 59,746 50,959 105,552
#> 3 Interest received NA 364 42 41
#> 4 Other income NA 3,942 2,260 4,998
#> 5 Total revenue NA 274,239 247,864 508,797
#> 6 Fair-value movement of investmen... 3 261,346 285,143 745,885
#> 7 Total income NA 535,585 533,007 1,254,682
#> 8 Operating expenses NA (265,148) (225,380) (466,238)
#> 9 Depreciation and amortisation ex... NA (22,996) (17,854) (35,698)
#> 10 Finance costs NA (19,355) (15,250) (30,664)
#> 11 Impairment loss 2 (10,784) - -
#> 12 Total expenses NA (318,283) (258,484) (532,600)
#> 13 Profit before income tax NA 217,302 274,523 722,082
#> 14 Income tax (expense) / credit NA (23,316) 6,944 (29,209)
#> 15 Profit for the period NA 193,986 281,467 692,873
#> 16 Earnings per share Basic and dil... NA 38.8 56.3 138.6
Created on 2022-11-19 with reprex v2.0.2

Cumsum function step wise in R

I am facing one problem, I calculated a monthly interest rate for a mortgage, however, I would need to sum the results in order to have it yearly (always 12 months).
H <- 2000000 # mortgage
i.m <- 0.03/12 # rate per month
year <- 15 # years
a <- (H*i.m*(1+i.m)^(12*year))/
((1+i.m)^(12*year)-1)
a # monthly payment
interest <- a*(1-(1/(1+i.m)^(0:(year*12))))
interest
cumsum(a*(1-(1/(1+i.m)^(0:(year*12))))) # first 12 values together and then next 12 values + first values and ... (I want to have for every year a value)
You may do this with tapply in base R.
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
yearly <- tapply(monthly, ceiling(seq_along(monthly)/12), sum)
I think you can use the following solution:
monthly <- cumsum(a*(1-(1/(1+i.m)^(0:(year*12)))))
sapply(split(monthly, ceiling(seq_along(monthly) / 12)), function(x) x[length(x)])
1 2 3 4 5 6 7 8
2254.446 9334.668 21098.218 37406.855 58126.414 83126.695 112281.337 145467.712
9 10 11 12 13 14 15 16
182566.812 223463.138 268044.605 316202.434 367831.057 422828.023 481093.905 486093.905

Is there any method : text to table with R?

I have a strings which have some pattern.
for example
>oc
[1]"for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
raw material is table and there is some disturbing things, I decide to extract text from table and organize, clean them with code then reshape table form.
The raw material table looks like this
for financial company payment | manufacturer company payment | distributor people payment | other..
1 m..| 10 20 30 ...
2 m..| 8 14 15 ...
1 y..| 5 9 11 ...
I appreciate any method so please, leave any comment for it. It would be great help to me.
Also what I tried to do is first, use extract_text function (in tabilizer library)
and second I use regular expression to make strings tidy
finally I use scan function.
Again, any method is okay. please leave any help. Thank you!
Here's a solution--anything but elegant but working:
Your data:
oc <- "for financial company payment manufacturer company payment distributor people payment other payment total payment 1 month payment 10 20 30 40 100 2 month payment 8 14 15 30 67 1 year payment 5 9 11 15 40"
First, split the string at payment:
oc <- strsplit(oc, " payment ")
Prepare a matrix to fill in the data:
mt <- matrix(NA, ncol = 5, nrow = 3)
Grab the relevant elements from oc as column names:
colnames(mt) <- oc[[1]][1:5]
Define the rownames:
rownames(mt) <- c("1 month", "2 month", "1 year")
Grab the numbers from oc:
numbers <- ocx[[1]][7:9]
Clean numbers:
numbers <- gsub("( 2 month| 1 year)", "", numbers)
Now breaknumbers into individual numbers using str_extract_all from the package stringr:
library(stringr)
numbers <- str_extract_all(numbers, "\\d+")
Iterate over the rows in mt to fill in the numbers from numbers:
for(i in 1:nrow(mt)){
mt[i,] <- numbers[[i]]
}
Finally redefine mt as a dataframe:
mt <- as.data.frame(mt)
Et voilá, the result:
mt
for financial company manufacturer company distributor people other total
1 month 10 20 30 40 100
2 month 8 14 15 30 67
1 year 5 9 11 15 40

How to calculate customer acquisition rate by finding out overlapping with previous years?

I have a date set CustOrderabout customer purchases from 2008-2013 with following information (this just part of the data):
CustID OrderYear Amount
101102 2008 22429.00
101102 2009 11045.00
101435 2010 10740.77
101435 2011 73669.50
107236 2012 162123.50
101416 2010 8102.00
101416 2011 360.00
101416 2012 36576.00
101416 2013 1960.00
101467 2012 997.00
101604 2010 2971.53
101664 2009 91.94
101664 2011 130.93
.........
Some customers may purchases continuously every year (i.e. 101416), or just certain years (i.e. 101664). I want to figure out the customer acquisition rate, that is how many new customers gained that year, in terms of rate and numbers (For customers who did not purchase continuously, only consider the first time of purchase). For instance,
Year Customer TotalCustomerNumber NewCustomerRate
2008 5 5 0%
2009 3 8 37%
2010 4 12 33%
2011 2 14 14%
2012 3 17 17%
2013 2 19 10%
Anyone have any ideas/hints how to do it?
I appreciate any helps!
I took some time to work out a solution and this method should work. Take a look a the comments for details:
# Setting a seed for reproducibility.
set.seed(10)
# Setting what years we want allowed.
validYears <- 2008:2015
# Generating a "fake" dataset for testing purposes.
custDF <- data.frame(CustID = abs(as.integer(rnorm(250, 50, 50))), OrderYear = 0, Amount = abs(rnorm(250, 100, 1000)))
custDF$OrderYear <- sapply(custDF$OrderYear, function(x) x <- sample(validYears, 1)) # Adding random years for each purchase.
# Initializing a new data frame to store the output values.
newDF <- data.frame(Year = validYears, NewCustomers = 0, RunningNewCustomerTotal = 0, NewCustomerRate = "")
custTotal <- 0 # Initializing a variable to be used in the loop.
firstIt <- 1 # Denotes the first iteration.
for (year in validYears) { # For each uniqueYear in your data set (which I arbitarily defined before making the dataset)
# Getting the unique IDs of the current year and the unique IDs of all past years.
currentIDs <- unique(custDF[custDF$OrderYear == year, "CustID"])
pastIDs <- unique(custDF[custDF$OrderYear < year, "CustID"])
if (firstIt == 1) { pastIDs <- c(-1) } # Setting a condition for the first iteration.
newIDs <- currentIDs[!(currentIDs %in% pastIDs)] # Getting all IDs that have not been previously used.
numNewIDs <- length(newIDs) # Getting the number of new IDs.
custTotal <- custTotal + numNewIDs # Getting the running total.
# Adding the new data into the data frame.
newDF[newDF$Year == year, "NewCustomers"] <- numNewIDs
newDF[newDF$Year == year, "RunningNewCustomerTotal"] <- custTotal
# Getting the rate.
if (firstIt == 1) {
NewCustRate <- 0
firstIt <- 2
} else { NewCustRate <- (1 - (newDF[newDF$Year == (year - 1), "RunningNewCustomerTotal"] / custTotal)) * 100 }
# Inputting the new data. Format and round are just getting the decimals down.
newDF[newDF$Year == year, "NewCustomerRate"] <- paste0(format(round(NewCustRate, 2)), "%")
}
With output:
> newDF
Year NewCustomers RunningNewCustomerTotal NewCustomerRate
1 2008 32 32 0%
2 2009 22 54 41%
3 2010 19 73 26%
4 2011 14 87 16%
5 2012 7 94 7.4%
6 2013 3 97 3.1%
7 2014 9 106 8.5%
8 2015 5 111 4.5%
Hope this helps!

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

Resources