Read PDF table into R where rows have varying numbers of lines - r

I'm hoping to read the following PDF into a tidy data frame within R:
PDF Table. The table even stretches across 70+ pages.
I am adept at reading in tables where each cell has one line, but I'm not sure how to extend that knowledge to cases where rows have a varying number of lines
Any help would be much appreciated!

I would suggest you to use tabulizer. It is better to extract tables from pdf files. Here the code for the file you shared:
library(tabulizer)
lst <- extract_tables(file = '8-31-2020 Paragraph IV Update_0.pdf')
#Format
renames <- function(x)
{
colnames(x) <- x[1,]
x <- x[2:dim(x)[1],,drop=F]
return(as.data.frame(x))
}
#Apply
lst21 <- lapply(lst,renames)
#Bind all
df <- do.call(rbind,lst21)
Output (some rows):
head(df)
DRUG NAME DOSAGE FORM STRENGTH
1 Abacavir Sulfate Tablets 300 mg
2 Abacavir Oral Solution 20 mg/mL
3 Abacavir Sulfate, Dolutegravir\rand Lamivudine Tablets 600 mg/50 mg/300\rmg
4 Abacavir Sulfate and\rLamivudine Tablets 600 mg/300 mg
5 Abacavir Sulfate, Lamivudine\rand Zidovudine Tablets 300 mg/150 mg/300\rmg
6 Abiraterone Acetate Tablets 125 mg
RLD/NDA DATE OF\rSUBMISSION NUMBER OF\rANDAs\rSUBMITTED 180-DAY\rSTATUS
1 Ziagen\r20977 1/28/2009 1 Eligible
2 Ziagen\r20978 12/27/2012 1 Eligible
3 Triumeq\r205551 8/14/2017 5
4 Epzicom\r21652 9/27/2007 1 Eligible
5 Trizivir\r21205 3/22/2011 1 Eligible
6 Yonsa\r210308 7/23/2018 1
180-DAY\rDECISION\rPOSTING\rDATE DATE OF\rFIRST\rAPPLICANT\rAPPROVAL
1 2/11/2020 6/18/2012
2 2/11/2020 9/26/2016
3
4 2/11/2020 9/29/2016
5 2/11/2020 12/5/2013
6
DATE OF FIRST\rCOMMERCIAL\rMARKETING BY\rFTF EXPIRATION\rDATE OF LAST\rQUALIFYING\rPATENT
1 6/19/2012 5/14/2018
2 9/15/2017 5/14/2018
3 12/8/2029
4 9/29/2016 5/14/2018
5 12/17/2013 5/14/2018
6 3/17/2034

Related

R combine multiple data frames based on column names into single table [duplicate]

This question already has answers here:
Trying to merge multiple csv files in R
(10 answers)
How to combine multiple .csv files in R?
(1 answer)
Closed 9 months ago.
I have a bunch of large CSVs and they all contain the exact same columns and I need to combine them all into a single CSV, so basically appending all the data from each data frame underneath the next. Like this
Table 1
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MI
10
80
122
Table 2
Prop_ID
State
Pasture
Soy
Corn
3
MN
152
0
15
4
IL
0
10
99
Output table
Prop_ID
State
Pasture
Soy
Corn
1
WI
20
45
75
2
MI
10
80
122
3
MN
152
0
15
4
IL
0
10
99
I have more than 2 tables to do this with, any help would be appreciated. Thanks
A possible solution, in base R:
rbind(df1, df2)
#> Prop_ID State Pasture Soy Corn
#> 1 1 WI 20 45 75
#> 2 2 MI 10 80 122
#> 3 3 MN 152 0 15
#> 4 4 IL 0 10 99
Or using dplyr:
dplyr::bind_rows(df1, df2)
Assuming all the csv files are in a single directory, and that these are the only files in that directory, this solution, using data.table, should work.
library(data.table)
setwd('<directory with your csv files>')
files <- list.files(pattern = '.+\\.csv$')
result <- rbindlist(lapply(files, fread))
list.files(...) returns a vector containing the file names in a given directory, based on a pattern. Here we ask for only files containing .csv at the end.
fread(...) is a very fast file reader for data.table. We apply this function to each file name ( lapply(files, fread) ) to generate a list containing the contents of each csv. Then we use rbindlist(...) to combine them row-wise.

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

vcdExtra::datasets not working on some Packages

R3.6.1, vcdExtra 0.7.1
vcdExtra::datasets("caret")
Error in get(x) : object 'GermanCredit' not found
vcdExtra::datasets fails on some packages like "caret".
Am I missing something?
thanks
If you only require the dataset of German Credit, try this code:
library(caret)
data("GermanCredit")
GermanCredit
And you will get:
Duration Amount InstallmentRatePercentage ResidenceDuration Age NumberExistingCredits NumberPeopleMaintenance Telephone
1 6 1169 4 4 67 2 1 0
2 48 5951 2 2 22 1 1 1
3 12 2096 2 3 49 1 2 1
4 42 7882 2 4 45 1 2 1
5 24 4870 3 4 53 2 2 1
Please, comment if it is what you need.
Regards,
Alexis
This is the sequence of commands that I need to run for a correct functioning of vcdExtra::datasets("caret")
library(evtree)
library(caret)
data(Sacramento)
data(tecator)
data(BloodBrain)
data(cox2)
data(dhfr)
data(oil)
data(mdrr)
data(pottery)
data(scat)
data(segmentationData)
vcdExtra::datasets("caret")
The output is
Item class dim Title
1 GermanCredit data.frame 1000x21 German Credit Data
2 Sacramento data.frame 932x9 Sacramento CA Home Prices
3 absorp matrix 215x100 Fat, Water and Protein Content of Meat Samples
4 bbbDescr data.frame 208x134 Blood Brain Barrier Data
5 cars data.frame 50x2 Kelly Blue Book resale data for 2005 model year GM cars
6 cox2Class factor 462 COX-2 Activity Data
7 cox2Descr data.frame 462x255 COX-2 Activity Data
8 cox2IC50 numeric 462 COX-2 Activity Data
9 dhfr data.frame 325x229 Dihydrofolate Reductase Inhibitors Data
10 endpoints matrix 215x3 Fat, Water and Protein Content of Meat Samples
11 fattyAcids data.frame 96x7 Fatty acid composition of commercial oils
12 logBBB numeric 208 Blood Brain Barrier Data
13 mdrrClass factor 528 Multidrug Resistance Reversal (MDRR) Agent Data
14 mdrrDescr data.frame 528x342 Multidrug Resistance Reversal (MDRR) Agent Data
15 oilType factor 96 Fatty acid composition of commercial oils
16 potteryClass factor 58 Pottery from Pre-Classical Sites in Italy
17 scat data.frame 110x19 Morphometric Data on Scat
18 scat_orig data.frame 122x20 Morphometric Data on Scat
19 segmentationData data.frame 2019x61 Cell Body Segmentation

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources