r pdf_text() split into lines and words - r

I can't upload the file into stackoverflow but I have a PDF containing a table spanning 3 pages. After using library(pdftools) and pdf_text(), it creates a 3 element character list where each element is a long string of all text from each page.
library(pdftools)
df <- pdf_text(file.pdf)
The data I need is on the 2nd page. I get the output:
df[2]
All Households 19,015 10,030 8,985 3,635 585 3,055 19.1 5.8 34.0\n\nHousing above standards 12,365 8,225 4,145 0 0 0 0.0 0.0 0.0\n\nBelow one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1\n\nBelow affordability standard12 4,885 1,230 3,660 3,125 535 2,590 64.0 43.5 70.8\n\nBelow adequacy standard13 1,360 555 810 425 75 350 31.2 13.5 43.2\n\n\n\n\n
I want to isolate the row "Below one or more housing standards" and the 8th column which contains the value "54.7".
I believe the next steps are to split the long string into lines by the line break character "\n", identify the applicable line, split the line into words, and select the 8th word.
I've tried splitting into lines using:
library(stringr)
lines <- df[2] %>% str_split("\n")
It returns a "List of 1" and I'm not sure how to work with it. Any suggestions on the syntax?
It's a bit convoluted to get to the original file.
https://www03.cmhc-schl.gc.ca/hmip-pimh/en/#Profile/126504/5/Alta%20Vista
Core Housing Need -> Full Report -> Export.
Oddly there isn't a way to just download a CSV.

Use readLines (which doesn't use the scan(text= ...) pathway and therefore needs textConnection.
library(pdftools)
#Using poppler version 0.62.0
df <- pdf_text("Downloads/TableExport.pdf")
str(df)
# chr [1:3] "Core Housing Need (2016 Statistics Canada's Census) — Alta Vista\n H "| __truncated__ ...
# for each page read in with readLines to make character vectors
# separated by \n
lines <- lapply(df, function(t) readLines( textConnection(t)) )
Then search for the line with the target:
lines[[2]][grep("Below one or more housing standards", lines[[2]])]
[1] "Below one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1"
If you assigned that value to the name target you could get the 8th column with this rather baroque regex:
sub("(Below one or more housing standards)([ ]*\\d*[,]*\\d*){6}[ ]*(\\d*[.]*\\d*)(.*)", "\\3", target)
#[1] "54.7"
Notice the need to allow commas and decimal points in the numeric specifications. As written it may not be totally general since the first six of the numeric columns are only allowed to have commas and not decimals. I guess you could allow a character class like "[.,]" to be more general. Or even: "([ ]*\\d*[,]*\\d+[.]*\\d*){6}" (lightly tested). I suspect there are packages that will handle tabular pdf formatting in a more principled manner.

This does not use pdftools, but I hope it is helpful to you. First, use rvest package to read the url of this table, then use html_table to extract into a table. Then, there is some manual manipulation
library(tidyverse)
library(rvest)
url = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/Profile/DetailsCoreHousingNeed?geographyId=126504&t=5"
# Read the url
doc = rvest::read_html(url)
# Extract the table, and provide anonymous V<x> names
table = rvest::html_table(doc)[[1]]
names(table) = paste0("V",1:ncol(table))
# drop first three rows
table <- table %>% filter(row_number()>2)
# Manually, identify the split rows (i.e. subheadings)
split_rows = c(1,9,24,32,36,40,44,48,55,62)
# Extract the subheadings
sub_table_names = table %>% filter(row_number() %in% split_rows) %>% pull(V1)
# Now, use lapply to filter the rows that are between the splits, and use as.numeric and str_remove_all to convert to numeric values
tables = lapply(seq_along(split_rows), function(x) {
table %>%
filter(between(row_number(), split_rows[x]+1, split_rows[x+1]-1 )) %>%
mutate(across(V2:V10, ~as.numeric(str_remove_all(.x,","))))
})
# Name the list of tables
names(tables) <- sub_table_names
Output:
$`Age of primary household maintainer3`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 15 to 24 years 1030 45 980 220 0 220 21.4 0 22.4
3 25 to 34 years 2700 715 1990 555 40 515 20.6 5.6 25.9
4 35 to 44 years 2795 1360 1440 545 25 520 19.5 1.8 36.1
5 45 to 54 years 3565 2005 1565 740 135 610 20.8 6.7 39
6 55 to 64 years 3535 2225 1315 615 155 455 17.4 7 34.6
7 65 years and over 5380 3685 1700 960 220 735 17.8 6 43.2
$`Household Type4`
# A tibble: 14 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Couple with children 4360 3145 1220 585 100 485 13.4 3.2 39.8
3 Couple without children 4755 3195 1555 390 70 315 8.2 2.2 20.3
4 Senior-led (65+) couple without children 2030 1695 335 140 50 90 6.9 2.9 26.9
5 Lone-parent household 2220 810 1405 845 135 710 38.1 16.7 50.5
6 Female lone-parent household 1845 660 1190 730 105 625 39.6 15.9 52.5
7 Male lone-parent household 370 155 220 115 30 85 31.1 19.4 38.6
8 Multiple-family household 265 165 100 70 20 45 26.4 12.1 45
9 One-person household 6075 2385 3685 1525 235 1290 25.1 9.9 35
10 Female one-person households 3615 1590 2025 920 135 795 25.4 8.5 39.3
11 Senior (65+) female living alone 1810 980 830 525 90 435 29 9.2 52.4
12 Male one-person household 2455 800 1660 605 105 500 24.6 13.1 30.1
13 Senior (65+) male living alone 600 350 250 170 50 120 28.3 14.3 48
14 Other non-family household 1345 330 1015 230 25 205 17.1 7.6 20.2
$`Immigrant households5`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Non-immigrant 12500 7115 5395 1665 230 1440 13.3 3.2 26.7
3 Non-permanent resident6 430 25 400 140 10 130 32.6 40 32.5
4 Immigrant 6085 2890 3190 1825 345 1485 30 11.9 46.6
5 Landed before 2001 4105 2480 1620 1065 275 790 25.9 11.1 48.8
6 Landed 2001 to 2010 1340 340 1000 460 55 400 34.3 16.2 40
7 Recent immigrants (landed 2011-2016)7 640 70 575 310 10 295 48.4 14.3 51.3
$`Households with seniors`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one senior (65 or older) 5910 4085 1825 1015 245 770 17.2 6 42.2
3 Other household type 13105 5945 7155 2625 340 2285 20 5.7 31.9
$`Households with children under 18`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one child less than 18 years old 4465 2455 2005 1140 170 975 25.5 6.9 48.6
3 Other household type 14550 7575 6980 2500 420 2080 17.2 5.5 29.8
$`Activity limitations8`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one person with activity limitations 10955 5830 5120 2285 385 1895 20.9 6.6 37
3 All other households 8060 4195 3865 1360 200 1160 16.9 4.8 30
$`Aboriginal households9`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Aboriginal households 655 215 440 120 20 105 18.3 9.3 23.9
3 Non-Aboriginal households 18355 9815 8540 3515 565 2955 19.2 5.8 34.6
$`Incomes, shelter costs10, and STIRs11`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average household income before taxes ($) 96464 134172 54357 29101 31212 28696 NA NA NA
2 Average monthly shelter costs ($) 1256 1408 1085 1039 1243 1000 NA NA NA
3 Average STIR before taxes (%) 24 17.2 31.5 46.8 49.7 46.2 NA NA NA
4 Median household income before taxes ($) 72502 107762 44596 27711 28437 27568 NA NA NA
5 Median monthly shelter costs ($) 1097 1193 1076 1013 1115 1006 NA NA NA
6 Median STIR before taxes (%) 19.3 14 26 43.8 45.8 43.3 NA NA NA
$`Housing standards`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Housing above standards 12365 8225 4145 0 0 0 0 0 0
3 Below one or more housing standards 6650 1805 4845 3640 585 3055 54.7 32.4 63.1
4 Below affordability standard12 4885 1230 3660 3125 535 2590 64 43.5 70.8
5 Below adequacy standard13 1360 555 810 425 75 350 31.2 13.5 43.2
6 Below suitability standard14 1480 210 1270 800 55 745 54.1 26.2 58.7

you could check if there is more up to date 2018 data by following the crumbs to https://www150.statcan.gc.ca/n1/pub/46-25-0001/462500012021001-eng.htm ,
However, If you only want one row it is easy to save the source with right clicks,
<tr>
<th scope="row">Below one or more housing standards</th>
<td>6,650</td>
<td>1,805</td>
<td>4,845</td>
<td>3,640</td>
<td>585</td>
<td>3,055</td>
<td>54.7</td>
<td>32.4</td>
<td>63.1</td>
</tr>
for the headings you need
HOUSEHOLDS TESTED FOR CORE HOUSING NEED 1 HOUSEHOLDS IN CORE HOUSING NEED 2 % OF HOUSEHOLDS IN CORE HOUSING NEED
TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS
and for footins
1 Data include all non-farm, non-band, non-reserve private households reporting positive incomes and shelter cost-to-income ratios less than 100 per cent.
2 A household is in core housing need if its housing does not meet one or more standards for housing adequacy (repair), suitability (crowding), or affordability and if it would have to spend 30 per cent or more of its before-tax income to pay the median rent (including utilities) of appropriately sized alternative local market housing. Adequate housing does not require any major repairs, according to residents. Suitable housing has enough bedrooms for the size and make-up of resident households. Affordable housing costs less than 30 per cent of before-tax household income.
You have a PDF and want to work with the raw Text but its clear there is some issue with the generated searchable text and we can see that in the headings and with copy and paste. Belowone ormore housing standards so here is the expected extraction from bottom of page 2
pdftotext -f 2 -l 2 -nopgbrk -simple -margint 650 tableexport.pdf -

Related

Looping over Dataframes and Columns in R

I have a dataframe with this structure:
df <- read.table(text="
site date v1 v2 v3 v4
a 2019-08-01 0 17 94 150
b 2019-08-01 5 25 83 148
c 2019-08-01 6 39 43 148
d 2019-08-01 10 39 144 165
a 2019-03-31 4 15 106 154
b 2019-03-31 4 21 70 151
c 2019-03-31 8 30 44 148
d 2019-03-31 9 41 144 160
a 2019-01-04 3 10 104 153
b 2019-01-04 2 16 90 150
c 2019-01-04 8 40 62 151
d 2019-01-04 9 43 142 162
a 2019-07-07 3 14 93 152
b 2019-07-07 2 23 74 147
c 2019-07-07 9 31 58 147
d 2019-07-07 9 36 123 170
a 2019-06-17 0 12 91 153
b 2019-06-17 3 25 73 147
c 2019-06-17 7 35 45 146
d 2019-06-17 8 40 134 168
a 2019-01-11 4 14 104 153
b 2019-01-11 5 18 73 151
c 2019-01-11 7 35 65 147
d 2019-01-11 11 44 134 168
a 2019-11-11 4 20 103 152
b 2019-11-11 6 22 79 152
c 2019-11-11 5 38 52 147
d 2019-11-11 10 38 144 163
a 2019-09-06 3 13 102 155
b 2019-09-06 6 17 74 149
c 2019-09-06 9 32 45 146
d 2019-09-06 11 42 138 165
", header=TRUE, stringsAsFactors=FALSE)
Now, I would like to calculate the statistic (min, max, mean, median, sd) of the variables (v1 - v4) for each of the sites for a full year, only the summer and only the winter.
First I subsetted the data for the summer and winter using the following code:
df_summer <- selectByDate(df, month = c(4:9))
df_winter <- selectByDate(df, month = c(1,2,3,10,11,12))
Then I tried to build a loop for the season and then for the variables. For this i created two lists:
df_list <- list(df, df_summer, df_winter)
col_names <- c("v1", "v2", "v3", "v4")
which I then tried to implement in the loop:
for (i in seq_along(df_list)){
for (j in col_names[,i]){
[j]_[i] <- describeBy([i]$[,j], [i]$site)
[j]_[i] <- data.frame(matrix(unlist([j]_[i]), nrow=length([j]_[i]), byrow=T))
[j]_[i]$site <- c("Frau2", "MW", "Sys1", "Sys4")
[j]_[i]$season <- c([i], [i], [i], [i])
[j]_[i]$type <- c([j], [j], [j], [j])
}
}
But this did not work - I get the messages:
Error: unexpected '[' in:
"for (j in col_names[,i]){
["
Error: unexpected '[' in " ["
Error: unexpected '}' in " }"
I already used the loop-"workflow" to generate the data I wanted, but this was done with copy and paste in order to get the data quick and dirty. Now I would like to tidy up the code.
Do you have an Idea how I could make this work or what I am doing wrong?
Thank you!
Matthias
UPDATE
So I tried what ekoam suggested - thank you for that! - and the following problems occured.
In contrary to the comments I wrote below ekoam's answer, the error occurs with both datasets (the example one provided here and the actual one I'm using - I'm not sure whether I'm allowed to publish the dataset).
This is my used code and the error I got:
df <- read_excel("C:/###/###/###/Example_data.xlsx")
df <- subset(data_watersamples, site %in% c("a","b","c", "d"))
my_summary <-
. %>%
group_by(site) %>%
summarise_at(vars(
c(v1, v2, v3, v4),
list(min = min, max = max, mean = mean, median = median, sd = sd)
)) %>%
pivot_longer(-site, names_to = c("type", "stat"), names_sep = "_") %>%
pivot_wider(names_from = "stat")
summer <- as.integer(format.Date(df$date, "%m")) %in% 4:9
df_list <- list(full_year = df, summer = df[summer, ], winter = df[!summer, ])
lapply(df_list, my_summary)
and get this error:
Error: Must subset columns with a valid subscript vector.
x Subscript has the wrong type `list`.
i It must be numeric or character.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
Error in `*tmp*`[[id - n]] :
attempt to select more than one element in integerOneIndex
Thanks for your help!
Matthias
As you want things to be tidy, how about this tidyverse approach to your problem?
library(dplyr)
library(tidyr)
my_summary <-
. %>%
group_by(site) %>%
summarise(across(
c(v1, v2, v3, v4),
list(min = min, max = max, mean = mean, median = median, sd = sd)
)) %>%
pivot_longer(-site, names_to = c("type", "stat"), names_sep = "_") %>%
pivot_wider(names_from = "stat")
summer <- as.integer(format.Date(df$date, "%m")) %in% 4:9
df_list <- list(full_year = df, summer = df[summer, ], winter = df[!summer, ])
lapply(df_list, my_summary)
Output
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
`summarise()` ungrouping output (override with `.groups` argument)
$full_year
# A tibble: 16 x 7
site type min max mean median sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a v1 0 4 2.62 3 1.69
2 a v2 10 20 14.4 14 3.07
3 a v3 91 106 99.6 102. 5.93
4 a v4 150 155 153. 153 1.49
5 b v1 2 6 4.12 4.5 1.64
6 b v2 16 25 20.9 21.5 3.52
7 b v3 70 90 77 74 6.63
8 b v4 147 152 149. 150. 1.92
9 c v1 5 9 7.38 7.5 1.41
10 c v2 30 40 35 35 3.78
11 c v3 43 65 51.8 48.5 8.84
12 c v4 146 151 148. 147 1.60
13 d v1 8 11 9.62 9.5 1.06
14 d v2 36 44 40.4 40.5 2.67
15 d v3 123 144 138. 140 7.38
16 d v4 160 170 165. 165 3.40
$summer
# A tibble: 16 x 7
site type min max mean median sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a v1 0 3 1.5 1.5 1.73
2 a v2 12 17 14 13.5 2.16
3 a v3 91 102 95 93.5 4.83
4 a v4 150 155 152. 152. 2.08
5 b v1 2 6 4 4 1.83
6 b v2 17 25 22.5 24 3.79
7 b v3 73 83 76 74 4.69
8 b v4 147 149 148. 148. 0.957
9 c v1 6 9 7.75 8 1.5
10 c v2 31 39 34.2 33.5 3.59
11 c v3 43 58 47.8 45 6.90
12 c v4 146 148 147. 146. 0.957
13 d v1 8 11 9.5 9.5 1.29
14 d v2 36 42 39.2 39.5 2.5
15 d v3 123 144 135. 136 8.85
16 d v4 165 170 167 166. 2.45
$winter
# A tibble: 16 x 7
site type min max mean median sd
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a v1 3 4 3.75 4 0.5
2 a v2 10 20 14.8 14.5 4.11
3 a v3 103 106 104. 104 1.26
4 a v4 152 154 153 153 0.816
5 b v1 2 6 4.25 4.5 1.71
6 b v2 16 22 19.2 19.5 2.75
7 b v3 70 90 78 76 8.83
8 b v4 150 152 151 151 0.816
9 c v1 5 8 7 7.5 1.41
10 c v2 30 40 35.8 36.5 4.35
11 c v3 44 65 55.8 57 9.60
12 c v4 147 151 148. 148. 1.89
13 d v1 9 11 9.75 9.5 0.957
14 d v2 38 44 41.5 42 2.65
15 d v3 134 144 141 143 4.76
16 d v4 160 168 163. 162. 3.40

pivot_longer, transform first column rownames, to column names

Using this data:
https://www.health.govt.nz/system/files/documents/publications/suicide-2015-tables.xlsx
The first column has rows:
"Number of suicides
Male
Female
Total
Age-standardised rate (deaths per 100,000)
Male
Female
Total"
However, I need these to be the column headers instead. Is this done so via pivot_longer() ?
Thanks
This is a pretty ugly data file, but hopefully this should get to what you want:
library(dplyr)
library(tidyr)
# After downloading the file in your project folder
dat <- readxl::read_excel("suicide-2015-tables.xlsx", skip = 2)
dat %>%
select(variables = ...1, `2006`:`2015`) %>% # Remove unneeded/blank columns
mutate(headers = if_else(is.na(`2006`), variables, NA_character_)) %>% # Create a headers variable
fill(headers, .direction = "down") %>% # Fill the headers down
pivot_longer(`2006`:`2015`, names_to = "year", values_to = "counts") %>% # Reshape data from wide to long
drop_na() %>%
unite("headers_vars", headers, variables, sep = " - ") %>% # Create a new variable that combines the headers and the subgroup breakdown
pivot_wider(names_from = headers_vars, values_from = counts) # Reshape back from long to wide
# A tibble: 10 x 15
year `Number of suic~ `Number of suic~ `Number of suic~ `Age-standardis~ `Age-standardis~ `Age-standardis~ `Age-specific r~
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2006 388 138 526 18.6 6.25 12.2 19.7
2 2007 371 116 487 17.4 5.01 11.0 15.2
3 2008 381 139 520 17.6 6.24 11.8 19.5
4 2009 393 117 510 17.9 5.03 11.3 18.1
5 2010 386 149 535 17.3 6.63 11.8 17.7
6 2011 377 116 493 17.0 5.06 10.9 20.1
7 2012 404 146 550 18.1 6.39 12.1 23.0
8 2013 367 147 514 16.0 6.41 11.0 17.8
9 2014 380 130 510 16.5 5.4 10.8 14.1
10 2015 384 143 527 16.4 6.1 11.1 16.9
# ... with 7 more variables: `Age-specific rates by life-stage age group (deaths per 100,000) - 25–44` <dbl>, `Age-specific rates by
# life-stage age group (deaths per 100,000) - 45–64` <dbl>, `Age-specific rates by life-stage age group (deaths per 100,000) - 65+` <dbl>,
# `Age-standardised suicide rates for Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide rates for Māori (deaths per
# 100,000) - Female` <dbl>, `Age-standardised suicide rates for non-Māori (deaths per 100,000) - Male` <dbl>, `Age-standardised suicide
# rates for non-Māori (deaths per 100,000) - Female` <dbl>

Output function to create multiple dataframes by subsetted row

I am trying to create multiple DFs from a function with each DF being the aggregate of up until varying row values. For your reference I am using fantasy football data. So right now I have each players stats for every week. I want to create a data frame for each week and their cumulative stats until that week.
Here is my function that I currently am using which only creates one list of aggregating the week 17 values.
sumuptopoint <- function(dfx,i) { listofdfs <- list()
dfy <- dfx[, !sapply(dfx, is.character)]
{for (i in 1:17)
dft <- dfy[dfy$Week < i,]
y <<- as.data.frame(aggregate(dft, list("PlayerID" = dft$PlayerID), sum))
listofdfs[[i]] <- y}
return(listofdfs)}
I expect 17 lists of aggregated data but am only get 1 list where 17 weeks prior to 17 are aggregated
Here is the df:
Team ByeWeek Rank.all PlayerID Name Position Week Opponent PassingCompletio~ PassingAttempts.~ PassingCompletio~ PassingYards.all PassingTouchdow~ PassingIntercep~ PassingRating.a~
<chr> <int> <int> <int> <chr> <chr> <dbl> <chr> <int> <int> <dbl> <int> <int> <int> <dbl>
1 ARI 12 201 19763 Josh ~ QB 8.00 SF 23 40 57.5 252 2 1 82.5
2 ARI 12 319 19763 Josh ~ QB 11.0 OAK 9 20 45.0 136 3 2 67.9
3 ARI 12 372 19763 Josh ~ QB 4.00 SEA 15 27 55.6 180 1 0 88.5
4 ARI 12 392 11527 Sam B~ QB 3.00 CHI 13 19 68.4 157 2 2 89.0
5 ARI 12 407 19763 Josh ~ QB 5.00 SF 10 25 40.0 170 1 0 77.1
6 ARI 12 411 19763 Josh ~ QB 10.0 KC 22 39 56.4 208 1 2 58.5

Convert Pdf into CSV using R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My pdf table looks like the following:
I'm trying to convert this table into csv file. The code I'm using is the following:
x <- c('pdftools', 'stringr', 'tidyverse')
lapply(x, require, character.only = TRUE)
Reading the file
pdf_text <- pdf_text('Input/file.pdf') %>%
readr::read_lines()
write.csv(pdf_text ,'pdf_text .csv', row.names = F)
Please find the file link attached.
But unfortunately I'm not getting the correct result. I tried many online options suggested. But none worked. Can someone please guide me?
Thanks!
The tabulizer package can easily extract tables from PDF.
It will return a list with one element (a matrix) for each page. So we convert the matrix to a dataframe (tibble), then we chop the headers and bind the rows.
We can then transform the values (strings) as numeric...
library(tidyverse)
library(tabulizer)
(extract_tables("c:/tmp/KMR-1989.pdf",
method = "lattice") %>%
map(as_tibble) %>%
map_dfr(slice, 4:1000) %>%
mutate_at(3:19, as.numeric) %>%
write_csv("my_pdf.csv"))
#> # A tibble: 44 x 19
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 MALLA~ 0 545. 316. 82.9 944. 2.1 0 0 2.1 0
#> 2 2 RAIKAL 321. 741. 226. 92.9 1381. 40.6 0 1.4 42 0
#> 3 3 DHARM~ 210. 503 271 187. 1171. 2 0 35.6 37.6 0
#> 4 4 VELGA~ 360. 336. 286. 143. 1124. 38 0 23.4 61.4 0
#> 5 5 KAMAN~ 0 442. 242. 130 814. 0 0 0 0 0
#> 6 6 MANTH~ 297. 394 320. 202 1213. 34.8 0 0 34.8 0
#> 7 7 KATAR~ 0 493 468 245 1206 20 0 0 20 17
#> 8 8 MAHAD~ 329 534 546 165 1574 28 0 0 28 0
#> 9 9 MUTHA~ 260. 293 296 253 1102. 31 0 0 31 0
#> 10 10 PEDDA~ 392. 277. 151. 85 905. 10.2 0 0 10.2 0
#> # ... with 34 more rows, and 7 more variables: V13 <dbl>, V14 <dbl>,
#> # V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>

How can I find the matching values from a column within 3 or more grouped dataframes?

I have grouped data frames (in my case three data frames grouped together ). I want to find the intersection between all three data frames based on a value in a column.
I have been playing around with the dplyr intersect function but don't see how I can use this with my grouped data frames. I want to find all rows within all three data frames that have the same Start.Coord value.
Here is one failed attempt with the resulting error message:
SameWithinTreatment <= SorbitolGroup %>% group_by(Sample) %>% intersect(Start.Coord)
Error in intersect_data_frame(x, y) : object 'Start.Coord' not found
Obviously I need another parameter to give to intersect(). I see that intersect() doesn't seem to be the function I need but it seems that there must be a way to do what I need.
I have done a lot of searching but everything I find only works with 2 data frames.
Here is some example data from my grouped data frames. There is one row with a common Start.Coord value between these three: the row with 8805 as the Start.Coord.
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Covera~ SD.of.Normalized.Covera~ TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 1019 1023 X1.combined 19 18 9.91 3.98 7.95
2 1510 1514 X1.combined 19 18 9.91 3.98 7.95
3 1514 1518 X1.combined 19 18 9.91 3.98 7.95
4 1520 1524 X1.combined 19 18 9.91 3.98 7.95
5 8805 8809 X1.combined 19 18 9.91 3.98 7.95
6 48185 48189 X1.combined 19 18 9.91 3.98 7.95
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X2 167 166 122. 21.7 43.4
2 11874 11878 X2 169 168 122. 21.7 43.4
3 12042 12046 X2 169 168 122. 21.7 43.4
4 18321 18325 X2 175 174 122. 21.7 43.4
5 25187 25191 X2 167 166 122. 21.7 43.4
6 25308 25312 X2 194 193 122. 21.7 43.4
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
<int> <int> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 8805 8809 X3 132 131 94.4 16.7 33.5
2 10340 10344 X3 135 134 94.4 16.7 33.5
3 11874 11878 X3 141 140 94.4 16.7 33.5
4 12042 12046 X3 137 136 94.4 16.7 33.5
5 18209 18213 X3 133 132 94.4 16.7 33.5
6 18218 18222 X3 143 142 94.4 16.7 33.5
So I would like to get back a new data frame that looks like this:
Start.Coord Stop.Coord Sample Coverage normalized.coverage Average.Normalized.Coverage SD.of.Normalized.Coverage TwoSD
8805 8809 X1.combined 19 18 9.91 3.98 7.95
8805 8809 X2 167 166 122. 21.7 43.4
8805 8809 X3 132 131 94.4 16.7 33.5
Is there a way to accomplish this?
If your 3 data frames have the same column names use rbind to combine them
SorbitolGroup<- rbind(df1,df2,df3)
then add
Start.Coord to group_by:
SorbitolGroup %>% group_by(Sample,Start.Coord)
If you want to count the number of observations in both groups
SorbitolGroup %>% group_by(Sample,Start.Coord) %>% tally()
it sounds like you need to use filter(), in addition to what #W148SMH suggested.
a <- data.frame(sample='a',value=sample(1:10,10,T))
b <- data.frame(sample='b',value=sample(1:10,10,T))
c <- data.frame(sample='c',value=sample(1:10,10,T))
df <- rbind(a,b,c)
summary(df)
df %>% filter(value==9)
df_new <- df %>% filter(value==9) # new data frame including all cases with value==9
df %>% count(sample,value)
df %>% group_by(sample,value) %>%
summarise(...) # to summarise other variables at each level of sample and value

Resources