Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
My pdf table looks like the following:
I'm trying to convert this table into csv file. The code I'm using is the following:
x <- c('pdftools', 'stringr', 'tidyverse')
lapply(x, require, character.only = TRUE)
Reading the file
pdf_text <- pdf_text('Input/file.pdf') %>%
readr::read_lines()
write.csv(pdf_text ,'pdf_text .csv', row.names = F)
Please find the file link attached.
But unfortunately I'm not getting the correct result. I tried many online options suggested. But none worked. Can someone please guide me?
Thanks!
The tabulizer package can easily extract tables from PDF.
It will return a list with one element (a matrix) for each page. So we convert the matrix to a dataframe (tibble), then we chop the headers and bind the rows.
We can then transform the values (strings) as numeric...
library(tidyverse)
library(tabulizer)
(extract_tables("c:/tmp/KMR-1989.pdf",
method = "lattice") %>%
map(as_tibble) %>%
map_dfr(slice, 4:1000) %>%
mutate_at(3:19, as.numeric) %>%
write_csv("my_pdf.csv"))
#> # A tibble: 44 x 19
#> V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 MALLA~ 0 545. 316. 82.9 944. 2.1 0 0 2.1 0
#> 2 2 RAIKAL 321. 741. 226. 92.9 1381. 40.6 0 1.4 42 0
#> 3 3 DHARM~ 210. 503 271 187. 1171. 2 0 35.6 37.6 0
#> 4 4 VELGA~ 360. 336. 286. 143. 1124. 38 0 23.4 61.4 0
#> 5 5 KAMAN~ 0 442. 242. 130 814. 0 0 0 0 0
#> 6 6 MANTH~ 297. 394 320. 202 1213. 34.8 0 0 34.8 0
#> 7 7 KATAR~ 0 493 468 245 1206 20 0 0 20 17
#> 8 8 MAHAD~ 329 534 546 165 1574 28 0 0 28 0
#> 9 9 MUTHA~ 260. 293 296 253 1102. 31 0 0 31 0
#> 10 10 PEDDA~ 392. 277. 151. 85 905. 10.2 0 0 10.2 0
#> # ... with 34 more rows, and 7 more variables: V13 <dbl>, V14 <dbl>,
#> # V15 <dbl>, V16 <dbl>, V17 <dbl>, V18 <dbl>, V19 <dbl>
Related
I need to divide columns despesatotal and despesamonetaria by the row named Total:
Lets suppose your data set is df.
# 1) Delete the last row
df <- df[-nrow(df),]
# 2) Build the desired data.frame [combining the CNAE names and the proportion columns
new.df <- cbind(grup_CNAE = df$grup_CNAE,
100*prop.table(df[,-1],margin = 2))
Finally, rename your columns. Be careful with the matrix or data.frame formats, because sometimes mathematical operations may suppose a problem. If you you use dput function in order to give us a reproducible example, the answer would be more accurate.
Here is a way to get it done. This is not the best way, but I think it is very readable.
Suppose this is your data frame:
mydf = structure(list(grup_CNAE = c("A", "B", "C", "D", "E", "Total"
), despesatotal = c(71, 93, 81, 27, 39, 311), despesamonetaria = c(7,
72, 36, 22, 73, 210)), row.names = c(NA, -6L), class = "data.frame")
mydf
# grup_CNAE despesatotal despesamonetaria
#1 A 71 7
#2 B 93 72
#3 C 81 36
#4 D 27 22
#5 E 39 73
#6 Total 311 210
To divide despesatotal values with its total value, you need to use the total value (311 in this example) as the denominator. Note that the total value is located in the last row. You can identify its position by indexing the despesatotal column and use nrow() as the index value.
mydf |> mutate(percentage1 = despesatotal/despesatotal[nrow(mydf)],
percentage2 = despesamonetaria /despesamonetaria[nrow(mydf)])
# grup_CNAE despesatotal despesamonetaria percentage1 percentage2
#1 A 71 7 0.22829582 0.03333333
#2 B 93 72 0.29903537 0.34285714
#3 C 81 36 0.26045016 0.17142857
#4 D 27 22 0.08681672 0.10476190
#5 E 39 73 0.12540193 0.34761905
#6 Total 311 210 1.00000000 1.00000000
library(tidyverse)
Sample data
# A tibble: 11 x 3
group despesatotal despesamonetaria
<chr> <int> <int>
1 1 198 586
2 2 186 525
3 3 202 563
4 4 300 562
5 5 126 545
6 6 215 529
7 7 183 524
8 8 163 597
9 9 213 592
10 10 175 530
11 Total 1961 5553
df %>%
mutate(percentage_total = despesatotal / last(despesatotal),
percentage_monetaria = despesamonetaria/ last(despesamonetaria)) %>%
slice(-nrow(.))
# A tibble: 10 x 5
group despesatotal despesamonetaria percentage_total percentage_monetaria
<chr> <int> <int> <dbl> <dbl>
1 1 198 586 0.101 0.106
2 2 186 525 0.0948 0.0945
3 3 202 563 0.103 0.101
4 4 300 562 0.153 0.101
5 5 126 545 0.0643 0.0981
6 6 215 529 0.110 0.0953
7 7 183 524 0.0933 0.0944
8 8 163 597 0.0831 0.108
9 9 213 592 0.109 0.107
10 10 175 530 0.0892 0.0954
This is a good place to use dplyr::mutate(across()) to divide all relevant columns by the Total row. Note this is not sensitive to the order of the rows and will apply the manipulation to all numeric columns. You can supply any tidyselect semantics to across() instead if needed in your case.
library(tidyverse)
# make sample data
d <- tibble(grup_CNAE = paste0("Group", 1:12),
despesatotal = sample(1e6:5e7, 12),
despesamonetaria = sample(1e6:5e7, 12)) %>%
add_row(grup_CNAE = "Total", summarize(., across(where(is.numeric), sum)))
# divide numeric columns by value in "Total" row
d %>%
mutate(across(where(is.numeric), ~./.[grup_CNAE == "Total"]))
#> # A tibble: 13 × 3
#> grup_CNAE despesatotal despesamonetaria
#> <chr> <dbl> <dbl>
#> 1 Group1 0.117 0.0204
#> 2 Group2 0.170 0.103
#> 3 Group3 0.0451 0.0837
#> 4 Group4 0.0823 0.114
#> 5 Group5 0.0170 0.0838
#> 6 Group6 0.0174 0.0612
#> 7 Group7 0.163 0.155
#> 8 Group8 0.0352 0.0816
#> 9 Group9 0.0874 0.135
#> 10 Group10 0.113 0.0877
#> 11 Group11 0.0499 0.0495
#> 12 Group12 0.104 0.0251
#> 13 Total 1 1
Created on 2022-11-08 with reprex v2.0.2
I'm using rvest and tidyverse to scrape and process some data off the web.
There was recently a change to the website where some of the data is now in 2 tables and you can change between them using a button.
I'm trying to figure out how to scrape the data from both. They seem to have the same css class now so I can't figure out how to access each individually.
The code below seems to grab the "extended snowfall history", but I can't seem to figure out how to get the "2022-2023 winter season" data. Obviously I'll need to do a little processing and math to put the "2022-2023 winter season" into a new row in "extended snowfall history", but I can't even figure out how to grab it.
Currently I have :
library(rvest)
library(tidyverse)
mammoth <- read_html('https://www.mammothmountain.com/on-the-mountain/historical-snowfall')
snow <- mammoth %>%
html_element('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE) %>%
mutate_if(is.character,as.factor) %>%
mutate_if(is.integer,as.double) %>%
select(-Total)
A simple approach would be to use rvest::html_elements('table.css-86hwhl') (plural rather than singular) which will extract all html elements with the css class 'table.css-86hwhl'. Then you can manually choose the tables you want.
For example:
mammoth %>%
html_elements('table.css-86hwhl') %>%
html_table(header= TRUE, convert = TRUE)
gives a list of datasets
[[1]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[2]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[3]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
[[4]]
# A tibble: 4 × 3
Date Inches `Season Total to Date`
<chr> <chr> <chr>
1 November 8 "15\"" "28\""
2 November 7 "2\"" "13\""
3 November 3 "5\"" "11\""
4 November 2 "6\"" "6\""
[[5]]
# A tibble: 53 × 13
Season `Pre-Oct` Oct Nov Dec Jan Feb Mar Apr May Jun Jul Total
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl>
1 1969-70 22 0 0 41 78 30.5 46 27 0 0 0 244.
2 1970-71 60 0 0 109 29 19.5 24 14 0 0 0 256.
3 1971-72 22 0 9 140. 32.2 11 1 53.5 0 0 0 268.
4 1972-73 4 0 57.1 64.5 84.9 103 43 10 4 0 0 370.
5 1973-74 45 0 0 45 87.5 9 82 38 0 0 0 306.
6 1974-75 15 0 13 58.5 26 101 90 75 0 0 0 378.
7 1975-76 27 0 0 14.5 13.5 54 50 38.5 0 0 0 198.
8 1976-77 4 0 0 0 26 27 37 0 0 0 0 94
9 1977-78 6 0 26 98 95.5 97 85.5 78.5 1 0 0 488.
10 1978-79 6 0 29.5 51.5 102. 96 78 11.5 11.5 0 0 386.
# … with 43 more rows
# ℹ Use `print(n = ...)` to see more rows
You can then just extract [[1]] and [[2]] and go from there, the tables that you are looking for. I'm sure there's a more principled approach out there, but this should do the job.
I can't upload the file into stackoverflow but I have a PDF containing a table spanning 3 pages. After using library(pdftools) and pdf_text(), it creates a 3 element character list where each element is a long string of all text from each page.
library(pdftools)
df <- pdf_text(file.pdf)
The data I need is on the 2nd page. I get the output:
df[2]
All Households 19,015 10,030 8,985 3,635 585 3,055 19.1 5.8 34.0\n\nHousing above standards 12,365 8,225 4,145 0 0 0 0.0 0.0 0.0\n\nBelow one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1\n\nBelow affordability standard12 4,885 1,230 3,660 3,125 535 2,590 64.0 43.5 70.8\n\nBelow adequacy standard13 1,360 555 810 425 75 350 31.2 13.5 43.2\n\n\n\n\n
I want to isolate the row "Below one or more housing standards" and the 8th column which contains the value "54.7".
I believe the next steps are to split the long string into lines by the line break character "\n", identify the applicable line, split the line into words, and select the 8th word.
I've tried splitting into lines using:
library(stringr)
lines <- df[2] %>% str_split("\n")
It returns a "List of 1" and I'm not sure how to work with it. Any suggestions on the syntax?
It's a bit convoluted to get to the original file.
https://www03.cmhc-schl.gc.ca/hmip-pimh/en/#Profile/126504/5/Alta%20Vista
Core Housing Need -> Full Report -> Export.
Oddly there isn't a way to just download a CSV.
Use readLines (which doesn't use the scan(text= ...) pathway and therefore needs textConnection.
library(pdftools)
#Using poppler version 0.62.0
df <- pdf_text("Downloads/TableExport.pdf")
str(df)
# chr [1:3] "Core Housing Need (2016 Statistics Canada's Census) — Alta Vista\n H "| __truncated__ ...
# for each page read in with readLines to make character vectors
# separated by \n
lines <- lapply(df, function(t) readLines( textConnection(t)) )
Then search for the line with the target:
lines[[2]][grep("Below one or more housing standards", lines[[2]])]
[1] "Below one or more housing standards 6,650 1,805 4,845 3,640 585 3,055 54.7 32.4 63.1"
If you assigned that value to the name target you could get the 8th column with this rather baroque regex:
sub("(Below one or more housing standards)([ ]*\\d*[,]*\\d*){6}[ ]*(\\d*[.]*\\d*)(.*)", "\\3", target)
#[1] "54.7"
Notice the need to allow commas and decimal points in the numeric specifications. As written it may not be totally general since the first six of the numeric columns are only allowed to have commas and not decimals. I guess you could allow a character class like "[.,]" to be more general. Or even: "([ ]*\\d*[,]*\\d+[.]*\\d*){6}" (lightly tested). I suspect there are packages that will handle tabular pdf formatting in a more principled manner.
This does not use pdftools, but I hope it is helpful to you. First, use rvest package to read the url of this table, then use html_table to extract into a table. Then, there is some manual manipulation
library(tidyverse)
library(rvest)
url = "https://www03.cmhc-schl.gc.ca/hmip-pimh/en/Profile/DetailsCoreHousingNeed?geographyId=126504&t=5"
# Read the url
doc = rvest::read_html(url)
# Extract the table, and provide anonymous V<x> names
table = rvest::html_table(doc)[[1]]
names(table) = paste0("V",1:ncol(table))
# drop first three rows
table <- table %>% filter(row_number()>2)
# Manually, identify the split rows (i.e. subheadings)
split_rows = c(1,9,24,32,36,40,44,48,55,62)
# Extract the subheadings
sub_table_names = table %>% filter(row_number() %in% split_rows) %>% pull(V1)
# Now, use lapply to filter the rows that are between the splits, and use as.numeric and str_remove_all to convert to numeric values
tables = lapply(seq_along(split_rows), function(x) {
table %>%
filter(between(row_number(), split_rows[x]+1, split_rows[x+1]-1 )) %>%
mutate(across(V2:V10, ~as.numeric(str_remove_all(.x,","))))
})
# Name the list of tables
names(tables) <- sub_table_names
Output:
$`Age of primary household maintainer3`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 15 to 24 years 1030 45 980 220 0 220 21.4 0 22.4
3 25 to 34 years 2700 715 1990 555 40 515 20.6 5.6 25.9
4 35 to 44 years 2795 1360 1440 545 25 520 19.5 1.8 36.1
5 45 to 54 years 3565 2005 1565 740 135 610 20.8 6.7 39
6 55 to 64 years 3535 2225 1315 615 155 455 17.4 7 34.6
7 65 years and over 5380 3685 1700 960 220 735 17.8 6 43.2
$`Household Type4`
# A tibble: 14 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Couple with children 4360 3145 1220 585 100 485 13.4 3.2 39.8
3 Couple without children 4755 3195 1555 390 70 315 8.2 2.2 20.3
4 Senior-led (65+) couple without children 2030 1695 335 140 50 90 6.9 2.9 26.9
5 Lone-parent household 2220 810 1405 845 135 710 38.1 16.7 50.5
6 Female lone-parent household 1845 660 1190 730 105 625 39.6 15.9 52.5
7 Male lone-parent household 370 155 220 115 30 85 31.1 19.4 38.6
8 Multiple-family household 265 165 100 70 20 45 26.4 12.1 45
9 One-person household 6075 2385 3685 1525 235 1290 25.1 9.9 35
10 Female one-person households 3615 1590 2025 920 135 795 25.4 8.5 39.3
11 Senior (65+) female living alone 1810 980 830 525 90 435 29 9.2 52.4
12 Male one-person household 2455 800 1660 605 105 500 24.6 13.1 30.1
13 Senior (65+) male living alone 600 350 250 170 50 120 28.3 14.3 48
14 Other non-family household 1345 330 1015 230 25 205 17.1 7.6 20.2
$`Immigrant households5`
# A tibble: 7 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Non-immigrant 12500 7115 5395 1665 230 1440 13.3 3.2 26.7
3 Non-permanent resident6 430 25 400 140 10 130 32.6 40 32.5
4 Immigrant 6085 2890 3190 1825 345 1485 30 11.9 46.6
5 Landed before 2001 4105 2480 1620 1065 275 790 25.9 11.1 48.8
6 Landed 2001 to 2010 1340 340 1000 460 55 400 34.3 16.2 40
7 Recent immigrants (landed 2011-2016)7 640 70 575 310 10 295 48.4 14.3 51.3
$`Households with seniors`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one senior (65 or older) 5910 4085 1825 1015 245 770 17.2 6 42.2
3 Other household type 13105 5945 7155 2625 340 2285 20 5.7 31.9
$`Households with children under 18`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one child less than 18 years old 4465 2455 2005 1140 170 975 25.5 6.9 48.6
3 Other household type 14550 7575 6980 2500 420 2080 17.2 5.5 29.8
$`Activity limitations8`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Household has at least one person with activity limitations 10955 5830 5120 2285 385 1895 20.9 6.6 37
3 All other households 8060 4195 3865 1360 200 1160 16.9 4.8 30
$`Aboriginal households9`
# A tibble: 3 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Aboriginal households 655 215 440 120 20 105 18.3 9.3 23.9
3 Non-Aboriginal households 18355 9815 8540 3515 565 2955 19.2 5.8 34.6
$`Incomes, shelter costs10, and STIRs11`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Average household income before taxes ($) 96464 134172 54357 29101 31212 28696 NA NA NA
2 Average monthly shelter costs ($) 1256 1408 1085 1039 1243 1000 NA NA NA
3 Average STIR before taxes (%) 24 17.2 31.5 46.8 49.7 46.2 NA NA NA
4 Median household income before taxes ($) 72502 107762 44596 27711 28437 27568 NA NA NA
5 Median monthly shelter costs ($) 1097 1193 1076 1013 1115 1006 NA NA NA
6 Median STIR before taxes (%) 19.3 14 26 43.8 45.8 43.3 NA NA NA
$`Housing standards`
# A tibble: 6 x 10
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 All Households 19015 10030 8985 3635 585 3055 19.1 5.8 34
2 Housing above standards 12365 8225 4145 0 0 0 0 0 0
3 Below one or more housing standards 6650 1805 4845 3640 585 3055 54.7 32.4 63.1
4 Below affordability standard12 4885 1230 3660 3125 535 2590 64 43.5 70.8
5 Below adequacy standard13 1360 555 810 425 75 350 31.2 13.5 43.2
6 Below suitability standard14 1480 210 1270 800 55 745 54.1 26.2 58.7
you could check if there is more up to date 2018 data by following the crumbs to https://www150.statcan.gc.ca/n1/pub/46-25-0001/462500012021001-eng.htm ,
However, If you only want one row it is easy to save the source with right clicks,
<tr>
<th scope="row">Below one or more housing standards</th>
<td>6,650</td>
<td>1,805</td>
<td>4,845</td>
<td>3,640</td>
<td>585</td>
<td>3,055</td>
<td>54.7</td>
<td>32.4</td>
<td>63.1</td>
</tr>
for the headings you need
HOUSEHOLDS TESTED FOR CORE HOUSING NEED 1 HOUSEHOLDS IN CORE HOUSING NEED 2 % OF HOUSEHOLDS IN CORE HOUSING NEED
TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS TOTAL OWNERS RENTERS
and for footins
1 Data include all non-farm, non-band, non-reserve private households reporting positive incomes and shelter cost-to-income ratios less than 100 per cent.
2 A household is in core housing need if its housing does not meet one or more standards for housing adequacy (repair), suitability (crowding), or affordability and if it would have to spend 30 per cent or more of its before-tax income to pay the median rent (including utilities) of appropriately sized alternative local market housing. Adequate housing does not require any major repairs, according to residents. Suitable housing has enough bedrooms for the size and make-up of resident households. Affordable housing costs less than 30 per cent of before-tax household income.
You have a PDF and want to work with the raw Text but its clear there is some issue with the generated searchable text and we can see that in the headings and with copy and paste. Belowone ormore housing standards so here is the expected extraction from bottom of page 2
pdftotext -f 2 -l 2 -nopgbrk -simple -margint 650 tableexport.pdf -
I am trying to create a linear regression model from openintro::babies that predicts a baby's birthweight from all other variables in the data except case.
I have to following code:
library(tidyverse)
library(tidymodels)
babies <- openintro::babies %>%
drop_na() %>%
mutate(bwt = 28.3495 * bwt) %>%
mutate(weight = 0.453592 * weight)
linear_reg() %>%
set_engine("lm") %>%
fit(formula = bwt ~ ., data = babies %>% select(-case)) %>%
pluck("fit") %>%
augment(babies)
but in my output, I obtain the case variable as well
# A tibble: 1,174 x 14
case bwt gestation parity age height weight smoke .fitted .resid .hat .sigma .cooksd .std.resid
<int> <dbl> <int> <int> <int> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3402. 284 0 27 62 45.4 0 3459. -56.8 0.00374 449. 0.00000863 -0.127
2 2 3203. 282 0 33 64 61.2 0 3547. -344. 0.00227 449. 0.000191 -0.767
3 3 3629. 279 0 28 64 52.2 1 3244. 385. 0.00291 449. 0.000307 0.858
4 5 3062. 282 0 23 67 56.7 1 3396. -334. 0.00475 449. 0.000379 -0.746
5 6 3856. 286 0 25 62 42.2 0 3474. 381. 0.00495 449. 0.000515 0.851
6 7 3912. 244 0 33 62 80.7 0 3065. 848. 0.0137 448. 0.00715 1.90
7 8 3742. 245 0 23 65 63.5 0 3124. 618. 0.00716 449. 0.00197 1.38
8 9 3402. 289 0 25 62 56.7 0 3558. -156. 0.00301 449. 0.0000521 -0.348
9 10 4054. 299 0 30 66 61.7 1 3591. 463. 0.00462 449. 0.000710 1.03
10 11 3969. 351 0 27 68 54.4 0 4527. -558. 0.0221 449. 0.00510 -1.26
# ... with 1,164 more rows
I'm not sure is it the correct way or it is inherent with the output.
Your code is correct. You're getting the case column because of the augment(babies) call, but if you replace it with augment(babies %>% select(-case)) you wont get that column. In other words, the regression model you're fitting does not take into acount the case column].
Question
I use time-series data regularly. Sometimes, I would like to transmute an entire data frame to obtain some data frame of growth rates, or shares, for example.
When using transmute this is relatively straight-forward. But when I have a lot of columns to transmute and I want to keep the date column, I'm not sure if that's possible.
Below, using the economics data set, is an example of what I mean.
Example
library(dplyr)
economics %>%
transmute(date,
pce * 10,
pop * 10,
psavert * 10)
# A tibble: 574 x 4
date `pce * 10` `pop * 10` `psavert * 10`
<date> <dbl> <dbl> <dbl>
1 1967-07-01 5067 1987120 126
2 1967-08-01 5098 1989110 126
3 1967-09-01 5156 1991130 119
4 1967-10-01 5122 1993110 129
5 1967-11-01 5174 1994980 128
6 1967-12-01 5251 1996570 118
7 1968-01-01 5309 1998080 117
8 1968-02-01 5336 1999200 123
9 1968-03-01 5443 2000560 117
10 1968-04-01 5440 2002080 123
# ... with 564 more rows
Now, using transmute_at. The below predictably removes date in the .vars argument, but I haven't found a way of removing date and reintroducing it in .funs such that the resulting data frame looks as it does above. Any ideas?
economics %>%
transmute_at(.vars = vars(-c(date, uempmed, unemploy)),
.funs = list("trans" = ~ . * 10))
# A tibble: 574 x 3
pce_trans pop_trans psavert_trans
<dbl> <dbl> <dbl>
1 5067 1987120 126
2 5098 1989110 126
3 5156 1991130 119
4 5122 1993110 129
5 5174 1994980 128
6 5251 1996570 118
7 5309 1998080 117
8 5336 1999200 123
9 5443 2000560 117
10 5440 2002080 123
# ... with 564 more rows
We can use if/else inside the function.
library(dplyr)
library(ggplot2)
data(economics)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .)
# A tibble: 574 x 4
# date pce pop psavert
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we need to change the column names selectively, can do this after the transmute_at
library(stringr)
economics %>%
transmute_at(vars(date:psavert), ~ if(is.numeric(.)) .* 10 else .) %>%
rename_at(vars(-date), ~ str_c(., '_trans'))
# A tibble: 574 x 4
# date pce_trans pop_trans psavert_trans
# <date> <dbl> <dbl> <dbl>
# 1 1967-07-01 5067 1987120 126
# 2 1967-08-01 5098 1989110 126
# 3 1967-09-01 5156 1991130 119
# 4 1967-10-01 5122 1993110 129
# 5 1967-11-01 5174 1994980 128
# 6 1967-12-01 5251 1996570 118
# 7 1968-01-01 5309 1998080 117
# 8 1968-02-01 5336 1999200 123
# 9 1968-03-01 5443 2000560 117
#10 1968-04-01 5440 2002080 123
# … with 564 more rows
If we are changing the column names in all the selected columns in transmute_at use list(trans =
economics %>%
transmute_at(vars(date:psavert), list(trans = ~if(is.numeric(.)) .* 10 else .))