I am adapting a code for my own needs, which has problems. I've been able to address most of the issues but am stuck on this current step. I've uploaded a pdf into R and have done a series of steps to manipulate the file for text mining.
I'm now trying to split each line of text. Useful is a (?)list of characters, and I've called the 11th object.
useful[11]
>" Busti
169 425 Total 2,786 5,259 Franklin
256 410"
As you can see, there's a big space before Busti. Useful[11] is the last row in the pdf page. Essentially, the first column is blank, Busti is the 2nd column, Total is the third column, and Franklin is the 4th column of the same row.
I am then splitting useful[11] so each column is now an individual object.
split <-
strsplit(useful,
"(?<=[0-9])\\s+(?=[A-Z])|(?<=[aA-zZ])\\s{2,}+(?=[A-Z])",
perl = T)
split[11]
[[1]]
[1] " Busti
169 425"
[2] "Total 2,786 5,259"
[3] "Franklin 256 410"
Instead of recognizing each column as an object, R is seeing obj 1-Busti, object 2-Total, object 3-Franklin whereas I want: object 1-space, object 2-Busti, and so on.
For example, in the row above i.e. useful[10], there is no empty space in any of the columns so:
useful[10]
[1] "Total 1,399 2,915 Arkwright 154 320 Smyrna 179 319 Deposit 110 169"
So when I use the split function, I get:
split[10]
[[1]]
[1] "Total 1,399 2,915" "Arkwright 154 320" "Smyrna 179 319"
[4] "Deposit 110 169"
Could someone help me figure out how to do the proper regex to account for this issue? Thank you in advance!
Here's an approach using the tidyverse and purrr:
library(tidyverse)
useful <- c(" Busti
169 425 Total 2,786 5,259 Franklin
256 410", "Total 1,399 2,915 Arkwright 154 320 Smyrna 179 319 Deposit 110 169")
map(useful, str_squish) %>%
str_split("\\s+")
# [[1]]
# [1] "Busti" "169" "425" "Total" "2,786" "5,259" "Franklin" "256" "410"
#
# [[2]]
# [1] "Total" "1,399" "2,915" "Arkwright" "154" "320" "Smyrna" "179" "319" "Deposit" "110"
# [12] "169"
Alternatively:
map(useful, str_squish) %>%
str_split("\\s+(?=[[:alpha:]])")
# [[1]]
# [1] "Busti 169 425" "Total 2,786 5,259" "Franklin 256 410"
#
# [[2]]
# [1] "Total 1,399 2,915" "Arkwright 154 320" "Smyrna 179 319" "Deposit 110 169"
And then you may want to consider...
map(useful, str_squish) %>%
str_split("\\s+(?=[[:alpha:]])") %>%
enframe %>%
unnest
# # A tibble: 7 x 2
# name value
# <int> <chr>
# 1 1 Busti 169 425
# 2 1 Total 2,786 5,259
# 3 1 Franklin 256 410
# 4 2 Total 1,399 2,915
# 5 2 Arkwright 154 320
# 6 2 Smyrna 179 319
# 7 2 Deposit 110 169
Or even...
map(useful, str_squish) %>%
str_split("\\s+(?=[[:alpha:]])") %>%
enframe %>%
unnest %>%
separate(value, c("Group", "Item1", "Item2"), sep = "\\s") %>%
mutate_at(vars(starts_with("Item")), ~ str_replace(., ",", "") %>% as.numeric)
# # A tibble: 7 x 4
# name Group Item1 Item2
# <int> <chr> <dbl> <dbl>
# 1 1 Busti 169 425
# 2 1 Total 2786 5259
# 3 1 Franklin 256 410
# 4 2 Total 1399 2915
# 5 2 Arkwright 154 320
# 6 2 Smyrna 179 319
# 7 2 Deposit 110 169
And finally, if the number of "items" is unknown or of varying length, you'll want to do something like the following and/or reference this question:
map(useful, str_squish) %>%
str_split("\\s+(?=[[:alpha:]])") %>%
enframe %>%
unnest %>%
mutate(to_sep = str_split(value, "\\s")) %>%
unnest(to_sep) %>%
group_by(value) %>%
mutate(row = row_number()) %>%
spread(row, to_sep)
# # A tibble: 7 x 5
# # Groups: value [7]
# name value `1` `2` `3`
# <int> <chr> <chr> <chr> <chr>
# 1 1 Busti 169 425 Busti 169 425
# 2 1 Franklin 256 410 Franklin 256 410
# 3 1 Total 2,786 5,259 Total 2,786 5,259
# 4 2 Arkwright 154 320 Arkwright 154 320
# 5 2 Deposit 110 169 Deposit 110 169
# 6 2 Smyrna 179 319 Smyrna 179 319
# 7 2 Total 1,399 2,915 Total 1,399 2,915
You may want to consider breaking this off into a more specific question, especially now that you are providing the pdf and ask more directly what you are trying to achieve. That being said, I'm not sure the blanks are relevant here, as you could use the following pipeline.
library(pdftools)
library(tidyverse)
text <- pdf_text("https://www.dec.ny.gov/docs/wildlife_pdf/09deerrpt.pdf")
clean_text <-
text %>%
str_squish() %>%
magrittr::extract(., 14:17) %>%
paste(collapse = " ") %>%
# First get rid of the header text
str_remove("New York State Department of Environmental.*TOTAL TAKE. ") %>%
# Now get rid of Page numbers, e.g., Page 14, Page 15
str_remove_all("Page [[:digit:]]{2}") %>%
# Get rid of the COUNTY labels since they're not going to line up anyway...
str_remove_all("[A-Z]{2,}") %>%
# Remove Totals since they won't line up...
str_remove("Statewide Totals.*") %>%
# Remove commas from numbers
str_remove_all(",") %>%
# Another squish for good measure and for some less than perfect removals above
str_squish()
clean_text %>%
# Remove the individual total lines
str_remove_all("Total\\s\\w+\\s\\w+") %>%
str_squish() %>%
str_extract_all("[A-Za-z ]+\\s\\d+\\s\\d+") %>%
unlist %>%
str_squish() %>%
data_frame(by_line = .) %>%
extract(
by_line, c("location", "adult_take", "total_take"), regex = "([A-Za-z ]+\\s?)(\\d+\\s?)(\\d+\\s?)"
) %>%
mutate(
location = str_squish(location),
adult_take = str_squish(adult_take) %>% as.numeric,
total_take = str_squish(total_take) %>% as.numeric
)
# # A tibble: 943 x 3
# location adult_take total_take
# <chr> <dbl> <dbl>
# 1 Carroll 103 215
# 2 Albany City 24 41
# 3 Allegany 115 231
# 4 Charlotte 116 248
# 5 Altona 50 87
# 6 Berne 163 292
# 7 Ashford 338 721
# 8 Chautauqua 242 613
# 9 Ausable 18 21
# 10 Bethlehem 141 280
# # ... with 933 more rows
Related
I want to find the coordinates for a list of addresses.
I am using a data set that can be found here: "https://www.data.gv.at/katalog/dataset/kaufpreissammlung-liegenschaften-wien"
I've inputed this using the read_csv function as "data". I'm using the tidyverse and jsonlite libraries. The only relevant columns are "Straße" which is the street name and "ON" which is the street number. The city for all of these is Vienna, Austria.
I'm using OpenStreetMap and have formatted my address data like the format requires:
data$formatted_address <- paste(ifelse(is.na(data$ON), "", data$ON), "+", tolower(data$Straße), ",+vienna", sep = "")
This formats the adresses in this column as 1+milanweg,+vienna and 12+granergasse,+vienna. When I manually input this into the API format, it all works out and I get the coordinates: https://nominatim.openstreetmap.org/search?q=1+milanweg,+vienna&format=json&polygon=1&addressdetails=1
Since I now want to do this for my entire row, I am using jsonlite to create requests in R.
data$coordinates <- data.frame(lat = NA, lon = NA)
for (i in 1:nrow(data)) {
result <- try(readLines(paste0("https://nominatim.openstreetmap.org/search?q=",
URLencode(data$formatted_address[i]), "&format=json&polygon=1&addressdetails=1")),
silent = TRUE)
if (!inherits(result, "try-error")) {
if (length(result) > 0) {
result <- fromJSON(result)
if (length(result) > 0 && is.list(result[[1]])) {
data$coordinates[i, ] <- c(result[[1]]$lat, result[[1]]$lon)
}
}
}
}
This should theoretically create the exact same API request, however, the lat and lon columns are always empty.
How can I fix this script to create a list of coordinates for each address in the data set?
Data setup
library(tidyverse)
library(httr2)
df <- df %>%
mutate(
formatted_address = str_c(
if_else(is.na(on), "", on), "+", str_to_lower(strasse), "+vienna"
) %>% str_remove_all(" ")
)
# A tibble: 57,912 × 7
kg_code katastralgemeinde ez plz strasse on formatted_address
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1617 Strebersdorf 1417 1210 Mühlweg 13 13+mühlweg+vienna
2 1607 Groß Jedlersdorf II 193 1210 Bahnsteggasse 4 4+bahnsteggasse+vienna
3 1209 Ober St.Veit 3570 1130 Jennerplatz 34/20 34/20+jennerplatz+vienna
4 1207 Lainz 405 1130 Sebastian-Brunner-Gasse 6 6+sebastian-brunner-gasse+vienna
5 1101 Favoriten 3831 1100 Laxenburger Straße 2C -2 D 2C-2D+laxenburgerstraße+vienna
6 1101 Favoriten 3827 1100 Laxenburger Straße 2 C 2C+laxenburgerstraße+vienna
7 1101 Favoriten 3836 1100 hinter Laxenburger Straße 2 C 2C+hinterlaxenburgerstraße+vienna
8 1201 Auhof 932 1130 Keplingergasse 10 10+keplingergasse+vienna
9 1213 Speising 135 1130 Speisinger Straße 29 29+speisingerstraße+vienna
10 1107 Simmering 2357 1100 BATTIGGASSE 44 44+battiggasse+vienna
# … with 57,902 more rows
# ℹ Use `print(n = ...)` to see more rows
API call and getting coordinates.
I gathered the display name matched by the API, and the lat & lon data.
get_coords <- function(address) {
cat("Getting coordinates", address, "\n")
str_c(
"https://nominatim.openstreetmap.org/search?q=",
address,
"&format=json&polygon=1&addressdetails=1"
) %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble() %>%
select(api_name = display_name,
lat, lon) %>%
slice(1)
}
df %>%
slice_sample(n = 10) %>%
mutate(coordinates = map(
formatted_address, possibly(get_coords, tibble(
api_name = NA_character_,
lat = NA_character_,
lon = NA_character_
))
)) %>%
unnest(coordinates)
# A tibble: 10 × 10
kg_code katastralgemeinde ez plz strasse on formatted_…¹ api_n…² lat lon
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1651 Aspern 3374 1220 ERLENWEG 8 8+erlenweg+… 8, Erl… 48.2… 16.4…
2 1613 Leopoldau 6617 1210 Oswald-Redlich-Straße 31 31+oswald-r… 31, Os… 48.2… 16.4…
3 1006 Landstraße 2425 1030 HAGENMÜLLERGASSE 45018 45018+hagen… Hagenm… 48.1… 16.4…
4 1101 Favoriten 541 1100 HERNDLGASSE 7 7+herndlgas… 7, Her… 48.1… 16.3…
5 1607 Groß Jedlersdorf II 221 1210 Prager Straße 70 70+pragerst… Prager… 48.2… 16.3…
6 1006 Landstraße 1184 1030 PAULUSGASSE 2 2+paulusgas… 2, Pau… 48.1… 16.3…
7 1654 Eßling 2712 1220 KAUDERSSTRASSE 61 61+kauderss… 61, Ka… 48.2… 16.5…
8 1401 Dornbach 2476 1170 Alszeile NA +alszeile+v… Alszei… 48.2… 16.2…
9 1654 Eßling 745 1220 Kirschenallee 19 19+kirschen… 19, Ki… 48.2… 16.5…
10 1204 Hadersdorf 3139 1140 MITTLERE STRASSE NA +mittlerest… Mittle… 48.2… 16.1…
# … with abbreviated variable names ¹formatted_address, ²api_name
I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I'll use the built-in chickwts data as an example.
Here's the data, there are 5 feed types.
> head(chickwts)
weight feed
1 179 horsebean
2 160 horsebean
3 136 horsebean
4 227 horsebean
5 217 horsebean
6 168 horsebean
> table(chickwts$feed)
casein horsebean linseed meatmeal soybean sunflower
12 10 12 11 14 12
What I want is the top rows by weight for every feed type. However, I need a different number for each feed type? For example,
top_n_feed <-
c(
"casein" = 3,
"horsebean" = 5,
"linseed" = 3,
"meatmeal" = 6,
"soybean" = 3,
"sunflower" = 2
)
How can I do this using dplyr?
To get the top n rows of each feed type by weight I can use code as below, but I'm not sure how to extend this to a different number for each feed type.
chickwts %>%
group_by(feed) %>%
slice_max(order_by = weight, n = 5)
This isn't really something that dplyr names easy. I'd recommend merging in the data and then filtering.
tibble(feed=names(top_n_feed), topn=top_n_feed) %>%
inner_join(chickwts) %>%
group_by(feed) %>%
arrange(desc(weight), .by_group=TRUE) %>%
filter(row_number() <= topn) %>%
select(-topn)
Any time you have a named list think purrr::imap. Avoid joins if not required, particuarly when working at scale.
library(dplyr)
library(purrr)
top_n_feed <- c(
"casein" = 3,
"horsebean" = 5,
"linseed" = 3,
"meatmeal" = 6,
"soybean" = 3,
"sunflower" = 2
)
imap_dfr(top_n_feed, ~ filter(chickwts, feed %in% .y) %>%
slice_max(order_by = weight, n = .x))
weight feed
1 404 casein
2 390 casein
3 379 casein
4 227 horsebean
5 217 horsebean
6 179 horsebean
7 168 horsebean
8 160 horsebean
9 309 linseed
10 271 linseed
11 260 linseed
12 380 meatmeal
13 344 meatmeal
14 325 meatmeal
15 315 meatmeal
16 303 meatmeal
17 263 meatmeal
18 329 soybean
19 327 soybean
20 316 soybean
21 423 sunflower
22 392 sunflower
Another way using split and map2:
library(dplyr)
library(purrr)
chickwts %>%
filter(feed %in% names(top_n_feed)) %>%
split(.$feed) %>%
map2_dfr(top_n_feed[names(.)], ~slice_max(.x, order_by = weight, n = .y))
Bring top_n_feed in chickwts dataframe and select top n rows for each group.
library(dplyr)
tibble::enframe(top_n_feed, name = 'feed') %>%
left_join(chickwts, by = 'feed') %>%
group_by(feed) %>%
top_n(first(value), weight)
# feed value weight
# <chr> <dbl> <dbl>
# 1 casein 3 390
# 2 casein 3 379
# 3 casein 3 404
# 4 horsebean 5 179
# 5 horsebean 5 160
# 6 horsebean 5 227
# 7 horsebean 5 217
# 8 horsebean 5 168
# 9 linseed 3 309
#10 linseed 3 260
# … with 12 more rows
For some reason I was not able to make slice_sample work for this example.
How can I scrape the data and add and additional column to show the year that it is scraped?
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>%
add_column(year = year)
write.csv(tables, year, file = "nba_draftsR.csv", na ="")
}
2000:2017 %>%
walk(function(year) {
nba_drafts(year)
})
Error: Column 1 must be named.
Checked your code, the error is happening at the step highlighted in below code.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
as.tibble() %>% # error is happening at this step
Debug Step:
The reason for this error is the first three columns names are balnks(""), which you need to assign first, then only you can change to tibble or data frame.
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
names(tables)
[1] "" "" "" "Round 1" "Round 1" "" "Totals" "Totals" "Totals" "Totals" "Totals"
[12] "Shooting" "Shooting" "Shooting" "Per Game" "Per Game" "Per Game" "Per Game" "Advanced" "Advanced" "Advanced" "Advanced"
I have added a for loop to update the names
nba_drafts <- function(year) {
url <- glue("https://www.basketball-reference.com/draft/NBA_{year}.html")
tables<-read_html(url) %>%
html_nodes("#stats") %>%
html_table() %>%
purrr::simplify() %>%
first()
oldName<-names(tables)
#updating names with col_
for(i in 1:length(oldName)){
oldName[i]<- paste0("col_",i,oldName[i])
}
names(tables)<-oldName
tables<-tables %>%
as.tibble() %>%
add_column(year = year)
return(tables)
}
Output:
> nba_drafts("2019")
# A tibble: 63 x 23
col_1 col_2 col_3 `col_4Round 1` `col_5Round 1` col_6 col_7Totals col_8Totals col_9Totals col_10Totals col_11Totals
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Rk Pk Tm Player College Yrs G MP PTS TRB AST
2 1 1 NOP Zion Williams… Duke 1 19 565 448 129 41
3 2 2 MEM Ja Morant Murray State 1 59 1771 1041 208 409
4 3 3 NYK RJ Barrett Duke 1 56 1704 803 279 143
5 4 4 LAL De'Andre Hunt… Virginia 1 63 2018 778 286 112
6 5 5 CLE Darius Garland Vanderbilt 1 59 1824 728 111 229
I am working with the dataframe 'by_class_survival' and I am trying to convert in other format, changing the rows and columns plus including conditions, I have already solved in a very rustic way, so but I am wondering if there is a better way to transpose columns and rows, plus adding conditions at the moment to create the transposition.
library(dplyr)
titanic_tbl <- dplyr::tbl_df(Titanic)
titanic_tbl <- titanic_tbl %>%
mutate_at(vars(Class:Survived), funs(factor))
by_class_survival <- titanic_tbl %>%
group_by(Class, Survived) %>%
summarize(Count = sum(n))
Original dataframe
# Class Survived Count
# 1 1st No 122
# 2 1st Yes 203
# 3 2nd No 167
# 4 2nd Yes 118
# 5 3rd No 528
# 6 3rd Yes 178
# 7 Crew No 673
# 8 Crew Yes 212
Creating a new dataframe based on the values from by_class_survival
first <- c(122,203)
second <- c(167, 118)
third <- c(528,178)
crew <- c(673,212)
titanic.df = data.frame(first,second,third,crew)
library(data.table)
t_titanic.df <- transpose(titanic.df)
rownames(t_titanic.df) <- colnames(titanic.df)
colnames(t_titanic.df) <- c("No survivor", "Survivor")
Expected result
## No survivor Survivor
## first 122 203
## second 167 118
## third 528 178
## crew 673 212
There is a better way to reach the expected result?
You can do it in one step with reshape2::dcast:
library(reshape2)
library(dplyr)
titanic_tbl %>%
dcast(Class ~ Survived, value.var = "n", sum)
Class No Yes
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212
or you can use tidyr::spread on the summarised data frame:
library(tidyr)
titanic_tbl %>%
group_by(Class, Survived) %>%
summarise(sum = sum(n)) %>%
spread(Survived, sum)
# A tibble: 4 x 3
# Groups: Class [4]
Class No Yes
<chr> <dbl> <dbl>
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212