I want to find the coordinates for a list of addresses.
I am using a data set that can be found here: "https://www.data.gv.at/katalog/dataset/kaufpreissammlung-liegenschaften-wien"
I've inputed this using the read_csv function as "data". I'm using the tidyverse and jsonlite libraries. The only relevant columns are "Straße" which is the street name and "ON" which is the street number. The city for all of these is Vienna, Austria.
I'm using OpenStreetMap and have formatted my address data like the format requires:
data$formatted_address <- paste(ifelse(is.na(data$ON), "", data$ON), "+", tolower(data$Straße), ",+vienna", sep = "")
This formats the adresses in this column as 1+milanweg,+vienna and 12+granergasse,+vienna. When I manually input this into the API format, it all works out and I get the coordinates: https://nominatim.openstreetmap.org/search?q=1+milanweg,+vienna&format=json&polygon=1&addressdetails=1
Since I now want to do this for my entire row, I am using jsonlite to create requests in R.
data$coordinates <- data.frame(lat = NA, lon = NA)
for (i in 1:nrow(data)) {
result <- try(readLines(paste0("https://nominatim.openstreetmap.org/search?q=",
URLencode(data$formatted_address[i]), "&format=json&polygon=1&addressdetails=1")),
silent = TRUE)
if (!inherits(result, "try-error")) {
if (length(result) > 0) {
result <- fromJSON(result)
if (length(result) > 0 && is.list(result[[1]])) {
data$coordinates[i, ] <- c(result[[1]]$lat, result[[1]]$lon)
}
}
}
}
This should theoretically create the exact same API request, however, the lat and lon columns are always empty.
How can I fix this script to create a list of coordinates for each address in the data set?
Data setup
library(tidyverse)
library(httr2)
df <- df %>%
mutate(
formatted_address = str_c(
if_else(is.na(on), "", on), "+", str_to_lower(strasse), "+vienna"
) %>% str_remove_all(" ")
)
# A tibble: 57,912 × 7
kg_code katastralgemeinde ez plz strasse on formatted_address
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 1617 Strebersdorf 1417 1210 Mühlweg 13 13+mühlweg+vienna
2 1607 Groß Jedlersdorf II 193 1210 Bahnsteggasse 4 4+bahnsteggasse+vienna
3 1209 Ober St.Veit 3570 1130 Jennerplatz 34/20 34/20+jennerplatz+vienna
4 1207 Lainz 405 1130 Sebastian-Brunner-Gasse 6 6+sebastian-brunner-gasse+vienna
5 1101 Favoriten 3831 1100 Laxenburger Straße 2C -2 D 2C-2D+laxenburgerstraße+vienna
6 1101 Favoriten 3827 1100 Laxenburger Straße 2 C 2C+laxenburgerstraße+vienna
7 1101 Favoriten 3836 1100 hinter Laxenburger Straße 2 C 2C+hinterlaxenburgerstraße+vienna
8 1201 Auhof 932 1130 Keplingergasse 10 10+keplingergasse+vienna
9 1213 Speising 135 1130 Speisinger Straße 29 29+speisingerstraße+vienna
10 1107 Simmering 2357 1100 BATTIGGASSE 44 44+battiggasse+vienna
# … with 57,902 more rows
# ℹ Use `print(n = ...)` to see more rows
API call and getting coordinates.
I gathered the display name matched by the API, and the lat & lon data.
get_coords <- function(address) {
cat("Getting coordinates", address, "\n")
str_c(
"https://nominatim.openstreetmap.org/search?q=",
address,
"&format=json&polygon=1&addressdetails=1"
) %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble() %>%
select(api_name = display_name,
lat, lon) %>%
slice(1)
}
df %>%
slice_sample(n = 10) %>%
mutate(coordinates = map(
formatted_address, possibly(get_coords, tibble(
api_name = NA_character_,
lat = NA_character_,
lon = NA_character_
))
)) %>%
unnest(coordinates)
# A tibble: 10 × 10
kg_code katastralgemeinde ez plz strasse on formatted_…¹ api_n…² lat lon
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 1651 Aspern 3374 1220 ERLENWEG 8 8+erlenweg+… 8, Erl… 48.2… 16.4…
2 1613 Leopoldau 6617 1210 Oswald-Redlich-Straße 31 31+oswald-r… 31, Os… 48.2… 16.4…
3 1006 Landstraße 2425 1030 HAGENMÜLLERGASSE 45018 45018+hagen… Hagenm… 48.1… 16.4…
4 1101 Favoriten 541 1100 HERNDLGASSE 7 7+herndlgas… 7, Her… 48.1… 16.3…
5 1607 Groß Jedlersdorf II 221 1210 Prager Straße 70 70+pragerst… Prager… 48.2… 16.3…
6 1006 Landstraße 1184 1030 PAULUSGASSE 2 2+paulusgas… 2, Pau… 48.1… 16.3…
7 1654 Eßling 2712 1220 KAUDERSSTRASSE 61 61+kauderss… 61, Ka… 48.2… 16.5…
8 1401 Dornbach 2476 1170 Alszeile NA +alszeile+v… Alszei… 48.2… 16.2…
9 1654 Eßling 745 1220 Kirschenallee 19 19+kirschen… 19, Ki… 48.2… 16.5…
10 1204 Hadersdorf 3139 1140 MITTLERE STRASSE NA +mittlerest… Mittle… 48.2… 16.1…
# … with abbreviated variable names ¹formatted_address, ²api_name
Related
I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I have an overview page of student statistics https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/index.html and each semester has specific information in a html table element, e.g. https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/2013-Wintersemester/index.html
I would like to scrape all information and put it together as a dataframe. I manually created a char vector of all URLs (perhaps there is another way).
Edit As was mentioned, some URL parts are capitalized, some are not. This list should be complete.
winters <- seq(from=2013, to=2021)
summers <- seq(from=2014, to=2022)
winters <- paste0(winters, "-wintersemester")
summers <- paste0(summers, "-Sommersemester")
all_terms <- c(rbind(winters, summers))
all_terms[1] <- "2013-Wintersemester"
all_terms[3] <- "2014-Wintersemester"
all_url <- paste0("https://www.europa-uni.de/de/struktur/verwaltung/dezernat_1/statistiken/", all_terms, "/index.html")
I can get data for a single page
all_url[1] %>%
read_html() %>%
html_table() %>%
as.data.frame()
Studierende gesamt 6645
weiblich 4206
männlich 2439
Deutsche 5001
Ausländer/innen 1644
1. Fachsemester 1783
1. Hochschulsemester 1110
But fail to write a for loop.
tables <- list()
index <- 1
for(i in length(all_url)){
table <- all_url[i] %>%
read_html() %>%
html_table()
tables[index] <- table
index <- index + 1
}
df <- do.call("rbind", tables)
It would be great to have a dataframe with each sub-page (semester / year) as rows and all student data as columns.
Some appear not to be available. You could solve this using tryCatch and substitute with NA.
library(rvest)
tables <- lapply(all_url, \(x) tryCatch(as.data.frame(html_table(read_html(x))),
error=\(e) NA)) |> setNames(all_terms)
tail(tables, 3)
# $`2021-Sommersemester`
# X1 X2
# 1 Studierende gesamt 5131
# 2 weiblich 3037
# 3 männlich 2054
# 4 Deutsche 3698
# 5 Ausländer/innen 1433
# 6 1. Fachsemester 394
# 7 1. Hochschulsemester 143
#
# $`2021-Wintersemester`
# [1] NA
#
# $`2022-Sommersemester`
# X1 X2
# 1 Studierende gesamt 4851
# 2 weiblich 2847
# 3 männlich 2004
# 4 Deutsche 3360
# 5 Ausländer/innen 1491
# 6 1. Fachsemester 403
# 7 1. Hochschulsemester 189
Thereafter you may want to rbind the non-missings,
na <- is.na(tables)
tables[!na] <- Map(`[<-`, tables[!na], 'sem', value=substr(all_terms[!na], 1, 6)) ## add year column*
res <- do.call(rbind, tables[!is.na(tables)])
head(res)
# X1 X2 sem
# 2013-Wintersemester.1 Studierende gesamt 6645 2013-W
# 2013-Wintersemester.2 weiblich 4206 2013-W
# 2013-Wintersemester.3 männlich 2439 2013-W
# 2013-Wintersemester.4 Deutsche 5001 2013-W
# 2013-Wintersemester.5 Ausländer/innen 1644 2013-W
# 2013-Wintersemester.6 1. Fachsemester 1783 2013-W
*better use sapply(strsplit(substr(all_terms[!na], 1, 6), '-'), \(x) paste(rev(x), collapse='_')) here to get valid names
and reshape the data.
reshape2::dcast(res, X1 ~ sem, value.var='X2')
# X1 2013-W 2014-S 2014-W 2015-S 2016-S 2017-S 2018-S 2019-S 2020-S 2021-S 2022-S
# 1 1. Fachsemester 1783 567 1600 557 613 693 810 611 405 394 403
# 2 1. Hochschulsemester 1110 199 1020 224 240 217 273 214 78 143 189
# 3 Ausländer/innen 1644 1510 1649 1501 1576 1613 1551 1527 1369 1433 1491
# 4 Deutsche 5001 4836 4843 4599 4682 4733 4821 4523 4040 3698 3360
# 5 männlich 2439 2347 2394 2255 2292 2388 2468 2388 2197 2054 2004
# 6 Studierende gesamt 6645 6346 6492 6100 6258 6346 6372 6051 5409 5131 4851
# 7 weiblich 4206 3999 4098 3845 3966 3958 3904 3663 3212 3037 2847
Here's a tidyverse approach. Note that I only use links 1:4 because there's something off with (some of) the others.
library(rvest)
library(tidyverse)
# gather terms and corresponding urls
data <- tibble(
term = all_terms,
url = paste0(
"https://www.europauni.de/de/struktur/verwaltung/dezernat_1/statistiken/",
all_terms, "/index.html"
)
)[1:4,]
# map over `data`, scrape table, rename variables and bind the results
map(1:nrow(data), ~ {
data$url[.x] %>%
read_html() %>%
html_table() %>%
`[[`(., 1) %>%
mutate(term = data$term[.x]) %>%
rename(Kategorie = X1, Anzahl = X2)
}) %>%
bind_rows()
Result:
# A tibble: 28 × 3
Kategorie Anzahl term
<chr> <int> <chr>
1 Studierende gesamt 6645 2013-Wintersemester
2 weiblich 4206 2013-Wintersemester
3 männlich 2439 2013-Wintersemester
4 Deutsche 5001 2013-Wintersemester
5 Ausländer/innen 1644 2013-Wintersemester
6 1. Fachsemester 1783 2013-Wintersemester
7 1. Hochschulsemester 1110 2013-Wintersemester
8 Studierende gesamt 6346 2014-Sommersemester
9 weiblich 3999 2014-Sommersemester
10 männlich 2347 2014-Sommersemester
...
All,
Thanks in advance. I have this school dataset. Each category (in Category column) has a range number of students (e.g., from 30 to 60 students), so I need to calculate:
the total number of classrooms that fall in each category (from category 1 to category 4), and
the percentage of classrooms that fall in the category.
For example, how many classrooms (NumOfClassrooms column) fall in Category_4, and what's the percentage of those classrooms to the total classrooms? Here is an illustrative example for my question:
ID = 1:1050
District = rep(c("AR", "CO", "AL", "KS", "IN", "ME", "KY", "ME", "MN", "NJ"), times = c(80, 120, 100, 110, 120, 100, 100, 120, 100, 100))
schoolName = randomNames::randomNames(1050, ethnicity = 5 ,which.names = "last")
Grade = rep(c("First", "Second", "Third", "Fourth"), times = c(400, 300, 200, 150))
NumOfClassrooms = sample(1:6)
StudentNumber = sample(1:90, 5)
AverageNumOfStudents = StudentNumber/NumOfClassrooms
Category = ifelse(AverageNumOfStudents > 0 & AverageNumOfStudents < 10, "category_1",
ifelse(AverageNumOfStudents >=10 & AverageNumOfStudents < 30, "category_2",
ifelse(AverageNumOfStudents >=30 & AverageNumOfStudents <= 60, "category_3",
ifelse(AverageNumOfStudents > 60 , "category_4", "NA"))))
dat = data.frame(ID, schoolName, Grade, NumOfClassrooms, StudentNumber, AverageNumOfStudents, Category)
Finally, I need to divide the results based on the "District" column into separate excel files using the following code (it should work fine once I get the above two steps).
Final_Divide = Final_df %>%
dplyr::group_by(District) %>%
dplyr::ungroup()
list_data <- split(Final_Divide,
Final_Divide$District)
options(digits=3)
Map(openxlsx::write.xlsx, list_data, paste0(names(list_data), '.xlsx'))
Thank you very much in advance.
Setting a random seed before your code for reproducibility:
set.seed(42)
# Your code creating dat
Table1 <- xtabs(NumOfClassrooms~Category, dat)
Table1
# Category
# category_1 category_2 category_4
# 1925 1575 175
Table2 <- prop.table(Table1)
round(Table2, 4) # Proportions
# Category
# category_1 category_2 category_4
# 0.5238 0.4286 0.0476
round(Table2 * 100, 2) # Percent
# Category
# category_1 category_2 category_4
# 52.38 42.86 4.76
If we include District in dat:
dat <- data.frame(ID, District, schoolName, Grade, NumOfClassrooms, StudentNumber, AverageNumOfStudents, Category)
Table3 <- xtabs(NumOfClassrooms~District+Category, dat)
addmargins(Table3)
# Category
# District category_1 category_2 category_4 Sum
# AL 187 149 16 352
# AR 143 121 14 278
# CO 220 180 20 420
# IN 220 180 20 420
# KS 198 166 19 383
# KY 187 148 17 352
# ME 407 329 36 772
# MN 176 153 17 346
# NJ 187 149 16 352
# Sum 1925 1575 175 3675
For row percentages by District:
round(prop.table(Table3, 1) * 100, 2)
# Category
# District category_1 category_2 category_4
# AL 53.12 42.33 4.55
# AR 51.44 43.53 5.04
# CO 52.38 42.86 4.76
# IN 52.38 42.86 4.76
# KS 51.70 43.34 4.96
# KY 53.12 42.05 4.83
# ME 52.72 42.62 4.66
# MN 50.87 44.22 4.91
# NJ 53.12 42.33 4.55
Here's a possible solution using the tidyverse
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100)
This will give us:
# Groups: Category [3]
ID District schoolName Grade NumOfClassrooms StudentNumber AverageNumOfStude~ Category `Total Classroom~ `Number of Classrooms in~ `Category Percent~
<int> <chr> <chr> <chr> <int> <int> <dbl> <chr> <int> <int> <dbl>
1 1 AR Svyatetskiy First 5 87 17.4 category~ 1050 525 50
2 2 AR Booco First 1 79 79 category~ 1050 175 16.7
3 3 AR Jones First 6 49 8.17 category~ 1050 350 33.3
4 4 AR Sapkin First 3 5 1.67 category~ 1050 350 33.3
5 5 AR Fosse First 2 35 17.5 category~ 1050 525 50
6 6 AR Vanwagenen First 4 87 21.8 category~ 1050 525 50
7 7 AR Orth First 5 79 17.4 category~ 1050 525 50
8 8 AR Moline First 1 49 79 category~ 1050 175 16.7
9 9 AR Bradford First 6 5 8.17 category~ 1050 350 33.3
10 10 AR Wollman First 3 35 1.67 category~ 1050 350 33.3
# ... with 1,040 more rows
If you need a separate table of just the category/# classrooms/percentage data:
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100) %>%
select(Category, "Number of Classrooms in Category", "Category Percentage") %>%
unique()
This gives us:
# A tibble: 3 x 3
# Groups: Category [3]
Category `Number of Classrooms in Category` `Category Percentage`
<chr> <int> <dbl>
1 category_2 525 50
2 category_4 175 16.7
3 category_1 350 33.3
Note that in your post, this code is a bit redundant:
Final_Divide = Final_df %>%
dplyr::group_by(District) %>%
dplyr::ungroup()
If you group and then immediately ungroup, you're actually just doing this:
Final_Divide <- Final_df
You could also consider adding split(.$District) to transform your data into a list all in one chunk of code:
dat %>%
mutate("Total Classrooms" = n()) %>%
group_by(Category) %>%
mutate("Number of Classrooms in Category" = n(),
"Category Percentage" = `Number of Classrooms in Category`/`Total Classrooms` * 100) %>%
split(.$District)
How does one rbind or bind_rows temporary tables created in SQL (tested and failed in Postgres and SQLite) by dplyr?
E.g.
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
copy_to(con, nycflights13::flights, "flights2",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")
Calling bind_rows gives the following error:
> bind_rows(flights_db, flights_db_2)
Error in bind_rows_(x, .id) :
Argument 1 must be a data frame or a named atomic vector, not a tbl_dbi/tbl_sql/tbl_lazy/tbl
As database holds unique records, here both the objects 'flights', 'flights2' are the same. Otherwise, we need
union(flights_db, flights_db_2)
The above will only create the dimensions as in 'flights_db' because both the objects are the same. If we need to create double the number of rows, then create a unique identifier
flights1 <- nycflights13::flights %>%
mutate(id= 1)
flights2 <- nycflights13::flights %>%
mutate(id = 2)
copy_to(con, flights1, "flights",
temporary = FALSE,
overwrite = TRUE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
copy_to(con, flights2, "flights2",
temporary = FALSE,
overwrite = TRUE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")
Now we do the union
union(flights_db, flights_db_2) %>%
summarise(n = n())
# Source: lazy query [?? x 1]
# Database: sqlite 3.19.3 []
# n
# <int>
#1 673552
dim(nycflights13::flights)
#[1] 336776 19
To demonstrate the uniqueness, we can select a small subset of disjointed rows for both the objects and then do the union
copy_to(con, nycflights13::flights[1:20,], "flights",
temporary = FALSE,
overwrite = TRUE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
copy_to(con, nycflights13::flights[21:30,], "flights2",
temporary = FALSE,
overwrite = TRUE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
flights_db <- tbl(con, "flights")
flights_db_2 <- tbl(con, "flights2")
union(flights_db, flights_db_2) %>%
collect
# A tibble: 30 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576
# 5 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719
# 6 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944
#10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733
# ... with 20 more rows, and 3 more variables: hour <dbl>, minute <dbl>, time_hour <dbl>
With thanks to Akrun for pointing me to the union family, it is possible to somewhat replicate bind_rows with:
Reduce(union_all, list(flights_db, flights_db, flights_db))
As noted in the comments to, and in Akrun's answer, union produces unique records in the result, and union_all is the equivalent to SQL's UNION ALL.
I have around 50+ csv files that all share the same 4 columns in this order:
REG_ID region age age_num
and then years anything from 1990 till 2016 in this format:
REG_ID region age age_num y_1992 y_1993 y_1994 y_2014.15
and I was wondering what could be the best way to merge them. Going thru each to add the missing years-columns would be time consuming and likely lead to errors.
The end format would be something like this:
REG_ID region reg_num age age_num y_1991 y_1992 y_1993
BFM2 Boucle 1 c_0_4 0 770 NA 120
BFM2 Boucle 1 c_5_9 5 810 NA 11
BFM2 Boucle 1 c_10_14 10 704 NA 130
BFM2 Boucle 1 c_15_19 15 71 NA 512
BFM2 Boucle 1 c_20_24 20 181 NA 712
Here's a way you can do it using tidyverse tools. First use dir to get a vector of csv paths, then use purrr:map to read them all in, returning a list of the data frames, and then use purrr::reduce to merge all the data frames using dplyr::left_join.
library(readr)
library(purrr)
library(dplyr)
create the data sets
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1991
BFM2,Boucle,1,c_0_4,0,770
BFM2,Boucle,1,c_5_9,5,810
BFM2,Boucle,1,c_10_14,10,704
BFM2,Boucle,1,c_15_19,15,71
BFM2,Boucle,1,c_20_24,20,181") %>%
write_csv("df_91.csv")
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1992
BFM2,Boucle,1,c_0_4,0,NA
BFM2,Boucle,1,c_5_9,5,NA
BFM2,Boucle,1,c_10_14,10,NA
BFM2,Boucle,1,c_15_19,15,NA
BFM2,Boucle,1,c_20_24,20,NA") %>%
write_csv("df_92.csv")
read_csv(
"REG_ID,region,reg_num,age,age_num,y_1993
BFM2,Boucle,1,c_0_4,0,120
BFM2,Boucle,1,c_5_9,5,11
BFM2,Boucle,1,c_10_14,10,130
BFM2,Boucle,1,c_15_19,15,512
BFM2,Boucle,1,c_20_24,20,712") %>%
write_csv("df_93.csv")
Create the final merged data set
dir(".", "\\.csv", full.names = TRUE) %>%
map(read_csv) %>%
reduce(left_join, by = c("REG_ID", "region", "reg_num", "age", "age_num"))
#> # A tibble: 5 x 8
#> REG_ID region reg_num age age_num y_1991 y_1992 y_1993
#> <chr> <chr> <int> <chr> <int> <int> <chr> <int>
#> 1 BFM2 Boucle 1 c_0_4 0 770 <NA> 120
#> 2 BFM2 Boucle 1 c_5_9 5 810 <NA> 11
#> 3 BFM2 Boucle 1 c_10_14 10 704 <NA> 130
#> 4 BFM2 Boucle 1 c_15_19 15 71 <NA> 512
#> 5 BFM2 Boucle 1 c_20_24 20 181 <NA> 712
I think the best way would be:
library(data.table)
library(stringr)
data<-list("vector")
files_to_loop<-list.vector()[str_detect(list.vector(),".csv")]
for (i in 1:length(files_to_loop)){
data[[i]]<-fread(files_to_loop[i])
}
data<-rbindlist(data,use.names=TRUE,fill=TRUE)