I am using rowwise to perform a function on each row. This takes a long time. In order to speed things up, is there a way to use parallel processing so that multiple cores are working on different rows concurrently?
As an example, I am aggregating PRISM weather data (https://prism.oregonstate.edu/) to the state level while weighting by population. This is based on https://www.patrickbaylis.com/blog/2021-08-15-pop-weighted-weather/.
Note that the below code requires downloads of daily weather data as well as the shapefile with population estimates at a very small geography.
library(prism)
library(tidyverse)
library(sf)
library(exactextractr)
library(tigris)
library(terra)
library(raster)
library(ggthemes)
################################################################################
#get daily PRISM data
prism_set_dl_dir("/prism/daily/")
get_prism_dailys(type = "tmean", minDate = "2012-01-01", maxDate = "2021-07-31", keepZip=FALSE)
Get states shape file and limit to lower 48
states = tigris::states(cb = TRUE, resolution = "20m") %>%
filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))
setwd("/prism/daily")
################################################################################
#get list of files in the directory, and extract date
##see if it is stable (TRUE) or provisional data (FALSE)
list <- ls_prism_data(name=TRUE) %>% mutate(date1=substr(files, nchar(files)-11, nchar(files)-4),
date2=substr(product_name, 1, 11),
year = substr(date2, 8, 11), month=substr(date2, 1, 3),
month2=substr(date1, 5, 6), day=substr(date2, 5, 6),
stable = str_detect(files, "stable"))
################################################################################
#function to get population weighted weather by state
#run the population raster outside of the loop
# SOURCE: https://sedac.ciesin.columbia.edu/data/set/usgrid-summary-file1-2000/data-download - Census 2000, population counts for continental US
pop_rast = raster("/population/usgrid_data_2000/geotiff/uspop00.tif")
pop_crop = crop(pop_rast, states)
states = tigris::states(cb = TRUE, resolution = "20m") %>%
filter(!NAME %in% c("Alaska", "Hawaii", "Puerto Rico"))
daily_weather <- function(varname, filename, date) {
weather_rast = raster(paste0(filename, "/", filename, ".bil"))
weather_crop = crop(weather_rast, states)
pop_rs = raster::resample(pop_crop, weather_crop)
states$value <- exact_extract(weather_crop, states, fun = "weighted_mean", weights=pop_rs)
names(states)[11] <- varname
states <- data.frame(states) %>% arrange(NAME) %>% dplyr::select(c(6,11))
states
}
################################################################################
days <- list %>% rowwise() %>% mutate(states = list(daily_weather("tmean", files, date1))))
As is, each row takes about 7 seconds. This adds up with 3500 rows. And I want to get other variables beside tmean. So it will take a day or more to do everything unless I can speed it up.
I am mainly interested in solutions to be able to use parallel processing with rowwise, but I also welcome other suggestions of how to speed up the code in other ways.
you could try either purrr of its multiprocessed equivalent furrr (either map() or pmap()). The quickest method would be to use data.table. See this blog post that gives some benchmarks behind my recommendation
Related
I am trying to use the r5r R package to create an isochrone accessibility study involving supermarkets in the City of Cleveland. I started by getting the boundary for the city, created a grid, and generated the centroids for the grid. I then used OSM to get the street network and locations of supermarkets. Finally, I created a travel time matrix using the r5r package. My code is below:
# load required packages
library(tidycensus)
library(tidytransit)
library(tmap)
library(osmdata)
library(tidyverse)
library(osmextract)
library(tigris)
library(r5r)
library(sf)
cleveland_boundary = places("Ohio") %>% filter(NAME == "Cleveland") %>% st_transform(4326)
cleveland_grid = st_make_grid(cleveland_boundary, square = FALSE, n=c(100,100),
what = "polygons") %>% st_as_sf() %>% st_filter(cleveland_boundary) %>%
mutate(id = seq(1, length(cleveland_grid$geometry), by=1)) %>% st_transform(4326)
cleveland_centroids = st_centroid(cleveland_grid)
cle_file = oe_match("Cleveland, Ohio")
cle_grocery = oe_read(cle_file$url, layer = "points", quiet = TRUE) %>%
st_transform(crs = st_crs(cleveland_boundary)) %>% st_filter(cleveland_boundary) %>%
rename(id = osm_id) %>% st_transform(4326)
dir.create("cle_network")
cleveland_streets = oe_read(cle_file$url, layer = "lines", quiet = TRUE, download_directory = "cle_network") %>%
filter(!is.na(highway)) %>%
st_transform(crs = st_crs(cleveland_boundary)) %>% st_filter(cleveland_boundary)
options(java.parameters = "-Xmx2G") # set up r5r core
r5r_core <- setup_r5("cle_network", verbose = FALSE, overwrite = TRUE)
ttm_wkday = travel_time_matrix(r5r_core = r5r_core,
origins = cle_grocery,
destinations = cleveland_centroids,
mode = c("WALK", "TRANSIT"),
departure_datetime = as.POSIXct("08-12-2022 14:00:00", format = "%d-%m-%Y %H:%M:%S"),
max_walk_dist = 1000,
max_trip_duration = 480,
verbose = FALSE)
I obtained GTFS data from [here] (https://www.riderta.com/sites/default/files/gtfs/latest/google_transit.zip) and saved it as "CLEgtfs.zip" in my cle_networks directory created in the above code.
The output of this code only gives me 532 results, with a maximum travel time of 29 minutes. This is clearly not correct, and it seems that transit travel times are not being factored in. My guess is that it is only accounting for walk time, and since I have a maximum walking distance of 1000 meters, I suspect only walking travel time is included in this travel time matrix. Is there any reason this may be happening? I would appreciate any guidance!
The departure datetime you're using is outside the valid range set in the calendar.txt of the GTFS data. The earliest start_date in the feed is 11/12/2022 (dd/mm/yyyy), but your departure date is 08/12/2022.
Basically, none of the transit services described in the feed run on this day, so that's why you have a walk-only matrix.
I am trying to replicate this visual, but with my own data. This is the template I am working off of - https://r-graph-gallery.com/183-choropleth-map-with-leaflet.html
My intent is to highlight every country with a value in the same color. I might make it a heatmap or something - but right now adding the polygons gives an error so I cannot try any color options at all.
# Setup
library(leaflet)
library(rgdal)
library(here)
library(tidyverse)
# Basically copy pasted from the template, but the download did not work. I manually went to the website, downloaded the file, manually un-zipped, and manually dropped it in my working directory
# download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip" , destfile="DATA/world_shape_file.zip")
# system("unzip DATA/world_shape_file.zip")
world_spdf <- readOGR(
dsn= here() ,
layer="TM_WORLD_BORDERS_SIMPL-0.3",
verbose=FALSE
)
world_spdf#data$POP2005[ which(world_spdf#data$POP2005 == 0)] = NA
world_spdf#data$POP2005 <- as.numeric(as.character(world_spdf#data$POP2005)) / 1000000 %>% round(2)
# Example of my data - I have countries and numbers associated with them, although not every country has a number
country <- c("Algeria", "Argentina", "Australia")
values <- c(1,4,4)
my_df <- dataframe(country, values)
# This is how I am trying to add MY values to the map. I have to convert the map to a tibble, add my data, then convert it back to a map. Perhaps this is the problem?
interactive_data_attempt <- world_spdf %>%
as.tibble() %>%
left_join(my_df , by = c("NAME" = "country")) %>%
mutate(texts = replace_na(texts, 0),
exists = texts > 1) %>%
st_as_sf(coords = c("LON","LAT"))
# This is the method I used to do the exact same thing in a domestic US map
bins <- c(seq(0,1,1), Inf)
pal <- colorBin(c("white","#C14A36"), domain = interactive_data_attempt$exists, bins = bins, reverse = FALSE)
# This gives an error: Error in to_ring.default(x) : Don't know how to get polygon data from object of class XY,POINT,sfg
leaflet(interactive_data_attempt) %>%
addTiles() %>%
setView(lat=10, lng=0 , zoom=2) %>%
addPolygons(fillColor = ~pal(interactive_data_attempt$exists))
You use readOGR to get an sp object, but at one point you convert it to tibble and then to sf? Not sure about sp, but in most cases you can handle sf as a regular tibble / dataframe, i.e. left_jointo it. And you can read shapefile directly to sf with st_read.
Then there's something with your variables, a mixup from copy-paste I would guess: in my_df you have values but you never do anything with it and in your mutate you use texts but it's unclear where it's coming from.
Binary palette is built from exists, a boolean value that should indicate if the actual value is present or not, though I'd assume you'd want to use values from your my_df$values instead.
Left NA values as-is, changed bins (to just 2) and adjusted some colours.
library(leaflet)
library(sf)
library(dplyr)
library(tidyr)
# download.file("http://thematicmapping.org/downloads/TM_WORLD_BORDERS_SIMPL-0.3.zip" , destfile="world_shape_file.zip")
# unzip("world_shape_file.zip",exdir = "world_shape_file")
world_sf <- st_read("world_shape_file")
world_sf$POP2005[ which(world_sf$POP2005 == 0)] = NA
world_sf$POP2005 <- as.numeric(as.character(world_sf$POP2005)) / 1000000 %>% round(2)
country <- c("Algeria", "Argentina", "Australia")
values <- c(1,4,4)
pal <- colorBin(c("blue","#C14A36"), domain = values, bins = 2, reverse = FALSE, na.color = "transparent")
world_sf %>%
left_join(
tibble(country, values),
by = c("NAME" = "country")) %>%
leaflet() %>%
addTiles() %>%
setView(lat=10, lng=0 , zoom=2) %>%
addPolygons(fillColor = ~pal(values), stroke = FALSE)
Created on 2022-11-12 with reprex v2.0.2
I'm trying to replicate a plot from the paper Michel et al., 'Quantitative Analysis of Culture Using Millions of Digitized Books' (2011). Specifically I'm trying to make the one on the top right here:
https://pubmed.ncbi.nlm.nih.gov/21163965/#&gid=article-figures&pid=fig-3-uid-2
I know the paper used v1 of the corpus but I'm doing it with v2 as it's easier to work with. When I use the Google Ngram viewer (specifying the English 2012 corpus which corresponds to v2, a year range of 1875 to 1975, and no smoothing) I get this, which looks pretty close.
When I tried to replicate this in R/ggplot I get this:
1950 and 1883 look pretty consistent with what is happening in the viewer plot, but I can't figure out what is happening with 1910. There appears to be very few occurrences of the year '1910' in the data set in comparison to some of the other years. Would anyone with a better understanding of the Google ngrams data set be able to point me in the right direction? Should I be supplementing this with something other than just the 1-gram dataset? Does the Google ngram viewer pick out occurrences of 1-grams in a different way?
The code I've used is below. A couple of other points: 1910 and 1950 do not seem to exist as 1-grams in the v2 data set, but 1883 does. To get this to even remotely work, I had to grepl for 1950 and 1910 to get any hits (i.e. they all seem to appear as parts of date ranges like 1890-1910, or with some other characters tacked on), rather than just doing a fixed search for those years in the ngram field. I also used purrr::map_dfr to do this rather than just a dplyr::case_when in case years appeared in the same ngram picked up by a grepl (e.g. the range 1883-1910 should be a hit for both of those years, not just one).
library(ggplot2)
library(dplyr)
library(purrr)
#---- Load data ----
counts_file <- file.path("data", "total_counts.txt")
ngrams_file <- file.path("data", "google_books_1gram_eng_v2.gz")
if (!dir.exists("data")) {
dir.create("data")
}
if (!file.exists(counts_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-totalcounts-20120701.txt",
counts_file
)
}
if (!file.exists(ngrams_file)) {
download.file(
"http://storage.googleapis.com/books/ngrams/books/googlebooks-eng-all-1gram-20120701-1.gz",
ngrams_file
)
}
one_grams <- read.delim(
gzfile(ngrams_file),
header = FALSE
)
names(one_grams) <- c("ngram", "year", "match_count", "volume_count")
one_grams_subset <- one_grams %>%
filter(year >= 1875 & year <= 1975)
total_counts_temp <- t(
read.table(
counts_file,
header = FALSE
)
)
total_counts_char <- do.call(
rbind,
strsplit(total_counts_temp, ",")
)
total_counts <- apply(total_counts_char, 2, as.numeric)
colnames(total_counts) <- c("year", "match_count", "page_count", "volume_count")
#---- Recreate plot 3A from Michel et al. (2011) ----
year_subset <- function(year_char, one_grams_data) {
one_grams_data %>%
filter(grepl(year_char, .[["ngram"]], fixed = TRUE)) %>%
group_by(year) %>%
summarise(year_count = sum(match_count, na.rm = TRUE)) %>%
mutate(year_gram = year_char)
}
plot_data <- map_dfr(c("1883", "1910", "1950"),
year_subset,
one_grams_subset) %>%
left_join(as_tibble(total_counts), by = "year") %>%
mutate(frequency = 10000 * year_count/match_count) %>%
select(year_gram, year, frequency, year_count)
ggplot(plot_data) +
geom_line(aes(x = year, y = frequency, colour = year_gram)) +
theme_minimal() +
labs(col = "ngram", x = "Year", y = "Frequency")
So, I am new to using R, so sorry if the questions seem a little basic!
But my work is asking me to look through census data using an API and identify some variables in each tract, then create a csv file they can look at. The code is fully written for me, I believe, but I need to change the variables to:
S2602_C01_023E - black / his
S2602_C01_081E - unemployment rate
S2602_C01_070E - not US citizen (divide by total population)
S0101_C01_030E - # over 65 (divide by total pop)
S1603_C01_009E - # below poverty (divide by total pop)
S1251_C01_010E - # child under 18 (divide by # households)
S2503_C01_013E - median income
S0101_C01_001E - total population
S2602_C01_078E - in labor force
And, I need to divide some of the variables, like I have written, and export all of this into a CSV file. I just don't really know what to do with the code..like I am just lost because I have never used R. I try changing the variables to the ones I need, but an error comes up. Any help would be greatly appreciated!
library(tidycensus)
library(tidyverse)
library(stringr)
library(haven)
library(profvis)
#list of variables possible
v18 <- load_variables(year = 2018,
dataset = "acs5",
cache = TRUE)
#function to get variables for all states. Year, variables can be
easily edited.
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(totpop = "B01003_001",
male = "B01001_002",
female = "B01001_026",
white_alone = "B02001_002",
black_alone = "B02001_003",
americanindian_alone = "B02001_004",
asian_alone = "B02001_005",
nativehaw_alone = "B02001_006",
other_alone = "B02001_007",
twoormore = "B02001_008",
nh = "B03003_002",
his = "B03003_003",
noncit = "B05001_006",
povstatus = "B17001_002",
num_households = "B19058_001",
SNAP_households = "B19058_002",
medhhi = "B19013_001",
hsdiploma_25plus = "B15003_017",
bachelors_25plus = "B15003_022",
greater25 = "B15003_001",
inlaborforce = "B23025_002",
notinlaborforce = "B23025_007",
greater16 = "B23025_001",
civnoninstitutional = "B27010_001",
withmedicare_male_0to19 = "C27006_004",
withmedicare_male_19to64 = "C27006_007",
withmedicare_male_65plus = "C27006_010",
withmedicare_female_0to19 = "C27006_014",
withmedicare_female_19to64 = "C27006_017",
withmedicare_female_65plus = "C27006_020",
withmedicaid_male_0to19 = "C27007_004",
withmedicaid_male_19to64 = "C27007_007",
withmedicaid_male_65plus = "C27007_010",
withmedicaid_female_0to19 = "C27007_014",
withmedicaid_female_19to64 = "C27007_017",
withmedicaid_female_65plus ="C27007_020"),
geography = "tract",
state = st )
return(df)
}
#loops over all states
df_list <- setNames(lapply(states, get_census_data), states)
##if you want to keep margin of error, remove everything after %>%
in next two lines
final_df <- bind_rows(df_list) %>%
select(-moe)
colnames(final_df)[3] <- "varname"
#cleaning up final data, making it wide instead of long
final_df_wide <- final_df %>%
gather(variable, value, -(GEOID:varname)) %>%
unite(temp, varname, variable) %>%
spread(temp, value)
#exporting to csv file, adjust your path
write.csv(final_df,"C:\Users\NAME\Documents\acs_2018_tractlevel_dat.
a.csv" )
Since you can't really give an reproducible example without revealing your API key, I'll try my best to figure out what could work here:
Let's first edit the function that pulls data from the API:
get_census_data <- function(st) {
Sys.sleep(5)
df <- get_acs(year = 2018,
variables = c(blackHis= "S2602_C01_023E",
unEmployRate = "S2602_C01_081E",
notUSCit = "S2602_C01_070E")
geography = "tract",
state = st )
return(df)
}
I've just put in two of the variables, but you should get the point.
Try if this works for you. And returns the data that is stored in the respective variables.
I am building a Shiny application in which a large ggplot2 fortified dataframe needs to be calculated over and over again, using a large amount of external source files. I am searching for the fastest and most efficient way to do this. In the following paragraph I will delve a little bit more into the subject and the code I have so far and also provide the input data to enable your kind assistance.
I am using the Helsinki Region Travel Time Matrix 2018, a dataset provided by Digital Geography Lab, a research group in the University of Helsinki. This data uses a generalised map of Helsinki capital region, in 250 x 250 meter cells (in my code grid_f), to calculate travel times between all cells in the map (grid ids are called YKR_ID, n=13231) by public transport, private car, bicycle and by foot. The calculations are stored in delimited .txt files, one text file for all the travel times to a specific cell id. The data is available for download at this website, under "Download the data". NB, the unzipped data is 13.8 GB in size.
Here is a selection from a text file in the dataset:
from_id;to_id;walk_t;walk_d;bike_s_t;bike_f_t;bike_d;pt_r_tt;pt_r_t;pt_r_d;pt_m_tt;pt_m_t;pt_m_d;car_r_t;car_r_d;car_m_t;car_m_d;car_sl_t
5785640;5785640;0;0;-1;-1;-1;0;0;0;0;0;0;-1;0;-1;0;-1
5785641;5785640;48;3353;51;32;11590;48;48;3353;48;48;3353;22;985;21;985;16
5785642;5785640;50;3471;51;32;11590;50;50;3471;50;50;3471;22;12167;21;12167;16
5785643;5785640;54;3764;41;26;9333;54;54;3764;54;54;3764;22;10372;21;10370;16
5787544;5785640;38;2658;10;7;1758;38;38;2658;38;38;2658;7;2183;7;2183;6
My interest is to visualise (with ggplot2) this 250x250m Helsinki region map for one travel mode, the private car, using any of the possible 13231 cell ids, repeatedly if the user wants. Because of this it is important that the dataframe fetch is as fast and efficient as possible. For this question, let's concentrate on the fetching and processing of the data from the external files and use only one specific id value.
In a nutshell, After I have produced a ggplot2::fortify() version of the 250 x 250 meter grid spatial dataset grid_f,
I need to scan through all the 13231 Travel Time Matrix 2018 text files
Pick only the relevant columns (from_id, to_id, car_r_t, car_m_t, car_sl_t) in each file
Pick the relevant row using from_id (in this case, origin_id <- "5985086") in each file
Join the the resulting row to the fortified spatial data grid_f
My code is as follows:
# Libraries
library(ggplot2)
library(dplyr)
library(rgdal)
library(data.table)
library(sf)
library(sp)
# File paths. ttm_path is the folder which contains the unchanged Travel
# Time Matrix 2018 data from the research group's home page
ttm_path <- "HelsinkiTravelTimeMatrix2018"
gridpath <- "MetropAccess_YKR_grid_EurefFIN.shp"
#### Import grid cells
# use this CRS information throughout the app
app_crs <- sp::CRS("+init=epsg:3067")
# Read grid shapefile and transform
grid_f <- rgdal::readOGR(gridpath, stringsAsFactors = TRUE) %>%
sp::spTransform(., app_crs) %>%
# preserve grid dataframe data in the fortify
{dplyr::left_join(ggplot2::fortify(.),
as.data.frame(.) %>%
dplyr::mutate(id = as.character(dplyr::row_number() - 1)))} %>%
dplyr::select(-c(x, y))
The code above this point is meant to run only once. The code below, more or less, would be run over and over with different origin_ids.
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c(1, 2, 14, 16, 18)
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-length(all_files)]
# lapply function
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# The part of the code that needs to be fast and efficient
result <-
lapply(all_files, FUN = TTM18_fetch, col_range, origin_id_num) %>%
data.table::rbindlist(., fill = TRUE) %>%
data.table::merge.data.table(dt_grid, ., by.x = "YKR_ID", by.y = "to_id")
The dataframe result should have 66155 rows of 12 variables, five rows for each 250x250 meter grid cell. The columns are YKR_ID, long, lat, order, hole, piece, id, group, from_id, car_r_t, car_m_t, car_sl_t.
My current lapply() and data.table::fread() solution takes about 2-3 minutes to complete. I think this is already a good achievement, but I can't help and think there are better and faster ways to complete this. So far, I have tried these alternatives to what I now have:
A conventional for loop: that was obviously a slow solution
I tried to teach myself more about vectorised functions in R, but that did not lead anywhere. Used this link
Tried to dabble with with() unsuccessfully using this SO question, inspired by this SO question
Looked into package parallel but ended up not utilising that because of the Windows environment I am using
Tried to find alternative ways to solve this with apply() and sapply() but nothing noteworthy came out of that.
As to why I didn't do all this to the data before ggplot2::fortify, I simply found it troublesome to work with a SpatialPolygonsDataFrame.
Thank you for your time.
Whenver I’m trying to figure out how to improve the performance of my R
functions, I generally use the following approach. First, I look for any
function calls that may be unesscesary or identify places where multiple
function calls can be simplified into one. Then, I look for places in my
code that are incurring the greatest time penalty by benchmarking each
part separately. This can easily be done using the microbenchmark
package.
For example, we can ask if we get better performance with or without
piping (e.g. %>%).
# hint... piping is always slower
library(magrittr)
library(microbenchmark)
microbenchmark(
pipe = iris %>% subset(Species=='setosa'),
no_pipe = subset(iris, Species=='setosa'),
times = 200)
Unit: microseconds
expr min lq mean median uq max neval cld
pipe 157.518 196.739 308.1328 229.6775 312.6565 2473.582 200 b
no_pipe 84.894 116.386 145.4039 126.1950 139.4100 612.492 200 a
Here, we find that removing subseting a data.frame without piping
takes nearly half the time to execute!
Next, I determine the net time penalty for each place I
benchmarked by multipling the execution time by total number of times it
needs to be executed. For the areas with the greatest net time penalty,
I try to replace it with faster functions and/or try reduce the total
number of times it needs to be executed.
TLDR
In your case, you can speed things up by using the fst package
although you would need to convert your csv files to fst files.
# before
TTM18_fetch <- function(x, col_range, origin_id) {
res <- data.table::fread(x, select = col_range)
res <- subset(res, from_id == origin_id)
return(res)
}
# after (NB x needs to be a fst file)
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
To convert your csv files to fst
library(data.table)
library(fst)
ttm_path <- 'REPLACE THIS'
new_ttm_path <- 'REPLACE THIS'
# Get filepaths of all of the TTM18 data. Remove metadata textfile filepath.
all_files <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE,
full.names = TRUE)
all_files <- all_files[-grepl('[Mm]eta', all_files)]
# creating new file paths and names for fst files
file_names <- list.files(path = ttm_path,
pattern = ".txt$",
recursive = TRUE)
file_names <- file_names[-grepl('[Mm]eta', file_names)]
file_names <- gsub(pattern = '.csv$',
replacement = '.fst',
x =file_names)
file_names <- file.path(new_ttm_path, file_names)
# csv to fst conversion
require(progress) # this will help you create track of things
pb <- progress_bar$new(
format = " :what [:bar] :percent eta: :eta",
clear = FALSE, total = length(file_names), width = 60)
# an index file to store from_id file locations
from_id_paths <- data.table(from_id = numeric(),
file_path = character())
for(i in seq_along(file_names)){
pb$tick(tokens = list(what = 'reading'))
tmp <- data.table::fread(all_files[i], key = 'from_id')
pb$update(tokens = list(what = 'writing'))
fst::write_fst(tmp,
compress = 50, # less compressed files read faster
path = file_names[i] )
pb$update(tokens = list(what = 'indexing'))
from_id_paths <- rbind(from_id_paths,
data.table(from_id = unique(tmp$from_id),
file_path = file_names[i]))
}
setkey(from_id_paths, from_id)
write_fst(from_id_paths,
path = file.path('new_ttm_path', 'from_id_index.fst'),
compress = 0)
This would be the replacement
library(fst)
library(data.table)
new_ttm_path <- 'REPLACE THIS'
#### Fetch TTM18 data
origin_id <- "5985086"
origin_id_num <- as.numeric(origin_id)
# column positions of columns from_id, to_id, car_r_t, car_m_t, car_sl_t
col_range <- c('from_id', 'to_id', 'car_r_t', 'car_m_t', 'car_sl_t')
# grid_f as data.table version
dt_grid <- as.data.table(grid_f)
nescessary_files <- read_fst(path = file.path(new_ttm_path,
'from_id_index.fst'),
as.data.table = TRUE
)[from_id==origin_id,file_path]
TTM18_fetch <- function(x, col_range, origin_id) {
res <- fst::read_fst(path = x,
columns = col_range,
as.data.table = TRUE)[from_id==origin_id]
return(res)
}
result <- rbindlist(lapply(nescessary_files, FUN = TTM18_fetch, col_range, origin_id_num),
fill = TRUE)
result <- data.table::merge.data.table(dt_grid, result, by.x = "YKR_ID", by.y = "to_id")