bigrquery: Create a BigQuery table with geoJSON files doesn't work - r

I'd like to create a BigQuery table with geoJSON files, despite the geoJSONis an accepted format in BQ (NEWLINE_DELIMITED_JSON) and bq_fields specification, or something coercible to it (like a data frame) the function bq_table_create() of the bigrquery package doesn't work. In my example below the output error is Erro: Unsupported type: list:
library(sf)
library(bigrquery)
library(DBI)
library(googleAuthR)
library(geojsonsf)
library(geojsonR)
# Convert shapefile to geoJSON
stands_sel <- st_read(
"D:/Dropbox/Stinkbug_Ml_detection_CMPC/dashboard/v_08_CMPC/sel_stands_CMPC.shp")
# Open as geoJSON
geo <- sf_geojson(stands_sel)
# Convert geoJSON to data frame
geo_js_df <- as.data.frame(geojson_wkt(geo))
str(geo_js_df)
# 'data.frame': 2 obs. of 17 variables:
# $ SISTEMA_PR: chr "MACRO ESTACA - EUCALIPTO" "SEMENTE - EUCALIPTO"
# $ ESPECIE : chr "SALIGNA" "DUNNI"
# $ ID_UNIQUE : chr "BARBANEGRA159A" "CAMPOSECO016A"
# $ CICLO : num 2 1
# $ LOCALIDADE: chr "BARRA DO RIBEIRO" "DOM FELICIANO"
# $ ROTACAO : num 1 1
# $ CARACTER_1: chr "Produtivo" "Produtivo"
# $ VLR_AREA : num 8.53 28.07
# $ ID_REGIAO : num 11 11
# $ CD_USO_SOL: num 2433 9053
# $ DATA_PLANT: chr "2008/04/15" "2010/04/15"
# $ ID_PROJETO: chr "002" "344"
# $ CARACTERIS: chr "Plantio Comercial" "Plantio Comercial"
# $ PROJETO : chr "BARBA NEGRA" "CAMPO SECO"
# $ ESPACAMENT: chr "3.00 x 2.50" "3.5 x 2.14"
# $ CD_TALHAO : chr "159A" "016A"
# $ geometry :List of 2
# ..$ : 'wkt' chr "MULTIPOLYGON (((-51.2142 -30.3517,-51.2143 -30.3518,-51.2143 -30.3518,-51.2143 -30.3519,-51.2143 -30.3519,-51.2"| __truncated__
# ..$ : 'wkt' chr "MULTIPOLYGON (((-52.3214 -30.4271,-52.3214 -30.4272,-52.3214 -30.4272,-52.3215 -30.4272,-52.3215 -30.4272,-52.3"| __truncated__
# - attr(*, "wkt_column")= chr "geometry"
# Insert information inside BQ
bq_conn <- dbConnect(bigquery(),
project = "my-project",
use_legacy_sql = FALSE
)
# First create the table
players_table = bq_table(project = "my-project", dataset = "stands_ROI_2021", table = "CF_2021")
bq_table_create(x = players_table, fields = as_bq_fields(geo_js_df))
Erro: Unsupported type: list

You can upload data frame with a list-type column on BigQuery by using bq_table_upload() syntax. Try this on your script instead of bq_table_create(),
bq_table_upload(players_table, geo_js_df)
For your reference, I tried this on my end using this sample data with a list-type column:
d <- data.frame(id = 1:2,
name = c("Jon", "Mark"),
children = I(list(c("Mary", "James"),
c("Greta", "Sally")))
)
R console:
Created BQ table:
EDIT:
As per this documentation, FeatureCollection is not yet supported in BigQuery, however there is an ongoing feature request you can find here. Workaround is to convert the GeoJson file to BigQuery new-line-delimited JSON before converting it to dataframe.
To convert GeoJson file to BigQuery new-line-delimited JSON, follow these steps:
Install node.js.
Add packages:
npm install fs JSONStream line-input-stream yargs
Clone the github repository:
git clone https://github.com/mentin/geoscripts.git
Change directory:
cd geoscripts/geojson2bq/
Convert GeoJson file to BigQuery new-line-delimited JSON:
node geojson2bqjson.js sel_stands.geojson > out.json
Using the new-line-delimited JSON file, convert this to dataframe in the R console, then use bq_table_upload() to upload the data to BigQuery.
library(bigrquery)
library(dplyr)
library(tidyverse)
library(jsonlite)
out <- stream_in(file('out.json'))
projectid<-"my-project"
datasetid<-"my-dataset"
bq_conn <- dbConnect(bigquery(),
project = projectid,
dataset = datasetid,
use_legacy_sql = FALSE)
players_table = bq_table(project = "my-project", dataset = "my-dataset", table = "CF_2021_test5")
bq_table_upload(players_table, out)
bq_table_download(players_table)
R console:
BigQuery table:

Related

Split PDF files in multiples files every 2 pages in R

I have a PDF document with 300 pages. I need to split this file in 150 files containing each one 2 pages. For example, the 1st document would contain pages 1 & 2 of the original file, the 2nd document, the pages 3 & 4 and so on.
Maybe I can use the "pdftools" package, but I don't know how.
1) pdftools Assuming that the input PDF is in the current directory and the outputs are to go into the same directory, change the inputs below and then get the number of pages num, compute the st and en vectors of start and end page numbers and repeatedly call pdf_subset. Note that the pdf_length and pdf_subset functions come from the qpdf R package but are also made available by the pdftools R package by importing them and exporting them back out.
library(pdftools)
# inputs
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
num <- pdf_length(infile)
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdf_subset(infile, pages = st[i]:en[i], output = outfile)
}
2) pdfbox The Apache pdfbox utility can split into files of 2 pages each. Download the .jar command line utilities file from pdfbox and be sure you have java installed. Then run this assuming that your input file is a.pdf and is in the current directory (or run the quoted part directly from the command line without the quotes and without R). The jar file name below may need to be changed if a later version is to be used. The one named below is the latest one currently (not counting alpha version).
system("java -jar pdfbox-app-2.0.26.jar PDFSplit -split 2 a.pdf")
3) animation/pdftk Another option is to install the pdftk program, change the inputs at the top of the script below and run. This gets the number of pages in the input, num, using pdftk and then computes the start and end page numbers, st and en, and then invokes pdftk repeatedly, once for each st/en pair to extract those pages into another file.
library(animation)
# inputs
PDFTK <- "~/../bin/pdftk.exe" # path to pdftk
infile <- "a.pdf" # input pdf
prefix <- "out_" # output pdf's will begin with this prefix
ani.options(pdftk = Sys.glob(PDFTK))
tmp <- tempfile()
dump_data <- pdftk(infile, "dump_data", tmp)
g <- grep("NumberOfPages", readLines(tmp), value = TRUE)
num <- as.numeric(sub(".* ", "", g))
st <- seq(1, num, 2)
en <- pmin(st + 1, num)
for (i in seq_along(st)) {
outfile <- sprintf("%s%0*d.pdf", prefix, nchar(num), i)
pdftk(infile, sprintf("cat %d-%d", st[i], en[i]), outfile)
}
Neither pdftools nor qpdf (on which the first depends) support splitting PDF files by other than "every page". You likely will need to rely on an external program, I'm confident you can get pdftk to do that by calling it once for each 2-page output.
I have a 36-page PDF here named quux.pdf in the current working directory.
str(pdftools::pdf_info("quux.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 36
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 8
# ..$ Producer : chr "pdfTeX-1.40.24"
# ..$ Author : chr ""
# ..$ Title : chr ""
# ..$ Subject : chr ""
# ..$ Creator : chr "LaTeX via pandoc"
# ..$ Keywords : chr ""
# ..$ Trapped : chr ""
# ..$ PTEX.Fullbanner: chr "This is pdfTeX, Version 3.141592653-2.6-1.40.24 (TeX Live 2022) kpathsea version 6.3.4"
# $ created : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ modified : POSIXct[1:1], format: "2022-05-17 22:54:40"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"
I also have pdftk installed and available in the page,
Sys.which("pdftk")
# pdftk
# "C:\\PROGRA~2\\PDFtk Server\\bin\\pdftk.exe"
With this, I can run an external script to create 2-page PDFs:
list.files(pattern = "pdf$")
# [1] "quux.pdf"
pages <- seq(pdftools::pdf_info("quux.pdf")$pages)
pages <- split(pages, (pages - 1) %/% 2)
pages[1:3]
# $`0`
# [1] 1 2
# $`1`
# [1] 3 4
# $`2`
# [1] 5 6
for (pg in pages) {
system(sprintf("pdftk quux.pdf cat %s-%s output out_%02i-%02i.pdf",
min(pg), max(pg), min(pg), max(pg)))
}
list.files(pattern = "pdf$")
# [1] "out_01-02.pdf" "out_03-04.pdf" "out_05-06.pdf" "out_07-08.pdf"
# [5] "out_09-10.pdf" "out_11-12.pdf" "out_13-14.pdf" "out_15-16.pdf"
# [9] "out_17-18.pdf" "out_19-20.pdf" "out_21-22.pdf" "out_23-24.pdf"
# [13] "out_25-26.pdf" "out_27-28.pdf" "out_29-30.pdf" "out_31-32.pdf"
# [17] "out_33-34.pdf" "out_35-36.pdf" "quux.pdf"
str(pdftools::pdf_info("out_01-02.pdf"))
# List of 11
# $ version : chr "1.5"
# $ pages : int 2
# $ encrypted : logi FALSE
# $ linearized : logi FALSE
# $ keys :List of 2
# ..$ Creator : chr "pdftk 2.02 - www.pdftk.com"
# ..$ Producer: chr "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
# $ created : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ modified : POSIXct[1:1], format: "2022-05-18 09:37:56"
# $ metadata : chr ""
# $ locked : logi FALSE
# $ attachments: logi FALSE
# $ layout : chr "no_layout"

Convert sf object to dataframe and restore it to original state

I'd like to convert a sf object to a dataframe and restore it to its original state. But when I make the conversion of st_as_text(st_sfc(stands_sel$geometry)) is shows very difficult to retrieve it again. In my example:
library(sf)
# get AOI in shapefile
download.file(
"https://github.com/Leprechault/trash/raw/main/sel_stands_CMPC.zip",
zip_path <- tempfile(fileext = ".zip")
)
unzip(zip_path, exdir = tempdir())
# Open the file
setwd(tempdir())
stands_sel <- st_read("sel_stands_CMPC.shp")
st_crs(stands_sel) = 4326
# Extract geometry as text
geom <- st_as_text(st_sfc(stands_sel$geometry))
# Add the features
features <- st_drop_geometry(stands_sel)
str(features)
# Joining feature + geom
geo_df <- cbind(features, geom)
str(geo_df)
# 'data.frame': 2 obs. of 17 variables:
# $ CD_USO_SOL: num 2433 9053
# $ ID_REGIAO : num 11 11
# $ ID_PROJETO: chr "002" "344"
# $ PROJETO : chr "BARBA NEGRA" "CAMPO SECO"
# $ CD_TALHAO : chr "159A" "016A"
# $ CARACTERIS: chr "Plantio Comercial" "Plantio Comercial"
# $ CARACTER_1: chr "Produtivo" "Produtivo"
# $ CICLO : int 2 1
# $ ROTACAO : int 1 1
# $ DATA_PLANT: chr "2008/04/15" "2010/04/15"
# $ LOCALIDADE: chr "BARRA DO RIBEIRO" "DOM FELICIANO"
# $ ESPACAMENT: chr "3.00 x 2.50" "3.5 x 2.14"
# $ ESPECIE : chr "SALIGNA" "DUNNI"
# $ SISTEMA_PR: chr "MACRO ESTACA - EUCALIPTO" "SEMENTE - EUCALIPTO"
# $ VLR_AREA : num 8.53 28.07
# $ ID_UNIQUE : chr "BARBANEGRA159A" "CAMPOSECO016A"
# $ geom : chr "MULTIPOLYGON (((-51.21423 -30.35172, -51.21426 -30.35178, -51.2143 -30.35181, -51.21432 -30.35186, -51.21433 -3"| __truncated__
# Return to original format again
stands_sf <- geo_df %>%
st_geometry(geom) %>%
sf::st_as_sf(crs = 4326)
#Error in UseMethod("st_geometry") :
Please, any help to restore my stands_sf object to the orinal state?
I think geom isn't in a format st_geometry is expecting. st_as_text converted your geometry into WKT as discussed in the help:
The returned WKT representation of simple feature geometry conforms to the simple features access specification and extensions, known as EWKT, supported by PostGIS and other simple features implementations for addition of SRID to a WKT string.
https://r-spatial.github.io/sf/reference/st_as_text.html
Instead, use st_as_sf(wkt=) to set the new (old) geometry.
st_as_sf(geo_df, wkt = "geom", crs = 4326)

Error while building ExpressionSet using Bioconductor

I am trying to make ExpressionSet files for the analysis of RNA-seq data. I simply have a matrix of counts called "exprs", a data.frame of features (genes) called "features" and a data.frame of sample attributes called "phenotypes".
Here is the code I run to import all data into R and create a single "Object" of Expressionset. But it returns an error.
## DE object creation
### importing 3 data files to R first
### Count MATRIX
dataDirectory <- system.file("extdata", package="Biobase")
exprs <- as.matrix(read.table("counts.txt", sep = "\t", header = TRUE, row.names = 1, as.is = TRUE))
class(exprs)
head.matrix(exprs)
str(exprs)
Output:
num [1:40220, 1:20] 12.39 6.37 11.18 10.72 10.65 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:40220] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
..$ : chr [1:20] "sample_1379_PNC" "sample_1360_PA_A" "sample_1412_PNB" "sample_1405_PA_A" ...
### Features data which contains gene names and symbols for each ensembl id (gene) -> DATAFRAME
features <- read.csv("features.txt", sep = "\t")
rownames(features) <- features$ID
features$ID <- NULL
str(features)
Output:
'data.frame': 40223 obs. of 2 variables:
$ Symbol : chr "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
$ Symbol2: chr "TSPAN6" "TNMD" "DPM1" "SCYL3" ...
### Phenotype data which contains attributes for each sample -> DATAFRAME
phenotypes <- read.csv("phenotypes.txt", sep = "\t")
rownames(phenotypes) <- phenotypes$X1
phenotypes$X1 <- NULL
str(phenotypes)
Output:
'data.frame': 20 obs. of 2 variables:
$ condition: chr "normal" "tumor" "normal" "tumor" ...
$ type : chr "mono" "mono" "poly" "mono" ...
# Load package
library(Biobase)
# Create ExpressionSet object
eset <- ExpressionSet(assayData = exprs,
phenoData = annotatedDataFrameFrom(phenotypes),
featureData = annotatedDataFrameFrom(features))
Output:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘annotatedDataFrameFrom’ for signature ‘"character"’

POST to API using httr in R results in error

I'm trying to pull data directly from an API into R using the httr package. The API doesn't require any authentication, and accepts JSON strings of lat, long, elevation, variable sets, and time period to estimate climate variables for any location. This is my first time using an API, but the code below is what I've cobbled together from various Stack Overflow posts.
library(jsonlite)
library(httr)
url = "http://apibc.climatewna.com/api/clmApi"
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990.nrm","Normal_1961_1990.nrm"),
varYSM = c("Y","SST"))
requestBody <- toJSON(list("output" = body),auto_unbox = TRUE) ##convert to JSON string
result <- POST("http://apibc.climatewna.com/api/clmApi", ##post to API
body = requestBody,
add_headers(`Content-Type`="application/json"))
content(result)
I've tried various different versions of this (e.g. writing the JSON string manually, putting the body as a list in POST with encode = "json"), and it always runs, but the content always contains the below error message:
$Message
[1] "An error has occurred."
$ExceptionMessage
[1] "Object reference not set to an instance of an object."
$ExceptionType
[1] "System.NullReferenceException"
If I use GET and specify the variables directly in the URL
url = "http://apibc.climatewna.com/api/clmApi/LatLonEl?lat=48.98&lon=-115.02&el=1000&prd=Normal_1961_1990&varYSM=Y"
result <- GET(url)
content(result)
it produces the correct output, but then I can only obtain information for one location at a time. There isn't currently any public documentation about this API as it's very new, but I've attached a draft of the section explaining it using JS below. I would very much appreciate any help/suggestions on what I'm doing wrong!
Thank you!
The main problem is that jQuery.ajax encodes the data using jQuery.param before sending it to the API, so what it's sending looks something like [0][lat]=48.98&[0][lon]=-115.02.... I don't know of a package in R that does a similar encoding as jQuery.param, so we'll have to hack something together.
Modifying your example slightly:
library(httr)
body <- data.frame(lat = c(48.98,50.2), ##two example locations
lon = c(-115.02, -120),
el = c(1000,100),
prd = c("Normal_1961_1990","Normal_1961_1990"),
varYSM = c("Y","Y"))
Now, we do the encoding, like so:
out <- sapply(1:nrow(body), function(i) {
paste(c(
paste0(sprintf("[%d][lat]", i - 1), "=", body$lat[i]),
paste0(sprintf("[%d][lon]", i - 1), "=", body$lon[i]),
paste0(sprintf("[%d][el]", i - 1), "=", body$el[i]),
paste0(sprintf("[%d][prd]", i - 1), "=", body$prd[i]),
paste0(sprintf("[%d][varYSM]", i - 1), "=", body$varYSM[i])
), collapse = "&")
})
out <- paste(out, collapse = "&")
so now out is in a form that the API likes. Finally
result <- POST(url = "http://apibc.climatewna.com/api/clmApi", ##post to API
body = out,
add_headers(`Content-Type`="application/x-www-form-urlencoded"))
noting the Content-Type. We get
df <- do.call(rbind, lapply(content(result), as.data.frame, stringsAsFactors = FALSE))
str(df)
# 'data.frame': 2 obs. of 29 variables:
# $ lat : chr "48.98" "50.2"
# $ lon : chr "-115.02" "-120"
# $ elev : chr "1000" "100"
# $ prd : chr "Normal_1961_1990" "Normal_1961_1990"
# $ varYSM : chr "Y" "Y"
# $ MAT : chr "5.2" "8"
# $ MWMT : chr "16.9" "20.2"
# $ MCMT : chr "-6.7" "-5.6"
# $ TD : chr "23.6" "25.7"
# $ MAP : chr "617" "228"
# $ MSP : chr "269" "155"
# $ AHM : chr "24.7" "79.1"
# $ SHM : chr "62.9" "130.3"
# $ DD_0 : chr "690" "519"
# $ DD5 : chr "1505" "2131"
# $ DD_18 : chr "4684" "3818"
# $ DD18 : chr "60" "209"
# $ NFFD : chr "165" "204"
# $ bFFP : chr "150" "134"
# $ eFFP : chr "252" "254"
# $ FFP : chr "101" "120"
# $ PAS : chr "194" "34"
# $ EMT : chr "-36.3" "-32.7"
# $ EXT : chr "37.1" "41.2"
# $ Eref : chr "14.7" "13.6"
# $ CMD : chr "721" "862"
# $ MAR : chr "347" "679"
# $ RH : chr "57" "57"
# $ Version: chr "ClimateBC_API_v5.51" "ClimateBC_API_v5.51"

R SpatialPointsDataFrame to SpatialLinesDataFrame

I've imported some GPS points from my Sports watch into R:
library(plotKML)
route <- readGPX("Move_Cycling.gpx")
str(route)
The data looks like this:
List of 5
$ metadata : NULL
$ bounds : NULL
$ waypoints: NULL
$ tracks :List of 1
..$ :List of 1
.. ..$ Move:'data.frame': 677 obs. of 5 variables:
.. .. ..$ lon : num [1:677] -3.8 -3.8 -3.8 -3.8 -3.8 ...
.. .. ..$ lat : num [1:677] 52.1 52.1 52.1 52.1 52.1 ...
.. .. ..$ ele : chr [1:677] "152" "151" "153" "153" ...
.. .. ..$ time : chr [1:677] "2014-06-08T09:17:08.050Z" "2014-06-08T09:17:18.680Z" "2014-06-08T09:17:23.680Z" "2014-06-08T09:17:29.680Z" ...
.. .. ..$ extensions: chr [1:677] "7627.7999992370605141521101800" "7427.6000003814697141511.7000000476837210180.8490009442642210" "9127.523.13003521531.7000000476837210181.799999952316280" "10027.534.96003841534.1999998092651410181.88300029210510" ...
$ routes : NULL
I've managed to transform to get the data points into a SpatialPointsDataFrame and to plot it over Google Earth with:
SPDF <- SpatialPointsDataFrame(coords=route$tracks[[1]]$Move[1:2],
data=route$tracks[[1]]$Move[1:2],
proj4string = CRS("+init=epsg:4326"))
plotKML(SPDF)
What I really want is the cycling track, i.e. a SpatialLinesDataFrame, but I can't work out how to set the ID field correctly to match the SpatialLines object with the data.
This is how far I've got:
tmp <- Line(coords=route$tracks[[1]]$Move[1:2])
tmp2 <- Lines(list(tmp), ID=c("coord"))
tmp3 <- SpatialLines(list(tmp2), proj4string = CRS("+init=epsg:4326"))
# result should be something like,
# but the ID of tmp3 and data don't match at the moment
SPDF <- SpatialLinesDataFrame(tmp3, data)
You can read the GPX file straight into a SpatialLinesDataFrame object with readOGR from the rgdal package. A GPX file can contain tracks, waypoints, etc and these are seen by OGR as layers in the file. So simply:
> track = readOGR("myfile.gpx","tracks")
> plot(track)
should work. You should see lines.
In your last line you've not said what your data is, but it needs to be a data frame with one row per track if you are trying to construct a SpatialLinesDataFrame from some SpatialLines and a data frame, and you can tell it not to bother matching the IDs because you don't actually have any real per-track data you are merging. So:
> SPDF = SpatialLinesDataFrame(tmp3, data.frame(who="me"),match=FALSE)
> plot(SPDF)
But if you use readOGR you don't need to go through all that. It will also read in a bit of per-track metadata from the GPX file.
Happy cycling!
As an update, here's my final solution
library(rgdal)
library(plotKML)
track <- readOGR("Move_Cycling.gpx","tracks")
plotKML(track, colour='red', width=2, labels="Cwm Rhaeadr Trail")

Resources