I am using R 3.6.3 on Ubuntu Desktop 20.04 and googledrive version 1.0.1
I am downloading data from this open google drive folder with about 3k files: https://drive.google.com/drive/folders/0B-owdnU_9_lpei1wbTBpS3RyTW8?resourcekey=0-SxZAhXpvnVSBVJjG_HYZ_w
if you add it to a google account of yours you can see that:
library(googledrive)
drive_about()
s2012 <- drive_ls(path = "CSV Stazione_Parametro_AnnoMese", pattern = paste0("storico_2012"))
drive_download(file = as_id(s2012$id[1]))
Which returns:
Error: Client error: (404) Not Found
* domain: global
* reason: notFound
* message: File not found: 0B-owdnU_9_lpY0otY3FYaF9nOG8.
* locationType: parameter
* location: fileId
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/gargle_error_request_failed>
Client error: (404) Not Found
* message: File not found: 0B-owdnU_9_lpSGR2SGtCUnlRekk.
* domain: global
* reason: notFound
* location: fileId
* locationType: parameter
Backtrace:
1. base::mapply(function(x) drive_download(file = as_id(x)), bla$id)
3. googledrive::drive_download(file = as_id(x))
5. googledrive:::as_dribble.drive_id(file)
6. googledrive::drive_get(id = x)
8. purrr::map(as_id(id), get_one_file)
9. googledrive:::.f(.x[[i]], ...)
10. gargle::response_process(response)
11. gargle:::stop_request_failed(error_message(resp), resp)
Run `rlang::last_trace()` to see the full context.
Note that the tibble of drive_ls is:
s2012
# A tibble: 201 x 3
name id drive_resource
* <chr> <chr> <list>
1 storico_2012_07000027_005.csv 0B-owdnU_9_lpY0otY3FYaF9nOG8 <named list [37]>
2 storico_2012_10000001_005.csv 0B-owdnU_9_lpcmFkUDYwUzh4X0k <named list [37]>
3 storico_2012_05000020_010.csv 0B-owdnU_9_lpcTlEMTFpbjJLSVE <named list [37]>
4 storico_2012_03000006_005.csv 0B-owdnU_9_lpbDJiNFZWUy1CcEU <named list [37]>
5 storico_2012_09000018_111.csv 0B-owdnU_9_lpRHlwN0JnNVNseDg <named list [37]>
6 storico_2012_07000041_005.csv 0B-owdnU_9_lpV1hINnZtSFRYaTg <named list [37]>
7 storico_2012_04000155_009.csv 0B-owdnU_9_lpMzh6a29BQ3hJbHM <named list [37]>
8 storico_2012_09000014_020.csv 0B-owdnU_9_lpS0Y0ZFIzbV9mX1U <named list [37]>
9 storico_2012_03000006_038.csv 0B-owdnU_9_lpMlpGbkpFdVdURzQ <named list [37]>
10 storico_2012_06000036_009.csv 0B-owdnU_9_lpa0kxTTBuLU83U2s <named list [37]>
# … with 191 more rows
Note that it perfectly works for other files in the same folder.
Any hint?
A question aside: is there any alternative to this library? googledrivestruggles to find all files available based a specific pattern, i.e. it has be run multiple times to find them all.
Thx,
A
Related
I'm trying to use "pak" to install an R source library from a private CRAN repo (hosted with Artifactory) and I haven't been able to figure out how to get it to work. I start with a conda enviroment like so:
conda create --name testing_r_pak --channel=conda-forge r-base=4.1.0 r-essentials=4.1.0
conda activate testing_r_pak
Rscript -e 'install.packages("pak", repos = "http://cran.us.r-project.org")'
and then run the following to try and install my R package, along with deliver some troubleshooting info:
Rscript -e 'options(width=400); print(pak::repo_status()); pak::pkg_install("TestRPackage")'
Here's the result:
# A data frame: 6 × 10
name url type bioc_version platform path r_version ok ping error
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <dbl> <list>
1 pvt_cran https://<pvt_cran_repo_url>/ cranlike NA source src/contrib 4.1 TRUE 1.05 <NULL>
2 CRAN https://cran.rstudio.com cran NA source src/contrib 4.1 TRUE 1.05 <NULL>
3 BioCsoft https://bioconductor.org/packages/3.14/bioc bioc 3.14 source src/contrib 4.1 TRUE 1.05 <NULL>
4 BioCann https://bioconductor.org/packages/3.14/data/annotation bioc 3.14 source src/contrib 4.1 TRUE 1.06 <NULL>
5 BioCexp https://bioconductor.org/packages/3.14/data/experiment bioc 3.14 source src/contrib 4.1 TRUE 1.23 <NULL>
6 BioCworkflows https://bioconductor.org/packages/3.14/workflows bioc 3.14 source src/contrib 4.1 TRUE 1.25 <NULL>
✔ Loading metadata database ... done
Error: <callr_remote_error: Cannot install packages:
* TestRPackage: Can't find package called TestRPackage.>
in process 21027
-->
<simpleError: Cannot install packages:
* TestRPackage: Can't find package called TestRPackage.>
Stack trace:
12. (function (...) ...
13. base:::withCallingHandlers(cli_message = function(msg) { ...
14. get("pkg_install_make_plan", asNamespace("pak"))(...)
15. prop$stop_for_solution_error()
16. private$plan$stop_for_solve_error()
17. pkgdepends:::pkgplan_stop_for_solve_error(self, private)
18. base:::stop("Cannot install packages:\n", msg, call. = FALSE)
19. base:::.handleSimpleError(function (e) ...
20. h(simpleError(msg, call))
21. base:::stop(e)
22. (function (e) ...
x Cannot install packages:
* TestRPackage: Can't find package called TestRPackage.
Execution halted
Some notes:
I know I have a correctly constructed .Rprofile with my private CRAN repo added into it. As you can see, pak::repo_status() even sees it correctly.
I know I have a properly functioning private CRAN repo because running Rscript -e 'install.packages("TestRPackage", repos = "https://<pvt_cran_repo_url>")' in the same conda environment works fine.
I've tried a few different versions of R with the same result.
I've tried this on a couple of different hosts (Mac, Ubuntu) with the same result.
I've also tried this without conda, with the same result.
I'm pretty stumped here - I hope it's something obvious I'm missing, but I'm not sure what else to do. I'd appreciate any help.
I am trying to read parquet file from databricks Filestore
library(sparklyr)
parquet_dir has been pre-defined
parquet_dir = /dbfs/FileStore/test/flc_next.parquet'
List the files in the parquet dir
filenames <- dir(parquet_dir, full.names = TRUE)
"/dbfs/FileStore/test/flc_next.parquet/_committed_6244562942368589642"
[2] "/dbfs/FileStore/test/flc_next.parquet/_started_6244562942368589642"
[3] "/dbfs/FileStore/test/flc_next.parquet/_SUCCESS"
[4] "/dbfs/FileStore/test/flc_next.parquet/part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6769e6-925-1-c000.snappy.parquet"
Show the filenames and their sizes
data_frame(
filename = basename(filenames),
size_bytes = file.size(filenames)
)
rning: `data_frame()` was deprecated in tibble 1.1.0.
Please use `tibble()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# A tibble: 4 × 2
filename size_bytes
<chr> <dbl>
1 _committed_6244562942368589642 124
2 _started_6244562942368589642 0
3 _SUCCESS 0
4 part-00000-tid-6244562942368589642-0edceedf-7157-4cce-a084-0f2a4a6… 248643
Import the data into Spark
timbre_tbl <- spark_read_parquet("flc_next.parquet", parquet_dir)
Error : $ operator is invalid for atomic vectors
Some(<code style = 'font-size:10p'> Error: $ operator is invalid for atomic vectors </code>)
I would appreciate any help/suggestion
Thanks in advance
The first argument of spark_read_parquet expects a spark connection, check this: sparklyr::spark_connect. If you are running the codes in Databricks then this should work:
sc <- spark_connect(method = "databricks")
timbre_tbl <- spark_read_parquet(sc, "flc_next.parquet", parquet_dir)
Using httr package, I pulled content from an api as follows. the content received is "text/xml". The output data from below is pasted at the end.
I converted to xml as follows
res_xml <- httr::content(res, as = "text", encoding = "UTF-8") %>%
xml2::read_xml()
When I checked the children it seems to have node_set
xml_children(res_xml )
{xml_nodeset (4)}
[1] <type value="searchset"/>
[2] <total value="1"/>
[3] <link>\n <relation value="self"/>\n <url value= ...
[4] <entry>\n <link>\n <relation value="self"/>\n ...
But, when I pull the "entry" node, there seems to be no data.
xml_find_all(res_xml ,".//entry")
{xml_nodeset (0)}
Instead of working with xml format, I converted to list and that is a complete nested lists of lists with unequal entries.
xml_list <- xml2::as_list(res_xml)
I would need a data frame for specific entries or more favorably complete xml output as separate tables, so that I can select the data I can work further much easily. Hence, tried the following. But, the output is all NULL.
lst <- xml_list$Bundle$entry
lst %>% dplyr::bind_rows()
# A tibble: 17 x 4
relation url Patient mode
<list> <list> <list> <list>
1 <NULL> <NULL> <list [0]> <NULL>
2 <NULL> <NULL> <named list [1]> <NULL>
3 <NULL> <NULL> <named list [1]> <NULL>
4 <NULL> <NULL> <named list [1]> <NULL>
.....
when I noticed the str(lst) the list has NULL list() and attr(value). I am interested in the attr value. If I convert the xml to JSON, everything is NULL.
Any help on how to flatten this list appropriately would be appreciated. or More better I can parse directly using xml2 package.
input data
<Bundle xmlns="\"http://hl7.org/fhir\""><type value="\"searchset\""/><total value="\"1\""/>
<resource>
<Patient><id value="\"TzfXm.YeCZh5GLGCXoCQqmjyn9vSjtJtIlcakCeyfbEcB\""/>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-race\"">
<valueCodeableConcept>
<coding><system value="\"urn:oid:2.16.840.1.113883.5.104\""/><code value="\"UNK\""/><display value="\"Unknown\""/></coding><text value="\"Unknown\""/></valueCodeableConcept>
</extension>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-ethnicity\"">
<valueCodeableConcept>
<coding><system value="\"urn:oid:2.16.840.1.113883.5.50\""/><code value="\"UNK\""/><display value="\"Unknown\""/></coding><text value="\"Unknown\""/></valueCodeableConcept>
</extension>
<extension url="\"http://hl7.org/fhir/StructureDefinition/us-core-birth-sex\"">
<valueCodeableConcept>
<coding><system value="\"http://hl7.org/fhir/v3/AdministrativeGender\""/><code value="\"F\""/><display value="\"Female\""/></coding><text value="\"Female\""/></valueCodeableConcept>
</extension>
<identifier><use value="\"usual\""/><system value="\"urn:oid:1.2.840.114350.1.13.172.3.7.5.737384.0\""/><value value="\"E296\""/></identifier>
<identifier><use value="\"usual\""/><system value="\"urn:oid:1.2.840.114350.1.13.172.2.7.5.737384.100\""/><value value="\"410000236\""/></identifier><active value="\"true\""/>
<name><use value="\"usual\""/><text value="\"Mother Milltest\""/><family value="\"Milltest\""/><given value="\"Mother\""/></name><gender value="\"female\""/><birthDate value="\"1978-05-06\""/><deceasedBoolean value="\"false\""/></Patient>
</resource>
<search><mode value="\"match\""/></search>
</entry>
</Bundle>
I'm trying out roadoi to access Unpaywall from R, but no matter what I try to query for, I'm getting this response:
Error in UseMethod("http_error") : no applicable method for
'http_error' applied to an object of class "c('simpleError', 'error',
'condition')"
Running methods(http_error) gives me this:
[1] http_error.character* http_error.integer* http_error.response*
Could this be caused by me being behind an institutional firewall? (even so, it seems weird that this would be the response...)
Is there a way around it?
The http_error (actually from library httr) is a very simple function: it loads an url given by a character (http_error.character), retrieves the response (http_error.response) and ultimately looks at the response code (http_error.integer). If the response code is >=400 the function returns TRUE otherwise FALSE.
What your error says, is that you (or any function in your chain) tries to call http_error on a simpleError object. My guess is that your firewall settings block the request. Because the request is blocked the underlying httr::RETRY (which is called from oadoi_fetch) returns an error instead of a proper response object and http_error sees just this error object and breaks.
If I locally switch off my proxy (through which I can make requests) I also get an error:
library(roadoi)
Sys.unsetenv(c("HTTP_PROXY", "HTTPS_PROXY"))
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', 'error', 'condition')"
As soon as my proxy is set properly I get
Sys.setenv(HTTPS_PROXY = my_proxy, HTTP_PROXY = my_proxy)
oadoi_fetch("10.1038/nature12373", email = "name#whatever.com")
# # A tibble: 1 x 16
# doi best_oa_location oa_locations data_standard is_oa genre journal_is_oa journal_is_in_d~ journal_issns journal_name publisher title year updated non_compliant authors
# <chr> <list> <list> <int> <lgl> <chr> <lgl> <lgl> <chr> <chr> <chr> <chr> <chr> <chr> <list> <list>
# 1 10.1038~ <tibble [1 x 10]> <tibble [4 x~ 2 TRUE journa~ FALSE FALSE 0028-0836,147~ Nature Springer ~ Nanometre-s~ 2013 2019-04-0~
If the problem lies indeed with the proxy, I would try the following, which helped me on my corporate Windows machine, but may be dependent on your local IT setting:
## get the proxy settings
system("netsh winhttp show proxy")
Sys.setenv(HTTP_PROXY = <the proxy from netsh>, HTTPS_PROXY = <the proxy from netsh>)
Actually, you can reproduce the error easily:
httr::http_error(simpleError("Cannot reach the page"))
# Error in UseMethod("http_error") :
# no applicable method for 'http_error' applied to an object of class
# "c('simpleError', # 'error', 'condition')"
Hi I have a deeply nested json file. I used sparklyr to read this json file and called this "data" object.
Firstly I will show what the data structure looks like:
# Database: spark_connection
data
-a : string
-b : string
-c : (struct)
c1 : string
c2 : (struct)
c21: string
c22: string
Something like this. So if I extract "a" using:
data %>% sdf_select(a)
I can view what the data inside, like:
# Database: spark_connection
a
<chr>
1 Hello world
2 Stack overflow is epic
THE PROBLEM now comes is when i use sdf_select() a deeper structure i.e.
data %>% sdf_select(c.c2.c22)
Viewing the data inside, I get this
# Database: spark_connection
c22
<list>
1 <list [1]>
2 <list [1]>
3 <list [1]>
4 <lgl [1]>
so if I collect the data so that the spark data frame turns into R data frame and viewing the data using commands
View(collect(data %>% sdf_select(c.c2.c22)))
The data shows
1 list("Good")
2 list("Bad")
3 NA
How do I turn every entry in each list above to a data frame table so that it shows Good, Bad, NA only instead with list("") on it?
I was unable to reproduce this. I used
[{"a":"jkl","b":"mno","c":{"c1":"ghi","c2":{"c21":"abc","c22":"def"}}}]
written to a test.json, followed by
spk_df <- spark_read_json(sc, "tmp", "file:///path/to/test.json")
spk_df %>% sdf_schema_viewer()
This seems to match the schema that you provided. However when I use sparklyr.nested::sdf_select() I get a different result.
spk_df %>% sdf_select(c.c2.c22)
# # Source: table<sparklyr_tmp_7431373dca00> [?? x 1]
# # Database: spark_connection
# c22
# <chr>
# 1 def
where c22 is a character column.
My guess is that in your real data, one of the levels is actually an array of structs. If this is the case, then indexing into an array forces a list wrapping (or else data would need to be dropped). You can resolve this in spark land using sdf_explode or you can resolve it locally in a variety of ways. For example, using purrr you would do something like:
df <- collect(spk_df)
df %>% mutate(c22=purrr::map(c22, ~unlist))
It is possible that you will need to write a function wrapping unlist to deal with different data types in different rows (the NA values are logical).
unlist_and_cast <- function(x) {
as.charater(unlist(x))
}
df %>% mutate(c22=purrr::map(c22, ~unlist_and_cast))
would do the trick I think (untested).