linking functions with purrr and referencing nested variables - r

I'm scraping data from a large online database (GBIF), which requires three steps: (1) match a GBIF "key" identifier to a species name, (2) send a query to the database, getting a download key ("res") in return, and (3) download, import, and filter the data associated with that species. I've written a function for each of these (not including the actual code here, since it's unfortunately very long and requires login credentials):
get_gbif_key <- function(species) {}
get_gbif_res <- function(gbifkey) {}
get_gbif_dat <- function(gbifres) {}
I have a list of several hundred species to which I want to apply these three functions in order. I know they work individually, but I can't figure out how to feed them into each other (probably using purrr?) and reference the correct inputs from the nested outputs of the previous function.
So, for example:
> testlist <- c('Gadus morhua','Caretta caretta')
> testkey <- map(testlist, get_gbif_key)
> testkey
[[1]]
[1] 8084280
[[2]]
[1] 8894817
Here's where I'm stuck. I want to feed the keys in this list structure into the next function, but I don't know how to properly reference them using map or other functions. I can do it by manually creating a new list for the next function:
> testlist2 <- c('8084280','8894817')
> testres <- map(testlist2, get_gbif_res)
> testres
[[1]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001342-180412121330197
[[2]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001343-180412121330197
EDIT: the structure of this output may be posing a problem here. When I run listviewer::jsonedit(testres), it just looks like a normal nested list with entries 0 and 1 holding the two download keys. However, when I run str(testres), I get the following:
> str(testres)
List of 2
$ :Class 'occ_download' atomic [1:1] 0001342-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
$ :Class 'occ_download' atomic [1:1] 0001343-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
And, again, for the third one:
> testlist3 <- c('0001342-180412121330197','0001343-180412121330197')
> testdat <- map(testlist3, get_gbif_dat)
Which successfully loads a list object with the desired data into R (it has two unnamed elements, 0 and 1, each of which is a list of 28 requested variables for each species). Any advice for scripting this get_gbif_key %>% get_gbif_res %>% get_gbif_dat workflow in a way that unpacks the preceding list structures correctly?

Here's what you should try based on the evidence provided so far. Basically, the results suggest you should be able to succeed with nested map-ping:
yourData <- map( unlist( # to make same class as your single func version
map(
map(testlist,
get_gbif_key), # returns gbifkeys
get_gbif_res)), # returns gbif_res's
get_gbif_dat) # returns data items
The last item that you showed the structure for is just a list of atomic character vectors with some extra attributes and your functions seems to handle that without difficulty, so mapping should succeed.

Related

Use purrr on a position of element in nested list?

Situation: I have a nested list in the image below. I want to use purrr to iterate over the second element of each nested list and apply a date conversion function.
Problem: I can write a for loop easily to iterate over it but I want to use this with purrr. My nested list attempts have not worked out. Normal list fine, nested by position, not fine.
Reproducible example code from Maurits Evers (Thank you!)
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
Any assistance appreciated!
Please see the comment above to understand how to provide a reproducible example including sample data.
Since you don't provide sample data, let's create some minimal mock data similar to what is shown in your screenshot.
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
To cast the third element of every list element as.Date we can then do
lst <- map(lst, ~{.x[[3]] <- as.Date(.x[[3]]); .x})
We can confirm that the third element of every list element is an object of type Date
str(lst)
#List of 2
# $ :List of 3
# ..$ : chr "one"
# ..$ : chr "12345"
# ..$ : Date[1:1], format: "2019-01-01"
# $ :List of 3
# ..$ : chr "two"
# ..$ : chr "67890"
# ..$ : Date[1:1], format: "2019-01-02"
Update
A more purrr/tidyverse-canonical approach would be to use modify_at (thanks #H1)
lst <- map(lst, ~modify_at(.x, 3, as.Date))
The result is the same as before.

Adding attributes to a name list erases the names of that list

Simple question based on an unexpected behavior I observed. I have a named list in R on which I add attributes with the attributes<- call. This erases the name of the list. Why and how can I prevent that?
ex:
ll <- list(a=1:4, b="der")
str(ll)
List of 2
$ a: int [1:4] 1 2 3 4
$ b: chr "der"
attributes(ll) <- list(attr1 = "my_attr")
str(ll)
List of 2
$ : int [1:4] 1 2 3 4
$ : chr "der"
- attr(*, "attr1")= chr "my_attr"
There are no names anymore.
I can get them back doing this:
names(ll) <- c("a", "b")
str(ll)
List of 2
$ a: int [1:4] 1 2 3 4
$ b: chr "der"
- attr(*, "attr1")= chr "my_attr"
However I would like not to have to record the names before and reapply them after. I have a feeling the original names are an attribute that gets overwritten by attributes<- call. Any idea how to get over that?
I think this (i.e., setting a single new attribute, or modifying an existing one, while leaving existing attributes in place) is exactly what attr()<- is for:
> attr(ll,"attr1") <- "my_attr"
> ll
$a
[1] 1 2 3 4
$b
[1] "der"
attr(,"attr1")
[1] "my_attr"
From the documentation for attributes:
Assigning attributes first removes all attributes, then sets any dim
attribute and then the remaining attributes in the order given: this
ensures that setting a dim attribute always precedes the dimnames
attribute.
I think capturing names beforehand may indeed be the only way, if you must use attributes. But I would consider changing the attribute with a more targeted function, if possible. What are you trying to set?
You may for instance consider adding a comment. See the documentation here.
A good way to add attributes to an existing object is to do:
attributes(ll) <- append(attributes(ll), list(attr1 = "my_attr"))
This is more robust as it works for attributes in list AND in data.frame and requires only one row.

Can't get metadata from dataframe using DataframeSource in tm for R

I have a dataframe with the following variables:
doc_id text URL author date forum
When I run
samplecorpus <- Corpus(DataframeSource(sampledataframe))
the documentation says I should get a corpus with all of the extra variables added as document-level metadata.
https://rdrr.io/rforge/tm/man/DataframeSource.html
http://finzi.psych.upenn.edu/R/library/tm/html/DataframeSource.html
Instead, I get a corpus that has all of the right documents in the right order, but all of their metadata is blank. I need this metadata to filter the documents for future analysis.
Someone else asked a similar question, but it never got answered...
In tm version a readTabular() replacement tm package DataframeSource () ignores my other columns as metadata
Does anyone have any ideas on how to fix this?
Thanks!
The documentation for tm explains this if you dig down (see ??tm::DublicCore). From the docs:
A corpus has two types of metadata. Corpus metadata ("corpus") contains corpus specific metadata in form of tag-value pairs. Document level metadata ("indexed") contains document specific metadata but is stored in the corpus as a data frame. Document level metadata is typically used for semantic reasons (e.g., classifications of documents form an own entity due to some high-level information like the range of possible values) or for performance reasons (single access instead of extracting metadata of each document). The latter can be seen as a from of indexing, hence the name "indexed". Document metadata ("local") are tag-value pairs directly stored locally at the individual documents.
DataframeSource automatically assigns only the the corpus metadata*. For example, see what the following prints:
library(tm)
data <- data.frame(doc_id = c(234345345, 1299),
text = c("The Prince and the Pauper",
"Little Women"),
author = c('Mark Twain', 'Louisa May Alcott'),
date = c(1881, 1868),
stringsAsFactors = FALSE)
samplecorpus <- Corpus(DataframeSource(data))
meta(samplecorpus)
# Or even
meta(samplecorpus[1], tag = 'author')
In order to assign metadata at the document level, you can work with meta to change tags. Bizarrely, this only works if you use VCorpus. So changing the above slightly, you can do:
samplecorpus <- VCorpus(DataframeSource(data))
# Can now set document metadata tags
meta(samplecorpus[[1]], tag = 'author') <- 'Mark Twain'
*EDIT:
Contemplating further (and responding to OP's comment), I agree that the documentation is not a completely accurate description of the package's observed behavior. The quoted documentation above refers to three levels (Corpus, indexed document level, and local document level), which in my example appear to correspond to samplecorpus, samplecorpus[1], and samplecorpus[[1]], respectively. If this correct, then the metadata is being assigned by DataframeSource at the promised level (if somewhat vaguely, as they never specified which document-level). However, the docs also claims the indexed document level is stored as a data frame and local as tag-value pairs, but both are stored as lists. Confusing. I can only conclude that this is either a bug in the package implementation or an error in the docs.
Barring contacting the package authors to clear this up (not a bad idea), I would propose the following workaround:
samplecorpus <- VCorpus(DataframeSource(data))
transfer_metadata <- function(x, i, tag){
return(meta(x[i], tag=tag)[[tag]])
}
tags <- colnames(data)
tags <- tags[! tags %in% c('doc_id', 'text')]
for(i in 1:length(samplecorpus)){
for (tag in tags){
meta(samplecorpus[[i]], tag=tag) <- transfer_metadata(samplecorpus, i=i, tag=tag)
}
}
You have to check if everything is loaded correctly. I made an example docs data.frame so you can see how it works. I used the same column names you have and added 1 extra (tags). Based on this example you might check if you have an issue somewhere.
docs <- data.frame(doc_id = c("doc_1", "doc_2"),
text = c("This is a text.", "This another one."),
url = c("https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r",
"https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r"),
author = c("Emi", "Emi"),
date = as.Date(c("2018-09-20", "2018-09-21")),
forum = c("stackoverflow", "stackoverflow"),
tags = c("r", "tm"),
stringsAsFactors = T)
# use Corpus or VCorpus
my_corpus <- Corpus(DataframeSource(docs))
meta(my_corpus)
url author date
1 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-20
2 https://stackoverflow.com/questions/52433344/cant-get-metadata-from-dataframe-using-dataframesource-in-tm-for-r Emi 2018-09-21
forum tags
1 stackoverflow r
2 stackoverflow tm
my_index <- meta(my_corpus, "tags") == "r"
inspect(my_corpus[my_index])
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 5
Content: documents: 1
doc_1
This is a text.
Now beware there is a difference in how meta is treated. If you do str(my_corpus) you will see the following:
List of 2
$ doc_1:List of 2
..$ content: chr "This is a text."
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2018-09-21 08:55:44"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "doc_1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ doc_2:List of 2
......
The meta info you see here is from meta(my_corpus, type = "local"). The metadata loaded with DataframeSource is of type indexed, meta(my_corpus, type = "indexed")
Page 5 of the vignette is important to read and experiment with to see all the different options that meta and DublinCore.

Incorrectly extracting repeated values from list of XML objects in R

I'm having a problem using lapply and xml_find_first from the xml2 package to pull nodes from a list of xml objects. I'm pulling a few thousand records from the Scopus API. Since I can only get 25 records at a time, I run it so I get a list with 100+ elements of 25 records each. I know a few of the records have missing values, so my goal is to shuffle things around until I get a list were each record is its own element, then use lapply and xml_find_first so that I'll get null values where appropriate. The problem is that I end up pulling repeated values as if everything is still nested in their initial lists.
Here's a reproducible example with a list of 2 elements with 2 records each, with citedby-count missing from the last one:
```{r}
library(xml2)
# Simulate how data come in from Scopus
# Build 2 list elements, 2 entries each
el1 <- read_xml(
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-1542382496</eid>
<citedby-count>9385</citedby-count>
</entry>
<entry>
<eid>2-s2.0-0032721879</eid>
<citedby-count>4040</citedby-count>
</entry>
</feed>"
)
el2 <- read_xml( # This one's missing citedby-count for last entry
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-0041751098</eid>
<citedby-count>3793</citedby-count>
</entry>
<entry>
<eid>2-s2.0-73449149291</eid>
</entry>
</feed>"
)
# Combine into list
lst <- list(el1,el2)
# Check
lst
```
This gives me:
My goal is to pull out the entries so they are list items. This way, xml_find_first should stick a null value in for the entry where citedby-count is missing.
```{r}
# Pull entry nodes
lst2 <- lapply(lst, xml_find_all, "//entry")
# Unlist
lst2 <- unlist(lst2, recursive=FALSE)
# Check - each entry is its own element
lst2
```
The hangup is when I try to extract a node that I know is missing in some of the entries in a way that will leave a null where it's missing. xml_find_first should do that. But...
```{r}
cbc <- lapply(lst2, xml_find_first, "//citedby-count")
cbc <- lapply(cbc, xml_text)
cbc # Repeats the first values of original nesting
```
So I checked what would happen with xml_find_all:
```{r}
cbc2 <- lapply(lst2, xml_find_all, "//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2 # Elements contain all values from initial nesting
```
Which makes no sense in comparison with the output of lst2 above. For some reason, pulling the text retains the values from the initial nesting, even though it doesn't show up when looking at the final list of xml objects. I'm stumped.
Indeed, as #Dave2e comments, do not simply use the "anywhere" XPath search (specifically the descendant-or-self search) with // for child elements as the search will run on entire document.
How can this be if I do not explicitly call the original document? If you run str() on any of your xml_find lists, you will see the object carries Rcpp external pointers to the current node and document available for recall as needed. In fact, I believe the node pointer displays when calling the list.
str(ls2)
# List of 4
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
lst2[[1]]$doc
# <pointer: 0x000000000ca7ff90>
typeof(lst2[[1]]$doc)
# [1] "externalptr"
Therefore, be careful of context when searching. You can use the dot prefix (as #Dave2e advises), .//, or no slashes at all for retrieval of child elements which here will be equivalent.
cbc2 <- lapply(lst2, xml_find_all, "citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
cbc2 <- lapply(lst2, xml_find_all, ".//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
Do note the .// will search ALL descendants (i.e., children, grandchildren, etc.) starting at current node. See What is the difference between .// and //* in XPath?

Sorting xts data to look like panel data in R

I need to use 'PerformanceAnalytics' package of R and to use this package, it requires me to convert the data into xts data. The data can be downloaded from this link: https://drive.google.com/file/d/0B8usDJAPeV85elBmWXFwaXB4WUE/edit?usp=sharing . Hence, I have created an xts data by using the following commands:
data<-read.csv('monthly.csv')
dataxts <- xts(data[,-1],order.by=as.Date(data$datadate,format="%d/%m/%Y"))
But after doing this, it looses the panel data structure. I tried to sort the xts data to get it back in panel data form but failed.
Can anyone please help me to reorganize the xts data to look like a panel data. I need to sort them by firm id (gvkey) and data(datadate).
xts objects are sorted by time index only. They cannot be sorted by anything else.
I would encourage you to split your data.frame into a list, by gvkey. Then convert each list element to xts and remove the columns that do not vary across time, storing them as xtsAttributes. You might also want to consider using the yearmon class, since you're dealing with monthly data.
You will have to determine how you want to encode non-numeric, time-varying values, since you cannot mix types in xts objects.
Data <- read.csv('monthly.csv', nrow=1000, as.is=TRUE)
DataList <- split(Data, Data$gvkey)
xtsList <- lapply(DataList, function(x) {
attrCol <- c("iid","tic","cusip","conm","exchg","secstat","tpci",
"cik","fic","conml","costat","idbflag","dldte")
numCol <- c("ajexm","ajpm","cshtrm","prccm","prchm","prclm",
"trfm", "trt1m", "rawpm", "rawxm", "cmth", "cshom", "cyear")
toEncode <- c("isalrt","curcdm")
y <- xts(x[,numCol], as.Date(x$datadate,format="%d/%m/%Y"))
xtsAttributes(y) <- as.list(x[1,attrCol])
y
})
Each list element is now an xts object, and is much more compact, since you do not repeat completely redundant data. And you can easily run analysis on each gvkey via lapply and friends.
> str(xtsList[["1004"]])
An ‘xts’ object on 1983-01-31/2012-12-31 containing:
Data: num [1:360, 1:13] 3.38 3.38 3.38 3.38 3.38 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:13] "ajexm" "ajpm" "cshtrm" "prccm" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 13
$ iid : int 1
$ tic : chr "AIR"
$ cusip : int 361105
$ conm : chr "AAR CORP"
$ exchg : int 11
$ secstat: chr "A"
$ tpci : chr "0"
$ cik : int 1750
$ fic : chr "USA"
$ conml : chr "AAR Corp"
$ costat : chr "A"
$ idbflag: chr "D"
$ dldte : chr ""
And you can access the attributes via xtsAttributes:
> xtsAttributes(xtsList[["1004"]])$fic
[1] "USA"
> xtsAttributes(xtsList[["1004"]])$tic
[1] "AIR"
An efficient way to achieve this goal is to covert the Panel Data (long format) into wide format using 'reshape2' package. After performing the estimations, convert it back to long format or panel data format. Here is an example:
library(foreign)
library(reshape2)
dd <- read.dta("DDA.dta") // DDA.dta is Stata data; keep only date, id and variable of interest (i.e. three columns in total)
wdd<-dcast(dd, datadate~gvkey) // gvkey is the id
require(PerformanceAnalytics)
wddxts <- xts(wdd[,-1],order.by=as.Date(wdd$datadate,format= "%Y-%m-%d"))
ssd60A<-rollapply(wddxts,width=60,SemiDeviation,by.column=TRUE,fill=NA) // e.g of rolling window calculation
ssd60A.df<-as.data.frame(ssd60A.xts) // convert dataframe to xts
ssd60A.df$datadate=rownames(ssd60A.df) // insert time index
lssd60A.df<-melt(ssd60A.df, id.vars=c('datadate'),var='gvkey') // convert back to panel format
write.dta(lssd60A.df,"ssd60A.dta",convert.factors = "string") // export as Stata file
Then simply merge it with the master database to perform some regression.

Resources