Use purrr on a position of element in nested list? - r

Situation: I have a nested list in the image below. I want to use purrr to iterate over the second element of each nested list and apply a date conversion function.
Problem: I can write a for loop easily to iterate over it but I want to use this with purrr. My nested list attempts have not worked out. Normal list fine, nested by position, not fine.
Reproducible example code from Maurits Evers (Thank you!)
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
Any assistance appreciated!

Please see the comment above to understand how to provide a reproducible example including sample data.
Since you don't provide sample data, let's create some minimal mock data similar to what is shown in your screenshot.
lst <- list(
list("one", "12345", "2019-01-01"),
list("two", "67890", "2019-01-02"))
To cast the third element of every list element as.Date we can then do
lst <- map(lst, ~{.x[[3]] <- as.Date(.x[[3]]); .x})
We can confirm that the third element of every list element is an object of type Date
str(lst)
#List of 2
# $ :List of 3
# ..$ : chr "one"
# ..$ : chr "12345"
# ..$ : Date[1:1], format: "2019-01-01"
# $ :List of 3
# ..$ : chr "two"
# ..$ : chr "67890"
# ..$ : Date[1:1], format: "2019-01-02"
Update
A more purrr/tidyverse-canonical approach would be to use modify_at (thanks #H1)
lst <- map(lst, ~modify_at(.x, 3, as.Date))
The result is the same as before.

Related

jsonlite array of arrays

when using jsonlite to import a json that has an array inside other array I get an undesired unnamed list. Exemple below:
myjson=jsonlite::fromJSON('{
"class" : "human",
"type" : [{
"shape":"thin",
"face":[{"eyes":"blues","hair":"brown"}]
}]
}')
str(myjson)
List of 2
$ class: chr "human"
$ type :'data.frame': 1 obs. of 2 variables:
..$ shape: chr "thin"
..$ face :List of 1
.. ..$ :'data.frame': 1 obs. of 2 variables:
.. .. ..$ eyes: chr "blues"
.. .. ..$ hair: chr "brown"
I would like to access the "eyes" field as below (however it doesn't work):
myjson[["type"]][["face"]][["eyes"]]
NULL
Instead, I need to add "[[1]]" in order to make it works:
myjson[["type"]][["face"]][[1]][["eyes"]]
[1] "blues"
Any ideas how could I format the json to get rid of this unnamed list?
The thing is, unnamed lists are used whenever there is a JSON vector [{}, {}, ...]. The fact that your first vector is turned into a named list and the second, inner one, is turned into an unnamed list is because jsonlite::fromJSON has arguments simplifyDataFrame = TRUE and flatten = TRUE by default, which have this behavior. I haven't looked into the source code, but it seems that the simplification involved (transforming a vector with only one element into a named list) only simplify the top-level objects.
A work around is to apply a function that turns any unnamed list with only a single object into the object itself.
my_json <- lapply(my_json, function(x) {
if (is.list(x)) # if element is a list, replace it with its first element
return(lapply(x, function(y) {
return(y[[1]])
}))
else
return(x)
})

Incorrectly extracting repeated values from list of XML objects in R

I'm having a problem using lapply and xml_find_first from the xml2 package to pull nodes from a list of xml objects. I'm pulling a few thousand records from the Scopus API. Since I can only get 25 records at a time, I run it so I get a list with 100+ elements of 25 records each. I know a few of the records have missing values, so my goal is to shuffle things around until I get a list were each record is its own element, then use lapply and xml_find_first so that I'll get null values where appropriate. The problem is that I end up pulling repeated values as if everything is still nested in their initial lists.
Here's a reproducible example with a list of 2 elements with 2 records each, with citedby-count missing from the last one:
```{r}
library(xml2)
# Simulate how data come in from Scopus
# Build 2 list elements, 2 entries each
el1 <- read_xml(
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-1542382496</eid>
<citedby-count>9385</citedby-count>
</entry>
<entry>
<eid>2-s2.0-0032721879</eid>
<citedby-count>4040</citedby-count>
</entry>
</feed>"
)
el2 <- read_xml( # This one's missing citedby-count for last entry
"<feed>
<blah>Bunch of stuff I don't need</blah>
<blah>Bunch of other stuff I don't need</blah>
<entry>
<eid>2-s2.0-0041751098</eid>
<citedby-count>3793</citedby-count>
</entry>
<entry>
<eid>2-s2.0-73449149291</eid>
</entry>
</feed>"
)
# Combine into list
lst <- list(el1,el2)
# Check
lst
```
This gives me:
My goal is to pull out the entries so they are list items. This way, xml_find_first should stick a null value in for the entry where citedby-count is missing.
```{r}
# Pull entry nodes
lst2 <- lapply(lst, xml_find_all, "//entry")
# Unlist
lst2 <- unlist(lst2, recursive=FALSE)
# Check - each entry is its own element
lst2
```
The hangup is when I try to extract a node that I know is missing in some of the entries in a way that will leave a null where it's missing. xml_find_first should do that. But...
```{r}
cbc <- lapply(lst2, xml_find_first, "//citedby-count")
cbc <- lapply(cbc, xml_text)
cbc # Repeats the first values of original nesting
```
So I checked what would happen with xml_find_all:
```{r}
cbc2 <- lapply(lst2, xml_find_all, "//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2 # Elements contain all values from initial nesting
```
Which makes no sense in comparison with the output of lst2 above. For some reason, pulling the text retains the values from the initial nesting, even though it doesn't show up when looking at the final list of xml objects. I'm stumped.
Indeed, as #Dave2e comments, do not simply use the "anywhere" XPath search (specifically the descendant-or-self search) with // for child elements as the search will run on entire document.
How can this be if I do not explicitly call the original document? If you run str() on any of your xml_find lists, you will see the object carries Rcpp external pointers to the current node and document available for recall as needed. In fact, I believe the node pointer displays when calling the list.
str(ls2)
# List of 4
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
# $ :List of 2
# ..$ node:<externalptr>
# ..$ doc :<externalptr>
# ..- attr(*, "class")= chr "xml_node"
lst2[[1]]$doc
# <pointer: 0x000000000ca7ff90>
typeof(lst2[[1]]$doc)
# [1] "externalptr"
Therefore, be careful of context when searching. You can use the dot prefix (as #Dave2e advises), .//, or no slashes at all for retrieval of child elements which here will be equivalent.
cbc2 <- lapply(lst2, xml_find_all, "citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
cbc2 <- lapply(lst2, xml_find_all, ".//citedby-count")
cbc2 <- lapply(cbc2, xml_text)
cbc2
# [[1]]
# [1] "9385"
# [[2]]
# [1] "4040"
# [[3]]
# [1] "3793"
# [[4]]
# character(0)
Do note the .// will search ALL descendants (i.e., children, grandchildren, etc.) starting at current node. See What is the difference between .// and //* in XPath?

linking functions with purrr and referencing nested variables

I'm scraping data from a large online database (GBIF), which requires three steps: (1) match a GBIF "key" identifier to a species name, (2) send a query to the database, getting a download key ("res") in return, and (3) download, import, and filter the data associated with that species. I've written a function for each of these (not including the actual code here, since it's unfortunately very long and requires login credentials):
get_gbif_key <- function(species) {}
get_gbif_res <- function(gbifkey) {}
get_gbif_dat <- function(gbifres) {}
I have a list of several hundred species to which I want to apply these three functions in order. I know they work individually, but I can't figure out how to feed them into each other (probably using purrr?) and reference the correct inputs from the nested outputs of the previous function.
So, for example:
> testlist <- c('Gadus morhua','Caretta caretta')
> testkey <- map(testlist, get_gbif_key)
> testkey
[[1]]
[1] 8084280
[[2]]
[1] 8894817
Here's where I'm stuck. I want to feed the keys in this list structure into the next function, but I don't know how to properly reference them using map or other functions. I can do it by manually creating a new list for the next function:
> testlist2 <- c('8084280','8894817')
> testres <- map(testlist2, get_gbif_res)
> testres
[[1]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001342-180412121330197
[[2]]
<<gbif download>>
Username: XXXX
E-mail: XXXX#gmail.com
Download key: 0001343-180412121330197
EDIT: the structure of this output may be posing a problem here. When I run listviewer::jsonedit(testres), it just looks like a normal nested list with entries 0 and 1 holding the two download keys. However, when I run str(testres), I get the following:
> str(testres)
List of 2
$ :Class 'occ_download' atomic [1:1] 0001342-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
$ :Class 'occ_download' atomic [1:1] 0001343-180412121330197
.. ..- attr(*, "user")= chr "XXXX"
.. ..- attr(*, "email")= chr "XXXX#gmail.com"
And, again, for the third one:
> testlist3 <- c('0001342-180412121330197','0001343-180412121330197')
> testdat <- map(testlist3, get_gbif_dat)
Which successfully loads a list object with the desired data into R (it has two unnamed elements, 0 and 1, each of which is a list of 28 requested variables for each species). Any advice for scripting this get_gbif_key %>% get_gbif_res %>% get_gbif_dat workflow in a way that unpacks the preceding list structures correctly?
Here's what you should try based on the evidence provided so far. Basically, the results suggest you should be able to succeed with nested map-ping:
yourData <- map( unlist( # to make same class as your single func version
map(
map(testlist,
get_gbif_key), # returns gbifkeys
get_gbif_res)), # returns gbif_res's
get_gbif_dat) # returns data items
The last item that you showed the structure for is just a list of atomic character vectors with some extra attributes and your functions seems to handle that without difficulty, so mapping should succeed.

Sorting xts data to look like panel data in R

I need to use 'PerformanceAnalytics' package of R and to use this package, it requires me to convert the data into xts data. The data can be downloaded from this link: https://drive.google.com/file/d/0B8usDJAPeV85elBmWXFwaXB4WUE/edit?usp=sharing . Hence, I have created an xts data by using the following commands:
data<-read.csv('monthly.csv')
dataxts <- xts(data[,-1],order.by=as.Date(data$datadate,format="%d/%m/%Y"))
But after doing this, it looses the panel data structure. I tried to sort the xts data to get it back in panel data form but failed.
Can anyone please help me to reorganize the xts data to look like a panel data. I need to sort them by firm id (gvkey) and data(datadate).
xts objects are sorted by time index only. They cannot be sorted by anything else.
I would encourage you to split your data.frame into a list, by gvkey. Then convert each list element to xts and remove the columns that do not vary across time, storing them as xtsAttributes. You might also want to consider using the yearmon class, since you're dealing with monthly data.
You will have to determine how you want to encode non-numeric, time-varying values, since you cannot mix types in xts objects.
Data <- read.csv('monthly.csv', nrow=1000, as.is=TRUE)
DataList <- split(Data, Data$gvkey)
xtsList <- lapply(DataList, function(x) {
attrCol <- c("iid","tic","cusip","conm","exchg","secstat","tpci",
"cik","fic","conml","costat","idbflag","dldte")
numCol <- c("ajexm","ajpm","cshtrm","prccm","prchm","prclm",
"trfm", "trt1m", "rawpm", "rawxm", "cmth", "cshom", "cyear")
toEncode <- c("isalrt","curcdm")
y <- xts(x[,numCol], as.Date(x$datadate,format="%d/%m/%Y"))
xtsAttributes(y) <- as.list(x[1,attrCol])
y
})
Each list element is now an xts object, and is much more compact, since you do not repeat completely redundant data. And you can easily run analysis on each gvkey via lapply and friends.
> str(xtsList[["1004"]])
An ‘xts’ object on 1983-01-31/2012-12-31 containing:
Data: num [1:360, 1:13] 3.38 3.38 3.38 3.38 3.38 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:13] "ajexm" "ajpm" "cshtrm" "prccm" ...
Indexed by objects of class: [Date] TZ: UTC
xts Attributes:
List of 13
$ iid : int 1
$ tic : chr "AIR"
$ cusip : int 361105
$ conm : chr "AAR CORP"
$ exchg : int 11
$ secstat: chr "A"
$ tpci : chr "0"
$ cik : int 1750
$ fic : chr "USA"
$ conml : chr "AAR Corp"
$ costat : chr "A"
$ idbflag: chr "D"
$ dldte : chr ""
And you can access the attributes via xtsAttributes:
> xtsAttributes(xtsList[["1004"]])$fic
[1] "USA"
> xtsAttributes(xtsList[["1004"]])$tic
[1] "AIR"
An efficient way to achieve this goal is to covert the Panel Data (long format) into wide format using 'reshape2' package. After performing the estimations, convert it back to long format or panel data format. Here is an example:
library(foreign)
library(reshape2)
dd <- read.dta("DDA.dta") // DDA.dta is Stata data; keep only date, id and variable of interest (i.e. three columns in total)
wdd<-dcast(dd, datadate~gvkey) // gvkey is the id
require(PerformanceAnalytics)
wddxts <- xts(wdd[,-1],order.by=as.Date(wdd$datadate,format= "%Y-%m-%d"))
ssd60A<-rollapply(wddxts,width=60,SemiDeviation,by.column=TRUE,fill=NA) // e.g of rolling window calculation
ssd60A.df<-as.data.frame(ssd60A.xts) // convert dataframe to xts
ssd60A.df$datadate=rownames(ssd60A.df) // insert time index
lssd60A.df<-melt(ssd60A.df, id.vars=c('datadate'),var='gvkey') // convert back to panel format
write.dta(lssd60A.df,"ssd60A.dta",convert.factors = "string") // export as Stata file
Then simply merge it with the master database to perform some regression.

How to delete a row from a data.frame without losing the attributes

for starters: I searched for hours on this problem by now - so if the answer should be trivial, please forgive me...
What I want to do is delete a row (no. 101) from a data.frame. It contains test data and should not appear in my analyses. My problem is: Whenever I subset from the data.frame, the attributes (esp. comments) are lost.
str(x)
# x has comments for each variable
x <- x[1:100,]
str(x)
# now x has lost all comments
It is well documented that subsetting will drop all attributes - so far, it's perfectly clear. The manual (e.g. http://stat.ethz.ch/R-manual/R-devel/library/base/html/Extract.data.frame.html) even suggests a way to preserve the attributes:
## keeping special attributes: use a class with a
## "as.data.frame" and "[" method:
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
d <- data.frame(i= 0:7, f= gl(2,4),
u= structure(11:18, unit = "kg", class="avector"))
str(d[2:4, -1]) # 'u' keeps its "unit"
I am not yet so far into R to understand what exactly happens here. However, simply running these lines (except the last three) does not change the behavior of my subsetting. Using the command subset() with an appropriate vector (100-times TRUE + 1 FALSE) gives me the same result. And simply storing the attributes to a variable and restoring it after the subset, does not work, either.
# Does not work...
tmp <- attributes(x)
x <- x[1:100,]
attributes(x) <- tmp
Of course, I could write all comments to a vector (var=>comment), subset and write them back using a loop - but that does not seem a well-founded solution. And I am quite sure I will encounter datasets with other relevant attributes in future analyses.
So this is where my efforts in stackoverflow, Google, and brain power got stuck. I would very much appreciate if anyone could help me out with a hint. Thanks!
If I understand you correctly, you have some data in a data.frame, and the columns of the data.frame have comments associated with them. Perhaps something like the following?
set.seed(1)
mydf<-data.frame(aa=rpois(100,4),bb=sample(LETTERS[1:5],
100,replace=TRUE))
comment(mydf$aa)<-"Don't drop me!"
comment(mydf$bb)<-"Me either!"
So this would give you something like
> str(mydf)
'data.frame': 100 obs. of 2 variables:
$ aa: atomic 3 3 4 7 2 7 7 5 5 1 ...
..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2 2 5 4 2 1 3 5 3 ...
..- attr(*, "comment")= chr "Me either!"
And when you subset this, the comments are dropped:
> str(mydf[1:2,]) # comment dropped.
'data.frame': 2 obs. of 2 variables:
$ aa: num 3 3
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
To preserve the comments, define the function [.avector, as you did above (from the documentation) then add the appropriate class attributes to each of the columns in your data.frame (EDIT: to keep the factor levels of bb, add "factor" to the class of bb.):
mydf$aa<-structure(mydf$aa, class="avector")
mydf$bb<-structure(mydf$bb, class=c("avector","factor"))
So that the comments are preserved:
> str(mydf[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Class 'avector' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
EDIT:
If there are many columns in your data.frame that have attributes you want to preserve, you could use lapply (EDITED to include original column class):
mydf2 <- data.frame( lapply( mydf, function(x) {
structure( x, class = c("avector", class(x) ) )
} ) )
However, this drops comments associated with the data.frame itself (such as comment(mydf)<-"I'm a data.frame"), so if you have any, assign them to the new data.frame:
comment(mydf2)<-comment(mydf)
And then you have
> str(mydf2[1:2,])
'data.frame': 2 obs. of 2 variables:
$ aa:Classes 'avector', 'numeric' atomic [1:2] 3 3
.. ..- attr(*, "comment")= chr "Don't drop me!"
$ bb: Factor w/ 5 levels "A","B","C","D",..: 4 2
..- attr(*, "comment")= chr "Me either!"
- attr(*, "comment")= chr "I'm a data.frame"
For those who look for the "all-in" solution based on BenBarnes explanation: Here it is.
(give the your "up" to the post from BenBarnes if this is working for you)
# Define the avector-subselection method (from the manual)
as.data.frame.avector <- as.data.frame.vector
`[.avector` <- function(x,i,...) {
r <- NextMethod("[")
mostattributes(r) <- attributes(x)
r
}
# Assign each column in the data.frame the (additional) class avector
# Note that this will "lose" the data.frame's attributes, therefore write to a copy
df2 <- data.frame(
lapply(df, function(x) {
structure( x, class = c("avector", class(x) ) )
} )
)
# Finally copy the attribute for the original data.frame if necessary
mostattributes(df2) <- attributes(df)
# Now subselects work without losing attributes :)
df2 <- df2[1:100,]
str(df2)
The good thing: When attached the class to all the data.frame's element once, the subselects never again bother attributes.
Okay - sometimes I am stunned how complicated it is to do the most simple operations in R. But I surely did not learn about the "classes" feature if I just marked and deleted the case in SPSS ;)
This is solved by the sticky package. (Full Disclosure: I am the package author.) Apply the sticky() to your vectors and the attributes are preserved through subset operations. For example:
> df <- data.frame(
+ sticky = sticky( structure(1:5, comment="sticky attribute") ),
+ nonstick = structure( letters[1:5], comment="non-sticky attribute" )
+ )
>
> comment(df[1:3, "nonstick"])
NULL
> comment(df[1:3, "sticky"])
[1] "sticky attribute"
This works for any attribute and not only comment.
See the sticky package for details:
on Github
on CRAN
I spent hours trying to figure out how to retain attribute data (specifically variable labels) when subsetting a dataframe (removing columns). The answer was so simple, I couldn't believe it. Just use the function spss.get from the Hmisc package, and then no matter how you subset, the variable labels are retained.

Resources