Purrr::map_df() drops NULL rows - r

When using purrr::map_df(), I will occasionally pass in a list of data frames where some items are NULL. When I do, map_df() returns a data frame with fewer rows than the the original list.
I assume what's going on is that map_df() calls dplyr::bind_rows() which ignores NULL values. However, I'm not sure how to identify my problematic rows.
Here's an example:
library(purrr)
problemlist <- list(NULL, NULL, structure(list(bounds = structure(list(northeast = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), southwest = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L), location = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), location_type = "ROOFTOP",
viewport = structure(list(northeast = structure(list(lat = 41.49,
lng = -71.46), .Names = c("lat", "lng"), class = "data.frame", row.names = 1L),
southwest = structure(list(lat = 41.49, lng = -71.46), .Names = c("lat",
"lng"), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L)), .Names = c("bounds",
"location", "location_type", "viewport"), class = "data.frame", row.names = 1L))
# what actually happens
map_df(problemlist, 'location')
# lat lng
# 1 41.49 -71.46
# desired result
map_df_with_Null_handling(problemlist, 'location')
# lat lng
# 1 NA NA
# 2 NA NA
# 3 41.49 -71.46
I considered wrapping my location accessor in one of purrr's error handling functions (eg. safely() or possibly()), but it's not that I'm running into errors--I'm just not getting the desired results.
What's the best way to handle NULL values with map_df()?

You can use the (as-of-present undocumented) .null argument for any of the map*() functions to tell the function what to do when it encounters a NULL value:
map_df(problemlist, 'location', .null = data_frame(lat = NA, lng = NA) )
# lat lng
# 1 NA NA
# 2 NA NA
# 3 41.49 -71.46

Related

Count the number a string exists in dataframe column with nested dataframes

I have the dataframe below which in column genres has nested dataframes with 3 columns. I wonder how can I find how many times the word "Pop" is displayed in the column name of all nested dataframes that exist in column genres
dat<-structure(list(name = c("Easy On Me", "All I Want For Christmas Is You",
"Overseas (feat. Central Cee)", "Last Christmas", "Shivers"),
releaseDate = c("2021-10-15", "1994-10-29", "2021-11-18",
"1984-01-01", "2021-09-09"), kind = c("songs", "songs", "songs",
"songs", "songs"), artistId = c("262836961", "91853", "1240341559",
"548421", "183313439"), artistUrl = c("https://music.apple.com/gb/artist/adele/262836961",
"https://music.apple.com/gb/artist/mariah-carey/91853", "https://music.apple.com/gb/artist/d-block-europe/1240341559",
"https://music.apple.com/gb/artist/wham/548421", "https://music.apple.com/gb/artist/ed-sheeran/183313439"
), artworkUrl100 = c("https://is3-ssl.mzstatic.com/image/thumb/Music115/v4/73/6d/7c/736d7cfb-c79d-c9a9-4170-5e71d008dea1/886449666430.jpg/100x100bb.jpg",
"https://is4-ssl.mzstatic.com/image/thumb/Music124/v4/c6/b7/27/c6b727f7-3a32-6b43-cee2-05bb71daf1cf/dj.itfmdeif.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music126/v4/2e/63/01/2e6301ee-905d-5ae8-c989-eaf9d8e7e6ae/21UM1IM30658.rgb.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music125/v4/47/55/0c/47550cd6-7ef5-bf86-c194-c7695d63c759/dj.xuditatj.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music125/v4/c5/d8/c6/c5d8c675-63e3-6632-33db-2401eabe574d/190296491412.jpg/100x100bb.jpg"
), genres = list(structure(list(genreId = c("14", "34"),
name = c("Pop", "Music"), url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2),
structure(list(genreId = c("34", "21", "17", "15", "14"
), name = c("Music", "Rock", "Dance", "R&B/Soul", "Pop"
), url = c("https://itunes.apple.com/gb/genre/id34",
"https://itunes.apple.com/gb/genre/id21", "https://itunes.apple.com/gb/genre/id17",
"https://itunes.apple.com/gb/genre/id15", "https://itunes.apple.com/gb/genre/id14"
)), class = "data.frame", row.names = c(NA, 5L)), structure(list(
genreId = c("18", "34"), name = c("Hip-Hop/Rap",
"Music"), url = c("https://itunes.apple.com/gb/genre/id18",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2),
structure(list(genreId = c("14", "34", "17"), name = c("Pop",
"Music", "Dance"), url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34", "https://itunes.apple.com/gb/genre/id17"
)), class = "data.frame", row.names = c(NA, 3L)), structure(list(
genreId = c("14", "34"), name = c("Pop", "Music"),
url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2))), row.names = c(NA,
5L), class = "data.frame")
If you want the total number of times the word Pop appears in the genres;
library(tidyverse)
sum(bind_rows(dat$genres)['name'] == 'Pop')
[1] 4
if you want the number of times in each nested dataframe:
map_dbl(dat$genres, ~sum(.x['name']=='Pop'))
[1] 1 1 0 1 1
mapply(`%in%`, "Pop", lapply(dat$genres, `[[`, "name"))
# Pop <NA> <NA> <NA> <NA>
# TRUE TRUE FALSE TRUE TRUE

Filter a nested list of dataframes based on logical vector in R

I have a nested list of coordinates:
coords <- list(`41` = structure(list(lon = c(11.9170235974052, 11.9890348226944,11.9266305690725),
lat = c(48.0539406017157, 48.0618200883643,48.0734094557987)),
class = "data.frame", row.names = c(NA, -3L )),
`51` = structure(list(lon = c(11.9700157009047, 11.9661664366154,11.9111812165745),
lat = c(48.0524843177559, 48.0645786453912, 48.0623193233537)),
class = "data.frame", row.names = c(NA, -3L)),
`61` = structure(list(lon = c(11.9464237941416, 11.9536554768081,11.9112311461624),
lat = c(48.040970408282, 48.0408864989903, 48.0284615642167)),
class = "data.frame", row.names = c(NA, -3L )),
`71` = structure(list(lon = c(11.9274864543974, 11.8733675039864,11.933264512569),
lat = c(48.0135478382282, 48.0216452485664, 48.0289752363299)),
class = "data.frame", row.names = c(NA, -3L)),
`81` = structure(list(lon = c(11.8837173493491, 11.9072450330566,11.8943898749275),
lat = c(48.0266639859759, 48.0132853717376, 48.0327326995006)),
class = "data.frame", row.names = c(NA, -3L )),
`91` = structure(list(lon = c(11.882538477087, 11.8377742591454,11.8817027393128),
lat = c(48.0284081468982, 48.022864811514, 48.0229810559649)),
class = "data.frame", row.names = c(NA, -3L )))
I would like to get this list filterd based on nested list of logical values.
index <- list(`41` = c(TRUE, TRUE, FALSE), `51` = c(FALSE, FALSE, TRUE
), `61` = c(FALSE, FALSE, FALSE), `71` = c(FALSE, FALSE, FALSE
), `81` = c(FALSE, FALSE, FALSE), `91` = c(FALSE, FALSE, FALSE))
What is the best approach to do so?
I tried to unlist the nested lists or to create a data.frame but it did not worked out.
Thank you!
You can use Map like this :
Map(function(x, y) x[y, ], coords, index)
#$`41`
# lon lat
#1 11.91702 48.05394
#2 11.98903 48.06182
#$`51`
# lon lat
#3 11.91118 48.06232
#$`61`
#[1] lon lat
#<0 rows> (or 0-length row.names)
#...
#...
In tidyverse :
library(purrr)
library(dplyr)
map2(coords, index, ~.x %>% filter(.y))
This answer works well to turn the lists in to data frames. If the ordering is consistent then I think this is what you need
library(purrr)
# use solution to convert lists to dataframes, storing the names in id column
coords_df <- map_df(coords, ~as.data.frame(.x), .id="id")
index_df <- map_df(index, ~as.data.frame(.x), .id="id")
# filter coordinates on the values in index
coords_df[index_df$.x,]

How to access multi-dimensional list element using FOR loop in R

I want to access a variable 'bandSpecificMatadata' from a multi-dimensional list in R, and create a vector of 'reflectanceCoefficient' for my remote sensing project.
Firstly, I was able to reduce the dimension of the list and then used nodes <- get('EarthObservationResult', matadata.list$resultOf) to exact the list.
Then it comes a problem when I try to create something like (bandNumber1 corresponds to reflectance coefficient 2.21e-5) using FOR loop.
for(node in nodes[6:9]) {
bn = get("bandNumber", node)
if(bn %in% c('1','2','3','4')){
i = integer(bn)
coeffs = get("reflectanceCoefficient", node)
}
print(coeffs)
}
which prints out:
[1] "2.21386105481e-05"
[1] "2.31474175457e-05"
[1] "2.60208594123e-05"
[1] "3.83481925626e-05"
But I want a vector with 1, 2, 3, 4 with the corresponding numbers. It seems to me that the number overwrites the last one every time it prints.
Then I tried:
for(node in nodes[6:9]) {
n = 1:4
b[n] = get("bandNumber", node)
if(b[n] %in% c('1','2','3','4')){
i = integer(b[n])
coeffs[i] = get("reflectanceCoefficient", node)
}
print(coeffs)
}
But turns out
Error in integer(b[n]) : invalid 'length' argument
In addition: Warning message:
In if (b[n] %in% c("1", "2", "3", "4")) { :
the condition has length > 1 and only the first element will be used
How do I fix this?
I used XML::xmlParse() to parse the xml and matadata.list <- XML::xmlToList() to convert the data to list.
For reproducible example, see below:
dput(matadata.list)
structure(list(metaDataProperty = structure(list(EarthObservationMetaData = structure(list(
identifier = "20170127_213132_0e0e_3B_AnalyticMS", acquisitionType = "NOMINAL",
productType = "L3B", status = "ARCHIVED", downlinkedTo = structure(list(
DownlinkInformation = structure(list(acquisitionStation = structure(list(
text = "Planet Ground Station Network", .attrs = structure("urn:eop:PS:stationLocation", .Names = "codeSpace")), .Names = c("text",
".attrs")), acquisitionDate = "2017-01-27T21:31:32+00:00"), .Names = c("acquisitionStation",
"acquisitionDate"))), .Names = "DownlinkInformation"),
archivedIn = structure(list(ArchivingInformation = structure(list(
archivingCenter = structure(list(text = "Planet Archive Center",
.attrs = structure("urn:eop:PS:stationLocation", .Names = "codeSpace")), .Names = c("text",
".attrs")), archivingDate = "2017-01-27T21:31:32+00:00",
archivingIdentifier = structure(list(text = "385180",
.attrs = structure("urn:eop:PS:dmsCatalogueId", .Names = "codeSpace")), .Names = c("text",
".attrs"))), .Names = c("archivingCenter", "archivingDate",
"archivingIdentifier"))), .Names = "ArchivingInformation"),
processing = structure(list(ProcessingInformation = structure(list(
processorName = "CMO Processor", processorVersion = "4.1.4",
nativeProductFormat = "GeoTIFF"), .Names = c("processorName",
"processorVersion", "nativeProductFormat"))), .Names = "ProcessingInformation"),
license = structure(list(licenseType = "20160101 - Inc - Single User",
resourceLink = structure(c("PL EULA", "https://assets.planet.com/docs/20160101_Inc_SingleUser.txt"
), class = structure("XMLAttributes", package = "XML"), namespaces = structure(c("xlink",
"xlink"), .Names = c("http://www.w3.org/1999/xlink",
"http://www.w3.org/1999/xlink")), .Names = c("title",
"href"))), .Names = c("licenseType", "resourceLink")),
versionIsd = "1.0", pixelFormat = "16U"), .Names = c("identifier",
"acquisitionType", "productType", "status", "downlinkedTo", "archivedIn",
"processing", "license", "versionIsd", "pixelFormat"))), .Names = "EarthObservationMetaData"),
validTime = structure(list(TimePeriod = structure(list(beginPosition = "2017-01-27T21:31:32+00:00",
endPosition = "2017-01-27T21:31:32+00:00"), .Names = c("beginPosition",
"endPosition"))), .Names = "TimePeriod"), using = structure(list(
EarthObservationEquipment = structure(list(platform = structure(list(
Platform = structure(list(shortName = "PlanetScope",
serialIdentifier = "0e0e", orbitType = "LEO-SSO"), .Names = c("shortName",
"serialIdentifier", "orbitType"))), .Names = "Platform"),
instrument = structure(list(Instrument = structure(list(
shortName = "PS2"), .Names = "shortName")), .Names = "Instrument"),
sensor = structure(list(Sensor = structure(list(sensorType = "OPTICAL",
resolution = structure(list(text = "3.0000",
.attrs = structure("m", .Names = "uom")), .Names = c("text",
".attrs")), scanType = "FRAME"), .Names = c("sensorType",
"resolution", "scanType"))), .Names = "Sensor"),
acquisitionParameters = structure(list(Acquisition = structure(list(
orbitDirection = "DESCENDING", incidenceAngle = structure(list(
text = "8.072969e-02", .attrs = structure("deg", .Names = "uom")), .Names = c("text",
".attrs")), illuminationAzimuthAngle = structure(list(
text = "7.610387e+01", .attrs = structure("deg", .Names = "uom")), .Names = c("text",
".attrs")), illuminationElevationAngle = structure(list(
text = "4.649194e+01", .attrs = structure("deg", .Names = "uom")), .Names = c("text",
".attrs")), azimuthAngle = structure(list(text = "1.242074e+01",
.attrs = structure("deg", .Names = "uom")), .Names = c("text",
".attrs")), spaceCraftViewAngle = structure(list(
text = "5.692807e-02", .attrs = structure("deg", .Names = "uom")), .Names = c("text",
".attrs")), acquisitionDateTime = "2017-01-27T21:31:32+00:00"), .Names = c("orbitDirection",
"incidenceAngle", "illuminationAzimuthAngle", "illuminationElevationAngle",
"azimuthAngle", "spaceCraftViewAngle", "acquisitionDateTime"
))), .Names = "Acquisition")), .Names = c("platform",
"instrument", "sensor", "acquisitionParameters"))), .Names = "EarthObservationEquipment"),
target = structure(list(Footprint = structure(list(multiExtentOf = structure(list(
MultiSurface = structure(list(surfaceMembers = structure(list(
Polygon = structure(list(outerBoundaryIs = structure(list(
LinearRing = structure(list(coordinates = "175.446585079397,-37.7068873856657 175.446633607572,-37.7045627724835 175.46731776545,-37.6311749428137 175.468010520596,-37.6311839417076 175.75989021492,-37.6819836599337 175.759889856814,-37.6820051679817 175.739424097003,-37.757826933992 175.739359440859,-37.7578262423109 175.446585079397,-37.7068873856657"), .Names = "coordinates")), .Names = "LinearRing"),
.attrs = structure("EPSG:4326", .Names = "srsName")), .Names = c("outerBoundaryIs",
".attrs"))), .Names = "Polygon"), .attrs = structure("EPSG:4326", .Names = "srsName")), .Names = c("surfaceMembers",
".attrs"))), .Names = "MultiSurface"), centerOf = structure(list(
Point = structure(list(pos = "175.603162359 -37.6944367036",
.attrs = structure("EPSG:4326", .Names = "srsName")), .Names = c("pos",
".attrs"))), .Names = "Point"), geographicLocation = structure(list(
topLeft = structure(list(latitude = "-37.6311749428",
longitude = "175.446585079"), .Names = c("latitude",
"longitude")), topRight = structure(list(latitude = "-37.6311749428",
longitude = "175.759890215"), .Names = c("latitude",
"longitude")), bottomRight = structure(list(latitude = "-37.757826934",
longitude = "175.759890215"), .Names = c("latitude",
"longitude")), bottomLeft = structure(list(latitude = "-37.757826934",
longitude = "175.446585079"), .Names = c("latitude",
"longitude"))), .Names = c("topLeft", "topRight", "bottomRight",
"bottomLeft"))), .Names = c("multiExtentOf", "centerOf",
"geographicLocation"))), .Names = "Footprint"), resultOf = structure(list(
EarthObservationResult = structure(list(product = structure(list(
ProductInformation = structure(list(fileName = "20170127_213132_0e0e_3B_AnalyticMS.tif",
productFormat = "GeoTIFF", spatialReferenceSystem = structure(list(
epsgCode = "32760", geodeticDatum = "WGS_1984",
projection = "WGS 84 / UTM zone 60S", projectionZone = "160"), .Names = c("epsgCode",
"geodeticDatum", "projection", "projectionZone"
)), resamplingKernel = "CC", numRows = "4565",
numColumns = "9194", numBands = "4", rowGsd = "3.0",
columnGsd = "3.0", radiometricCorrectionApplied = "true",
geoCorrectionLevel = "Precision Geocorrection",
elevationCorrectionApplied = "FineDEM", atmosphericCorrectionApplied = "false"), .Names = c("fileName",
"productFormat", "spatialReferenceSystem", "resamplingKernel",
"numRows", "numColumns", "numBands", "rowGsd", "columnGsd",
"radiometricCorrectionApplied", "geoCorrectionLevel",
"elevationCorrectionApplied", "atmosphericCorrectionApplied"
))), .Names = "ProductInformation"), mask = structure(list(
MaskInformation = structure(list(type = "UNUSABLE DATA",
format = "RASTER", referenceSystemIdentifier = structure(list(
text = "32760", .attrs = structure("EPSG", .Names = "codeSpace")), .Names = c("text",
".attrs")), fileName = "20170127_213132_0e0e_3B_AnalyticMS_DN_udm.tif"), .Names = c("type",
"format", "referenceSystemIdentifier", "fileName"
))), .Names = "MaskInformation"), cloudCoverPercentage = structure(list(
text = "0.01", .attrs = structure("percentage", .Names = "uom")), .Names = c("text",
".attrs")), cloudCoverPercentageQuotationMode = "AUTOMATIC",
unusableDataPercentage = structure(list(text = "0.0",
.attrs = structure("percentage", .Names = "uom")), .Names = c("text",
".attrs")), bandSpecificMetadata = structure(list(
bandNumber = "1", comment = NULL, radiometricScaleFactor = "0.01",
comment = NULL, reflectanceCoefficient = "2.21386105481e-05"), .Names = c("bandNumber",
"comment", "radiometricScaleFactor", "comment", "reflectanceCoefficient"
)), bandSpecificMetadata = structure(list(bandNumber = "2",
comment = NULL, radiometricScaleFactor = "0.01",
comment = NULL, reflectanceCoefficient = "2.31474175457e-05"), .Names = c("bandNumber",
"comment", "radiometricScaleFactor", "comment", "reflectanceCoefficient"
)), bandSpecificMetadata = structure(list(bandNumber = "3",
comment = NULL, radiometricScaleFactor = "0.01",
comment = NULL, reflectanceCoefficient = "2.60208594123e-05"), .Names = c("bandNumber",
"comment", "radiometricScaleFactor", "comment", "reflectanceCoefficient"
)), bandSpecificMetadata = structure(list(bandNumber = "4",
comment = NULL, radiometricScaleFactor = "0.01",
comment = NULL, reflectanceCoefficient = "3.83481925626e-05"), .Names = c("bandNumber",
"comment", "radiometricScaleFactor", "comment", "reflectanceCoefficient"
))), .Names = c("product", "mask", "cloudCoverPercentage",
"cloudCoverPercentageQuotationMode", "unusableDataPercentage",
"bandSpecificMetadata", "bandSpecificMetadata", "bandSpecificMetadata",
"bandSpecificMetadata"))), .Names = "EarthObservationResult"),
.attrs = structure(c("http://schemas.planet.com/ps/v1/planet_product_metadata_geocorrected_level http://schemas.planet.com/ps/v1/planet_product_metadata_geocorrected_level.xsd",
"1.2.1", "1.0"), class = structure("XMLAttributes", package = "XML"), namespaces = structure(c("xsi",
"", ""), .Names = c("http://www.w3.org/2001/XMLSchema-instance",
"", "")), .Names = c("schemaLocation", "version", "planet_standard_product_version"
))), .Names = c("metaDataProperty", "validTime", "using",
"target", "resultOf", ".attrs"))
As you did not provide any reproducible data, the following attempt may not work:
# Initialise vectors:
b <- vector(mode = "character", length = 4)
coeffs <- vector(mode = "character", length = 4)
# Get coefficients
for(i in 6:9) {
b[i] = get("bandNumber", nodes[[i]])
coeffs[i] <- ifelse(b[i] %in% 6:9),
get("reflectanceCoefficient", nodes[[i]]), # Yes cond val
NA) # No cond val
}
coeffs
(edited to answer the updated question)
Have a look at these answers to work with original xml data: How to parse XML to R data frame
You already parsed the xml file and now you have lists. I think package purrr (https://purrr.tidyverse.org/) helps a lot in this case.
I assume that we know the path to the EarthObservationResult. Note how we extract reflectanceCoefficient from all sub-nodes and discard the NULL elements with compact.
library(tidyverse)
nodes <- matadata.list$resultOf$EarthObservationResult
coefff <- nodes %>%
purrr::map("reflectanceCoefficient") %>%
purrr::compact() %>%
purrr::map_dbl(~ as.numeric(.x)) %>%
purrr::set_names(nm = NULL)
print(coeffs)
#> [1] 2.213861e-05 2.314742e-05 2.602086e-05 3.834819e-05
Created on 2018-08-28 by the reprex package (v0.2.0).

Multiple lists into one data frame in R

So I'm running a package in which the output of the function I'm using is something similar to this:
area ID structure
1 150 1 house
I have several of these which I get by looping through some stuff. Basically this is my loop function:
for (k in 1:length(models)) {
for (l in 1:length(patients)) {
print(result[[l]][[k]])
tableData[[l]][[k]] <- do.call(rbind, result[[l]][[k]])
}
}
So the print(result[[l]][[k]]) gives the output I showed you in the beginning. So my issue is to put all of these into one dataframe. And so far it just doesn't work, i.e. the do.call function, which I have read is the one to use when combining lists into dataframes.
So where am I going wrong here ?
Updated:
dput() output (area = value in this case):
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))
Edit: I initially used purrr::map_dfr to solve this problem, but purrr::reduce is much more appropriate.
The list nesting means we have to bind rows together twice. Here's a solution using the purrr and dplyr packages and assigning your dput list to the variable my_list:
library(purrr)
library(dplyr)
my_df <- reduce(my_list, bind_rows)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
my_df
#> value ID structure model
#> 1 0.03947978 1 house house wood
#> 2 0.03947978 1 house house stone
#> 3 0.03069239 2 house house wood
#> 4 0.03069239 2 house house stone
I find map-ing with purrr way more intuitive than do.call. Let me know if this helps!

Flattening list with variable nesting levels creates additional observations

I have a nested list of geocoded Moscow street addresses, converted from a nested list. However, the dataframe I was geocoding from had only addresses without zip codes, and in a few hundred (out of 33k) cases, the address returned multiple results for the same street address with different zipcodes. This created additional nesting in the list, which when converted to a dataframe results in a differing number of observations from the initial dataframe.
A result with only one address has the following structure:
(Ignore the gibberish, R console will not render Cyrillic correctly)
structure(list(results = structure(list(address_components = list(
structure(list(long_name = c("4", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ",
"127299"), short_name = c("4", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU",
"127299"), types = list("street_number", "route", c("political",
"sublocality", "sublocality_level_1"), c("locality", "political"
), c("administrative_area_level_2", "political"), c("country",
"political"), "postal_code")), .Names = c("long_name", "short_name",
"types"), class = "data.frame", row.names = c(NA, 7L))),
formatted_address = "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 4, Ìîñêâà, Ðîññèÿ, 127299",
geometry = structure(list(location = structure(list(lat = 55.8176896,
lng = 37.522891), .Names = c("lat", "lng"), class = "data.frame", row.names = 1L),
location_type = "ROOFTOP", viewport = structure(list(
northeast = structure(list(lat = 55.8190385802915,
lng = 37.5242399802915), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), southwest = structure(list(
lat = 55.8163406197085, lng = 37.5215420197085), .Names = c("lat",
"lng"), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L)), .Names = c("location",
"location_type", "viewport"), class = "data.frame", row.names = 1L),
partial_match = TRUE, place_id = "ChIJ59yLsy1ItUYR5EEBFbFJoSA",
types = list("street_address")), .Names = c("address_components",
"formatted_address", "geometry", "partial_match", "place_id",
"types"), class = "data.frame", row.names = 1L), status = "OK"), .Names = c("results",
"status"))
Whereas a result with multiple possible addresses looks like:
structure(list(results = structure(list(address_components = list(
structure(list(long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ",
"127299"), short_name = c("23", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU",
"127299"), types = list("street_number", "route", c("political",
"sublocality", "sublocality_level_1"), c("locality", "political"
), c("administrative_area_level_2", "political"), c("country",
"political"), "postal_code")), .Names = c("long_name", "short_name",
"types"), class = "data.frame", row.names = c(NA, 7L)), structure(list(
long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã",
"Ìîñêâà", "Ìîñêâà", "Ðîññèÿ", "125008"), short_name = c("23",
"óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã",
"Ìîñêâà", "Ìîñêâà", "RU", "125008"), types = list("street_number",
"route", c("political", "sublocality", "sublocality_level_1"
), c("locality", "political"), c("administrative_area_level_2",
"political"), c("country", "political"), "postal_code")), .Names = c("long_name",
"short_name", "types"), class = "data.frame", row.names = c(NA,
7L))), formatted_address = c("óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 127299",
"óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 125008"), geometry = structure(list(
location = structure(list(lat = c(55.8169112, 55.826859),
lng = c(37.5202899, 37.529427)), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1:2), location_type = c("ROOFTOP",
"ROOFTOP"), viewport = structure(list(northeast = structure(list(
lat = c(55.8182601802915, 55.8282079802915), lng = c(37.5216388802915,
37.5307759802915)), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2),
southwest = structure(list(lat = c(55.8155622197085,
55.8255100197085), lng = c(37.5189409197085, 37.5280780197085
)), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1:2)), .Names = c("location",
"location_type", "viewport"), class = "data.frame", row.names = 1:2),
partial_match = c(TRUE, TRUE), place_id = c("ChIJnVMw7C1ItUYRdfeWEQrXuAk",
"ChIJnbnwOdY3tUYR1_D9pHTqCsI"), types = list("street_address",
"street_address")), .Names = c("address_components",
"formatted_address", "geometry", "partial_match", "place_id",
"types"), class = "data.frame", row.names = 1:2), status = "OK"), .Names = c("results",
"status"))
In the results element in the second list, there is an additional level of nesting for each possible address, which when flattened creates an "extra" observation for that address, making it impossible to cbind() the geocoding results back to the list of addresses. I am using the following functions to flatten my nested lists to data-frames. How can I modify them to take only the first address when this additional nesting occurs? If the address is incorrect, the buildings will simply be discarded from the sample when I later merge with another dataframe, so I am concerned only with making each geocoded observation match to the appropriate row in the original dataframe (the source of the addresses).
flatten_googleway <- function(df) {
require(jsonlite)
res <- jsonlite::flatten(df)
res[, names(res) %in% c("geometry.location_type", "geometry.location.lat",
"geometry.location.lng", "formatted_address")]
}
moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
if (length(x$results) == 0) template_res[1, ] else flatten_googleway(x$results)
}))
##template for NA results
structure(list(formatted_address = character(0), geometry.location_type = character(0),
geometry.location.lat = numeric(0), geometry.location.lng = numeric(0)), .Names = c("formatted_address",
"geometry.location_type", "geometry.location.lat", "geometry.location.lng"
), row.names = integer(0), class = "data.frame")
Whoops, I was massively over-complicating things, as usual. I was able to fix this simply by modifying the lapply() call to replace all list elements with no results, and elements where x$results$address_components is greater than length 1 (as is the case when multiple possible results are returned).
moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
if (length(x$results) == 0 | length(x$results$formatted_address) > 1) template_res[1, ] else flatten_googleway(x$results)
}))
I still lose some data this way unfortunately, but identifying which address is correct out of the options given would likely be too time-consuming, and a bit silly in a dataset with so many observations.

Resources