Multiple lists into one data frame in R - r

So I'm running a package in which the output of the function I'm using is something similar to this:
area ID structure
1 150 1 house
I have several of these which I get by looping through some stuff. Basically this is my loop function:
for (k in 1:length(models)) {
for (l in 1:length(patients)) {
print(result[[l]][[k]])
tableData[[l]][[k]] <- do.call(rbind, result[[l]][[k]])
}
}
So the print(result[[l]][[k]]) gives the output I showed you in the beginning. So my issue is to put all of these into one dataframe. And so far it just doesn't work, i.e. the do.call function, which I have read is the one to use when combining lists into dataframes.
So where am I going wrong here ?
Updated:
dput() output (area = value in this case):
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))
list(list(structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0394797760472196, ID = "1 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame")),
list(structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "wood", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L), class = "data.frame"),
structure(list(value = 0.0306923865158472, ID = "2 house",
structure = "house", model = structure(1L, .Label = "stone", class = "factor")), .Names = c("value",
"ID", "structure", "model"), row.names = c(NA, -1L
), class = "data.frame")))

Edit: I initially used purrr::map_dfr to solve this problem, but purrr::reduce is much more appropriate.
The list nesting means we have to bind rows together twice. Here's a solution using the purrr and dplyr packages and assigning your dput list to the variable my_list:
library(purrr)
library(dplyr)
my_df <- reduce(my_list, bind_rows)
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
#> Warning in bind_rows_(x, .id): binding character and factor vector,
#> coercing into character vector
my_df
#> value ID structure model
#> 1 0.03947978 1 house house wood
#> 2 0.03947978 1 house house stone
#> 3 0.03069239 2 house house wood
#> 4 0.03069239 2 house house stone
I find map-ing with purrr way more intuitive than do.call. Let me know if this helps!

Related

Count the number a string exists in dataframe column with nested dataframes

I have the dataframe below which in column genres has nested dataframes with 3 columns. I wonder how can I find how many times the word "Pop" is displayed in the column name of all nested dataframes that exist in column genres
dat<-structure(list(name = c("Easy On Me", "All I Want For Christmas Is You",
"Overseas (feat. Central Cee)", "Last Christmas", "Shivers"),
releaseDate = c("2021-10-15", "1994-10-29", "2021-11-18",
"1984-01-01", "2021-09-09"), kind = c("songs", "songs", "songs",
"songs", "songs"), artistId = c("262836961", "91853", "1240341559",
"548421", "183313439"), artistUrl = c("https://music.apple.com/gb/artist/adele/262836961",
"https://music.apple.com/gb/artist/mariah-carey/91853", "https://music.apple.com/gb/artist/d-block-europe/1240341559",
"https://music.apple.com/gb/artist/wham/548421", "https://music.apple.com/gb/artist/ed-sheeran/183313439"
), artworkUrl100 = c("https://is3-ssl.mzstatic.com/image/thumb/Music115/v4/73/6d/7c/736d7cfb-c79d-c9a9-4170-5e71d008dea1/886449666430.jpg/100x100bb.jpg",
"https://is4-ssl.mzstatic.com/image/thumb/Music124/v4/c6/b7/27/c6b727f7-3a32-6b43-cee2-05bb71daf1cf/dj.itfmdeif.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music126/v4/2e/63/01/2e6301ee-905d-5ae8-c989-eaf9d8e7e6ae/21UM1IM30658.rgb.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music125/v4/47/55/0c/47550cd6-7ef5-bf86-c194-c7695d63c759/dj.xuditatj.jpg/100x100bb.jpg",
"https://is1-ssl.mzstatic.com/image/thumb/Music125/v4/c5/d8/c6/c5d8c675-63e3-6632-33db-2401eabe574d/190296491412.jpg/100x100bb.jpg"
), genres = list(structure(list(genreId = c("14", "34"),
name = c("Pop", "Music"), url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2),
structure(list(genreId = c("34", "21", "17", "15", "14"
), name = c("Music", "Rock", "Dance", "R&B/Soul", "Pop"
), url = c("https://itunes.apple.com/gb/genre/id34",
"https://itunes.apple.com/gb/genre/id21", "https://itunes.apple.com/gb/genre/id17",
"https://itunes.apple.com/gb/genre/id15", "https://itunes.apple.com/gb/genre/id14"
)), class = "data.frame", row.names = c(NA, 5L)), structure(list(
genreId = c("18", "34"), name = c("Hip-Hop/Rap",
"Music"), url = c("https://itunes.apple.com/gb/genre/id18",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2),
structure(list(genreId = c("14", "34", "17"), name = c("Pop",
"Music", "Dance"), url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34", "https://itunes.apple.com/gb/genre/id17"
)), class = "data.frame", row.names = c(NA, 3L)), structure(list(
genreId = c("14", "34"), name = c("Pop", "Music"),
url = c("https://itunes.apple.com/gb/genre/id14",
"https://itunes.apple.com/gb/genre/id34")), class = "data.frame", row.names = 1:2))), row.names = c(NA,
5L), class = "data.frame")
If you want the total number of times the word Pop appears in the genres;
library(tidyverse)
sum(bind_rows(dat$genres)['name'] == 'Pop')
[1] 4
if you want the number of times in each nested dataframe:
map_dbl(dat$genres, ~sum(.x['name']=='Pop'))
[1] 1 1 0 1 1
mapply(`%in%`, "Pop", lapply(dat$genres, `[[`, "name"))
# Pop <NA> <NA> <NA> <NA>
# TRUE TRUE FALSE TRUE TRUE

Reshape dataframe to be three columns wide without knowing variable names

I have a dataframe with unknown column names, but a consistent format. How can I reshape it to be three columns wide without using column names?
cpu[,1:6]
Datapoints.Timestamp Datapoints.Maximum Datapoints.Unit Datapoints.Timestamp.1 Datapoints.Maximum.1 Datapoints.Unit.1
1 2019-03-05T08:00:00Z 7.833333 Percent 2019-03-11T22:00:00Z 24.25 Percent
GOAL
Timestamp Maximum Unit
2019-03-05T08:00:00Z 7.833333 Percent
.....
Dataset:
> dput(cpu[,1:6])
structure(list(Datapoints.Timestamp = structure(1L, .Label = "2019-03-05T08:00:00Z", class = "factor"),
Datapoints.Maximum = 7.83333333332848, Datapoints.Unit = structure(1L, .Label = "Percent", class = "factor"),
Datapoints.Timestamp.1 = structure(1L, .Label = "2019-03-11T22:00:00Z", class = "factor"),
Datapoints.Maximum.1 = 24.2500000000048, Datapoints.Unit.1 = structure(1L, .Label = "Percent", class = "factor")), .Names = c("Datapoints.Timestamp",
"Datapoints.Maximum", "Datapoints.Unit", "Datapoints.Timestamp.1",
"Datapoints.Maximum.1", "Datapoints.Unit.1"), class = "data.frame", row.names = c(NA,
-1L))
do.call(rbind, lapply(seq(1, NCOL(df1), 3), function(i)
setNames(df1[,i+(0:2)], colnames(df1)[1:3])))
# Datapoints.Timestamp Datapoints.Maximum Datapoints.Unit
#1 2019-03-05T08:00:00Z 7.833333 Percent
#2 2019-03-11T22:00:00Z 24.250000 Percent

Purrr::map_df() drops NULL rows

When using purrr::map_df(), I will occasionally pass in a list of data frames where some items are NULL. When I do, map_df() returns a data frame with fewer rows than the the original list.
I assume what's going on is that map_df() calls dplyr::bind_rows() which ignores NULL values. However, I'm not sure how to identify my problematic rows.
Here's an example:
library(purrr)
problemlist <- list(NULL, NULL, structure(list(bounds = structure(list(northeast = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), southwest = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L), location = structure(list(
lat = 41.49, lng = -71.46), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), location_type = "ROOFTOP",
viewport = structure(list(northeast = structure(list(lat = 41.49,
lng = -71.46), .Names = c("lat", "lng"), class = "data.frame", row.names = 1L),
southwest = structure(list(lat = 41.49, lng = -71.46), .Names = c("lat",
"lng"), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L)), .Names = c("bounds",
"location", "location_type", "viewport"), class = "data.frame", row.names = 1L))
# what actually happens
map_df(problemlist, 'location')
# lat lng
# 1 41.49 -71.46
# desired result
map_df_with_Null_handling(problemlist, 'location')
# lat lng
# 1 NA NA
# 2 NA NA
# 3 41.49 -71.46
I considered wrapping my location accessor in one of purrr's error handling functions (eg. safely() or possibly()), but it's not that I'm running into errors--I'm just not getting the desired results.
What's the best way to handle NULL values with map_df()?
You can use the (as-of-present undocumented) .null argument for any of the map*() functions to tell the function what to do when it encounters a NULL value:
map_df(problemlist, 'location', .null = data_frame(lat = NA, lng = NA) )
# lat lng
# 1 NA NA
# 2 NA NA
# 3 41.49 -71.46

Flattening list with variable nesting levels creates additional observations

I have a nested list of geocoded Moscow street addresses, converted from a nested list. However, the dataframe I was geocoding from had only addresses without zip codes, and in a few hundred (out of 33k) cases, the address returned multiple results for the same street address with different zipcodes. This created additional nesting in the list, which when converted to a dataframe results in a differing number of observations from the initial dataframe.
A result with only one address has the following structure:
(Ignore the gibberish, R console will not render Cyrillic correctly)
structure(list(results = structure(list(address_components = list(
structure(list(long_name = c("4", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ",
"127299"), short_name = c("4", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU",
"127299"), types = list("street_number", "route", c("political",
"sublocality", "sublocality_level_1"), c("locality", "political"
), c("administrative_area_level_2", "political"), c("country",
"political"), "postal_code")), .Names = c("long_name", "short_name",
"types"), class = "data.frame", row.names = c(NA, 7L))),
formatted_address = "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 4, Ìîñêâà, Ðîññèÿ, 127299",
geometry = structure(list(location = structure(list(lat = 55.8176896,
lng = 37.522891), .Names = c("lat", "lng"), class = "data.frame", row.names = 1L),
location_type = "ROOFTOP", viewport = structure(list(
northeast = structure(list(lat = 55.8190385802915,
lng = 37.5242399802915), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1L), southwest = structure(list(
lat = 55.8163406197085, lng = 37.5215420197085), .Names = c("lat",
"lng"), class = "data.frame", row.names = 1L)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1L)), .Names = c("location",
"location_type", "viewport"), class = "data.frame", row.names = 1L),
partial_match = TRUE, place_id = "ChIJ59yLsy1ItUYR5EEBFbFJoSA",
types = list("street_address")), .Names = c("address_components",
"formatted_address", "geometry", "partial_match", "place_id",
"types"), class = "data.frame", row.names = 1L), status = "OK"), .Names = c("results",
"status"))
Whereas a result with multiple possible addresses looks like:
structure(list(results = structure(list(address_components = list(
structure(list(long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "Ðîññèÿ",
"127299"), short_name = c("23", "óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ",
"Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã", "Ìîñêâà", "Ìîñêâà", "RU",
"127299"), types = list("street_number", "route", c("political",
"sublocality", "sublocality_level_1"), c("locality", "political"
), c("administrative_area_level_2", "political"), c("country",
"political"), "postal_code")), .Names = c("long_name", "short_name",
"types"), class = "data.frame", row.names = c(NA, 7L)), structure(list(
long_name = c("23", "óëèöà Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã",
"Ìîñêâà", "Ìîñêâà", "Ðîññèÿ", "125008"), short_name = c("23",
"óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ", "Ñåâåðíûé àäìèíèñòðàòèâíûé îêðóã",
"Ìîñêâà", "Ìîñêâà", "RU", "125008"), types = list("street_number",
"route", c("political", "sublocality", "sublocality_level_1"
), c("locality", "political"), c("administrative_area_level_2",
"political"), c("country", "political"), "postal_code")), .Names = c("long_name",
"short_name", "types"), class = "data.frame", row.names = c(NA,
7L))), formatted_address = c("óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 127299",
"óë. Áîëüøàÿ Àêàäåìè÷åñêàÿ, 23, Ìîñêâà, Ðîññèÿ, 125008"), geometry = structure(list(
location = structure(list(lat = c(55.8169112, 55.826859),
lng = c(37.5202899, 37.529427)), .Names = c("lat", "lng"
), class = "data.frame", row.names = 1:2), location_type = c("ROOFTOP",
"ROOFTOP"), viewport = structure(list(northeast = structure(list(
lat = c(55.8182601802915, 55.8282079802915), lng = c(37.5216388802915,
37.5307759802915)), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2),
southwest = structure(list(lat = c(55.8155622197085,
55.8255100197085), lng = c(37.5189409197085, 37.5280780197085
)), .Names = c("lat", "lng"), class = "data.frame", row.names = 1:2)), .Names = c("northeast",
"southwest"), class = "data.frame", row.names = 1:2)), .Names = c("location",
"location_type", "viewport"), class = "data.frame", row.names = 1:2),
partial_match = c(TRUE, TRUE), place_id = c("ChIJnVMw7C1ItUYRdfeWEQrXuAk",
"ChIJnbnwOdY3tUYR1_D9pHTqCsI"), types = list("street_address",
"street_address")), .Names = c("address_components",
"formatted_address", "geometry", "partial_match", "place_id",
"types"), class = "data.frame", row.names = 1:2), status = "OK"), .Names = c("results",
"status"))
In the results element in the second list, there is an additional level of nesting for each possible address, which when flattened creates an "extra" observation for that address, making it impossible to cbind() the geocoding results back to the list of addresses. I am using the following functions to flatten my nested lists to data-frames. How can I modify them to take only the first address when this additional nesting occurs? If the address is incorrect, the buildings will simply be discarded from the sample when I later merge with another dataframe, so I am concerned only with making each geocoded observation match to the appropriate row in the original dataframe (the source of the addresses).
flatten_googleway <- function(df) {
require(jsonlite)
res <- jsonlite::flatten(df)
res[, names(res) %in% c("geometry.location_type", "geometry.location.lat",
"geometry.location.lng", "formatted_address")]
}
moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
if (length(x$results) == 0) template_res[1, ] else flatten_googleway(x$results)
}))
##template for NA results
structure(list(formatted_address = character(0), geometry.location_type = character(0),
geometry.location.lat = numeric(0), geometry.location.lng = numeric(0)), .Names = c("formatted_address",
"geometry.location_type", "geometry.location.lat", "geometry.location.lng"
), row.names = integer(0), class = "data.frame")
Whoops, I was massively over-complicating things, as usual. I was able to fix this simply by modifying the lapply() call to replace all list elements with no results, and elements where x$results$address_components is greater than length 1 (as is the case when multiple possible results are returned).
moscowhousegeo.df <- do.call(rbind, lapply(moscowhouse.list, function(x) {
if (length(x$results) == 0 | length(x$results$formatted_address) > 1) template_res[1, ] else flatten_googleway(x$results)
}))
I still lose some data this way unfortunately, but identifying which address is correct out of the options given would likely be too time-consuming, and a bit silly in a dataset with so many observations.

How to search for string patterns in another string and include a separator?

My data is structured as follows:
dput(head(CharacterAnalysis,5))
structure(list(Character = c("A", "a", "B", "b", "C"),
Descriptor = c("Jog", "Change Direction", "Shuffle", "Walk", "Stop"),
.Names = c("Character", "Descriptor"),
row.names = c(NA, 5L), class = "data.frame")
I wish to lookup the Character and relevant Descriptor in the following data frame, but am unsure how to do so:
dput(head(StringAnalysis,3))
structure(list(MovementString = c("ACb", "aAaB", "BbCa"),
.Names = c("MovementString"),
row.names = c(NA, 3L), class = "data.frame")
My expected outcome/ data frame would be:
dput(head(Output,3))
structure(list(MovementString = c("ACb", "aAaB", "BbCa"),
MovementPerformed = c("Jog/ Stop/ Walk", "Change Direction/ Jog/ Change Direction/ Shuffle", "Shuffle/ Walk/ Stop/ Change Direction")
.Names = c("MovementString", "MovementPerformed"),
row.names = c(NA, 3L), class = "data.frame")
I would like a forward stroke (/) or similar to separate each Descriptor as it signals a new movement. Any advice on how to please complete this? My data frame CharacterAnalysis is over 1 million rows long, so I do not wish to have to search for each MovementString separately!
Thank you.
CharacterAnalysis <-
structure(list(Character = c("A", "a", "B", "b", "C"),
Descriptor = c("Jog", "Change Direction", "Shuffle", "Walk", "Stop")),
.Names = c("Character", "Descriptor"),
row.names = c(NA, 5L), class = "data.frame")
Output <-
structure(list(MovementString = c("ACb", "aAaB", "BbCa"),
MovementPerformed = c("Jog/ Stop/ Walk", "Change Direction/ Jog/ Change Direction/ Shuffle", "Shuffle/ Walk/ Stop/ Change Direction")),
.Names = c("MovementString", "MovementPerformed"),
row.names = c(NA, 3L), class = "data.frame")
# A simple approach based on names
# Build the lookup table just once
m <- CharacterAnalysis$Descriptor
names(m) <- CharacterAnalysis$Character
# Build the MovementPerformed column
Output$MovementPerformed <-
sapply(strsplit(Output$MovementString,""),
FUN = function(x) paste(m[x], collapse = "/ "))

Resources