I would like to make some specific calculation within a large dataset.
This is my MWE using an API call (takes 3-4 sec ONLY to Download)
devtools::install_github('mingjerli/IMFData')
library(IMFData)
fdi_asst <- c("BFDA_BP6_USD","BFDAD_BP6_USD","BFDAE_BP6_USD")
databaseID <- "BOP"
startdate <- "1980-01-01"
enddate <- "2016-12-31"
checkquery <- FALSE
FDI_ASSETS <- as.data.frame(CompactDataMethod(databaseID, list(CL_FREA = "Q", CL_AREA_BOP = "", CL_INDICATOR_BOP= fdi_asst), startdate, enddate, checkquery))
my dataframe 'FDI_ASSETS' looks like this (I provide a picture instead of head() for convenience)
the last column is a list and contains three more variables:
head(FDI_ASSETS$Obs)
[[1]]
#TIME_PERIOD #OBS_VALUE #OBS_STATUS
1 1980-Q1 30.0318922812441 <NA>
2 1980-Q2 23.8926174547104 <NA>
3 1980-Q3 26.599634375058 <NA>
4 1980-Q4 32.7522451203517 <NA>
5 1981-Q1 44.124979234001 <NA>
6 1981-Q2 35.9907120805994 <NA>
MY SCOPE
I want to do the following:
if/when the "#UNIT_MULT == 6" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000
if/when the "#UNIT_MULT == 3" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000000
UPDATE
Structure of FDI_ASSETS looks like this:
str(FDI_ASSETS)
'data.frame': 375 obs. of 6 variables:
$ #FREQ : chr "Q" "Q" "Q" "Q" ...
$ #REF_AREA : chr "FI" "MX" "MX" "TO" ...
$ #INDICATOR : chr "BFDAE_BP6_USD" "BFDAD_BP6_USD" "BFDAE_BP6_USD" "BFDAD_BP6_USD" ...
$ #UNIT_MULT : chr "6" "6" "6" "3" ...
$ #TIME_FORMAT: chr "P3M" "P3M" "P3M" "P3M" ...
$ Obs :List of 375
..$ :'data.frame': 147 obs. of 3 variables:
.. ..$ #TIME_PERIOD: chr "1980-Q1" "1980-Q2" "1980-Q3" "1980-Q4" ...
.. ..$ #OBS_VALUE : chr "30.0318922812441" "23.8926174547104" "26.599634375058" "32.7522451203517" ...
.. ..$ #OBS_STATUS : chr NA NA NA NA ...
..$ :'data.frame': 60 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q3" "2002-Q1" "2002-Q2" ...
.. ..$ #OBS_VALUE : chr "9.99999999748979E-05" "9.99999997475243E-05" "9.8999999998739E-05" "-9.90000000342661E-05" ...
..$ :'data.frame': 63 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q2" "2001-Q3" "2001-Q4" ...
.. ..$ #OBS_VALUE : chr "130.0149" "189.627" "3453.8319" "630.483" ...
..$ :'data.frame': 17 obs. of 2 variables:
I downloaded your data and it is quite complicated. I have removed my wrong answer so that you can get it answered by #akrun or someone similar :) I don't have the time to parse through it right now.
I found the following solution
list_assets<-list(FDI_ASSETS=FDI_ASSETS, Portfolio_ASSETS=Portfolio_ASSETS, other_invest_ASSETS=other_invest_ASSETS, fin_der_ASSETS=fin_der_ASSETS, Reserves=Reserves)
for (df in list_assets){
for( i in 1:length(df$"#UNIT_MULT")){
if (df$"#UNIT_MULT"[i]=="6"){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000
} else if ((df$"#UNIT_MULT"[i]=="3")){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000000
}
}
}
Please let me know how I can modify the code in order to make it more efficient and avoid these loops.
Related
I have a list, List_A, which contains elements with a similar pattern. I am trying to nest the similar items based on the pattern. Example below
List_A <- list("Q_2020", "Q_2021", "C_2019", "C_2020", "K_2020")
Output:
# List of 5
# $ : chr "Q_2020"
# $ : chr "Q_2021"
# $ : chr "C_2019"
# $ : chr "C_2020"
# $ : chr "K_2020"
Desired Output:
# List of 3
# $ Q:List of 2
# ..$ : chr "Q_2020"
# ..$ : chr "Q_2021"
# $ C:List of 2
# ..$ : chr "C_2019"
# ..$ : chr "C_2020"
# $ K: chr "K_2020"
I know the pattern would be before "_" so I am using sub() to get the pattern - but I would like some input on how I can create the nested lists. My attempt is a bit too noisy: 1) extract pattern i.e. Q, K, C, 2) create nested list skeleton, 3) Loop an if statement.
Any input is appreciated.
We could split by substring
v1 <- unlist(List_A)
subv1 <- trimws(v1, whitespace = "_.*")
List_A_new <- lapply(split(v1, factor(subv1, levels = unique(subv1))), as.list)
-output structure
str(List_A_new)
List of 3
$ Q:List of 2
..$ : chr "Q_2020"
..$ : chr "Q_2021"
$ C:List of 2
..$ : chr "C_2019"
..$ : chr "C_2020"
$ K:List of 1
..$ : chr "K_2020"
Do you need this?
> str(split(List_A, factor(u <- substr(unlist(List_A), 1, 1), levels = unique(u))))
List of 3
$ Q:List of 2
..$ : chr "Q_2020"
..$ : chr "Q_2021"
$ C:List of 2
..$ : chr "C_2019"
..$ : chr "C_2020"
$ K:List of 1
..$ : chr "K_2020"
I have a dataframe nested within a dataframe that I'm getting from Mongo. The number of rows match in each so that when viewed it looks like a typical dataframe. My question, how do I expand the nested dataframe into the parent so that I can run dplyr selects? See the layout below
'data.frame': 10 obs. of 2 variables:
$ _id : int 1551 1033 1061 1262 1032 1896 1080 1099 1679 1690
$ personalInfo:'data.frame': 10 obs. of 2 variables:
..$ FirstName :List of 10
.. ..$ : chr "Jack"
.. ..$ : chr "Yogesh"
.. ..$ : chr "Steven"
.. ..$ : chr "Richard"
.. ..$ : chr "Thomas"
.. ..$ : chr "Craig"
.. ..$ : chr "David"
.. ..$ : chr "Aman"
.. ..$ : chr "Frank"
.. ..$ : chr "Robert"
..$ MiddleName :List of 10
.. ..$ : chr "B"
.. ..$ : NULL
.. ..$ : chr "J"
.. ..$ : chr "I"
.. ..$ : chr "E"
.. ..$ : chr "A"
.. ..$ : chr "R"
.. ..$ : NULL
.. ..$ : chr "J"
.. ..$ : chr "E"
As per suggestion, here's how you recreate the data
id <- c(1551, 1033, 1061, 1262, 1032, 1896, 1080, 1099, 1679, 1690)
fname <- list("Jack","Yogesh","Steven","Richard","Thomas","Craig","David","Aman","Frank","Robert")
mname <- list("B",NULL,"J","I","E","A","R",NULL,"J","E")
sub <- as.data.frame(cbind(fname, mname))
master <- as.data.frame(id)
master$personalInfo <- sub
We could loop the 'personalInfo', change the NULL elements of the list to NA and convert it to a real dataset with 3 columns
library(tidyverse)
out <- master %>%
pull(personalInfo) %>%
map_df(~ map_chr(.x, ~ replace(.x, is.null(.x), NA))) %>%
bind_cols(master %>%
select(id), .)
str(out)
#'data.frame': 10 obs. of 3 variables:
# $ id : num 1551 1033 1061 1262 1032 ...
# $ fname: chr "Jack" "Yogesh" "Steven" "Richard" ...
# $ mname: chr "B" NA "J" "I" ...
While #akrun's answer is probably more practical and probably the way to tidy your data, I think this output is closer to what you describe.
I create a new environment where I put the data.frame's content, there I unlist to the said environment the content of your problematic column, and finally I wrap it all back into a data.frame.
I use a strange hack with cbind as as.data.frame is annoying with list columns. Using tibble::as_tibble works fine however.
new_env <- new.env()
list2env(master,new_env)
list2env(new_env$personalInfo,new_env)
rm(personalInfo,envir = new_env)
res <- as.data.frame(do.call(cbind,as.list(new_env))) # or as_tibble(as.list(new_env))
rm(new_env)
res
# fname id mname
# 1 Jack 1551 B
# 2 Yogesh 1033 NULL
# 3 Steven 1061 J
# 4 Richard 1262 I
# 5 Thomas 1032 E
# 6 Craig 1896 A
# 7 David 1080 R
# 8 Aman 1099 NULL
# 9 Frank 1679 J
# 10 Robert 1690 E
str(res)
# 'data.frame': 10 obs. of 3 variables:
# $ fname:List of 10
# ..$ : chr "Jack"
# ..$ : chr "Yogesh"
# ..$ : chr "Steven"
# ..$ : chr "Richard"
# ..$ : chr "Thomas"
# ..$ : chr "Craig"
# ..$ : chr "David"
# ..$ : chr "Aman"
# ..$ : chr "Frank"
# ..$ : chr "Robert"
# $ id :List of 10
# ..$ : num 1551
# ..$ : num 1033
# ..$ : num 1061
# ..$ : num 1262
# ..$ : num 1032
# ..$ : num 1896
# ..$ : num 1080
# ..$ : num 1099
# ..$ : num 1679
# ..$ : num 1690
# $ mname:List of 10
# ..$ : chr "B"
# ..$ : NULL
# ..$ : chr "J"
# ..$ : chr "I"
# ..$ : chr "E"
# ..$ : chr "A"
# ..$ : chr "R"
# ..$ : NULL
# ..$ : chr "J"
# ..$ : chr "E"
I was trying to convert below nested list into data.frame but without luck. There are a few complications, mainly the column "results" of position 1 is inconsistent with position 2, as there is no result in position 2.
item length inconsistent across different positions
[[1]]
[[1]]$html_attributions
list()
[[1]]$results
geometry.location.lat geometry.location.lng
1 25.66544 -100.4354
id place_id
1 6ce0a030663144c8e992cbce51eb00479ef7db89 ChIJVy7b7FW9YoYRdaH2I_gOJIk
reference
1 CmRSAAAATdtVfB4Tz1aQ8GhGaw4-nRJ5lZlVNgiOR3ciF4QjmYC56bn6b7omWh1SJEWWqQQEFNXxGZndgEwSgl8sRCOtdF8aXpngUY878Q__yH4in8EMZMCIqSHLARqNgGlV4mKgEhDlvkHLXLiBW4F_KQVT83jIGhS5DJipk6PAnpPDXP2p-4X5NPuG9w
[[1]]$status
[1] "OK"
[[2]]
[[2]]$html_attributions
list()
[[2]]$results
list()
[[2]]$status
[1] "ZERO_RESULTS"
I tried the following codes but they aint' working.
#1
m1 <- do.call(rbind, lapply(myDataFrames, function(y) do.call(rbind, y)))
relist(m1, skeleton = myDataFrames)
#2
relist(matrix(unlist(myDataFrames), ncol = 4, byrow = T), skeleton = myDataFrames)
#3
library(data.table)
df<-rbindlist(myDataFrames, idcol = "index")
df<-rbindlist(myDataFrames, fill=TRUE)
#4
myDataFrame <- do.call(rbind.data.frame, c(myDataFrames, list(stringsAsFactors = FALSE)))
I think I have enough of the original JSON to be able to create a reproducible example:
okjson <- '{"html_attributions":[],"results":[{"geometry":{"location":{"lat":25.66544,"lon":-100.4354},"id":"foo","place_id":"quux"}}],"status":"OK"}'
emptyjson <- '{"html_attributions":[],"results":[],"status":"ZERO_RESULTS"}'
jsons <- list(okjson, emptyjson, okjson)
From here, I'll step (slowly) through the process. I've included much of the intermediate structure for reproducibility, I apologize for the verbosity. This can easily be grouped together and/or put within a magrittr pipeline.
lists <- lapply(jsons, jsonlite::fromJSON)
str(lists)
# List of 3
# $ :List of 3
# ..$ html_attributions: list()
# ..$ results :'data.frame': 1 obs. of 1 variable:
# .. ..$ geometry:'data.frame': 1 obs. of 3 variables:
# .. .. ..$ location:'data.frame': 1 obs. of 2 variables:
# .. .. .. ..$ lat: num 25.7
# .. .. .. ..$ lon: num -100
# .. .. ..$ id : chr "foo"
# .. .. ..$ place_id: chr "quux"
# ..$ status : chr "OK"
# $ :List of 3
# ..$ html_attributions: list()
# ..$ results : list()
# ..$ status : chr "ZERO_RESULTS"
# $ :List of 3
# ..$ html_attributions: list()
# ..$ results :'data.frame': 1 obs. of 1 variable:
# .. ..$ geometry:'data.frame': 1 obs. of 3 variables:
# .. .. ..$ location:'data.frame': 1 obs. of 2 variables:
# .. .. .. ..$ lat: num 25.7
# .. .. .. ..$ lon: num -100
# .. .. ..$ id : chr "foo"
# .. .. ..$ place_id: chr "quux"
# ..$ status : chr "OK"
goodlists <- Filter(function(a) "results" %in% names(a) && length(a$results) > 0, lists)
goodresults <- lapply(goodlists, `[[`, "results")
str(goodresults)
# List of 2
# $ :'data.frame': 1 obs. of 1 variable:
# ..$ geometry:'data.frame': 1 obs. of 3 variables:
# .. ..$ location:'data.frame': 1 obs. of 2 variables:
# .. .. ..$ lat: num 25.7
# .. .. ..$ lon: num -100
# .. ..$ id : chr "foo"
# .. ..$ place_id: chr "quux"
# $ :'data.frame': 1 obs. of 1 variable:
# ..$ geometry:'data.frame': 1 obs. of 3 variables:
# .. ..$ location:'data.frame': 1 obs. of 2 variables:
# .. .. ..$ lat: num 25.7
# .. .. ..$ lon: num -100
# .. ..$ id : chr "foo"
# .. ..$ place_id: chr "quux"
goodresultsdf <- lapply(goodresults, function(a) jsonlite::flatten(as.data.frame(a)))
str(goodresultsdf)
# List of 2
# $ :'data.frame': 1 obs. of 4 variables:
# ..$ geometry.id : chr "foo"
# ..$ geometry.place_id : chr "quux"
# ..$ geometry.location.lat: num 25.7
# ..$ geometry.location.lon: num -100
# $ :'data.frame': 1 obs. of 4 variables:
# ..$ geometry.id : chr "foo"
# ..$ geometry.place_id : chr "quux"
# ..$ geometry.location.lat: num 25.7
# ..$ geometry.location.lon: num -100
We now have a list-of-data.frames, a good place to be.
do.call(rbind.data.frame, c(goodresultsdf, stringsAsFactors = FALSE))
# geometry.id geometry.place_id geometry.location.lat geometry.location.lon
# 1 foo quux 25.66544 -100.4354
# 2 foo quux 25.66544 -100.4354
I am having trouble understanding the outputs when using this google_distance function. When using mydist() in ggmap I would get the number of miles, minutes, hours that it would take to get to point A to point B.
Now my output looks like this when I use google_distance. Can anyone help explain what each of the numbers is referring to?
$rows
elements
1 791 km, 790588, 7 hours 28 mins, 26859, 7 hours 35 mins, 27286, OK
My code is as follows:
results <- google_distance(origins = list(c(26.19660, -98.23591)),
destinations = list(c(31.62327, -94.64276)),
mode = "driving", key = key, simplify = TRUE)
What you're seeing is the standard JSON response, but simplified into a data.frame (as per the simplify = TRUE argument)
If you look one level deeper at your response, you'll get the description of those valeus
results$rows$elements
# [[1]]
# distance.text distance.value duration.text duration.value duration_in_traffic.text duration_in_traffic.value
# 1 791 km 790588 7 hours 28 mins 26859 7 hours 28 mins 26906
where
distance.value is in metres
duration.value is in seconds
Similarly, looking at the structure of the result object, you'll see all the JSON elements
str(results)
# List of 4
# $ destination_addresses: chr "805 E College St, Nacogdoches, TX, USA"
# $ origin_addresses : chr "1400-1498 W Houston Ave, McAllen, TX 78501, USA"
# $ rows :'data.frame': 1 obs. of 1 variable:
# ..$ elements:List of 1
# .. ..$ :'data.frame': 1 obs. of 4 variables:
# .. .. ..$ distance :'data.frame': 1 obs. of 2 variables:
# .. .. .. ..$ text : chr "791 km"
# .. .. .. ..$ value: int 790588
# .. .. ..$ duration :'data.frame': 1 obs. of 2 variables:
# .. .. .. ..$ text : chr "7 hours 28 mins"
# .. .. .. ..$ value: int 26859
# .. .. ..$ duration_in_traffic:'data.frame': 1 obs. of 2 variables:
# .. .. .. ..$ text : chr "7 hours 28 mins"
# .. .. .. ..$ value: int 26906
# .. .. ..$ status : chr "OK"
# $ status : chr "OK"
Further Reference:
Google Developers Guide: Distance Matrix
I am trying to read in a file in "flexible data format" using R.
I got the number of bytes I should be reading in (counting from EOF, e.g., I should be reading EOF-32 to EOF bytes in as my data).
I am seeking the equivalences to the fseek and fread from MATLAB in R.
I think you would do better with a different approach (if I've got the right "flexible data format" file format here). You can deal with much of these (horrible) files with basic string functions in R:
library(stringr)
# read in fdf file
l <- readLines("http://rud.is/dl/Fe.fdf")
# some basic cleanup
l <- sub("#.*$", "", l) # remove comments
l <- sub("^=.*$", "", l) # remove comments
l <- gsub("\ +", " ", l) # compress spaces
l <- str_trim(l) # beg/end space trim
l <- grep("^$", l, value=TRUE, invert=TRUE) # ignore blank lines
# start of data blocks
blocks <- which(grepl("^%block", l))
# all "easy"/simple lines
simple <- str_split_fixed(grep("^[[:digit:]%]", l, value=TRUE, invert=TRUE),
"[[:space:]]+", 2)
# "simple" name/val [unit] conversions
convert_vals <- function(simple) {
vals <- simple[,2]
names(vals) <- simple[,1]
lapply(vals, function(v) {
# if logical
if (tolower(v) %in% c("t", "true", ".true.", "f", "false", ".false.")) {
return(as.logical(gsub("\\.", "", v)))
}
# if it's just a number
# i may be missing a numeric fmt char in this horrible format
if (grepl("^[[:digit:]\\.\\+\\-]+$", v)) {
return(as.numeric(v))
}
# if value and unit convert to an actual number with a unit attribute
# or convert it here from the table starting on line 927 of fdf.f
if (grepl("^[[:digit:]]", v) & (!any(is.na(str_locate(v, " "))))) {
vu <- str_split_fixed(v, " ", 2)
x <- as.numeric(vu[,1])
attr(x, "unit") <- vu[,2]
return(x)
}
# handle "1.d-3" and other vals with other if's
# anything not handled is returned
return(v)
})
}
# handle begin/end block "complex" data conversion
convert_blocks <- function(lines) {
block_names <- sub("^%block ", "", grep("^%block", lines, value=TRUE))
lapply(blocks, function(blk_start) {
blk <- lines[blk_start]
blk_info <- str_split_fixed(blk, " ", 2)
blk_end <- which(grepl(sprintf("^%%endblock %s", blk_info[,2]), lines))
# this is overly simplistic since you have to do some conversions, but you know the line
# range of the data values now so you can process them however you need to
read.table(text=lines[(blk_start+1):(blk_end-1)],
header=FALSE, stringsAsFactors=FALSE, fill=TRUE)
}) -> blks
names(blks) <- block_names
return(blks)
}
fdf <- c(convert_vals(simple),
convert_blocks(l))
str(fdf)
Output of the str:
List of 32
$ SystemName : chr "bcc Fe ferro GGA"
$ SystemLabel : chr "Fe"
$ WriteCoorStep : chr ""
$ WriteMullikenPop : num 1
$ NumberOfSpecies : num 1
$ NumberOfAtoms : num 1
$ PAO.EnergyShift : atomic [1:1] 50
..- attr(*, "unit")= chr "meV"
$ PAO.BasisSize : chr "DZP"
$ Fe : num 2
$ LatticeConstant : atomic [1:1] 2.87
..- attr(*, "unit")= chr "Ang"
$ KgridCutoff : atomic [1:1] 15
..- attr(*, "unit")= chr "Ang"
$ xc.functional : chr "GGA"
$ xc.authors : chr "PBE"
$ SpinPolarized : logi TRUE
$ MeshCutoff : atomic [1:1] 150
..- attr(*, "unit")= chr "Ry"
$ MaxSCFIterations : num 40
$ DM.MixingWeight : num 0.1
$ DM.Tolerance : chr "1.d-3"
$ DM.UseSaveDM : logi TRUE
$ DM.NumberPulay : num 3
$ SolutionMethod : chr "diagon"
$ ElectronicTemperature : atomic [1:1] 25
..- attr(*, "unit")= chr "meV"
$ MD.TypeOfRun : chr "cg"
$ MD.NumCGsteps : num 0
$ MD.MaxCGDispl : atomic [1:1] 0.1
..- attr(*, "unit")= chr "Ang"
$ MD.MaxForceTol : atomic [1:1] 0.04
..- attr(*, "unit")= chr "eV/Ang"
$ AtomicCoordinatesFormat : chr "Fractional"
$ ChemicalSpeciesLabel :'data.frame': 1 obs. of 3 variables:
..$ V1: int 1
..$ V2: int 26
..$ V3: chr "Fe"
$ PAO.Basis :'data.frame': 5 obs. of 3 variables:
..$ V1: chr [1:5] "Fe" "0" "6." "2" ...
..$ V2: num [1:5] 2 2 0 2 0
..$ V3: chr [1:5] "" "P" "" "" ...
$ LatticeVectors :'data.frame': 3 obs. of 3 variables:
..$ V1: num [1:3] 0.5 0.5 0.5
..$ V2: num [1:3] 0.5 -0.5 0.5
..$ V3: num [1:3] 0.5 0.5 -0.5
$ BandLines :'data.frame': 5 obs. of 5 variables:
..$ V1: int [1:5] 1 40 28 28 34
..$ V2: num [1:5] 0 2 1 0 1
..$ V3: num [1:5] 0 0 1 0 1
..$ V4: num [1:5] 0 0 0 0 1
..$ V5: chr [1:5] "\\Gamma" "H" "N" "\\Gamma" ...
$ AtomicCoordinatesAndAtomicSpecies:'data.frame': 1 obs. of 4 variables:
..$ V1: num 0
..$ V2: num 0
..$ V3: num 0
..$ V4: int 1
You can see the output (and the file and this code) in this gist since it's easier to copy/past/clone a gist.
You still need to:
deal with unit conversion (but with this grid::unit-like structure that shld be far more straightforward)
swap out the naive read.table with a better "block reader"
deal with file includes (pretty simple, tho, if you add a function or two)
With a bit of tweaking/polish this cld be a new R package, not that I'd ever want a data file in this format ever.