Creating a codebook in R that considers non-labelled variables - r

I found a useful and simple function which creates a codebook for .dta datasets in R, the code is the following:
codebook <- map_df(dt, function(x) attributes(x)$label) %>%
gather(key = Code, value = Label) %>%
mutate(Type = map_chr(dt, typeof),
Mean = map_dbl(dt, mean, na.rm = T),
Prop_miss = map_dbl(dt, function(x) mean(is.na(x))))
The function works just fine, unless there are variables in the dataset that are unlabaled, in that case it doesn't. I would like to modify it so that it also reports variables with no labels as "UNLABELED" along with the rest of the information. I tried something like this:
if (map_df(dt, function(x) attributes(x)$label) == NULL) {
attr(function(x) dt$(x), "label") <- "NO LABEL"
}
But it doesn't work (I am not really skilled with coding on R).
Thanks in advance for the help.

The `if/else condition would be inside
library(purrr)
map_dfr(dt, ~ {
if(is.null(attributes(.x)$label)) {
attr(.x, "label") <- "NO LABEL"
}
attributes(.x)$label})

Related

readOGR: error in evaluating the argument 'x' in selecting a method for function 'merge': length(dsn) == 1L is not TRUE

I'm attempting to adapt a function written by a coworker, but I've hit a point where neither of us know how to solve the issue. Generally, the function reads in several different shapefiles kept in different directories, then merges them all with another necessary .csv dataset (facilitycrosswalk). I haven't really worked with spatial data in R before, so some of this is totally foreign to me--apologies for anything confusing as this is not my code. Here's a general recreation:
#working directory
working <- "C:/my/main/directory/"
#lists feeding into function
county <- c("County1", "County2", "County3", "County4", "County5")
longsitenames <- c("Site Name 1", "Site Name 2", "Site Name 3")
shortsitenames <- c("SN1", "SN2", "SN3")
#creating function to read and merge all shapefiles
readshapefiles <- function(county){
map2(.x = longsitenames,
.y = shortsitenames,
.f = ~ rgdal::readOGR(
glue("{working}Shared_Data/Spatial Data/Site Data/{county} County/{.x}/Final Shapefile/{.y}.shp" )) %>%
sp::merge(x = .,
y = facilitycrosswalk %>%
mutate(Name = glue("{`Facility Site Name`} Service Area")),
by.x = "Name",
by.y = "Name",
all.x = T,
all.y = F) %>%
st_as_sf(as(., "Spatial")) %>%
filter(method=="censusblock areal interpolation")) %>%
setNames(c(sitenameslong))
}
#running function and formatting output
shp_all <- readshapefiles(county) %>%
reduce(., rbind.sf) %>%
st_as_sf(as(., "Spatial")) %>%
st_transform(st_crs(wgs84))
This gives the error "Error in (function (cond) :
error in evaluating the argument 'x' in selecting a method for function 'merge': length(dsn) == 1L is not TRUE"
When I attempt to run the function on one specific county it seems to work okay, which makes me think it's having issues iterating over the list of counties, but I'm not sure why that would be. Any help is greatly appreciated!

Problems generating tree diagram with hctreemap2

library(highcharter)
library(dplyr)
library(viridisLite)
library(forecast)
library(treemap)
data("Groceries", package = "arules")
dfitems <- tbl_df(Groceries#itemInfo)
set.seed(10)
dfitemsg <- dfitems %>%
mutate(category = gsub(" ", "-", level1),
subcategory = gsub(" ", "-", level2)) %>%
group_by(category, subcategory) %>%
summarise(sales = n() ^ 3 ) %>%
ungroup() %>%
sample_n(31)
hctreemap2(group_vars = c("category","subcategory"),
size_var = "sales")%>%
hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
Pop: {point.value:,.0f}<br>
GNI: {point.colorValue:,.0f}")
the error is the following
Error in hctreemap2(., group_vars = c("category", "subcategory"), size_var = "sales") : Treemap data uses same label at multiple levels.
I tried everything and it doesn't work out, could someone with experience explain to me what is happening?
When I tried your code, it also stated that the function was deprecated and to use data_to_hierarchical. Although, it's never quite that simple, right? I tried multiple ways to get hctreemap2 to work, but wasn't able to discern that issue. From there I turned to the package recommended data_to_hierarchical. Now that worked without an issue--once I figured out the right type, which in hindsight seemed kind-of obvious.
That being said, this is what I've got:
data_to_hierarchical(data = dfitemsg,
group_vars = c(category,subcategory),
size_var = sales) %>%
hchart(type = "treemap") %>%
hc_tooltip(pointFormat = "<b>{point.name}</b>:<br>
Pop: {point.value:,.0f}<br>
GNI: {point.colorValue:,.0f}")
You didn't actually designate a color, so the GNI comes up blank.
Let me know if you run into any issues.
Based on your comment:
I have not found a way to change the color to density, which is what both hctreemap2 and treemap appear to do. The function data_to_heirarchical codes the colors to the first grouping variable or the level 1 variable.
Inadvertently, I did figure out why the function hctreemap2 would not work. It checks to see if any category labels are the same as a subcategory label. I didn't go through all of the data, but I know there is a perfumery perfumery. I don't understand what that's a hard stop. If that is a problem for this call, why wouldn't data_to_heirchical be looking for this issue, as well?
So, I changed the function. First, I called the function itself.
x = hctreemap2
Then I selected it from the environment pane. Alternatively, you can code View(x).
This view is read-only, but it's easier to read than the console. I copied the function and assigned it to its original name with changes. I removed two pieces of the code, which changed nothing structurally speaking to how the chart is created.
I removed the first line of code in the function:
.Deprecated("data_to_hierarchical")
and this code (about a third of the way down)
if (data %>% select(!!!group_syms) %>% map(unique) %>% unlist() %>%
anyDuplicated()) {
stop("Treemap data uses same label at multiple levels.")
}
This left me to recreate the function with this code:
hctreemap2 <- function (data, group_vars, size_var, color_var = NULL, ...)
{
assertthat::assert_that(is.data.frame(data))
assertthat::assert_that(is.character(group_vars))
assertthat::assert_that(is.character(size_var))
if (!is.null(color_var))
assertthat::assert_that(is.character(color_var))
group_syms <- rlang::syms(group_vars)
size_sym <- rlang::sym(size_var)
color_sym <- rlang::sym(ifelse(is.null(color_var), size_var, color_var))
data <- data %>% mutate_at(group_vars, as.character)
name_cell <- function(..., depth) paste0(list(...),
seq_len(depth),
collapse = "")
data_at_depth <- function(depth) {
data %>%
group_by(!!!group_syms) %>%
summarise(value = sum(!!size_sym), colorValue = sum(!!color_sym)) %>%
ungroup() %>%
mutate(name = !!group_syms[[depth]], level = depth) %>%
mutate_at(group_vars, as.character()) %>% {
if (depth == 1) {
mutate(., id = paste0(name, 1))
}
else {
mutate(.,
parent = pmap_chr(list(!!!group_syms[seq_len(depth) - 1]),
name_cell, depth = depth - 1),
id = paste0(parent, name, depth))
}
}
}
treemap_df <- seq_along(group_vars) %>% map(data_at_depth) %>% bind_rows()
data_list <- treemap_df %>% highcharter::list_parse() %>%
purrr::map(~.[!is.na(.)])
colorVals <- treemap_df %>%
filter(level == length(group_vars)) %>% pull(colorValue)
highchart() %>%
hc_add_series(data = data_list, type = "treemap",
allowDrillToNode = TRUE, ...) %>%
hc_colorAxis(min = min(colorVals), max = max(colorVals), enabled = TRUE)
}
Now your code, as originally written will work. You did not change the highcharter package by doing this. So if you think you'll use it in the future save the function code, as well. You will need the library purrr, since you already called dplyr (where most, if any conflicts occur), you could just call tidyverse (which calls several libraries at one time, including both dplyr and purrr).
This is what it will look like with set.seed(10):
If you drill down on the largest block:
It looks odd to me, but I'm guessing that's what you were looking for to begin with.

Purrr package add species names to each output name in SSDM

I have a list of species and I am running an ensemble SDM modelling function on the datset filtering by each species, to give an ensemble SDM per species from the dataset.
I have used purrr package to get it running, and the code works fine when there is no naming convention added in. However, when it outputs the Ensemble.SDM for each species, they are all named the same thing "ensemble.sdm", so when I want to stack them, I cannot as they are all named the same thing.
I would like to be able to name each output of the model something different, ideally linked to the species name picked out in the line: data <- Occ_full %>% filter(NAME == .x)
The working code is written below:
list_of_species <- unique(unlist(Occ_full$NAME))
# Return unique values
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1)
})
The code I have tried to get it named within it, is below, but it does not work, it names it with lots of repeitions of the row number.
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
label <- as.character(data)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1, name = label )
})
Could anyone help me please? I simply want each "output" to be named with the species name specified in the filter. Thank you
Try using split with imap -
list_of_species <- split(Occ_full, Occ_full$NAME)
output <- purrr::imap(list_of_species,~{
ensemble_modelling(c('GAM'), .x, Env_Vars,Xcol = 'LONGITUDE',
Ycol = 'LATITUDE', rep = 1, name = .y)
})
split would ensure that the list_of_species is named which can be used in imap.

Can I write a function to revalue levels of a factor?

I have a column 'lg_with_children' in my data frame that has 5 levels, 'Half and half', 'Mandarin', 'Shanghainese', 'Other', 'N/A', and 'Not important'. I want to condense the 5 levels down to just 2 levels, 'Shanghainese' and 'Other'.
In order to do this I used the revalue() function from the plyr package to successfully rename the levels. I used the code below and it worked fine.
data$lg_with_children <- revalue(data$lg_with_children,
c("Mandarin" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Half and half" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("N/A" = "Other"))
data$lg_with_children <- revalue(data$lg_with_children,
c("Not important" = "Other"))
To condense the code a little I went back data before I revalued the levels and attempted to write a function. I tried the following after doing research on how to write your own functions (I'm rather new at this).
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c("source" = "target"))}
I intentionally left the df, col, source, and target generic because I need to revalue some other columns in the same way.
Next, I tried to run the code filling in the args and get this message:
warning message
I am not quite sure what the problem is. I tried the following adjustment to code and still nothing.
revalue_factor_levels <- function(df, col, source, target) {df$col <- revalue(df$col, c(source = target))}
Any guidance is appreciated. Thanks.
You can write your function to recode the levels - the easiest way to do that is probably to change the levels directly with levels(fac) <- list(new_lvl1 = c(old_lvl1, old_lvl2), new_lvl2 = c(old_lvl3, old_lvl4))
But there are already several functions that do it out of the box. I typically use the forcats package to manipulate factors.
Check out fct_recode from the forcats package. Link to doc.
There are also other functions that could help you - check out the comments below.
Now, as to why your code isn't working:
df$col looks for a column literally named col. The workaround is to do df[[col]] instead.
Don't forget to return df at the end of your function
c(source = target) will create a vector with one element named "source", regardless of what happens to be in the variable source.
The solution is to create the vector c(source = target) in 2 steps.
revalue_factor_levels <- function(df, col, source, target) {
to_rename <- target
names(to_rename) <- source
df[[col]] <- revalue(df[[col]], to_rename)
df
}
Returning the df means the syntax is:
data <- revalue_factor_levels(data, "lg_with_children", "Mandarin", "Other")
I like functions that take the data as the first argument and return the modified data because they are pipeable.
library(dplyr)
data <- data %>%
revalue_factor_levels("lg_with_children", "Mandarin", "Other") %>%
revalue_factor_levels("lg_with_children", "Half and half", "Other") %>%
revalue_factor_levels("lg_with_children", "N/A", "Other")
Still, using forcats is easier and less prone to breaking on edge cases.
Edit:
There is nothing preventing you from both using forcats and creating your custom function. For example, this is closer to what you want to achieve:
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- forcats::fct_others(df[[col]], keep = ref_level)
df
}
# Will keep Shanghaisese and revalue other levels to "Other".
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")
Here is what I ended up with thanks to help from the community.
revalue_factor_levels <- function(df, col, ref_level) {
df[[col]] <- fct_other(df[[col]], keep = ref_level)
df
}
data <- revalue_factor_levels(data, "lg_with_children", "Shanghainese")

Which environment should be called when using eval( ) in a function?

I've got a set of functions that I'm trying to work with and I'm struggling to figure out why the assignment isn't working. Here are the functions I'm using:
new_timeline <- function() {
timeline = structure(list(), class="timeline")
timeline$title <- list("text" = list("headline" = NULL, "text" = NULL),
"start_date" = list("year" = NULL, "month" = NULL, "day" = NULL),
"end_date" = list("year" = NULL, "month" = NULL, "day" = NULL))
return(timeline)
}
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>% parse(text = .) %>% eval()
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>% parse(text = .) %>% eval()
if (!is.na(date[3])) stringr::str_interp("self$title$${time_type}_date$day <- date[3]") %>% parse(text = .) %>% eval()
return(self)
}
edit_title <- function(self, headline = NULL, text = NULL, start_date = NULL, end_date = NULL) {
if (class(self) != "timeline") stop("The object passed must be a timeline object.")
if (is.null(headline) && is.null(self$title$text$headline)) stop("Headline cannot be empty when adding a new title.")
if (!is.null(headline)) self$title$text$headline <- headline
if (!is.null(text)) self$title$text$text <- text
if (!is.null(start_date)) self <- .add_date(self, date = start_date, time_type = "start")
if (!is.null(end_date)) self <- .add_date(self, date = end_date, time_type = "end")
return(self)
}
EDIT: The above code has been severely reduced per a request in the comments. The code is still sufficient to reproduce the error.
I know that's a bit long-winded, so I apologize. The first function establishes a new timeline object. The third function allows us to change the title of the timeline object and the second function is a helper function that handles dates. The code would be used like this:
library(magrittr)
#devtools::install_github("hadley/stringr")
library(stringr)
tl <- new_timeline()
tl <- tl %>% edit_title(headline = "My Timeline", text = "Example", start_date = "2015-10-18")
The code runs with no errors, but when I call tl$title$start_date$year, it comes back as NULL. Using an answer I got in this previous question I asked, I tried to set envir = globalenv() within the eval function. When I do that, the function returns an error saying that object self cannot be found.
So I'm under the impression that self is held in the parent.frame(). So I add both of these to a list: envir = list(globalenv(), parent.frame()). This causes the function to run without error, but there's still no assignment.
Where am I going wrong? Thanks in advance!
As mentioned in the comments, I think you could probably do away with all of the code parsing and just pass variables in [[ for your assignments. Anyway, when you use the pipe operator a bunch of function wrapping happens so determining how many frames to go back is painful. Here are a couple solutions modifying the .add_date function.
You already found one, using <<-, since it searches back through the parent environments until it finds the variable (or doesnt and assigns it in the global).
Another would be just storing the function environment() and passing that to eval.
A third would be counting how many frames deep you go, and using sys.frame to tell eval which environment to look in.
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
## Examining environemnts
e <- environment() # current env
efirst <- sys.nframe() # frame number
print(paste("Currently in frame", efirst))
envs <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.frames()} # list of frames
elast <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.nframe()} # number of last
print(paste("Went", elast, "frames deep."))
## Go back this many frames in eval
goback <- efirst-elast
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
## Solution 1: use sys.frame
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>%
parse(text = .) %>% eval(envir=sys.frame(goback))
## Solution 2: use environment defined in function
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>%
parse(text = .) %>% eval(envir=e)
return(self)
}

Resources