Saving output of lapply to respective data frames - r

I am pretty new to R. This seems like a simple question, but I just don't know the best way to approach it. I have checked similar questions but have not found the answer I am looking for.
I have a list for data frames (actually tibbles) that I want to run through the convert() function from the hablar package to convert all of the data types for each variable in the data frames. I then want to overwrite the original data frames. Here is a simplified example data frame (N.B. all of the variables are currently factors). For simplicity I have made adm2 and adm3 the same as adm1, but there are different in my real data.
adm1 <- data.frame(admV1 = as.factor(c("male", "female", "male", "female")),
admV2 = as.factor(c("12.2", "13.0", "14.0", "15.1")),
admV3 = as.factor(c("free text", "more free text", "even more free text", "free text again")),
admV4 = as.factor(c("2019-01-01T12:00:00", "2019-01-01T12:00:00", "2019-01-01T12:00:00", "2019-01-01T12:00:00")))
adm1 <- as_tibble(adm1)
adm2 <- adm1
adm3 <- adm1
dis1 <- data.frame(disV1 = as.factor(c("yes", "no", "yes", "no")),
disV2 = as.factor(c("12.2", "13.0", "14.0", "15.1")),
disV3 = as.factor(c("free text", "more free text", "even more free text", "free text again")),
disV4 = as.factor(c("2019-01-01+T12:00:00", "2019-01-01+T12:00:00", "2019-01-01+T12:00:00", "2019-01-01+T12:00:00")))
dis1 <- as_tibble(dis1)
dis2 <- dis1
dis3 <- dis1
I have two 'types' of data frames: admissions and discharges. I defined the variables that need to be converted to each data type (N.B. In my real example each is a character vector containing more than one variable name):
# Define data types
adm_chr<- admV3
adm_num<- admV2
adm_fct<- admV1
adm_dte<- admV4
dis_chr<- disV3
dis_num<- disV2
dis_fct<- disV1
dis_dte<- disV4
I have then created a list of the datasets:
# Define datasets
adm_dfs<- list(adm1, adm2, adm2)
dis_dfs<- list(dis1, dis2, dis3)
This is what I have managed so far:
# Write function
convertDataTypes<- function(dfs, type = c("adm", "dis")){
outputs1<- dfs %>% lapply(convert(chr(paste0(type, "_chr")),
num(paste0(type, "_num")),
fct(paste0(type, "_fct"))))
outputs2<- dfs %>% mutate_at(vars(paste0(type, "_dte")),
ymd_hms, tz = "GMT")
}
# Run function
convertDataTypes(adm_dfs, "adm")
I think I need to then use lapply over outputs1 and outputs2 to assign the variables, but there is probably a much better way of approaching this. I would be very grateful for your input.

If the 'dfs' are a list of data.frames, then
library(hablar)
library(purrr)
library(dplyr)
If the 'type' corresponds to each data.frame in the list use map2
convertDataTypes <- function(dfs, type = c("adm", "dis")) {
map2(dfs, type, ~ {
.type <- .y
map(.x, ~ .x %>%
convert(chr(str_c(.type, "_chr")),
num(str_c(.type, "_num")),
fct(str_c(.type, "_fct"))) %>%
mutate_at(vars(str_c(.type, "_dte")),
ymd_hms, tz = "GMT"))
})
}
dfsN <- list(adm_dfs, dis_dfs)

Related

Error when doing Panel VAR (panelvar package) in R

I tried to run a panel var on dataset I got from Statistics Sweden and here is what I get:
df<- read_excel("Inkfördelning per kommun.xlsx")
nujavlar <- pvarfeols(dependent_vars = c("Kvintil-1", "Kvintil-4", "Kvintil-5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year")
)
Error: Can't subset columns that don't exist.
x Column `Kvintil-1` doesn't exist.
I often get this message too:
Warning in xtfrm.data.frame(x) : cannot xtfrm data frames
Error: Can't subset columns that don't exist.
x Location 2 doesn't exist.
ℹ There are only 1 column.
I have made sure that all data is numeric. I have also tried cleaning my workspace and restarted the programme. I also tried to convert it into a paneldata frame with palm package. I also tried converting my entity variable "Kommun" (Municipality) into factors and it still doesn't work.
Here's the data if someone wants to give it a go.
https://docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC/edit?usp=sharing&ouid=113164216369677216623&rtpof=true&sd=true
The column names in your dataframe are Kvintil 1, not Kvintil-1, so the variable you are referring to really does not exist. Please be aware that in R, variable names cannot have hyphens and it is good practice to avoid spaces in variable names because it is annoying to refer to variables with spaces. I have included a reproducible example below.
library(tidyverse)
library(gsheet)
library(panelvar)
url <- 'docs.google.com/spreadsheets/d/16Ak_Z2n6my-5wEw69G29_NLryQKcrYZC'
df <- gsheet2tbl(url) %>%
rename(Kvintil1 = `Kvintil 1`) %>%
rename(Kvintil2 = `Kvintil 2`) %>%
rename(Kvintil3 = `Kvintil 3`) %>%
rename(Kvintil4 = `Kvintil 4`) %>%
rename(Kvintil5 = `Kvintil 5`) %>%
as.data.frame()
nujavlar <- pvarfeols(
dependent_vars = c("Kvintil1", "Kvintil4", "Kvintil5"),
lags = 1,
transformation = "demean",
data = df,
panel_identifier = c("Kommun", "Year"))

Converting string to data masking with dyplr inside function

I'm trying to make a function which restructures some data. I got this function to work and it looked somehthing like this:
function_1 <- function(df, group, focal, reference){
if (reference == "rest"){
df <- df %>% dplyr::mutate({{group}} := recode({{group}}, {{focal}} := {{focal}}, .default = "rest"))
df <- df %>% dplyr::select(ac_id, question_id, question_result_score, {{group}})
df <- df[!(duplicated(dplyr::select(df, ac_id, question_id))), ]
df <- df %>% dplyr::arrange(ac_id)
}
else{
df <- dplyr::filter(df, {{group}} == {{focal}} | {{group}} == {{reference}})
df <- df %>% dplyr::select(ac_id, question_id, question_result_score, {{group}})
df <- df[!(duplicated(dplyr::select(df, ac_id, question_id))), ]
df <- df %>% dplyr::arrange(ac_id)
}
return(df)
}
# and I run the following command:
function_1(mydata, gender, "male", "rest")
This works exactly as I want it to. Now this needs to go inside another function (let's call this function_2), where I loop over different demographic characteristics (age, gender, english-native, etc.) and demographic indicators (e.g. "male" (from gender), "female" (from gender), etc.).
Inside function_2 we loop over the output of another function, which returns a dataframe with the following structure:
group
focal
reference
gender
female
male
gender
female
rest
gender
male
rest
english
native
non-native
...
...
...
The problem when looping over this output is (I THINK) that the input of function_1 becomes:
function_1(mydata, "gender", "female", "male")
#instead of
function_1(mydata, gender, "female", "male")
So without the quotation marks. Does anybody know a way how to fix function_1 such that it works with input as shown above?
Any help would be greatly appreciated and if any other information let me know!
KR
P.S.
Maybe the following helps. To generate the table as shown above, we use a function which I stored in a variable called viable_cat and this output has the following properties:
typeof(viable_cat)
[1] "character"
> class(viable_cat)
[1] "matrix" "array" ```
I recommend !!sym(.) for turning strings into variable names. For example:
library(dplyr)
data(mtcars)
in_var = "mpg"
out_var = "mpg2"
new = mtcars %>%
mutate(!!sym(out_var) := 2 * !!sym(in_var))
You can pass strings between multiple functions with ease.
I know this technique is not recommended by programming with dplyr. It is an older approach for programming with dplyr. I find it more applicable to my use cases than some of the options currently recommended.

Purrr package add species names to each output name in SSDM

I have a list of species and I am running an ensemble SDM modelling function on the datset filtering by each species, to give an ensemble SDM per species from the dataset.
I have used purrr package to get it running, and the code works fine when there is no naming convention added in. However, when it outputs the Ensemble.SDM for each species, they are all named the same thing "ensemble.sdm", so when I want to stack them, I cannot as they are all named the same thing.
I would like to be able to name each output of the model something different, ideally linked to the species name picked out in the line: data <- Occ_full %>% filter(NAME == .x)
The working code is written below:
list_of_species <- unique(unlist(Occ_full$NAME))
# Return unique values
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1)
})
The code I have tried to get it named within it, is below, but it does not work, it names it with lots of repeitions of the row number.
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
label <- as.character(data)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1, name = label )
})
Could anyone help me please? I simply want each "output" to be named with the species name specified in the filter. Thank you
Try using split with imap -
list_of_species <- split(Occ_full, Occ_full$NAME)
output <- purrr::imap(list_of_species,~{
ensemble_modelling(c('GAM'), .x, Env_Vars,Xcol = 'LONGITUDE',
Ycol = 'LATITUDE', rep = 1, name = .y)
})
split would ensure that the list_of_species is named which can be used in imap.

How to convert a dataframe in long format into a list of an appropriate format?

I have a dataframe in the following long format:
I need to convert it into a list which should look something like this:
Wherein, each of the main element of the list would be the "Instance No." and its sub-elements should contain all its corresponding Parameter & Value pairs - in the format of "Parameter X" = "abc" as you can see in the second picture, listed one after the other.
Is there any existing function which can do this? I wasn't really able to find any. Any help would be really appreciated.
Thank you.
A dplyr solution
require(dplyr)
df_original <- data.frame("Instance No." = c(3,3,3,3,5,5,5,2,2,2,2),
"Parameter" = c("age", "workclass", "education", "occupation",
"age", "workclass", "education",
"age", "workclass", "education", "income"),
"Value" = c("Senior", "Private", "HS-grad", "Sales",
"Middle-aged", "Gov", "Hs-grad",
"Middle-aged", "Private", "Masters", "Large"),
check.names = FALSE)
# the split function requires a factor to use as the grouping variable.
# Param_Value will be the properly formated vector
df_modified <- mutate(df_original,
Param_Value = paste0(Parameter, "=", Value))
# drop the parameter and value columns now that the data is contained in Param_Value
df_modified <- select(df_modified,
`Instance No.`,
Param_Value)
# there is now a list containing dataframes with rows grouped by Instance No.
list_format <- split(df_modified,
df_modified$`Instance No.`)
# The Instance No. is still in each dataframe. Loop through each and strip the column.
list_simplified <- lapply(list_format,
select, -`Instance No.`)
# unlist the remaining Param_Value column and drop the names.
list_out <- lapply(list_simplified ,
unlist, use.names = F)
There should now be a list of vectors formatted as requested.
$`2`
[1] "age=Middle-aged" "workclass=Private" "education=Masters" "income=Large"
$`3`
[1] "age=Senior" "workclass=Private" "education=HS-grad" "occupation=Sales"
$`5`
[1] "age=Middle-aged" "workclass=Gov" "education=Hs-grad"
The posted data.table solution is faster, but I think this is a bit more understandable.
require(data.table)
your_dt <- data.table(your_df)
dt_long <- melt.data.table(your_dt, id.vars='Instance No.')
class(dt_long) # for debugging
dt_long[, strVal:=paste(variable,value, sep = '=')]
result_list <- list()
for (i in unique(dt_long[['Instance No.']])){
result_list[[as.character(i)]] <- dt_long[`Instance No.`==i, strVal]
}
Just for reference. Here is the R base oneliner to do this. df is your dataframe.
l <- lapply(split(df, list(df["Instance No."])),
function(x) paste0(x$Parameter, "=", x$Value))

Change column values depending on other column in R

I have problem with my data frame.
I have a dataframe with 2 columns, 'word' and 'word_categories'. I created different variables which include the different words, e.g. 'noun' which includes all the nouns of the word column. I now want to change the labels in the word_categories column to the corresponding variable. So if the word in the word column is included in the object 'noun', I want the word_categories column to display 'noun'.
df <- read.csv("palm.csv")
noun <- c("house", ...)
adj <- c("hard", ...)
...
The data frame looks like the following. It includes other columns but they are fine.
word word_categories
house
car
hard
...
I now want to look, if the words are in any of the created objects and if so, I want the corresponding label printed in the word_categories column. So for 'house' the column should show noun, for 'hard' it should show adjective. If the word is in none of the objects, it should show nothing or 'NA'.
I tried it with the following:
palm$word_categories <- ifelse(palm$word == noun, "noun",
ifelse(palm$word == adj, "adjective", "")))
This, however, doesn't work at all and I have 7 Objects in total so the statement becomes ridiculously long. How do I do it properly?
If the dataframe is called palm (you first call it df but later you use palm) and noun and adj are vectors as you define above, I would do:
library(dplyr)
palm <- palm %>%
mutate(word_categories = case_when(word %in% noun ~ "noun",
word %in% adj ~ "adjective",
TRUE ~ NA_character_))
One way would be to create a named vector of your noun/adjective dictionaries to select each element. The name would be the word and the corresponding data would be noun, adjective etc. You didn't really supply any data so I made some up.
df <- data.frame(
stringsAsFactors = FALSE,
word = c("dog", "short", "bird", "cat", "short", "man")
)
nounName <- c('dog', 'cat', 'bird')
adjName <- c('quick', 'brown', 'short')
noun <- rep('noun', length(nounName))
adj <- rep('adjective', length(adjName))
names(noun) <- nounName
names(adj) <- adjName
partsofspeech <- c(noun, adj)
df$word_categories <- partsofspeech[df$word]

Resources