Combine Data Wrangling workflow into a single function - r

I have got a list of dataframes where I want to perform some data wrangling operations on. For every year, I got a new list of data.frames
results_2018 <- list_of_objects %>%
map(~dplyr::top_n(.x, 10, Germany)) %>%
map(~rename(.x, "Answers" = "Answer.Options"))
results_2019 <- list_of_objects_2 %>%
map(~dplyr::top_n(.x, 10, Germany)) %>%
map(~rename(.x, "Answers" = "Data.Points"))
This is my code where I calculate the top 10 values for each year for one country. Since there are 10 years in history, is there a way to comebine these calculations into a single function?
I guess map2 and pmap might do the job, but I canĀ“t get my head around how this works.
Can anyone help me? (sorry for not providing reproduceable data, datasets are quite large)

You can try :
library(tidyverse)
list_obj <- list(list_of_objects, list_of_objects_2, ..., ..)
#If there are lot of them use
#list_obj <- mget(ls(pattern = 'list_of_objects\\d+'))
output <- map(list_obj, ~map(.x, function(x)
x %>% top_n(10, Germany) %>% rename("Answers" = "Answer.Options"))
This would return you list of lists, possibly using map_df for inner map would be useful.
map(list_obj, ~map_df(.x, function(x)
x %>% top_n(10, Germany) %>% rename("Answers" = "Answer.Options"))

Related

Creating Groups based on Column Position

Good afternoon!
I think this is pretty straight forward question, but I think I am missing a couple of steps. Would like to create groups based on column position.
Am working with a dataframe / tibble; 33 rows long, and 66 columns wide. However, every sequence of 6 columns, should really be separated into its own sub-dataframe / tibble.
The sequence of the number columns is arbitrary to the dataframe. Below is an attempt with mtcars, where I am trying to group every 2 columns into its own sub-dataframe.
mtcars %>%
tibble() %>%
group_by(across(seq(1,2, length.out = 11))) %>%
nest()
However, that method generates errors. Something similar applies when working just within nest() as well.
Using mtcars, would like to create groups using a sequence for every 3 columns, or some other number.
Would ultimately like the mtcars dataframe to be...
Columns 1:3 to be group 1,
Columns 4:6 to be group 2,
Columns 7:9 to be group 3, etc... while retaining the information for the rows in each column.
Also considered something with pivot_longer...
mtcars %>%
tibble() %>%
pivot_longer(cols = seq(1,3, by = 1))
...but that did not generate defined groups, or continue the sequencing along all columns of the dataframe.
Hope one of you can help me with this! Would make certain tasks for work much easier.
PS - A plus if you can keep the workflow to tidyverse centric code :)
You could try this. It splits the dataframe into a list of dataframes based on the number of columns you want (3 in your example):
library(tidyverse)
list_of_dataframes <- mtcars %>%
tibble() %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
group_by(row) %>%
mutate(group = ceiling(row_number()/ 3)) %>%
ungroup() %>%
group_split(group) %>%
map(
~select(.x, row, name, value) %>%
pivot_wider()
)
EDIT
Here, based on comments from the question asker, we will avoid pivoting the data. Instead, we map the groups across the dataframe.
list_of_dataframes <- map(seq(1, ncol(mtcars), by = 3),
~mtcars %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+2, ncol(mtcars))))))
We can then wrap this in a function to make it a little easier to use and change group sizes and dataframes:
group_split_cols <- function(.data, ncols_per_group){
map(seq(1, ncol(.data), by = ncols_per_group),
~.data %>%
as_tibble() %>%
select(all_of(.x:min(c(.x+ncols_per_group-1, ncol(.data))))))
}
list_of_dataframes <- group_split_cols(.data = mtcars, ncols_per_group = 3)

Dataframes within list within list - how to access each dataframe in R

I'm pretty new to R, so please bear with me.
I read in an Excel spreadsheet containing 31 years of MLB prospect data on separate sheets using the following code:
path = "../Documents/BA Prospects 1990-2021.xlsx"
prospect_data <- excel_sheets(path = path) %>%
map(~ data.frame(read_excel(path, sheet = .)))
I then wrote a function to clean up the data a bit, and applied it to all 31 elements:
pull_df <- function(data, n = 1, year = 1990) {
prospect_data[[n]] %>%
data.frame() %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
prospect_data <- lapply(prospect_data[1:31], pull_df)
What I want is to take the 31 dataframes and save each one globally. However, those dataframes are each nested within a separate list. Those 31 lists are nested within the list prospect_data. No matter what I try, with for loops and all, I can't extract those dataframes from the prospect_data list to be able to manipulate them further. Honestly, I'd settle for one big data frame with 3100 rows and 17 columns at this point. I just want to get my data into a dataframe.
I know I did a poor job of explaining it, but please help!
It is not a good practice to create individual dataframes in the global environment. Using lists is a better idea or if you prefer combine them in one big dataframe. Also length(1990:2021) gives 32 values so I have adjusted below answer to use 1990:2020 instead if you in total have 31 dataframes in prospect_data.
library(dplyr)
library(purrr)
pull_df <- function(data, year) {
data %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
combine_df <- map2_df(prospect_data, 1900:2020, pull_df)
If I understand your problem correctly, prospect_data is a list of one-element lists, where the single element in each component list is a dataframe.
If this is the case, you can "flatten" prospect_data with:
flattened_prospect_data <- unlist(prospect_data, recursive = FALSE)
However, if prospect_data is a list of arbitrary length lists, where the dataframe of interest is at index j you can do the following.
flattened_prospect_data <- lapply(prospect_data, function(x) x[[j]])
If each dataframe is nested even deeper, say each one is the kth element, of the jth element, of the ith element of an element of prospect_data. Then you can perform recursive extraction with
flattened_prospect_data <- lapply(prospect_data, function(x) x[[c(i, j, k)]])
Once you have all the dataframes in a list you can assign them to a name by their year using
get_year <- function(df) unique(df$year)
create_name <- function(x) paste(x, "Prospects") # create_name(1990) returns "1990 Prospects"
df_names <- sapply(flattened_prospect_data, function(x) create_name(get_year(x)))
Map(function(x, y) assign(x,value = y, envir = .GlobalEnv), df_names, flattened_prospect_data)

Lagged values multiple columns with function in R

I would like to create lagged values for multiple columns in R.
First, I used a function to create lead/lag like this:
mleadlag <- function(x, n, ts_id) {
pos <- match(as.numeric(ts_id) + n, as.numeric(ts_id))
x[pos]
}
Second, I would like to apply this function for several columns in R. firm.characteristics is list of columns I would like to compute lagged values.
library(dplyr)
firm.characteristics <- colnames(df)[4:6]
for(i in 1:length(firm.characteristics)){
df <- df %>%
group_by(company) %>%
mutate(!!paste0("lag_", i) := mleadlag(df[[i]] ,-1, fye)) %>%
ungroup()
}
However, I didn't get the correct values. The output for all companies in year t is the last row in year t-1. It didn't group by the company any compute the lagged values.
Can anyone help me which is wrong in the loop? Or what should I do to get the correct lagged values?
Thank you so much for your help.
Reproducible sample could be like this:
set.seed(42) ## for sake of reproducibility
n <- 6
dat <- data.frame(company=1:n,
fye=2009,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat2 <- data.frame(company=1:n,
fye=2010,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
dat3 <- data.frame(company=1:n,
fye=2011,
x=rnorm(n),
y=rnorm(n),
z=rnorm(n),
k=rnorm(n),
m=rnorm(n))
df <- rbind(dat,dat2,dat3)
I would try to stay away from loops in the tidyverse. Many of the tidyverse applications that would traditionally require loops already exist and are very fast, which creates more efficient and intuitive code (the latter being my opinion). This is a great use case for dplyr's across() functionality. I first changed the df to a tibble.
df %>%
as_tibble() %>%
group_by(company) %>%
mutate(
across(firm.characteristics, ~lag(., 1L))
) %>%
ungroup()
This generates the required lagged values. For more information see dplyr's across documentation.

A better way to split apply and combine in R using sp::merge() as function

I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.

XML2-Package: How to treat empty Nodes?

I am trying to extract some data from an html site. I got 500 nodes which should conatain a date, a title and a summary. By using
url <- "https://www.bild.de/suche.bild.html?type=article&query=Migration&resultsPerPage=1000"
html_raw <- xml2::read_html(url)
main_node <- xml_find_all(html_raw, "//section[#class='query']/ol") %>%
xml_children()
xml_find_all(main_node, ".//time") #time
xml_find_all(main_node, ".//span[#class='headline']") #title
xml_find_all(main_node, ".//p[#class='entry-content']") #summary
it returns three vectors with dates, titles and summaries, which than can be knitted together. At least in theory. Unfortunately my Code finds 500 dates, 500 titles but only 499 summaries. The reason for this is, that one of the nodes is just missing.
This leaves me with the problem, that I cannot bind this into an data frame because of the difference in length. The summaries wouldn't match the exact dates and titles.
An easy solution would be, to loop through the nodes and replace the empty node with a placeholder like an "NA".
dates <- c()
titles <- c()
summaries <- c()
for(i in 1:length(main_node)){
date_temp <- xml_find_all(main_node[i], ".//time") %>%
xml_text(trim = TRUE) %>%
as.Date(format = "%d.%m.%Y")
title_temp <- xml_find_all(main_node[i], ".//span[#class='headline']") %>%
xml_text(trim = TRUE)
summary_temp <- xml_find_all(main_node[i], ".//p[#class='entry-content']") %>%
xml_text(trim = TRUE)
if(length(summary_temp) == 0) summary_temp <- "NA"
dates <- c(dates, date_temp)
titles <- c(titles, title_temp)
summaries <- c(summaries, summary_temp)
}
But this makes a simple three line code unnecessary long. So my question I guess is: Is there a more sophisticated approach than a loop?
You could use the purrr library to help and avoid the explicit loop
library(purrr)
dates <- main_node %>% map_chr(. %>% xml_find_first(".//time") %>% xml_text())
titles <- main_node %>% map_chr(. %>% xml_find_first(".//span[#class='headline']") %>% xml_text())
summaries <- main_node %>% map_chr(. %>% xml_find_first(".//p[#class='entry-content']") %>% xml_text())
This uses the fact that xml_find_first will return NA if an elements is not found as pointed out by #Dave2e.
But also in general growing a list by appending each iteration in a loop is very inefficient in R. It's better to pre-allocate the vector (since it will be of a known length) and then assign values each iteration to the proper slot (out[i] <- val). There's not really anything wrong with loops themselves in R; it's really just memory reallocation that can slow things down.

Resources