I am trying to develop a function that will take data, see if it matches a value in a category (e.g., 'Accident', and if so, develop a new dataframe using the following code.
cat.df <- function(i) {
sdb.i <- sdb %>%
filter(Category == i) %>%
group_by(Year) %>%
summarise(count = n()}
The name of the dataframe should be sdb.i, where i is the name of the category (e.g., 'Accident'). Unfortunately, I cannot get it to work. I'm notoriously bad with functions and would love some help.
It's not entirely clear what you are after so I am making a guess.
First of all, your function cat.df misses a closing bracket so it would not run.
I think it is good practice to pass all objects as parameters to a function. In my example I use the iris dataset so I pass this explicitly to the function.
You cannot change the name of a data frame in the way you describe. I offer two alternatives: if the count of your categories is limited you can just create separate names for each object. it you have many categories, best to combine all result objects into a list.
library(dplyr)
data(iris)
cat.df <- function(data, i) {
data <- data %>%
filter(Species== i) %>%
group_by(Petal.Width) %>%
summarise(count = n())
return(data)
}
result.setosa <- cat.df(iris, "setosa") # unique name
Species <- sort(unique(iris$Species))
results_list <- lapply(Species, function(x) cat.df(iris, x)) # combine all df's into a list
names(results_list) <- Species # name the list elements
You can then get the list elements as e.g. results_list$setosa or results_list[[1]].
Related
Hello I have created a forloop to split my data with respect to a certain row as so:
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
R doesn't like this, as it says team_ not found. the code does run if I include a list as such
team_ <- list()
for(i in 1:(nrow(df)))
{
team_[[i]] <- df %>% filter(team == i)
}
This works... However, I am given a list with thousands of empty items and just a few that contain my filtered data sets.
is there a simpler way to create the data sets without this list function?
thank you
A simpler option is split from base R which would be faster than using == to subset in a loop
team_ <- split(df, df$team)
If we want to do some operations for each row, in tidyverse, it can be done with rowwise
library(dplyr)
df %>%
rowwise %>%
... step of operations ...
or with group_by
df %>%
group_by(team) %>%
...
The methods akrun suggests are much better than a loop, but you should understand why this isn't working. Remember for(i in 1:nrow(df)) will give you one list item for i = 1, i = 2, etc, right up until i = nrow(df), which is several thousand by the sounds of thing. If you don't have any rows where team is 1, you will get an empty data frame as the first item, and the same will be true for every other value of i that isn't represented.
A loop like this would work:
for(i in unique(df$team)) team_[[i]] <- df %>% filter(team == i)
But I would stick to a non-looping method as described by akrun.
I'm pretty new to R, so please bear with me.
I read in an Excel spreadsheet containing 31 years of MLB prospect data on separate sheets using the following code:
path = "../Documents/BA Prospects 1990-2021.xlsx"
prospect_data <- excel_sheets(path = path) %>%
map(~ data.frame(read_excel(path, sheet = .)))
I then wrote a function to clean up the data a bit, and applied it to all 31 elements:
pull_df <- function(data, n = 1, year = 1990) {
prospect_data[[n]] %>%
data.frame() %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
prospect_data <- lapply(prospect_data[1:31], pull_df)
What I want is to take the 31 dataframes and save each one globally. However, those dataframes are each nested within a separate list. Those 31 lists are nested within the list prospect_data. No matter what I try, with for loops and all, I can't extract those dataframes from the prospect_data list to be able to manipulate them further. Honestly, I'd settle for one big data frame with 3100 rows and 17 columns at this point. I just want to get my data into a dataframe.
I know I did a poor job of explaining it, but please help!
It is not a good practice to create individual dataframes in the global environment. Using lists is a better idea or if you prefer combine them in one big dataframe. Also length(1990:2021) gives 32 values so I have adjusted below answer to use 1990:2020 instead if you in total have 31 dataframes in prospect_data.
library(dplyr)
library(purrr)
pull_df <- function(data, year) {
data %>%
filter_all(any_vars(!is.na(.))) %>%
mutate(year = year) %>%
select(year, everything())
}
combine_df <- map2_df(prospect_data, 1900:2020, pull_df)
If I understand your problem correctly, prospect_data is a list of one-element lists, where the single element in each component list is a dataframe.
If this is the case, you can "flatten" prospect_data with:
flattened_prospect_data <- unlist(prospect_data, recursive = FALSE)
However, if prospect_data is a list of arbitrary length lists, where the dataframe of interest is at index j you can do the following.
flattened_prospect_data <- lapply(prospect_data, function(x) x[[j]])
If each dataframe is nested even deeper, say each one is the kth element, of the jth element, of the ith element of an element of prospect_data. Then you can perform recursive extraction with
flattened_prospect_data <- lapply(prospect_data, function(x) x[[c(i, j, k)]])
Once you have all the dataframes in a list you can assign them to a name by their year using
get_year <- function(df) unique(df$year)
create_name <- function(x) paste(x, "Prospects") # create_name(1990) returns "1990 Prospects"
df_names <- sapply(flattened_prospect_data, function(x) create_name(get_year(x)))
Map(function(x, y) assign(x,value = y, envir = .GlobalEnv), df_names, flattened_prospect_data)
I have a list object that contains several elements. One element is a data frame that I wish to modify: I want to perform some operations such as column renaming and mutating new columns.
Although one simple way is to extract the nested data frame, modify it, and finally re-assign the output to the original parent list, I'd like to avoid such solution because it requires intermediate assignment.
Example
Data. let's build a list of several data objects
library(tibble)
my_list <- lst(letters, mtcars, co2, uspop, iris)
Task.
I want to modify my_list$mtcars to:
rename the cyl column
compute a new column that takes a square root of values in mpg column
I want to modify my_list$iris to:
select columns that start with sepal
rename them to lowercase
and ultimately I expect to get back a list object that is identical to the original my_list, except for the changes I made for mtcars and iris.
My attempt. Right now, the only way I know to achieve this involves re-assignment:
library(dplyr)
my_list$mtcars <-
my_list$mtcars %>%
rename("Number of cylinders" = cyl) %>%
mutate(sqrt_of_mpg = sqrt(mpg))
my_list$iris <-
my_list$iris %>%
select(starts_with("Sepal")) %>%
rename_with(tolower)
My question: Given my_list, how could I point to a nested element by its name, specify which actions should happen to modify it, and get back the parent my_list with just those modifications?
I imagine some sort of a pipe that looks like this (just to get my general idea)
## DEMO ##
my_list %>%
update_element(which = "mtcars", what = rename, mutate) %>%
update_element(which = "iris", what = select, rename)
Thanks!
You can try purrr's modify_at function
library(tidyverse)
my_list %>%
modify_at("mtcars", ~rename(.,"Number of cylinders" = cyl) %>%
mutate(sqrt_of_mpg = sqrt(mpg))) %>%
modify_at("iris", ~select(., starts_with("Sepal")) %>%
rename_with(tolower))
You can use imap which passes name along with data for each iteration but this is not closer to your general idea.
library(dplyr)
my_list <- purrr::imap(my_list, ~{
if(.y == 'mtcars')
.x %>% rename("Number of cylinders" = cyl) %>%mutate(sqrt_of_mpg = sqrt(mpg))
else if(.y == 'iris')
.x %>% select(starts_with("Sepal")) %>% rename_with(tolower)
else .x
})
I have a data set similar to the example below, complex sample data. Thanks to SO user IRTFM, I was able to adapt the code and save results (i'm only interested in the total proportions, not the confidence intervals) as a reshaped object for further processing. What I would like to do is extend this sapply to generate results for 20 other variables. I would like to save the results as data frames in a list, ideally, since I think this is the most efficient way. My struggle is how to extend the sapply so that I can process multiple variables at once. I thought about a for loop over a list that holds the names of the variables and started to make this list, var_list below, but this seems not the way forward. I'd rather take advantage of the apply family since I would like the results to be stored in a list.
library(survey) # using the `dclus1` object that is standard in the examples.
library(reshape)
library(tidyverse)
data(api)
stype_t <- sapply( levels(dclus1$variables$stype),
function(x){
form <- as.formula( substitute( ~I(stype %in% x), list(x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c( z, c(attr(z,"ci")) )} ) %>%
as.data.frame() %>% slice(1) %>% reshape::melt() %>% dplyr::mutate(value = round(value, digits = 4)*100)
Lets say you then wanted to repeat the above using the variable awards. You could copy the lines and do it that way but it would be better to be more efficient. So I started by making a list of the names of the two variables in this example data but I am stumped as to how to apply this list to the code above and retain the results in a list of dataframes. I tried wrapping the sapply with an lapply but this did not work because I'm betting that was wrong. Any advice or thoughts would be appreciated.
var_list <- list("stype", "awards")
Instead of $ to reference named elements, consider [[ extractor to reference names by string. Also, extend substitute for dynamic variable:
# DEFINED METHOD
df_build <- function(var) {
sapply(levels(dclus1$variables[[var]]), function(x) {
form <- as.formula(substitute(~I(var %in% x),
list(var=as.name(var), x=x)))
z <- svyciprop(form, dclus1, method="me", df=degf(dclus1))
c(z, c(attr(z,"ci")))
}) %>%
as.data.frame() %>%
slice(1) %>%
reshape::melt() %>%
dplyr::mutate(value = round(value, digits = 4)*100)
}
# ITERATE THROUGH CHARACTER VECTOR AND CALL METHOD
var_list <- list("stype", "awards")
df_list <- lapply(var_list, df_build)
I have a dataframe with 500 observations for each of 3106 US counties. I would like to merge that dataframe with a SpatialPolygonsDataFrame.
I have tried a few approaches. I have found that if I filter the data by a variable iter_id I can use sp::merge() to merge the datasets. I presume that I can then rbind them back together. sp::merge() does not allow a right or full join and the spatial data needs to be in the left position. So a many to one will not work. The really nasty way I have tried is:
(I am not sure how to represent the dataframe with the variables of interest here)
library(choroplethr)
data(continental_us_states)
us <- tigris::counties(continental_us_states)
gm_y_corr <- tribble(~GEOID,~iter_id,~neat_variable,
01001,1,"value_1",
01003,1,"value_2",
...
01001,2,"value_3",
01003,2,"value_4",
...
01001,500,"value_5",
01003,500,"value_6")
filtered <- gm_y_corr %>%
filter(iter_id ==1)
us.gm <- sp::merge(us, filtered ,by='GEOID')
for (j in 2:500) {
tmp2 <- gm_y_corr %>%
filter(iter_id == j)
tmp3 <- sp::merge(us, tmp2,by='GEOID')
us.gm <- rbind(us.gm,tmp3)
}
I know there must be a better way. I have tried group_by. But multple matches are found. So I must not be understanding the group_by.
> geo_dat <- gm_y_corr %>%
+ group_by(iter_id)%>%
+ sp::merge(us, .,by='GEOID')
Error in .local(x, y, ...) : non-unique matches detected
I would like to merge the spatial data with the interesting data.
Here you can use the splitting functionality of base R in split or the more recent dplyr::group_split. This will separate your data frame according to your splitting variable and you can lapply or purrr::map a function such as merge to it and then dplyr::bind_rows to collapse the returned list back to a dataframe. Since I can't manage to get the us data I have just written what I suspect would work.
gm_y_corr %>%
group_by(iter_id) %>% # group
group_split() %>% # split
lapply(., function(x){ # apply function(x) merge(us, x, by = "GEOID") to leach list element
merge(us, x, by = "GEOID")
}) %>%
bind_rows() # collapse to data frame
equivalently this is the same as using base R functionality. The new group_by %>% group_split is a little more intuitive in my opinion.
gm_y_corr %>%
split(.$iter_id) %>%
lapply(., function(x){
merge(us, x, by = "GEOID")
}) %>%
bind_rows()
If you wanted to just use group_by you would have to follow up with dplyr::do function which I believe does a similar thing to what I have just done above. But without you having to split it yourself.