I have created a list of dataframes with split like so:
dataframes_list <- split(df, f = df$variable3)
Each dataframe (131 in total) there is in long format and have the same variables and structure. I want to perform the function pivot_wider in all of them simultaneously.
I have been struggling with some functions of the apply family, but could not get it done:
First I reduced the number of variables within each dataframe selecting only those that should be used for pivoting
dataframes_list_2 <- lapply(dataframes_list, function (x) select(x, variable1, variable2))
Then I tried pivot_wider
dataframes_list_3 <- lapply(dataframes_list_2, function(x) pivot_wider(x, names_from = variable1, values_from = variable 2)
What I obtain in this way is the list with dataframes that contain 1 observation per variable, each of them being a vector of (in this case) 12 values. What I want instead is this:
Because there was a warning telling me that my observations were not uniquely identified, I varied the code above including such variable. But what I got was this:
Can someone give me some answer to this issue?
Thank you
Each dataframe in the list has this aspect:
I had the same problem and I solved it this way:
df_list <- lapply(1:length(my_list),
function(x) (pivot_wider(my_list[[x]], names_from = names, values_from = values)))
bind_rows(df_list)
You will get what you needed! Hope it helps!
You could try:
map(my_list, ~ (pivot_wider(.x, names_from=1,values_from= 2)))
number 1 and 2 are the columns in my tibbles. You can use map_dfr. To combine the data sets you can use unnest of bind_rows.
Related
At the moment I am trying to apply GLM predict on a dataframe. The dataframe is quite large therefore I want to apply predict by chunks.
I have found a solution but it is quite unhandy. I first create an empty dataframe and then use rbind. Is there a more efficient way of doing this?
df=data[c(),]
for (x in split(data, factor(sort(rank(row.names(data))%%10)))) {
x["prediction"]=predict(model, x, type="response")
df=rbind(df,x)
}
As the comments mention, an example of what you want your output dataframe to look like would be very helpful.
But I think you can achieve what you want by making a grouping variable first then using 'group_by', something like this:
df <- data %>%
mutate(group = rep(1:10, times = nrow(.)/10)) %>% # make an arbitrary grouping factor for this example
group_by(group) %>% # group by whatever your grouping factor is
summarise(predictions = predict(model, x, type = 'response')) # summarise could be replaced by mutate
I have several series, each one indicates the deflator for the GDP for each country. (Data attached down below)
So what I want to do is to divide every column for the 97th position.
I know this could be pretty simple for you, but I am struggling.
This is my code so far:
d_data <- d_data %>%
mutate_if(is.numeric, function(x) x/d_data[[97,x]])
So as you can see in the data, from columns 3 to 8 data are numeric.
I think the error is that argument x of the function refers to the column name, while in the d_data, the second argument refers to column position and that is the main issue.
How can I solve this? Thanks in advance!!
Data
Data was massive to put here (745 rows, 8 columns)
So I uploaded the dput(d_data) output here
Use mutate with across as _at/_all are deprecated. Also, to extract by position, use nth
library(dplyr)
d_data %>%
mutate(across(where(is.numeric), ~ .x/nth(.x, 97)))
In the OP's code, instead of d_data[[97,x]], it should be x[97] as x here is the column value itself
d_data %>%
mutate_if(is.numeric, function(x) x/x[97])
If we want to subset the original data column, have to pass either column index or column name. Here, x doesn't refer to column index or name. But with across, we can get the column name with cur_column() e.g. (mtcars %>% summarise(across(everything(), ~ cur_column()))) which is not needed for this case
As a follow up to this question, I'm using dplyr's group_split() to make dataframes / tibbles based on a levels of a column. Continuing off of this question, I want to split off of two columns instead of 1. When I try to split and name the columns, it attributes the wrong names to some of the datasets.
Here's a simple example:
library(dplyr)
#Sample dataset to intuitively illustrate issue
example <- tibble(number = c(1:6),
even_or_odd = c("odd", "even", "odd", "even", "odd", "even"),
prime_or_not = c("prime", "prime", "prime", "not", "prime", "not")) %>%
mutate(type = paste0(even_or_odd, "_", prime_or_not)) %>%
mutate(type_factor = factor(type, levels = unique(type)))
#Does group split to make 3 datasets
the_test <- example %>%
group_split(even_or_odd, prime_or_not) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #wrong label :`-(
odd_prime <- the_test["odd_prime"]$odd_prime #wrong label :`-(
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
My question: how do I ensure that my group names will be attributed to the right dataset and avoid the issues here with even_not and odd_prime being mixed up?
In my actual dataset, I have 50+ combinations, so typing them all out manually is not an option. In addition, my actual dataset will have some combinations that don't consistently exist (like the (like the odd not prime combination here), so relying on index isn't an option.
Instead of splitting by the two columns, use the factor column that was created, which ensures that it splits by the order of the levels created in the type_factor. In addition, using the unique on type_factor can have some issues if the order of the values in 'type_factor' is different i.e. unique gets the first non-duplicated value based on its occurrence. Instead, levels is better. In fact, it may be more appropriate to droplevels as well in case of unused levels.
the_test <- example %>%
group_split(type_factor) %>%
setNames(levels(example$type_factor))
group_split returns unnamed list. If we want to avoid the pain of renaming incorrectly, use split from base R which does return a named list. Thus, it can return in any order as long as the key/value pairs are correct
# 1 - return in a different order based on alphabetic order
split(example, example[c("even_or_odd", "prime_or_not")], drop = TRUE)
# 2 - return order based on the levels of the factor column
split(example, example$type_factor)
# 3 - With dplyr pipe
example %>%
split(.$type_factor)
# 4 - or using magrittr exposition operator
library(magrittr)
example %$%
split(x = ., f = type_factor)
Oh, of course the moment I post it, I realize that an easy solution existed:
Just change the group split to the new variable and it works!
library(dplyr)
#Does group split to make 3 datasets
the_test <- example %>%
group_split(type_factor) %>%
setNames(unique(example$type_factor))
#The data sets with some being correct but others not
even_prime <- the_test["even_prime"]$even_prime #works!
even_not <- the_test["even_not"]$even_not #works now!
odd_prime <- the_test["odd_prime"]$odd_prime #works now!
odd_not <- the_test["odd_not"]$odd_not #works--correctly throws an error!
I have a tibble with 20 variables. So far I've been using this pipe to find out which values appear more than once in a single column
as_tibble(iris) %>% group_by(Petal.Length) %>% summarise(n=sum(n())) %>% filter(n>1)
I was wonering if I could write a line that could loop this through all the columns and return 20 different tibbles (or as many as I need in the future) in the same way the pipe above would return one tibble. I have tried writing my own loops but I've had no success, I am quite new.
The iris example dataset has 5 columns so feel free to give an answer with 5 columns.
Thank you!
library(dplyr)
col_names <- colnames(iris)
lapply(
col_names,
function(col) {
iris %>%
group_by_at(col) %>%
summarise(n = n()) %>%
filter(n > 1)
}
)
In base R 4.1+ we have this one-liner. For each column it applies table and then filters out those elements whose value exceeds 1. Finally it converts what remains of the table to a data frame. Omit stack if it is ok to return a list of table objects instead of a list of data frames.
lapply(iris, \(x) stack(Filter(function(x) x > 1, table(x))))
A variation of that is to keep only duplicated items and then add 1 giving slightly fewer keystrokes. Again we can omit stack if returning a list of table objects is ok.
lapply(iris, \(x) stack(table(x[duplicated(x)]) + 1))
I have a dataframe made up of 400'000 rows and about 50 columns. As this dataframe is so large, it is too computationally taxing to work with.
I would like to split this dataframe up into smaller ones, after which I will run the functions I would like to run, and then reassemble the dataframe at the end.
There is no grouping variable that I would like to use to split up this dataframe. I would just like to split it up by number of rows. For example, I would like to split this 400'000-row table into 400 1'000-row dataframes.
How might I do this?
Make your own grouping variable.
d <- split(my_data_frame,rep(1:400,each=1000))
You should also consider the ddply function from the plyr package, or the group_by() function from dplyr.
edited for brevity, after Hadley's comments.
If you don't know how many rows are in the data frame, or if the data frame might be an unequal length of your desired chunk size, you can do
chunk <- 1000
n <- nrow(my_data_frame)
r <- rep(1:ceiling(n/chunk),each=chunk)[1:n]
d <- split(my_data_frame,r)
You could also use
r <- ggplot2::cut_width(1:n,chunk,boundary=0)
For future readers, methods based on the dplyr and data.table packages will probably be (much) faster for doing group-wise operations on data frames, e.g. something like
(my_data_frame
%>% mutate(index=rep(1:ngrps,each=full_number)[seq(.data)])
%>% group_by(index)
%>% [mutate, summarise, do()] ...
)
There are also many answers here
I had a similar question and used this:
library(tidyverse)
n = 100 #number of groups
split <- df %>% group_by(row_number() %/% n) %>% group_map(~ .x)
from left to right:
you assign your result to split
you start with df as your input dataframe
then you group your data by dividing the row_number by n (number of groups) using modular division.
then you just pass that group through the group_map function which returns a list.
So in the end your split is a list with in each element a group of your dataset.
On the other hand, you could also immediately write your data by replacing the group_map call by e.g. group_walk(~ write_csv(.x, paste0("file_", .y, ".csv"))).
You can find more info on these powerful tools on:
Cheat sheet of dplyr explaining group_by
and also below for:
group_map, group_walk follow up functions