Split dataframe into 20 groups based on column values [duplicate]

Split dataframe into 20 groups based on column values [duplicate] - r

This question already has answers here:
Splitting a continuous variable into equal sized groups
(11 answers)
How to categorize a continuous variable in 4 groups of the same size in R?
(1 answer)
R divide data into groups
(1 answer)
Closed 2 years ago.
I am fairly new to R and can't find a concise way to a problem.
I have a dataframe in R called df that looks as such. It contain a column called values that contains values from 0 to 1 ordered numerically and a binary column called flag that contains either 0 or 1.
df
value flag
0.033 0
0.139 0
0.452 1
0.532 0
0.687 1
0.993 1
I wish to split this dataframe into X amount of groups from 0 to 1. For example if I wished a 4 split grouping, the data would be split from 0-0.25, 0.25-0.5, 0.5-0.75, 0.75-1. This data would also contain the corresponding flag to that point.
I want to solution to be scalable so if I wished to split it into more group then I can. I am also limited to the tidyverse packages.
Does anyone have a solution for this? Thanks

if n is the number of partitions:
L = seq(1,n)/n
GroupedList = lapply(L,function(x){
df[(df$value < x) & (df$value > (x-(1/n))),]
})
I think this should produce a list of dataframes where each dataframe contains what you asked.

You can use cut to divide data into n groups and use it in split to have list of dataframes.
n <- 4
list_df <- split(df, cut(df$value, breaks = n))
If you want to split the data between 0-1 into n groups you can do :
list_df <- split(df, cut(df$value, seq(0, 1, length.out = n + 1)))

Related

How to add cells based off of a specific integer? [duplicate]

This question already has answers here:
Sum elements of a vector beween zeros in R
(3 answers)
Closed 2 years ago.
I want to add values from a column. They go in sequence:
0,225,2352,34234,23442,23456,0,123,...
I want to add the values from 0 until the following 0 but not including the second.
For example, i want an output of
(0+225+2352+34234+23442+23456),(0+123+,...,),...
I want to store them as a new column of totals

One simple solution in base R is
sapply(split(x, cumsum(x == 0)), sum)
With split you basically create groups of elements that you want to sum together using sapply. The final result will be a named numeric vector.
Sample data
x <- c(0,225,2352,34234,23442,23456,0,123,2,0,1,42)
sapply(split(x, cumsum(x == 0)), sum)
# 1 2 3
# 83709 125 43

Method to extract all existing combination in two columns [duplicate]

This question already has answers here:
R equivalent of SELECT DISTINCT on two or more fields/variables
(4 answers)
Closed 2 years ago.
I prepared a simple code for my question due to the original data volume is huge.
df <- data.frame(X=c(0,0,1,1,1,1),Y=c(0,0,0,0,1,1),Z=c(1.5,2,5,0.7,3.5,4.2))
I'm trying to extract all actually existing combinations in columns X and Y. So the expected result will be (0,0),(1,0),(1,1).
But, If I utilize expand.grid, it will return all available combinations mathematically with elements 0 & 1. So (0,1) will be included in the result
So my question is how to extract only actually existing combinations in two different columns?
Any opinion is welcome!

We can subset the relevant columns and then use unique over it.
unique(df[c('X', 'Y')])
# X Y
#1 0 0
#3 1 0
#5 1 1
Or in dplyr, use distinct
library(dplyr)
df %>% distinct(X, Y)

How to create a variable using another variable as an Index? [duplicate]

This question already has answers here:
Using row-wise column indices in a vector to extract values from data frame [duplicate]
(2 answers)
Closed 3 years ago.
I'm looking to create a new variable, d, which grabs the value from either an or b based off of the variable C.
dat = data.frame(a=1:10,b=11:20,c=rep(1:2,5))
The result would be:
d = c(1,12,3,14,... etc)

We can use a row/column indexing where the row index is the sequence of rows and column index the 'c' column, cbind them and extract the elements from the dataset based on this
dat$d <- dat[1:2][cbind(seq_len(nrow(dat)), dat$c)]
dat$d
#[1] 1 12 3 14 5 16 7 18 9 20
NOTE: This should also work when there are multiple column values to extract.

You can do
dat$d <- ifelse(dat$c==1,dat$a,dat$b)

A dplyr variant
dat %>%
mutate(d = case_when(c==1 ~ a,
TRUE ~ b))

Generate cross-section from panel data in R [duplicate]

This question already has answers here:
data.frame Group By column [duplicate]
(4 answers)
Closed 6 years ago.
I have a panel data file (long format) and I need to convert it to cross-sectional data. That is I don't just need a transformation to the wide format but I need exactly one observation per individual that contains the mean for each variable.
Here's what I want to to: I have panel data (a number of observations for each individual) in a data frame and I'm looking for an easy way in R to generate a new data frame that contains cumulated data for each individual, i. e. either the sum of all observations in each variable or their mean. It might also be interesting to get a measure of volatility.
For example I have a given data frame panel_data that contains panel data:
> individual <- c(1,1,2,2,3,3)
> var1 <- c(2,3,3,3,4,3)
> panel_data <- data.frame(individual,var1)
> panel_data
individual var1
1 1 2
2 1 3
3 2 3
4 2 3
5 3 4
6 3 3
The result should look like this:
> cross_data
individual var1
1 1 5
2 2 6
3 3 7
Now this is only an example. I need this feature in a number of varieties, the most important one probably being the intra-individual mean for each variable.

There are ways to do this using base R or using the popular packages data.table or dplyr. Everyone has their own preference and mine is dplyr.
You can very easily perform a variety of operation to summarise your data per individual. With dplyr syntax, you first group_by individual to specify that operations should be performed on groups defined by the variable "individual". You can then summarise your groups using a function you specify.
Try the following:
library("dplyr")
panel_data %>%
group_by(individual) %>%
summarise(sum_var1 = sum(var1), mean_var1=mean(var1))
Do not be put off by the %>% notation, it is just a convenient shortcut to chain operations:
x %>% f is equivalent to f(x)
x %>% f(a) is equivalent to f(x, a)
x %>% f(a) %>% g(b) is equivalent to g(f(x, a), b)

Apply function over consecutive groups in vector [duplicate]

This question already has answers here:
Calculate the mean of every 13 rows in data frame
(4 answers)
Closed 1 year ago.
I want to calculate meas of three consecutive variables a vector.
Ex:
Vec<-rep(1:10)
I would like the output to be like the screenshot below:

You can create the following function to calculate means by groups of 3 (or any other number):
f <- function(x, k=3)
{
for(i in seq(k,length(x),k))
x[(i/k)] <- mean(x[(i-k+1):i])
return(x[1:(length(x)/k)])
}
f(1:15)
[1] 2 5 8 11 14

We can create a grouping variable using gl and then get the mean with ave
ave(Vec, as.numeric(gl(length(Vec), 3, length(Vec))))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Split dataframe into 20 groups based on column values [duplicate] - r

if n is the number of partitions: L = seq(1,n)/n GroupedList = lapply(L,function(x){ df[(df$value < x) & (df$value > (x-(1/n))),] }) I think this should produce a list of dataframes where each dataframe contains what you asked.

You can use cut to divide data into n groups and use it in split to have list of dataframes. n <- 4 list_df <- split(df, cut(df$value, breaks = n)) If you want to split the data between 0-1 into n groups you can do : list_df <- split(df, cut(df$value, seq(0, 1, length.out = n + 1)))

Related

How to add cells based off of a specific integer? [duplicate]

Method to extract all existing combination in two columns [duplicate]

How to create a variable using another variable as an Index? [duplicate]

Generate cross-section from panel data in R [duplicate]

Apply function over consecutive groups in vector [duplicate]

Categories

Resources