Generate cross-section from panel data in R [duplicate] - r

This question already has answers here:
data.frame Group By column [duplicate]
(4 answers)
Closed 6 years ago.
I have a panel data file (long format) and I need to convert it to cross-sectional data. That is I don't just need a transformation to the wide format but I need exactly one observation per individual that contains the mean for each variable.
Here's what I want to to: I have panel data (a number of observations for each individual) in a data frame and I'm looking for an easy way in R to generate a new data frame that contains cumulated data for each individual, i. e. either the sum of all observations in each variable or their mean. It might also be interesting to get a measure of volatility.
For example I have a given data frame panel_data that contains panel data:
> individual <- c(1,1,2,2,3,3)
> var1 <- c(2,3,3,3,4,3)
> panel_data <- data.frame(individual,var1)
> panel_data
individual var1
1 1 2
2 1 3
3 2 3
4 2 3
5 3 4
6 3 3
The result should look like this:
> cross_data
individual var1
1 1 5
2 2 6
3 3 7
Now this is only an example. I need this feature in a number of varieties, the most important one probably being the intra-individual mean for each variable.

There are ways to do this using base R or using the popular packages data.table or dplyr. Everyone has their own preference and mine is dplyr.
You can very easily perform a variety of operation to summarise your data per individual. With dplyr syntax, you first group_by individual to specify that operations should be performed on groups defined by the variable "individual". You can then summarise your groups using a function you specify.
Try the following:
library("dplyr")
panel_data %>%
group_by(individual) %>%
summarise(sum_var1 = sum(var1), mean_var1=mean(var1))
Do not be put off by the %>% notation, it is just a convenient shortcut to chain operations:
x %>% f is equivalent to f(x)
x %>% f(a) is equivalent to f(x, a)
x %>% f(a) %>% g(b) is equivalent to g(f(x, a), b)

Related

Split dataframe into 20 groups based on column values [duplicate]

This question already has answers here:
Splitting a continuous variable into equal sized groups
(11 answers)
How to categorize a continuous variable in 4 groups of the same size in R?
(1 answer)
R divide data into groups
(1 answer)
Closed 2 years ago.
I am fairly new to R and can't find a concise way to a problem.
I have a dataframe in R called df that looks as such. It contain a column called values that contains values from 0 to 1 ordered numerically and a binary column called flag that contains either 0 or 1.
df
value flag
0.033 0
0.139 0
0.452 1
0.532 0
0.687 1
0.993 1
I wish to split this dataframe into X amount of groups from 0 to 1. For example if I wished a 4 split grouping, the data would be split from 0-0.25, 0.25-0.5, 0.5-0.75, 0.75-1. This data would also contain the corresponding flag to that point.
I want to solution to be scalable so if I wished to split it into more group then I can. I am also limited to the tidyverse packages.
Does anyone have a solution for this? Thanks
if n is the number of partitions:
L = seq(1,n)/n
GroupedList = lapply(L,function(x){
df[(df$value < x) & (df$value > (x-(1/n))),]
})
I think this should produce a list of dataframes where each dataframe contains what you asked.
You can use cut to divide data into n groups and use it in split to have list of dataframes.
n <- 4
list_df <- split(df, cut(df$value, breaks = n))
If you want to split the data between 0-1 into n groups you can do :
list_df <- split(df, cut(df$value, seq(0, 1, length.out = n + 1)))

Method to extract all existing combination in two columns [duplicate]

This question already has answers here:
R equivalent of SELECT DISTINCT on two or more fields/variables
(4 answers)
Closed 2 years ago.
I prepared a simple code for my question due to the original data volume is huge.
df <- data.frame(X=c(0,0,1,1,1,1),Y=c(0,0,0,0,1,1),Z=c(1.5,2,5,0.7,3.5,4.2))
I'm trying to extract all actually existing combinations in columns X and Y. So the expected result will be (0,0),(1,0),(1,1).
But, If I utilize expand.grid, it will return all available combinations mathematically with elements 0 & 1. So (0,1) will be included in the result
So my question is how to extract only actually existing combinations in two different columns?
Any opinion is welcome!
We can subset the relevant columns and then use unique over it.
unique(df[c('X', 'Y')])
# X Y
#1 0 0
#3 1 0
#5 1 1
Or in dplyr, use distinct
library(dplyr)
df %>% distinct(X, Y)

Finding average values for multiple groups in tidy data in R [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Closed 4 years ago.
I have a tidy dataframe with study data. "init_cont" and "family" represent the different conditions in this study. There are three possible options for init_cont (A, B, or C) and two possible options for family (D or E), yielding a 3x2 experimental design. In this example, there are two different questions that each participant must answer (specified in column "qnumber"). The "value" column indicates their response to the question asked.
id init_cont family qnumber value
1 A D 1 5
1 A D 2 3
2 B D 1 4
2 B D 2 2
3 C E 1 4
3 C E 2 3
4 A E 1 5
4 A E 2 2
I am trying to determine the best way (preferably within the tidyverse) to determine the average of the values for each question, separated by condition. There are 6 conditions, which come from the 6 combinations of the 3 options in init_cont combined with the 2 options in family. In this dataframe, there are only 2 questions, but the actual dataset has 14.
I know I could probably do this by making distinct dataframes for each of the 6 conditions and then breaking these down further to make distinct dataframes for each question, then finding the average values for each dataframe. There must be a better way to do this in fewer steps.
Using tidyverse, to determine the average of the values for each question, separated by condition of say, family:
data %>%
group_by(family) %>%
summarize(avg_value = mean(value))
If you prefer, you can even find the average of the values for each question by condition of say family and a second (or more) variable, say, religion:
data %>%
group_by(family, religion) %>%
summarize(avg_value = mean(value))
EDIT 1: Based on feedback, here's the code to get the average value grouped by init_cont, family, and qnumber:
data %>%
group_by(init_cont, family, qnumber) %>%
summarize(avg_value = mean(value))
See a sample:
We can use aggregate from base R
aggregate(value ~ family, data, mean)

R Data-Frame: Get Maximum of Variable B condititional on Variable A [duplicate]

This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I am searching for an efficient and fast way to do the following:
I have a data frame with, say, 2 variables, A and B, where the values for A can occur several times:
mat<-data.frame('VarA'=rep(seq(1,10),2),'VarB'=rnorm(20))
VarA VarB
1 0.95848233
2 -0.07477916
3 2.08189370
4 0.46523827
5 0.53500190
6 0.52605101
7 -0.69587974
8 -0.21772252
9 0.29429577
10 3.30514605
1 0.84938361
2 1.13650996
3 1.25143046
Now I want to get a vector giving me for every unique value of VarA
unique(mat$VarA)
the maximum of VarB conditional on VarA.
In the example here that would be
1 0.95848233
2 1.13650996
3 2.08189370
etc...
My data-frame is very big so I want to avoid the use of loops.
Try this:
library(dplyr)
mat %>% group_by(VarA) %>%
summarise(max=max(VarB))
Try to use data.table package.
library(data.table)
mat <- data.table(mat)
result <- mat[,max(VarB),VarA]
print(result)
Try this:
library(plyr)
ddply(mat, .(VarA), summarise, VarB=min(VarB))

R count occurrences of an element by groups [duplicate]

This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.
Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())

Resources