This question already has answers here:
How to get summary statistics by group
(14 answers)
Mean per group in a data.frame [duplicate]
(8 answers)
Closed 5 years ago.
I'm trying to merge rows of a data set by using the mean operator.
Basically, I want to convert data set 1 into data set 2 (see below)
1. ID MEASUREMENT 2. ID MEASURE
A 20 A 22.5
B 30 B 30
A 25 .
. .
. .
How can I do this on R?
Note that in contrast to the example I have given here, my data set is really large and I can't look through the data set, group rows according to their id's then find colMeans.
My thoughts are to order the dataset, separate the measures for each id, then find each mean and regroup the data. However, this will be very time consuming.
I would really appreciate if someone can assist me with a direct code or even a for loop.
This code should be able to do that for you.
library(data.table)
setDT(dat)
dat = dat[ , .(MEASURE = mean(MEASUREMENT)), by = .(ID)]
Just to be a little more complete i'll throw in an example and a way to do this in base R.
Data:
dat = data.frame(ID = c("A","A","A","B","B","C"), MEASUREMENT = c(1:3,61,13,7))
With only base R functions:
aggregate(MEASUREMENT ~ ID, FUN = mean, dat)
ID MEASUREMENT
1 A 2
2 B 37
3 C 7
With data.table:
library(data.table)
setDT(dat)
dat = dat[ , .(MEASURE = mean(MEASUREMENT)), by = .(ID)]
> dat
ID MEASURE
1: A 2
2: B 37
3: C 7
You can also do this easily in dplyr, assuming your data is in df
library(dplyr)
df <- df %>%
group_by(ID) %>%
summarize(MEASURE = mean(MEASUREMENT))
Related
This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 1 year ago.
I'd like to stick with dplyr() if possible (I'm a major fan), or base R is fine too if the solution is simple.
Suppose you have two data frames, as shown below. I'd like to create a new data frame that compares the two (df1 and df2), and only shows those complete rows from df1 whose ID doesn't appear in the ID's shown in df2. df2 serves as the "knock out" list. How would this be done?
df1 <- data.frame(
ID = c(1,2,3,4,5),
Value = c(10,20,30,40,50)
)
df2 <- data.frame(
ID = c(6,7,8,1,2),
Value = c(60,70,80,10,20)
)
The new data frame, call it df3, would look like this after applying the df2 "knock outs", when run in the R studio console:
ID Value
1 3 30
2 4 40
3 5 50
ID's 1 and 2 got knocked out because they appear in both df1 and df2.
This could be achieved via an anti_join:
dplyr::anti_join(df1, df2, by = "ID")
#> ID Value
#> 1 3 30
#> 2 4 40
#> 3 5 50
As I mentioned in the comments. The answer of stefan is best. Here is an alternative with Base R:
df1[!(df1$ID %in% df2$ID),]
ID Value
3 3 30
4 4 40
5 5 50
benchmark:
This question already has answers here:
data.frame Group By column [duplicate]
(4 answers)
Closed 6 years ago.
I have a panel data file (long format) and I need to convert it to cross-sectional data. That is I don't just need a transformation to the wide format but I need exactly one observation per individual that contains the mean for each variable.
Here's what I want to to: I have panel data (a number of observations for each individual) in a data frame and I'm looking for an easy way in R to generate a new data frame that contains cumulated data for each individual, i. e. either the sum of all observations in each variable or their mean. It might also be interesting to get a measure of volatility.
For example I have a given data frame panel_data that contains panel data:
> individual <- c(1,1,2,2,3,3)
> var1 <- c(2,3,3,3,4,3)
> panel_data <- data.frame(individual,var1)
> panel_data
individual var1
1 1 2
2 1 3
3 2 3
4 2 3
5 3 4
6 3 3
The result should look like this:
> cross_data
individual var1
1 1 5
2 2 6
3 3 7
Now this is only an example. I need this feature in a number of varieties, the most important one probably being the intra-individual mean for each variable.
There are ways to do this using base R or using the popular packages data.table or dplyr. Everyone has their own preference and mine is dplyr.
You can very easily perform a variety of operation to summarise your data per individual. With dplyr syntax, you first group_by individual to specify that operations should be performed on groups defined by the variable "individual". You can then summarise your groups using a function you specify.
Try the following:
library("dplyr")
panel_data %>%
group_by(individual) %>%
summarise(sum_var1 = sum(var1), mean_var1=mean(var1))
Do not be put off by the %>% notation, it is just a convenient shortcut to chain operations:
x %>% f is equivalent to f(x)
x %>% f(a) is equivalent to f(x, a)
x %>% f(a) %>% g(b) is equivalent to g(f(x, a), b)
This question already has answers here:
Extract the maximum value within each group in a dataframe [duplicate]
(3 answers)
Closed 7 years ago.
I am searching for an efficient and fast way to do the following:
I have a data frame with, say, 2 variables, A and B, where the values for A can occur several times:
mat<-data.frame('VarA'=rep(seq(1,10),2),'VarB'=rnorm(20))
VarA VarB
1 0.95848233
2 -0.07477916
3 2.08189370
4 0.46523827
5 0.53500190
6 0.52605101
7 -0.69587974
8 -0.21772252
9 0.29429577
10 3.30514605
1 0.84938361
2 1.13650996
3 1.25143046
Now I want to get a vector giving me for every unique value of VarA
unique(mat$VarA)
the maximum of VarB conditional on VarA.
In the example here that would be
1 0.95848233
2 1.13650996
3 2.08189370
etc...
My data-frame is very big so I want to avoid the use of loops.
Try this:
library(dplyr)
mat %>% group_by(VarA) %>%
summarise(max=max(VarB))
Try to use data.table package.
library(data.table)
mat <- data.table(mat)
result <- mat[,max(VarB),VarA]
print(result)
Try this:
library(plyr)
ddply(mat, .(VarA), summarise, VarB=min(VarB))
This question already has answers here:
Faster ways to calculate frequencies and cast from long to wide
(4 answers)
Closed 4 years ago.
Hi I have a dataframe like this
Start <- c("A")
End <- c("C")
Days <- c("Day1")
df2 <- data.frame(Start,End,Days)
I am trying to use dcast
df2 <- dcast(df2,Days ~ End,value.var="Days")
but it returns is
Days C
1 Day1 Day1
My desired output is the count
Days C
1 Day1 1
What am I missing here? Kindly provide some inputs on this. Is there a better way to do this using dplyr?
We can create a sequence column of 1 and then use dcast
dcast(transform(df2, i1=1), Days~End, value.var='i1')
# Days C
#1 Day1 1
Or another option is using the fun.aggregate
dcast(df2, Days~End, length)
# Days C
#1 Day1 1
As the OP mentioned about dplyr, it involves using the first method as it doesn't have the fun.aggregate
library(dplyr)
df2 %>%
mutate(C=1) %>%
select(Days:C)
Hi you are on the right track. What you need when you cast your data frame is to have a function that is applied to the aggregation during the casting.
In this case , you want something that counts the occurence of each group to do so you use the function length
dcast(df2,Days ~ End, length ) # or dcast(df, Days ~ End, table)
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 7 years ago.
I have a data.frame that looks somewhat like this.
k <- data.frame(id = c(1,2,2,1,2,1,2,2,1,2), act = c('a','b','d','c','d','c','a','b','a','b'), var1 = 25:34, var2= 74:83)
I have to group the data into separate levels by first 2 columns and write the mean of the the next 2 columns(var1 and var2). It should look like this
id act varmean1 varmean2
1 1 a
2 1 c
3 2 a
4 2 b
5 2 b
6 2 d
The values of respective means are filled in varmean1 and varmean2.
My actual dataframe has 88 columns where I have to group the data into separate levels by the first 2 columns and find the respective means of the remaining. Please help me figure this out as soon as possible. Please try to use 'dplyr' package for the solution if possible. Thanks.
You have several options:
base R:
aggregate(. ~ id + act, k, mean)
or
aggregate(cbind(var1, var2) ~ id + act, k, mean)
The first option aggregates all the column by id and act, the second option only the column you specify. In this case both give the same result, but it is good to know for when you have more columns and only want to aggregate some of them.
dplyr:
library(dplyr)
k %>%
group_by(id, act) %>%
summarise_each(funs(mean))
If you want to specify the columns for which to calculate the mean, you can use summarise instead of summarise_each:
k %>%
group_by(id, act) %>%
summarise(var1mean = mean(var1), var2mean = mean(var2))
data.table:
library(data.table)
setDT(k)[, lapply(.SD, mean), by = .(id, act)]
If you want to specify the columns for which to calculate the mean, you can add .SDcols like:
setDT(k)[, lapply(.SD, mean), by = .(id, act), .SDcols=c("var1", "var2")]