R - Compute a summary column [duplicate] - r

This question already has answers here:
Joining aggregated values back to the original data frame [duplicate]
(5 answers)
Closed 6 years ago.
I am trying to compute an additional column in my dataframe that contains some summary data (mean, min, max). Starting from this dataframe
Group Value
A 15
A 5
B 4
B 2
C 25
C 15
I would like to calculate means for every group:
Group Mean
A 10
B 3
C 20
But i would like to add a column to the original dataframe repeating the value for every row of the same group, like this:
Group Value Mean
A 15 10
A 5 10
B 4 3
B 2 3
C 25 20
C 15 20
I managed to obtain this result using aggregate first (to create a temporary dataframe) and than merge the original dataframe with the temporary one using "Group" as merging variable.
I am sure there is an easier and faster way to do this. Of note, i would like to be able to do this with the base functions (e.g. no dplyr, reshape, etc) if possible. Thank you!

In base R, this can be easily done with ave
df$Mean <- with(df, ave(Value, Group))
df
# Group Value Mean
#1 A 15 10
#2 A 5 10
#3 B 4 3
#4 B 2 3
#5 C 25 20
#6 C 15 20

Related

Is there a good way to compare 2 data tables but compare the data from i to data of i+1 in second data table [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 2 years ago.
I have tried various functions including compare and all.equal but I am having difficulty finding a test to see if variables are the same.
For context, I have a data.frame which in some cases has a duplicate result. I have tried copying the data.frame so I can compare it with itself. I would like to remove the duplicates.
One approach I considered was to look at row A from dataframe 1 and subtract it from row B from dataframe 2. If they equal to zero, I planned to remove one of them.
Is there an approach I can use to do this without copying my data?
Any help would be great, I'm new to R coding.
Suppose I had a data.frame named data:
data
Col1 Col2
A 1 3
B 2 7
C 2 7
D 2 8
E 4 9
F 5 12
I can use the duplicated function to identify duplicated rows and not select them:
data[!duplicated(data),]
Col1 Col2
A 1 3
B 2 7
D 2 8
E 4 9
F 5 12
I can also perform the same action on a single column:
data[!duplicated(data$Col1),]
Col1 Col2
A 1 3
B 2 7
E 4 9
F 5 12
Sample Data
data <- data.frame(Col1 = c(1,2,2,2,4,5), Col2 = c(3,7,7,8,9,12))
rownames(data) <- LETTERS[1:6]

R: how to combine the same rows as columns A and B of three dataframes, and add the corresponding C column [duplicate]

This question already has answers here:
How do you pivot data from a list of data frames in R?
(3 answers)
Closed 5 years ago.
I have some dataframes, I want to merged them if their first two columns are identical, and add the corresponding third column.
For example, I have three dataframe as follows:
> dump1
a b c
q 12 2
w 23 3
e 34 4
> dump2
a b c
q 12 1
w 23 1
s 3 1
> dump3
a b c
q 2 6
w 23 7
s 3 8
d 2 9
Now,I want to get the merged dataframe:
> dump5
a b c
d 2 9
q 2 6
s 3 9
q 12 3
w 23 11
e 34 4
The data is very big, so I want to have a quikly way.
How to do it? Anybody knows?
Appreciate in advance.
Thank you.
We place the datasets in a list, rbind it with rbindlist from data.table, grouped by 'a' and 'b', get the sum of 'c'
library(data.table)
rbindlist(list(dump1, dump2, dump3))[, .(c = sum(c)), .(a, b)]
If there are many datasets with object names start with dump followed by numbers created in the global environment, instead of specifying the object names individually, we can use ls with pattern to get the object names, and then values with mget in a list,
rbindlist(mget(ls(pattern = "dump\\d+")))[, .(c= sum(c)), .(a, b)]

R Subset using first and last column names of interest [duplicate]

This question already has answers here:
refer to range of columns by name in R
(6 answers)
Closed 6 years ago.
> df
a b c d e
1 1 4 7 10 13
2 2 5 8 11 14
3 3 6 9 12 15
To subset the columns b,c,d we can use df[,2:4] or df[,c("b", "c", "d")]. However, I am looking for a solution which fetches me the columns b,c,d using something like df[,b:d]. In other words, I want to simply use the first and last column names of interest to subset the data. I have been looking for a solution to this but am unsuccessful. All the examples I have seen till date refer to each and every specific column name while subsetting.
It's also simple in base R, e.g.:
subset(df, select=b:d)
Or roll your own:
df[do.call(seq, as.list(match(c("b","d"), names(df))) )]
If you are open to using dplyr:
dplyr::select(df, b:d)
b c d
1 4 7 10
2 5 8 11
3 6 9 12

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

Resources