Sum of data in a column based on categorical condition from another column [duplicate] - r

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 9 years ago.
Suppose I have a data frame like this:
set.seed(123)
df <- as.data.frame(cbind(y<-sample(c("A","B","C"),10,T), X<-sample(c(1,2,3),10,T)))
df <- df[order(df$V1),]
Is there a simply function to sum (or any FUN) V2 by V1 and add to df as a new column, such that:
df$sum <- c(6,6,8,8,8,8,6,6,6,6)
df
I may write a function for that, but I have to do that frequently and be better to know the simplest way to realize that.

I agree with #mnel at least on his first point. I didn't see ave demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...)) construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6

Related

Is there a good way to compare 2 data tables but compare the data from i to data of i+1 in second data table [duplicate]

This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 2 years ago.
I have tried various functions including compare and all.equal but I am having difficulty finding a test to see if variables are the same.
For context, I have a data.frame which in some cases has a duplicate result. I have tried copying the data.frame so I can compare it with itself. I would like to remove the duplicates.
One approach I considered was to look at row A from dataframe 1 and subtract it from row B from dataframe 2. If they equal to zero, I planned to remove one of them.
Is there an approach I can use to do this without copying my data?
Any help would be great, I'm new to R coding.
Suppose I had a data.frame named data:
data
Col1 Col2
A 1 3
B 2 7
C 2 7
D 2 8
E 4 9
F 5 12
I can use the duplicated function to identify duplicated rows and not select them:
data[!duplicated(data),]
Col1 Col2
A 1 3
B 2 7
D 2 8
E 4 9
F 5 12
I can also perform the same action on a single column:
data[!duplicated(data$Col1),]
Col1 Col2
A 1 3
B 2 7
E 4 9
F 5 12
Sample Data
data <- data.frame(Col1 = c(1,2,2,2,4,5), Col2 = c(3,7,7,8,9,12))
rownames(data) <- LETTERS[1:6]

How sum a column corresponding to group elements? [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 9 years ago.
Suppose I have a data frame like this:
set.seed(123)
df <- as.data.frame(cbind(y<-sample(c("A","B","C"),10,T), X<-sample(c(1,2,3),10,T)))
df <- df[order(df$V1),]
Is there a simply function to sum (or any FUN) V2 by V1 and add to df as a new column, such that:
df$sum <- c(6,6,8,8,8,8,6,6,6,6)
df
I may write a function for that, but I have to do that frequently and be better to know the simplest way to realize that.
I agree with #mnel at least on his first point. I didn't see ave demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...)) construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6

R join same row and calculate mean value [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 7 years ago.
I have a data frame that looks like this:
data<-data.frame(y=c(1,1,2,2,3,4,5,5),x=c(5,5,10,10,5,10,5,5))
y x
1 1 5
2 1 5
3 2 10
4 2 30
5 3 5
6 4 10
7 5 4
8 5 8
How can a merge those rows with same value in y column and modify the x column value to the mean of them.
I would like something like this:
y x
1 1 5
2 2 20
3 3 5
4 4 10
7 5 6
I'm trying:
unique(data)
But it removes the values instead of doing the mean of same rows.
It is easy with dplyr. Like here:
library("dplyr")
data %>%
group_by(y) %>%
summarise(x=mean(x))
We can use aggregate
aggregate(x~y, data, mean)
User plyr.
# Create dummy data.
nel = 30
df <- data.frame(x = round(5*runif(nel)), y= round(10*runif(nel)))
# Summarise means
require(plyr)
df$x <- as.factor(df$x)
res <- ddply(df, .(x), summarise, mu=mean(y))

Adding group column to data frame [duplicate]

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?
In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

Return df with a columns values that occur more than once [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?
Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).
Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Resources