sum by group in a data.frame - r

I'm trying to get the sum of a numerical variable per a categorical variable (in a data frame). I've tried using tapply, but it's doesn't take a whole data.frame.
Here is a working example with some data that looks like this:
> set.seed(667)
> df <- data.frame(a = sample(c("Group A","Group B","Group C",NA), 10, rep = TRUE),
b = sample(c(1, 2, 3, 4, 5, 6), 10, rep=TRUE),
c = sample(c(11, 12, 13, 14, 15, 16), 10, rep=TRUE))
> df
a b c
1 Group A 4 12
2 Group B 6 12
3 <NA> 4 14
4 Group C 1 16
5 <NA> 2 14
6 <NA> 3 13
7 Group C 4 13
8 <NA> 6 15
9 Group B 3 16
10 Group B 5 16
using tapply, I can get one vector at a time:
> tapply(df$b,df$a,sum)
Group A Group B Group C
4 14 5
but I am more interested in getting something like this:
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
Any help would be appreciated. Thanks.

Use aggregate instead:
aggregate(df[ , c("b","c")], df['a'], FUN=sum)
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
I'm not sure why but you need to pass the second argument to aggregate as a list, so using df$a will error out. It then uses the function on the individual columns in the first argument.

Related

R Creating new columns using vector contains name of variables

I have a data and a vector contain name of variables and i want to create new variable contain rowsum of variables in my vector, and i want the name of new variable ( sum of variables in my vector) to be concatenation of names of variables
for example i have this data
> data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
and this vector
>Vec
"A" , "C" , "D"
the result i want is the sum of Variables A , C and D and the name of my variable is ACD
here's the result i want :
> data
Name A B C D ACD E
r1 1 5 12 21 34 15
r2 2 4 7 10 18 9
r3 5 15 6 9 20 6
r4 7 8 0 7 14 18
I tried this :
data <- cbind(data , as.data.frame(rowSums(data[,Vec]) ))
But i don't know how to create the name
Here's the result i got
>data
Name A B C D E rowSums(data[,Vec])
r1 1 5 12 21 15 34
r2 2 4 7 10 9 18
r3 5 15 6 9 6 20
r4 7 8 0 7 18 14
Not that i gave just a sample example to explain what i want to do
i want to do affectation of my old data to my new data ( that contains the new variable), like i did in my command above
edit 1 : in my real program , i don't know the elements ( name of my variables in my vector so i can not do data$ACD <- cbind(data , as.data.frame(rowSums(data[,Vec]) )) as suggested by Pax, in fact i have for loop that generate my vectors and each time i create variable to put the result i want ( sum of variable in my vector) so i don't know how to affect the name without knowing the elements of vectors
Please tell me if you need anymore clarifications or informations
Thank you
It's not a one line solution but you can set the name on the subsequent line:
data <- data.frame(A = c(1, 2, 5, 7),
B = c(5, 4, 15, 8),
C = c(12, 7, 6, 0),
D = c(21, 10, 9, 7),
E = c(15, 9, 6, 18))
Vec <- c("A" , "C" , "D")
data <- cbind(data, rowSums(data[,Vec]))
# Add name
names(data)[ncol(data)] <- paste(Vec, collapse="")
# A B C D E ACD
# 1 1 5 12 21 15 34
# 2 2 4 7 10 9 19
# 3 5 15 6 9 6 20
# 4 7 8 0 7 18 14
Here is an option with the janitor package. You can use adorn_totals which appends a totals row or column to a data.frame. The name argument includes the name of the new column in this case, and final Vec included at the end includes the columns to total.
library(janitor)
adorn_totals(data, "col", fill = NA, na.rm = TRUE, name = paste(Vec, collapse = ""), all_of(Vec))
Output
A B C D E ACD
1 5 12 21 15 34
2 4 7 10 9 19
5 15 6 9 6 20
7 8 0 7 18 14

Expand data frame and and add rowsums from another dataframe

I am trying to find a faster way of accomplishing the following code since my actual dataset is very large. I would like to get rid of the for loop altogether. I am trying to duplicate each row in xdf into a new data frame based on the number of columns in values. Then, next to each entry in the new dataset, show the row sums from column 1 in values up to the column j.
xdf <- data_frame(
x = c('a', 'b', 'c'),
y = c(4, 5, 6),
)
values <- data_frame(
col_1 = c(5, 9, 1),
col_2 = c(4, 7, 6),
col_3 = c(1, 5, 2),
col_4 = c(7, 8, 5)
)
for (j in seq(ncol(values))){
if (j==1){
Temp <- cbind(xdf, z= rowSums(values[1:j]))
}
else{
Temp <- rbind(Temp, cbind(xdf, z= rowSums(values[1:j])))
}
}
print(Temp)
The output should be:
x y z
1 a 4 5
2 b 5 9
3 c 6 1
4 a 4 9
5 b 5 16
6 c 6 7
7 a 4 10
8 b 5 21
9 c 6 9
10 a 4 17
11 b 5 29
12 c 6 14
Is there a shorter way to accomplish this?
This is the closest answer that I could get on SO.
How to expand data frame based on values?
I am new to R, so sorry for the longwinded code.
Here's one base R option :
Repeat the rows in xdf as there are number of columns in values, iteratively increment one column at a time to find rowSums and add it as a new column in the final dataframe.
newdf <- xdf[rep(seq(nrow(xdf)), ncol(values)), ]
newdf$z <- c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x])))
newdf
# A tibble: 12 x 3
# x y z
# <chr> <dbl> <dbl>
# 1 a 4 5
# 2 b 5 9
# 3 c 6 1
# 4 a 4 9
# 5 b 5 16
# 6 c 6 7
# 7 a 4 10
# 8 b 5 21
# 9 c 6 9
#10 a 4 17
#11 b 5 29
#12 c 6 14
A concise one-liner as suggested by #sindri_baldur doesn't require repeating the rows explicitly.
cbind(xdf, z = c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x]))))

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

How to calculate value for an observation by group?

I have a data frame like so:
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
mydf
group score
1 a 10
2 a 9
3 a 2
4 a 3
5 b 1
6 b 10
7 b 1
8 b 10
9 c 3
10 c 7
11 c 1
12 c 3
I can calculate the mean of each group like so:
> by(mydf[,c("score")], mydf$group, mean)
mydf$group: a
[1] 6
-------------------------------------------------------------------
mydf$group: b
[1] 5.5
-------------------------------------------------------------------
mydf$group: c
[1] 3.5
But what I wish to do, is create a new column, say called resdidual which contains the residual from the mean of the group. It would seem like there is some way to use one of the apply functions to do this, but for some reason I can't see it.
I would want my end result to look like so:
mydf
group score residual
1 a 10 4
2 a 9 3
3 a 2 -4
4 a 3 -3
5 b 1 -4.5
6 b 10 4.5
7 b 1 -4.5
8 b 10 4.5
9 c 3 -.5
10 c 7 3.5
11 c 1 -2.5
12 c 3 -.5
Any ideas or pointers to the right direction is appreciated.
How about:
mydf$score - tapply(mydf$score, mydf$group, mean)[as.character(mydf$group)]
tapply works the same as by but with a nicer output. The [as.character(mydf$group)] subsets and replicates tapply's output so that it aligns mdf$group.
library(dplyr)
mydf %>% group_by(group) %>% mutate(residual = score - mean(score))
I take the data, I group by group, then I add a column (using mutate) which is the difference between the variable score and the mean of that variable in each group.
library(hash)
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
byResult <- by(mydf[,c("score")], mydf$group, mean)
h <- hash(keys= names(byResult), values =byResult)
residualsVar <- apply(mydf,1,function(row){
as.vector(values(h,row[1]))-as.numeric(row[2])
})
df <- cbind(mydf,residualsVar)

length of column, split by group

Hi I am using R to count the number of datapoints in a column, split by groups like so:
Type Value
---- -----
A 1
A 6
A 4
A 6
B 8
B 10
B 3
B 8
C 7
C 4
where I want to plot 3 bars, how many A's,how many B's, how many C's. The values in the value column are not important.
How do I do this.
If my data were in different columns I could obviously use
sapply(list(col1,col2,col3),length))
but I don't want to transform my data.
Thanks
If the Value column doesn't matter, then ggplot2 can help you in that regard
library(ggplot2)
set.seed(9001)
df <- data.frame(Type = c(rep("A", 4), rep("B", 4), rep("C", 2)), Value = sample(1:20, 10))
df
## Type Value
## 1 A 5
## 2 A 19
## 3 A 4
## 4 A 12
## 5 B 1
## 6 B 18
## 7 B 13
## 8 B 17
## 9 C 8
## 10 C 14
ggplot(df) + geom_bar(aes(x = Type))

Resources