Conditional sum in R in different columns [duplicate] - r

This question already has answers here:
Using R statistics add a group sum to each row [duplicate]
(3 answers)
Closed 6 years ago.
I want to sum numbers in B column based on numbers in A column.
For example:
Column A : 2001 2002 2002 2002 2003 2003
Column B: 1 2 3 4 5 6
I want to add a column C that sums up B based on A. My desired result is:
Column A : 2001 2002 2002 2002 2003 2003
Column B: 1 2 3 4 5 6
Column C: 1 9 (2+3+4) 9 9 11 11
I have done a lot of search but really have no clue where to begin, thanks in advance for any help!

We can use mutate from dplyr after grouping by 'A'
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(C= sum(B))
Or with ave from base R
df1$C <- with(df1, ave(B, A, FUN = sum))
An efficient option is data.table
library(data.table)
setDT(df1)[, C := sum(B), by = A]

Related

Sort data.frame or data.table using vector of column names [duplicate]

This question already has answers here:
Sort a data.table fast by Ascending/Descending order
(2 answers)
Order data.table by a character vector of column names
(2 answers)
Sort a data.table programmatically using character vector of multiple column names
(1 answer)
Closed 2 years ago.
I have a data.frame (a data.table in fact) that I need to sort by multiple columns. The names of columns to sort by are in a vector. How can I do it? E.g.
DF <- data.frame(A= 5:1, B= 11:15, C= c(3, 3, 2, 2, 1))
DF
A B C
5 11 3
4 12 3
3 13 2
2 14 2
1 15 1
sortby <- c('C', 'A')
DF[order(sortby),] ## How to do this?
The desired output is the following but using the sortby vector as input.
DF[with(DF, order(C, A)),]
A B C
1 15 1
2 14 2
3 13 2
4 12 3
5 11 3
(Solutions for data.table are preferable.)
EDIT: I'd rather avoid importing additional packages provided that base R or data.table don't require too much coding.
With data.table:
setorderv(DF, sortby)
which gives:
> DF
A B C
1: 1 15 1
2: 2 14 2
3: 3 13 2
4: 4 12 3
5: 5 11 3
For completeness, with setorder:
setorder(DF, C, A)
The advantage of using setorder/setorderv is that the data is reordered by reference and thus very fast and memory efficient. Both functions work on data.table's as wel as on data.frame's.
If you want to combine ascending and descending ordering, you can use the order-parameter of setorderv:
setorderv(DF, sortby, order = c(1L, -1L))
which subsequently gives:
> DF
A B C
1: 1 15 1
2: 3 13 2
3: 2 14 2
4: 5 11 3
5: 4 12 3
With setorder you can achieve the same with:
setorder(DF, C, -A)
Using dplyr, you can use arrange_at which accepts string column names :
library(dplyr)
DF %>% arrange_at(sortby)
# A B C
#1 1 15 1
#2 2 14 2
#3 3 13 2
#4 4 12 3
#5 5 11 3
Or with the new version
DF %>% arrange(across(sortby))
In base R, we can use
DF[do.call(order, DF[sortby]), ]
Also possible with dplyr:
DF %>%
arrange(get(sort_by))
But Ronaks answer is more elegant.

R - counting with NA in dataframe [duplicate]

This question already has answers here:
ignore NA in dplyr row sum
(6 answers)
Closed 4 years ago.
lets say that I have this dataframe in R
df <- read.table(text="
id a b c
1 42 3 2 NA
2 42 NA 6 NA
3 42 1 NA 7", header=TRUE)
I´d like to calculate all columns to one, so result should look like this.
id a b c d
1 42 3 2 NA 5
2 42 NA 6 NA 6
3 42 1 NA 7 8
My code below doesn´t work since there is that NA values. Please note that I have to choose columns that I want to count since in my real dataframe I have some columns that I don´t want count together.
df %>%
mutate(d = a + b + c)
You can use rowSums for this which has an na.rm parameter to drop NA values.
df %>% mutate(d=rowSums(tibble(a,b,c), na.rm=TRUE))
or without dplyr using just base R.
df$d <- rowSums(subset(df, select=c(a,b,c)), na.rm=TRUE)

R Subset using first and last column names of interest [duplicate]

This question already has answers here:
refer to range of columns by name in R
(6 answers)
Closed 6 years ago.
> df
a b c d e
1 1 4 7 10 13
2 2 5 8 11 14
3 3 6 9 12 15
To subset the columns b,c,d we can use df[,2:4] or df[,c("b", "c", "d")]. However, I am looking for a solution which fetches me the columns b,c,d using something like df[,b:d]. In other words, I want to simply use the first and last column names of interest to subset the data. I have been looking for a solution to this but am unsuccessful. All the examples I have seen till date refer to each and every specific column name while subsetting.
It's also simple in base R, e.g.:
subset(df, select=b:d)
Or roll your own:
df[do.call(seq, as.list(match(c("b","d"), names(df))) )]
If you are open to using dplyr:
dplyr::select(df, b:d)
b c d
1 4 7 10
2 5 8 11
3 6 9 12

Loop through columns and apply ddply [duplicate]

This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 6 years ago.
My data frame looks like this:
Stage Var1 var2 Var1 var2
A 1 11 9 12
A 2 NA 3 13
A NA NA 2 10
B 4 14 1 4
B NA NA 4 2
B 6 16 6 8
B 7 17 100 9
C 8 NA 4 6
C 9 19 34 12
C 10 NA 5 18
C 1 0 6 3
I would like to split the dataframe using ddply, apply mean() for each group. Later it has to be looped for all the columns. Hence i am trying something like this:
for(i in names(NewInput)){
NewInput[[i]] <- ddply(NewInput , "Model_Stage", function(x) {
mean.Cycle2 <- mean(x$NewInput[[i]])
})
}
The above code works fine without for loop (i.e) ddply works fine with one variable. However when I run through columns using for loop i am getting several warnings
In loop_apply(n, do.ply):argument is not numeric or logical: returning NA
Question:
-> How to loop through ddply over all the variables using for loop?
-> Is it possible to use apply()?
Thank you.
-Chris
You can try
library(plyr)
ddply(df1, .(Stage), colwise(mean, na.rm=TRUE))
Other options include
library(dplyr)
df1 %>%
group_by(Stage) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)))
Or
library(data.table)
setDT(df1)[, lapply(.SD, mean, na.rm=TRUE), Stage]
Or using base R
aggregate(.~Stage, df1, FUN=mean, na.rm=TRUE, na.action=NULL)

Calculate row sum value in R

Hi I am new to R and would like to get some advice on how to perform sum calculation in data frame structure.
year value
Row 1 2001 10
Row 2 2001 20
Row 3 2002 15
Row 4 2002 NA
Row 5 2003 5
How can I use R to return the total sum value by year? Many thanks!
year sum value
Row 1 2001 30
Row 2 2002 15
Row 3 2003 5
There are lots of ways to do that.
One of them is using the function aggregate like this:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mytable <- aggregate(mydf$value, by=list(year), FUN=sum, na.rm=TRUE)
colnames(mytable) <- c('Year','sum_values')
> mytable
Year sum_values
1 2001 30
2 2002 15
3 2003 5
This link might also be helpful.
There is also rowsum, which is quite efficient
with(mydf, rowsum(value, year, na.rm=TRUE))
# [,1]
# 2001 30
# 2002 15
# 2003 5
Or tapply
with(mydf, tapply(value, year, sum, na.rm=TRUE))
# 2001 2002 2003
# 30 15 5
Or as.data.frame(xtabs(...))
as.data.frame(xtabs(mydf[2:1]))
# year Freq
# 1 2001 30
# 2 2002 15
# 3 2003 5
LyzandeR has provided a working answer in base R. If you want to use dplyr which is a great data management tool you could do:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mydf %>%
group_by(year) %>%
summarise(sum_values = sum(value,na.rm=T))
The advantage of dplyr in this case is for larger datasets it will be much, much faster than base R. I also believe it's much more readable.

Resources