Calculate row sum value in R - r

Hi I am new to R and would like to get some advice on how to perform sum calculation in data frame structure.
year value
Row 1 2001 10
Row 2 2001 20
Row 3 2002 15
Row 4 2002 NA
Row 5 2003 5
How can I use R to return the total sum value by year? Many thanks!
year sum value
Row 1 2001 30
Row 2 2002 15
Row 3 2003 5

There are lots of ways to do that.
One of them is using the function aggregate like this:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mytable <- aggregate(mydf$value, by=list(year), FUN=sum, na.rm=TRUE)
colnames(mytable) <- c('Year','sum_values')
> mytable
Year sum_values
1 2001 30
2 2002 15
3 2003 5
This link might also be helpful.

There is also rowsum, which is quite efficient
with(mydf, rowsum(value, year, na.rm=TRUE))
# [,1]
# 2001 30
# 2002 15
# 2003 5
Or tapply
with(mydf, tapply(value, year, sum, na.rm=TRUE))
# 2001 2002 2003
# 30 15 5
Or as.data.frame(xtabs(...))
as.data.frame(xtabs(mydf[2:1]))
# year Freq
# 1 2001 30
# 2 2002 15
# 3 2003 5

LyzandeR has provided a working answer in base R. If you want to use dplyr which is a great data management tool you could do:
year <- c(2001,2001,2002,2002,2003)
value <- c(10,20,15,NA,5)
mydf<-data.frame(year,value)
mydf %>%
group_by(year) %>%
summarise(sum_values = sum(value,na.rm=T))
The advantage of dplyr in this case is for larger datasets it will be much, much faster than base R. I also believe it's much more readable.

Related

Conditional sum in R in different columns [duplicate]

This question already has answers here:
Using R statistics add a group sum to each row [duplicate]
(3 answers)
Closed 6 years ago.
I want to sum numbers in B column based on numbers in A column.
For example:
Column A : 2001 2002 2002 2002 2003 2003
Column B: 1 2 3 4 5 6
I want to add a column C that sums up B based on A. My desired result is:
Column A : 2001 2002 2002 2002 2003 2003
Column B: 1 2 3 4 5 6
Column C: 1 9 (2+3+4) 9 9 11 11
I have done a lot of search but really have no clue where to begin, thanks in advance for any help!
We can use mutate from dplyr after grouping by 'A'
library(dplyr)
df1 %>%
group_by(A) %>%
mutate(C= sum(B))
Or with ave from base R
df1$C <- with(df1, ave(B, A, FUN = sum))
An efficient option is data.table
library(data.table)
setDT(df1)[, C := sum(B), by = A]

eliminate observations with the same id but actually do not correspond in r

I am using a national survey to run an econometric analysis in R.
The df is based on a survey which is conducted every two years: some families have been interviwed for more than one times, and others appear just one time.
The variable family represents the code number of the family, the variable nord the code number of the componenet of the family in a certain year; the variable nordp represents the code number that the individual had in the previous survey. So when individuals are interwied more than one time nord and nordp shuold be the same, but actually it is not always true.
I need to filter the df in order to have only the individual that appears at least one time:
df <- df %>%
group_by(nquest, nordp) %>%
filter(n()>1)
Then I assign a unique id value to each individual with this command (in different years I have the same id for the same couple of nquest and nord):
df <- transform(df, id=as.numerica(interaction(nquest, nord))
the problem is that sometime the data were introduce in a wrong way so that in one year the same individual (identified with the same nquest and nordp) actually is not really the same person; for example look at the two lines with **; they have the same nquest and nordp, and so the same id, but they are not the same person (nord is not the same, and also sex is different).
year id nquest nord nordp sex
**2000 1 10 1 1 F**
2000 2 20 1 1 M
2000 3 30 1 1 M
2002 1 10 1 1 F
2002 2 20 1 1 M
2002 4 40 1 1 F
**2004 1 10 2 1 M**
2004 2 20 1 1 M
2004 3 30 1 1 M
so my problem is eliminate the observations that are not really the same using sex as check variable; consider that the df is composed by more than 50k observations and so I can't check for each id.
Thank you in advance
You could do
unique_df <- unique(df[,c("id","nquest","nordp","sex")])
unique_df$id[duplicated(df_unique$nquest)]
This returns the ids with multiple different sex annotations.
With summarise_each and n_distinct from dplyr you could do:
library("dplyr")
DF=read.table(text="year id nquest nord nordp sex
**2000 1 10 1 1 F**
2000 2 20 1 1 M
2000 3 30 1 1 M
2002 1 10 1 1 F
2002 2 20 1 1 M
2002 4 40 1 1 F
**2004 1 10 2 1 M**
2004 2 20 1 1 M
2004 3 30 1 1 M",header=TRUE,stringsAsFactors=FALSE)
summaryDF= DF %>%
group_by(id) %>%
summarise_each(funs(n_distinct),everything(),-year,-id) %>%
filter(sex>1 & nord >1 & nquest==1 & nordp==1 ) %>% #filter conditions on resultant data.frame
as.data.frame()
summaryDF
# id nquest nord nordp sex
# 1 1 2 1 3

Counting unique values across variables (columns) in R

I have a large dataset with repeated measures over 5 time periods.
2012 2009 2006 2003 2000
3 1 4 4 1
5 3 2 2 3
6 7 3 5 6
I want to add a new column, which is the number of unique values among years 2000 to 2012. e.g.,
2012 2009 2006 2003 2000 nunique
3 1 4 4 1 3
5 3 2 2 3 3
6 7 3 5 6 4
I am working in R and, if it helps, there are only 14 possible different values of the measured value at each time period.
I found this page: Count occurrences of value in a set of variables in R (per row) and tried the various solutions offered on it. What it gives me however is a count of each value, not the number of unique values.
Other similar questions on here seem to ask about counting number of unique values within a variable /column, rather than across each row.
Any suggestions would be appreciated.
Here's one alternative
> df$nunique <- apply(df, 1, function(x) length(unique(x)))
> df
2012 2009 2006 2003 2000 nunique
1 3 1 4 4 1 3
2 5 3 2 2 3 3
3 6 7 3 5 6 4
If you have a large dataset, you may want to avoid looping over the rows, but use a faster framework, like S4Vectors:
df <- data.frame('2012'=c(3,5,6),
'2009'=c(1,3,7),
'2006'=c(4,2,3),
'2003'=c(4,2,5),
'2000'=c(1,3,6))
dup <- S4Vectors:::duplicatedIntegerPairs(as.integer(as.matrix(df)), row(df))
dim(dup) <- dim(df)
rowSums(!dup)
Or, the matrixStats package:
m <- as.matrix(df)
mode(m) <- "integer"
rowSums(matrixStats::rowTabulates(m) > 0)
The trick is to use 'apply' and assign each row to a variable (e.g. x). You can then write a custom function, in this case one that uses 'unique' and 'length' to get the answer that you want.
df <- data.frame('2012'=c(3,5,6), '2009'=c(1,3,7), '2006'=c(4,2,3), '2003'=c(4,2,5), '2000'=c(1,3,6))
df$nunique = apply(df, 1, function(x) {length(unique(x))})
try this one out:
sapply(data, function(x) length(unique(x)))

Split and Diff function in R

I have a data frame called data. I am splitting the data using split function by an attribute called KEY.
data <- split(data, data$KEY);
After splitting the dataframe by KEY, what we get is data for individual firms. dataframe data had the data for all the firms in the universe. After the split, each individual split has two columns, year and sales. For each split, I have to calculate incremental sales corresponding to each year. For instance, if we have data 2002 - 10, 2003 - 12, 2004 - 15, 2005 - 20. What I am interested in getting would be 2003-2, 2004 -3, 2005 - 5, for each split.
I have written a function, called mod_sale, to perform the job mentioned:
data[with(data, order(year)),];
sale_data <- diff(data$SALE);
data <- data[-1,];
data$SALE <- sale_data;
return(data)
Currently, I am using for loop:
for(key in names(data)){
a <- try(mod_sale(data[[key]]))
if(class(a) == "try-error") next;
mod_data <- rbind(mod_data,a)};
I think there is some way, I can use sapply (and may be plyr too). Can someone help me with improving this R code? Not sure how sapply code would go.
sapply(data, mod_sale)
Any help would be appreciated. Thanks.
Edit:
Here is a data example:
a <- data.frame();
key <- c(1,1,1,1,2,2,2,2,2,3,3,3);
sales <- c(12,12,15,8,3,6,3,9,9,12,3,7);
year <- c(2002,2003,2004,2005,2001,2002,2003,2004,2005,2003,2004,2005);
ovar <- runif(12,5.0,7.5);
a <- data.frame(key,sales,year,ovar)
In the resultant data.frame, I am expecting incremental sales rather than real sales. Obviously, we will lose 3 data points for 3 key; one for each starting year, as we are taking difference. So there will be three less rows in the resultant data.frame, which would have columns key,diff(sales),year, and ovar.
This is what I would have done:
a$diffsales <- ave( a$sales, a$key, FUN=function(x) c(NA, diff(x) ) )
a
key sales year ovar diffsales
1 1 12 2002 6.845177 NA
2 1 12 2003 6.328153 0
3 1 15 2004 6.872669 3
4 1 8 2005 6.098920 -7
5 2 3 2001 7.154824 NA
6 2 6 2002 6.110810 3
7 2 3 2003 5.906624 -3
8 2 9 2004 5.214369 6
9 2 9 2005 5.818218 0
10 3 12 2003 5.354354 NA
11 3 3 2004 6.728992 -9
12 3 7 2005 7.412213 4
I appreciate the attempt to display what you'd tried. Thank you.
In the future, try to provide a small example, like this:
df <- data.frame(year = 2001:2010,
sale = sample(20,10))
df <- rbind(df,df,df)
df$key <- rep(letters[1:3],each = 10)
That makes it much clearer what your data look like, and it makes it very easy for people trying to answer. The easier you make it for us, the faster+better answers you'll get.
I'd recommend sorting before splitting:
#Sort first (already sorted, but you get the idea)
df <- df[order(df$key,df$year),]
df_split <- split(df,df$key)
You don't actually want to use sapply. (Try it and see.) You just want lapply:
out <- lapply(df_split,function(x) {x$sale_diff <- c(NA,diff(x$sale)); x[-1,]})
You'd put it all together again using:
do.call(rbind,out)
You're right, plyr or data.table could also do this. I'll leave those examples to others.
Using data.table:
library(data.table)
dt = data.table(a)
dt[, sale_diff := c(NA, diff(sales)), by = key]
dt
# key sales year ovar sale_diff
# 1: 1 12 2002 7.416857 NA
# 2: 1 12 2003 5.625818 0
# 3: 1 15 2004 5.018934 3
# 4: 1 8 2005 6.671986 -7
# 5: 2 3 2001 6.242739 NA
# 6: 2 6 2002 6.297763 3
# 7: 2 3 2003 6.482124 -3
# 8: 2 9 2004 6.724256 6
# 9: 2 9 2005 5.071265 0
#10: 3 12 2003 6.136681 NA
#11: 3 3 2004 6.974392 -9
#12: 3 7 2005 6.517553 4

Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set

I am trying to compute rolling means of an unbalanced data set. To illustrate my point I have produced this toy example of my data:
ID year Var RollingAvg(Var)
1 2000 2 NA
1 2001 3 2
1 2002 4 2.5
1 2003 2 3
2 2001 2 NA
2 2002 5 2
2 2003 4 3.5
The column RollingAvg(Var) is what I want, but can't get. In words, I am looking for the rolling average of ALL the previous observations of Var for each ID. I have tried using rollapply and ddply in the zoo and the plyr package, but I can't see how to set the rolling window length to use ALL the previous observations for each ID. Maybe I should use the plm package instead? Any help is appreciated.
I have seen other posts on rolling means on BALANCED panel data set, but I can't seem to extrapolate their answers to unbalanced data.
Thanks,
M
Using data.table:
library(data.table)
d = data.table(your_df)
d[, RollingAvg := {avg = cumsum(Var)/seq_len(.N);
c(NA, avg[-length(avg)])},
by = ID]
(or even simplified)
d[, RollingAvg := c(NA, head(cumsum(Var)/(seq_len(.N)), -1)), by = ID]
Assuming that years are contiguous within each ID (which is case in the example data) and DF is the input data frame, here is a solution using just base R. cumRoll is a function that performs the required operation on one ID and ave then performs it by ID:
cumRoll <- function(x) c(NA, head(cumsum(x) / seq_along(x), -1))
DF$Roll <- ave(DF$Var, DF$ID, FUN = cumRoll)
The result is:
> DF
ID year Var Roll
1 1 2000 2 NA
2 1 2001 3 2.0
3 1 2002 4 2.5
4 1 2003 2 3.0
5 2 2001 2 NA
6 2 2002 5 2.0
7 2 2003 4 3.5

Resources