R: Function “diff” over various groups - r

While searching for a solution to my problem I found this thread: Function "diff" over various groups in R. I've got a very similar question so I'll just work with the example there.
This is what my desired output should look like:
name class year diff
1 a c1 2009 NA
2 a c1 2010 67
3 b c1 2009 NA
4 b c1 2010 20
I have two variables which form subgroups - class and name. So I want to compare only the values which have the same name and class. I also want to have the differences from 2009 to 2010. If there is no 2008, diff 2009 should return NA (since it can't calculate a difference).
I'm sure it works very similarly to the other thread but I just can't make it work. I used this code too (and simply solved the ascending year by sorting the data differently), but somehow R still manages to calculate a difference and does not return NA.
ddply(df, .(class, name), summarize, year=head(year, -1), value=diff(value))

Using the data set form the other post, I would do something like
library(data.table)
df <- df[df$year != 2008, ]
setkey(setDT(df), class, name, year)
df[, diff := lapply(.SD, function(x) c(NA, diff(x))),
.SDcols = "value", by = list(class, name)]
Which returns
df
# name class year value diff
# 1: a c1 2009 33 NA
# 2: a c1 2010 100 67
# 3: b c1 2009 80 NA
# 4: b c1 2010 90 10
# 5: a c2 2009 80 NA
# 6: a c2 2010 90 10
# 7: b c2 2009 90 NA
# 8: b c2 2010 100 10
# 9: a c3 2009 90 NA
#10: a c3 2010 100 10
#11: b c3 2009 80 NA
#12: b c3 2010 99 19

Using dplyr
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff=c(NA,diff(value)))
# Source: local data frame [12 x 5]
# Groups: class, name
# name class year value diff
# 1 a c1 2009 33 NA
# 2 a c1 2010 100 67
# 3 a c2 2009 80 NA
# 4 a c2 2010 90 10
# 5 a c3 2009 90 NA
# 6 a c3 2010 100 10
# 7 b c1 2009 80 NA
# 8 b c1 2010 90 10
# 9 b c2 2009 90 NA
# 10 b c2 2010 100 10
# 11 b c3 2009 80 NA
# 12 b c3 2010 99 19
Update:
With relative difference
df %>%
filter(year!=2008)%>%
arrange(name, class, year)%>%
group_by(class, name)%>%
mutate(diff1=c(NA,diff(value)), rel_diff=round(diff1/value[row_number()-1],2))

Related

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Calculation and replacement in R

I have a dataset like the following and I need to compare the value of each year (2005-2009) with the average value of (2002-2004).
Year Firm R
2002 A 30
2003 A 11
2004 A 1
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 24
2003 B 30
2004 B 25
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
The Issue that I need to calculate the average of (2002-2004) for each firm and replace the value in years 2002-2004 with the new value (i.e. the calculated average). for example, the new dataset should be like this:
Year Firm R
2002 A 14
2003 A 14
2004 A 14
2005 A 7
2006 A 15
2007 A 20
2008 A 3.5
2009 A 8
2002 B 26.333
2003 B 26.333
2004 B 26.333
2005 B 5.2
2006 B 11.8
2007 B 78
2008 B 90
2009 B 57
I have tried to use the following code:
df$R[df$Year==2002 & df$Year==2003 & df$Year==2004] = (df$R[df$Year==2002] + df$R[df$Year==2003] + df$R[df$Year==2004])/3
but when I apply it nothing changes!!!!!?????
I hope you can help with this issue
You can use data.table for this if you like:
library(data.table)
year <- c(rep(seq(2002,2009,1),2))
firm <- c(rep("A",8),rep("B",8))
r <- c(30,11,1,7,15,20,3.5,8,24,30,25,5.2,11.8,78,90,57)
aa <- data.table(year,firm,r)
aa[year>=2002 & year<=2004, r:= mean(r), by = firm]
Giving this result :
year firm r
1: 2002 A 14.00000
2: 2003 A 14.00000
3: 2004 A 14.00000
4: 2005 A 7.00000
5: 2006 A 15.00000
6: 2007 A 20.00000
7: 2008 A 3.50000
8: 2009 A 8.00000
9: 2002 B 26.33333
10: 2003 B 26.33333
11: 2004 B 26.33333
12: 2005 B 5.20000
13: 2006 B 11.80000
14: 2007 B 78.00000
15: 2008 B 90.00000
16: 2009 B 57.00000
The mistake in your code is that you are not grouping by Firm name and also using & instead or |. In my example test.txt is the file which has input same as in question.
Below code should help you achieve what you need.
library(dplyr)
df <- read.delim('test.txt', header = T, sep = '\t')
print(df)
# get unique firm names for grouping
firms <- unique(df$Firm)
# for each firm, calculate mean and update it
for (f in firms){
df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)] =
sum(df$R[df$Firm == f & (df$Year==2002 | df$Year==2003 | df$Year==2004)])/3
}
print(df)
Try this dplyr version:
library(tidyverse)
data %>%
filter(Year<2005) %>% # this subsets the data
group_by(Firm) %>% # state which values you want to evaluate
summarise(m=mean(R)) %>% # take the mean (named mean)
left_join(data) %>% # join the original data to the summarised data
mutate(R=ifelse(Year<2005 & Firm=='A', m,
ifelse(Year<2005 & Firm=='B', m, R))) %>% # nested ifelse to define conditions
select(year,firm,R) -> newdata # select the desired columns and rename the data.frame

Drop subgroup of obs in dataframe if first observation of group is na

In R I have a dataframe df of this form:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
NA 5 2011 04 1234759
5 5 2011 05 1234759
5 5 2011 06 1234759
2 2 2001 11 1234760
NA NA 2001 11 1234760
Some of the a's and b's are NAs. I wish to subset the dataframe by id, have each subset ordered by year and month and then drop the whole subset/id if the first observation in order of time of either a or b is na.
For the example above, inteded result is:
a b year month id
1 2 2012 01 1234758
1 1 2012 02 1234758
2 2 2001 11 1234760
NA NA 2001 11 1234760
I did it the non vectorized way, which took forever to run, as follow:
df_summary <- as.data.frame(table(df$id),stringsAsFactors=FALSE)
df <- df[order(df$id,df$year,df$month),]
remove <- ""
j <- 1
l <- 0
for(i in 1:nrow(df_summary)){
m <- df_summary$Var1[i]
if( is.na(df$a[j]) | is.na(df$b[j]) ) {
l <- l + 1
remove[l] <- df_summary$id[i]
}
j <- j + m
}
df <- df[!(df$id %in% remove),]
What is a faster, vectorized way, to achieve the same result?
What I tried, also to double-check my code:
dt <- setDT(df)
remove_vectorized <- dt[,list(remove_first_na=(is.na(a[1]) | is.na(b[1]))),by=id]
which suggests me to remove ALL observation, which is patently wrong.
Here are few data.table possible approaches
First- fixing your attempt
library(data.table)
setDT(df)[, if(!is.na(a[1L]) & !is.na(b[1L])) .SD, by = id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or we can generalize this (on expense of speed probably)
setDT(df)[, if(Reduce(`&`, !is.na(.SD[1L, .(a, b)]))) .SD, by = id]
## OR maybe `setDT(df)[, if(Reduce(`&`, !sapply(.SD[1L, .(a, b)], is.na))) .SD , by = id]`
## in order to avoid to matrix conversions)
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Another way is to combine unique and na.omit methods
indx <- na.omit(unique(setDT(df), by = "id"), by = c("a", "b"))
Then, a simple subset will do
df[id %in% indx$id]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or maybe a binary join?
df[indx[, .(id)], on = "id"]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
Or
indx <- na.omit(unique(setDT(df, key = "id")), by = c("a", "b"))
df[.(indx$id)]
# id a b year month
# 1: 1234758 1 2 2012 1
# 2: 1234758 1 1 2012 2
# 3: 1234760 2 2 2001 11
# 4: 1234760 NA NA 2001 11
(The last two are mainly for illustration)
For more info regarding data.table, please visit Getting Started on GH

R Transform Data Frame and Remove NAs

I've converted a data set in R from LONG to WIDE format and now have one measurement per row. What would be the best way to consolidate the rows based on the "Date" column and remove the NAs?
Here is a sample of what I have:
Date M1 M2 M3 M4
1 2013 NA NA NA 2
2 2013 6 NA NA NA
3 2013 NA 19 NA NA
4 2013 NA NA 10 NA
5 2014 NA NA NA 1
6 2014 NA NA 231 NA
7 2014 NA 215 NA NA
8 2014 16 NA NA NA
This is what I'd like to create:
Date M1 M2 M3 M4
1 2013 6 19 10 2
2 2014 16 215 231 1
Any suggestions or help would be appreciated!
Without knowing more about your dataset, you can try something like this:
library(data.table)
as.data.table(mydf)[, lapply(.SD, sum, na.rm = TRUE), by = Date]
# Date M1 M2 M3 M4
# 1: 2013 6 19 10 2
# 2: 2014 16 215 231 1
It doesn't have to be using "data.table" (but that's going to be one of your fastest options) but can be any one of your favorite aggregation functions.
If you have one measurement per row:
result<-aggregate(cbind(M1=data$M1, M2=data$M2, M3=data$M3, M4=data$M4),
by=list(Date= data$Date), FUN=sum, na.rm=TRUE)
Edit
This is better as mentioned by Ananda in the comments:
aggregate(. ~ Date, mydf, sum, na.rm = TRUE, na.action = "na.pass")
Using dplyr
library(dplyr)
df1%>%
group_by(Date) %>%
summarise_each(funs(sum(., na.rm=TRUE)))
# Date M1 M2 M3 M4
#1 2013 6 19 10 2
#2 2014 16 215 231 1
If there is only one non-NA observation per each column per 'Date', you could replace the summarise_each step with summarise_each(funs(na.omit(.)))

Removing rows of data frame if number of NA in a column is larger than 3

I have a data frame (panel data): Ctry column indicates the name of countries in my data frame. In any column (for example: Carx) if number of NAs is larger 3; I want to drop the related country in my data fame. For example,
Country A has 2 NA
Country B has 4 NA
Country C has 3 NA
I want to drop country B in my data frame. I have a data frame like this (This is for illustration, my data frame is actually very huge):
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
B 2000 NA
B 2001 NA
B 2002 NA
B 2003 NA
B 2004 18
B 2005 16
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
I want to create a data frame like this:
Ctry year Carx
A 2000 23
A 2001 18
A 2002 20
A 2003 NA
A 2004 24
A 2005 18
C 2000 NA
C 2001 NA
C 2002 24
C 2003 21
C 2004 NA
C 2005 24
A fairly straightforward way in base R is to use sum(is.na(.)) along with ave, to do the counting, like this:
with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x))))
# [1] 1 1 1 1 1 1 4 4 4 4 4 4 3 3 3 3 3 3
Once you have that, subsetting is easy:
mydf[with(mydf, ave(Carx, Ctry, FUN = function(x) sum(is.na(x)))) <= 3, ]
# Ctry year Carx
# 1 A 2000 23
# 2 A 2001 18
# 3 A 2002 20
# 4 A 2003 NA
# 5 A 2004 24
# 6 A 2005 18
# 13 C 2000 NA
# 14 C 2001 NA
# 15 C 2002 24
# 16 C 2003 21
# 17 C 2004 NA
# 18 C 2005 24
You can use by() function to group by Ctry and count NA's of each group :
DF <- read.csv(
text='Ctry,year,Carx
A,2000,23
A,2001,18
A,2002,20
A,2003,NA
A,2004,24
A,2005,18
B,2000,NA
B,2001,NA
B,2002,NA
B,2003,NA
B,2004,18
B,2005,16
C,2000,NA
C,2001,NA
C,2002,24
C,2003,21
C,2004,NA
C,2005,24',
stringsAsFactors=F)
res <- by(data=DF$Carx,INDICES=DF$Ctry,FUN=function(x)sum(is.na(x)))
validCtry <-names(res)[res <= 3]
DF[DF$Ctry %in% validCtry, ]
# Ctry year Carx
#1 A 2000 23
#2 A 2001 18
#3 A 2002 20
#4 A 2003 NA
#5 A 2004 24
#6 A 2005 18
#13 C 2000 NA
#14 C 2001 NA
#15 C 2002 24
#16 C 2003 21
#17 C 2004 NA
#18 C 2005 24
EDIT :
if you have more columns to check, you could adapt the previous code as follows:
res <- by(data=DF,INDICES=DF$Ctry,
FUN=function(x){
return(sum(is.na(x$Carx)) <= 3 &&
sum(is.na(x$Barx)) <= 3 &&
sum(is.na(x$Tarx)) <= 3)
})
validCtry <- names(res)[res]
DF[DF$Ctry %in% validCtry, ]
where, of course, you may change the condition in FUN according to your needs.
Since you mention that you data is "very huge" (whatever that means exactly), you could try a solution with dplyr and see if it's perhaps faster than the solutions in base R. If the other solutions are fast enough, just ignore this one.
require(dplyr)
newdf <- df %.% group_by(Ctry) %.% filter(sum(is.na(Carx)) <= 3)

Resources