Subset by multiple conditions - r

Maybe it's something basic, but I couldn't find the answer.
I have
Id Year V1
1 2009 33
1 2010 67
1 2011 38
2 2009 45
3 2009 65
3 2010 74
4 2009 47
4 2010 51
4 2011 14
I need to select only the rows that have the same Id but it´s in the three years 2009, 2010 and 2011.
Id Year V1
1 2009 33
1 2010 67
1 2011 38
4 2009 47
4 2010 51
4 2011 14
I try
d1_3 <- subset(d1, Year==2009 |Year==2010 |Year==2011 )
but it doesn't work.
Can anyone provide some suggestions that how I can do this in R?

I think ave could be useful here. I call your original data frame 'df'. For each Id, check if 2009-2011 is present in Year (2009:2011 %in% x). This gives a logical vector, which can be summed. Test if the sum equals 3 (if all Years are present, the sum is 3), which results in a new logical vector, which is used to subset rows of the data frame.
df[ave(df$Year, df$Id, FUN = function(x) sum(2009:2011 %in% x) == 3, ]
# Id Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14

Another way of using ave
DF
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 4 2 2009 45
## 5 3 2009 65
## 6 3 2010 74
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14
DF[ave(DF$Year, DF$Id, FUN = function(x) all(2009:2011 %in% x)) == 1, ]
## Id Year V1
## 1 1 2009 33
## 2 1 2010 67
## 3 1 2011 38
## 7 4 2009 47
## 8 4 2010 51
## 9 4 2011 14

This should do the job :)
library(plyr)
ds<-ddply(ds,.(Id),mutate,Nobs=length(Year))
ds[ds$Nobs == 3 & ds$Year %in% 2009:2011,]

I think an approach using ave is reasonable. But there are lots of ways to solve this problem. I show a few other ways using base R. Then in the last 2 examples I'll introduce the package data.table.
Again, just throwing this out there to provide some options to use different aspects of the language.
d1 <- data.frame(ID=c(1,1,1,2,3,3,4,4,4), Year=c(2009,2010,2011, 2009,2009, 2010, 2009, 2010, 2011), V1=c(33, 67, 38, 45, 65, 74, 47, 51, 14))
# long way
use_years <- as.character(2009:2011)
cnts <- table(d1[,c("ID","Year")])[,use_years]
use_id <- rownames(cnts)[rowSums(cnts)==length(use_years)]
d1[d1[,"ID"]%in%use_id,]
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
# another longish way
ind1 <- d1[,"Year"]%in%2009:2011
d1_ind <- d1[ind1,"ID"]
ind2 <- d1_ind %in% unique(d1_ind)[tabulate(d1_ind)==3]
d1[ind1,][ind2,]
# ID Year V1
# 1 1 2009 33
# 2 1 2010 67
# 3 1 2011 38
# 7 4 2009 47
# 8 4 2010 51
# 9 4 2011 14
OK, let's try out a couple methods using data.table. One of my favorite packages of all time. Can be a little tricky at first though, so make sure your boots are on tight (Oh, yeah, it's fast!) :)
# medium way
library(data.table)
d2 <- as.data.table(d1)
d2[ID%in%d2[Year%in%2009:2011, list(logic=nrow(.SD)==3),by="ID"][(logic),ID]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14
# short way
d2[Year%in%2009:2011][ID%in%unique(ID)[table(ID)==3]]
# ID Year V1
# 1: 1 2009 33
# 2: 1 2010 67
# 3: 1 2011 38
# 4: 4 2009 47
# 5: 4 2010 51
# 6: 4 2011 14

Related

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

reordering database by loop in R - help me

I am trying to reorder a database by loop but it does not work for me. There is too much data to do it one by one.
fact <- rep (1:2 , each = 3)
t1 <- c(2006,2007,2008,2000,2001,2002)
t2 <- c(2007,2008,2009,2001,2002,2004)
var1 <- c(56,52,44,10,32,41)
var2 <- c(52,44,50,32,41,23)
db1 <- as.data.frame(cbind(fact, t1, t2, var1, var2))
db1
fact t1 t2 var1 var2
1 1 2006 2007 56 52
2 1 2007 2008 52 44
3 1 2008 2009 44 50
4 2 2000 2001 10 32
5 2 2001 2002 32 41
6 2 2002 2004 41 23
I need it to stay this way:
factor <- rep (1:2 , each = 4)
t <- c(2006,2007,2008,2009,2000,2001,2002,2004)
var <- c(56,52,44,50,10,32,41,23)
db2 <- as.data.frame(cbind(factor, t, var))
db2
factor t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 1 2009 50
5 2 2000 10
6 2 2001 32
7 2 2002 41
8 2 2004 23
very thanks
dat1 <- as.data.frame(cbind(fact, t1, var1))
names(dat1) <- c("fact", "t", "var")
dat2 <- as.data.frame(cbind(fact, t2, var2))
names(dat2) <- c("fact", "t", "var")
rbind.data.frame(dat1, dat2)
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(dat[,c(1,2,4)], dat[,c(1,3,5)])
fact t var
1 1 2006 56
2 1 2007 52
3 1 2008 44
4 2 2000 10
5 2 2001 32
6 2 2002 41
7 1 2007 52
8 1 2008 44
9 1 2009 50
10 2 2001 32
11 2 2002 41
12 2 2004 23
Or, as indicated, have a look at the reshape2 package - melt would certainly be of use, e.g.
library(reshape2)
dat <- db1
names(dat) <- c("fact", rep("t", 2), rep("var", 2))
rbind(melt(dat[,c(1,2,4)], id.vars = c("fact","t"), value.name = "var"),
melt(dat[,c(1,3,5)], id.vars = c("fact","t"), value.name = "var")
)

unlist and merge into a single dataframe in r

I have a list of dataframes that I need to be combined into a single one.
year<-1990:2000
v1<-1:11
v2<-20:30
df1<-data.frame(year,v1)
df2<-data.frame(year,v2)
ldf<-list(df1,df2)
I now want to unlist this dataframe and get
> head(df)
year v1 v2
1 1990 1 20
2 1991 2 21
3 1992 3 22
4 1993 4 23
Note that my question is different from the solution provided in a similar question, where the solution to that question was: `df <- ldply(ldf, data.frame)
Because what I am essentially looking for, is a more automatic way of doing this: df<-merge(df1,df2, by="year")
With more number of list elements, a convenient option is reduce with one of the join functions
library(tidyverse)
ldf %>%
reduce(inner_join, by = "year")
# year v1 v2
#1 1990 1 20
#2 1991 2 21
#3 1992 3 22
#4 1993 4 23
#5 1994 5 24
#6 1995 6 25
#7 1996 7 26
#8 1997 8 27
#9 1998 9 28
#10 1999 10 29
#11 2000 11 30
Is there anything wrong with:
df <- merge(ldf[[1]], ldf[[2]], by="year")
Or for a long list:
df1 <- ldf[[1]]
for (x in 2:length(ldf)) {
df1 <- merge(df1, ldf[[x]])
}
# year v1 v2
# 1 1990 1 20
# 2 1991 2 21
# 3 1992 3 22
# 4 1993 4 23
# 5 1994 5 24
# 6 1995 6 25
# 7 1996 7 26
# 8 1997 8 27
# 9 1998 9 28
# 10 1999 10 29
# 11 2000 11 30

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Resources