R ifelse condition: frequency of continuously NA - r

I'm new to R, and I was looking for similar questions, but was not able to find one to fix mine, any help would be appreciated.
I have a data frame M:
date value
1 182-2002-01-01 23.95
2 182-2002-01-02 17.47
3 182-2002-01-03 NA
4 183-2002-01-01 NA
5 183-2002-01-02 5.50
6 183-2002-01-03 17.02
What I need to do is: if there are less than 5 NA (continuously), I will just repeat the previous number(17.47), and if there are more than 5 NA in a row, I will need to delete the whole month.
I tried function rle many times, but didn't work, many thanks for your help.

I'm going to adjust your question a little bit for the purposes of demonstration.
I'm going to use a similar dataset to you, but for 2 NAs in a row. This generalises to 5 very easily, don't worry. I'm also going to use a data set that better demonstrates the solution
So first, how to get your data to look like what I'm going to use:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
Now that's out of the road, this is the data I'm going to work with:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
set.seed(1234)
M2<-expand.grid(ID=182, year=2002:2004, month=1:2, day=1:3, KEEP.OUT.ATTRS=FALSE)
M2 <- M2[with(M2, order(year, month, day, ID)),] #sort the data
M2$value <- sample(c(NA, rnorm(100)), nrow(M2),
prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
M2
ID year month day value
1 182 2002 1 1 -0.5012581
7 182 2002 1 2 1.1022975
13 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
10 182 2002 2 2 1.1022975
16 182 2002 2 3 -1.2519859
2 182 2003 1 1 NA
8 182 2003 1 2 NA
14 182 2003 1 3 NA
5 182 2003 2 1 0.9729168
11 182 2003 2 2 0.9594941
17 182 2003 2 3 NA
3 182 2004 1 1 NA
9 182 2004 1 2 -1.1088896
15 182 2004 1 3 0.9594941
6 182 2004 2 1 -0.4027320
12 182 2004 2 2 -0.0151383
18 182 2004 2 3 -1.0686427
First, we're going to remove all cases where, within a month, there are 2 or more NAs in a row:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$value))
if(any(runs$lengths[runs$values] >= maxlen)) NULL else x
}
library(plyr)
rem <- ddply(M2, .(ID, year, month), NA_run, 2)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 NA
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
You can see that the two in a row NAs have been removed. The one remaining is there because it belongs to two different months. Now we're going to fill in the remaining NAs. The na.rm=FALSE argument is there to keep the NAs if they're right at the beginning (which is what you want, I think).
library(zoo)
rem$value <- na.locf(rem$value, na.rm=FALSE)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 0.9594941
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
Now all you need to do to make this 5 or more with your data is to change the value of the maxlen argument in NA_run to 5.
EDIT: Alternatively, if you don't want values to copy over from previous months:
library(zoo)
rem$value <- ddply(rem, .(ID, year, month), summarise,
value=na.locf(value, na.rm=FALSE))$value
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427

I'd do this in two steps:
An rle, rollapply, or shift-based strategy to fill in the small gaps (fewer than 5 NAs in a row).
A by, aggregate, or ddply-based strategy to take any month with NAs remaining after step 1 and make the whole month NA.

Related

Panel data in long format

I have two data frames:
d1:
Id group occu D Year
12 1 1 12 2007
13 4 2 67 2007
14 6 3 34 2007
15 7 1 88 2007
16 2 2 72 2007
17 1 1 43 2007
18 4 1 66 2007
and d2:
Id group occu D Year
12 1 1 34 2010
13 4 2 100 2010
14 6 3 76 2010
15 7 1 99 2010
16 2 2 102 2010
17 1 1 55 2010
18 4 1 32 2010
The variables "group" and "occu" are factors I want to make a panel data for the year 2007 and 2010 in the long form in R.
How can I do this?

data.frame - Make observations independent within a 3 day window

I have got a data.frame with 4 columns and ca. 6000 rows.
The columns are:
ID,
Day (the number of day since the first day of the time period considered),
Year (the year in which the observation has been recorded) and
Count (the number of observations pooled, between all the IDs, in that particular day).
df = read.table(text = 'ID Day Year Count
33012 12448 2001 46
35004 12448 2001 46
35008 12448 2001 46
37006 12448 2001 46
21009 4835 1980 44
24005 4835 1980 44
27001 4835 1980 44
27002 4835 1980 44
25005 5569 1982 34
29001 5569 1982 34
29002 5569 1982 34
30003 5569 1982 34', header = TRUE)
I need to create a time window of three days and run a for loop for each Day, counting the number of observations in that time range.
e.g. starting from Day 12448 (or "day 0") I need to check in all the dataframe for Day 12447 (day prior or "day -1") and Day 12449 (day after or "day +1") if exist observations recorded.
In other words, taking di = 12448, there exist any "di +1" and/or "di -1" in the dataframe?
If yes, I have to delete from the dataframe "di +1" and/or "di -1" in order to avoid overlapping and add both their "Count" values into the "di" 's "Count" observation.
Do you have any hint that can help me making the for loop?
#thepule, thanks a lot. I tried to run your code within my dataset. So I created a vector with all the days in the column "Day"
days <- unique(df$Day)
and adjusted the for loop appropriately, but it doesn't work, in the sense that I obtain very low values in the column Count.
Where is the mistake?
Here an example of my dataframe:
df = read.table(text ='ID Day Year Count
33012 12448 2001 5
35004 12448 2001 5
35008 12448 2001 5
37006 12448 2001 5
37008 12448 2001 5
27900 12800 2002 4
27987 12800 2002 4
27123 12800 2002 4
27341 12800 2002 4
56124 14020 2003 3
12874 14020 2003 3
11447 14020 2003 3
11231 12447 2001 2
31879 12447 2001 2
56784 12449 2001 1
64148 12799 2002 1
45613 12801 2001 1
77632 10324 1991 1
55313 14002 2003 1
11667 14019 2003 1', header = TRUE)
My output, after the for loop, should be:
ID Day Year Count
1 33012 12448 2001 8
2 35004 12448 2001 8
3 35008 12448 2001 8
4 37006 12448 2001 8
5 37008 12448 2001 8
6 27900 12800 2002 6
7 27987 12800 2002 6
8 27123 12800 2002 6
9 27341 12800 2002 6
10 56124 14020 2003 4
11 12874 14020 2003 4
12 11447 14020 2003 4
13 77632 10324 1991 1
14 55313 14002 2003 1
n.b each observations, for each ID, are max 1 per year.
n.b.b. the Count column is ordered as decreasing = TRUE
Updated answer:
# Create data frame
tt <- read.table(text = "
ID Day Year Count
33012 12448 2001 5
35004 12448 2001 5
35008 12448 2001 5
37006 12448 2001 5
37008 12448 2001 5
27900 12800 2002 4
27987 12800 2002 4
27123 12800 2002 4
27341 12800 2002 4
56124 14020 2003 3
12874 14020 2003 3
11447 14020 2003 3
11231 12447 2001 2
31879 12447 2001 2
56784 12449 2001 1
64148 12799 2002 1
45613 12801 2001 1
77632 10324 1991 1
55313 14002 2003 1
11667 14019 2003 1", header= T)
# Vector of day targets you want to repeat the procedure for
targets <- unique(tt$Day)
for (i in targets) {
temp <- tt$Count[tt$Day == i]
if(length(temp >0)) {
condition <- tt$Day == i - 1
if(any(condition)) {
tt$Count[tt$Day == i] <- mean(tt$Count[condition]) + tt$Count[tt$Day == i]
tt <- tt[!condition,]
}
condition2 <- tt$Day == i + 1
if(any(condition2)) {
tt$Count[tt$Day == i] <- mean(tt$Count[condition2]) + tt$Count[tt$Day == i]
tt <- tt[!condition2,]
}
}
}
Output:
tt
ID Day Year Count
1 33012 12448 2001 8
2 35004 12448 2001 8
3 35008 12448 2001 8
4 37006 12448 2001 8
5 37008 12448 2001 8
6 27900 12800 2002 6
7 27987 12800 2002 6
8 27123 12800 2002 6
9 27341 12800 2002 6
10 56124 14020 2003 4
11 12874 14020 2003 4
12 11447 14020 2003 4
18 77632 10324 1991 1
19 55313 14002 2003 1

How can I drop observations within a group following the occurrence of NA?

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2

Merging data frames with different number of rows and different columns

I have two data frames with different number of columns and rows. I want to combine them into one data frame.
> month.saf
Name NCDC Year Month Day HrMn Temp Q
244 AP 99999 2014 2 1 0 12 1
245 AP 99999 2014 2 1 300 12.2 1
246 AP 99999 2014 2 1 600 14.4 1
247 AP 99999 2014 2 1 900 18.6 1
248 AP 99999 2014 2 1 1200 18 1
249 AP 99999 2014 2 1 1500 13.6 1
250 AP 99999 2014 2 1 1800 11.8 1
251 AP 99999 2014 2 1 2100 10.8 1
252 AP 99999 2014 2 2 0 8.4 1
253 AP 99999 2014 2 2 300 8.6 1
254 AP 99999 2014 2 2 600 19.8 2
255 AP 99999 2014 2 2 900 22.8 1
256 AP 99999 2014 2 2 1200 20.8 1
257 AP 99999 2014 2 2 1500 16.4 1
258 AP 99999 2014 2 2 1800 13.4 1
259 AP 99999 2014 2 2 2100 12.4 1
> T2Mdf
V1 V2
0 293.494262695312 291.642639160156
300 294.003479003906 292.375091552734
600 296.809997558594 295.207885742188
900 298.287811279297 297.181549072266
1200 298.317565917969 297.725708007813
1500 298.134002685547 296.226165771484
1800 296.006805419922 293.354248046875
2100 293.785491943359 293.547210693359
0.1 294.638732910156 293.019866943359
300.1 292.179992675781 291.256958007812
The output that I want is like this:
Name NCDC Year Month Day HrMn Temp Q V1 V2
244 AP 99999 2014 2 1 0 12 1 293.4942627 291.6426392
245 AP 99999 2014 2 1 300 12.2 1 294.003479 292.3750916
246 AP 99999 2014 2 1 600 14.4 1 296.8099976 295.2078857
247 AP 99999 2014 2 1 900 18.6 1 298.2878113 297.1815491
248 AP 99999 2014 2 1 1200 18 1 298.3175659 297.725708
249 AP 99999 2014 2 1 1500 13.6 1 298.1340027 296.2261658
250 AP 99999 2014 2 1 1800 11.8 1 296.0068054 293.354248
251 AP 99999 2014 2 1 2100 10.8 1 293.7854919 293.5472107
252 AP 99999 2014 2 2 0 8.4 1 294.6387329 293.0198669
253 AP 99999 2014 2 2 300 8.6 1 292.1799927 291.256958
254 AP 99999 2014 2 2 600 19.8 2 292.2477417 291.3471069
255 AP 99999 2014 2 2 900 22.8 1 294.2276306 294.2766418
256 AP 99999 2014 2 2 1200 20.8 1 NA NA
257 AP 99999 2014 2 2 1500 16.4 1 NA NA
258 AP 99999 2014 2 2 1800 13.4 1 NA NA
259 AP 99999 2014 2 2 2100 12.4 1 NA NA
I tried cbindbut it gives me an error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 216, 220
And using rbind.fill() but it gives me something like
V1 V2 Name USAF NCDC Year Month Day HrMn I Type QCP Temp Q
1 293.494262695312 291.642639160156 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
2 294.003479003906 292.375091552734 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
3 296.809997558594 295.207885742188 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
4 298.287811279297 297.181549072266 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
5 298.317565917969 297.725708007813 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
6 <NA> <NA> AP 421820 99999 2014 2 1 0 4 FM-12 NA 12 1
7 <NA> <NA> AP 421820 99999 2014 2 1 300 4 FM-12 NA 12.2 1
8 <NA> <NA> AP 421820 99999 2014 2 1 600 4 FM-12 NA 14.4 1
9 <NA> <NA> AP 421820 99999 2014 2 1 900 4 FM-12 NA 18.6 1
10 <NA> <NA> AP 421820 99999 2014 2 1 1200 4 FM-12 NA 18 1
How is it possible to do this in R?
If A and B are the two input data frames, here are some solutions:
1) merge This solutions works regardless of whether A or B has more rows.
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
The first two arguments could be replaced with just A and B respectively if A and B have default rownames, i.e. 1, 2, ..., or if they have consistent rownames. That is, merge(A, B, by = 0, all = TRUE)[-1] .
For example, if we have this input:
# test inputs
A <- data.frame(BOD, row.names = letters[1:6])
B <- setNames(2 * BOD[1:2, ], c("X", "Y"))
then:
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
gives:
Time demand X Y
1 1 8.3 2 16.6
2 2 10.3 4 20.6
3 3 19.0 NA NA
4 4 16.0 NA NA
5 5 15.6 NA NA
6 7 19.8 NA NA
1a) An equivalent variation is:
do.call("merge", c(lapply(list(A, B), data.frame, row.names=NULL),
by = 0, all = TRUE))[-1]
2) cbind.zoo This solution assumes that A has more rows and that B's entries are all of the same type, e.g. all numeric. A is not restricted. These conditions hold in the data of the question.
library(zoo)
data.frame(A, cbind(zoo(, 1:nrow(A)), as.zoo(B)))

In R, sum over all rows above a given row and restarting at new ID?

The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]

Resources