I have two data frames with different number of columns and rows. I want to combine them into one data frame.
> month.saf
Name NCDC Year Month Day HrMn Temp Q
244 AP 99999 2014 2 1 0 12 1
245 AP 99999 2014 2 1 300 12.2 1
246 AP 99999 2014 2 1 600 14.4 1
247 AP 99999 2014 2 1 900 18.6 1
248 AP 99999 2014 2 1 1200 18 1
249 AP 99999 2014 2 1 1500 13.6 1
250 AP 99999 2014 2 1 1800 11.8 1
251 AP 99999 2014 2 1 2100 10.8 1
252 AP 99999 2014 2 2 0 8.4 1
253 AP 99999 2014 2 2 300 8.6 1
254 AP 99999 2014 2 2 600 19.8 2
255 AP 99999 2014 2 2 900 22.8 1
256 AP 99999 2014 2 2 1200 20.8 1
257 AP 99999 2014 2 2 1500 16.4 1
258 AP 99999 2014 2 2 1800 13.4 1
259 AP 99999 2014 2 2 2100 12.4 1
> T2Mdf
V1 V2
0 293.494262695312 291.642639160156
300 294.003479003906 292.375091552734
600 296.809997558594 295.207885742188
900 298.287811279297 297.181549072266
1200 298.317565917969 297.725708007813
1500 298.134002685547 296.226165771484
1800 296.006805419922 293.354248046875
2100 293.785491943359 293.547210693359
0.1 294.638732910156 293.019866943359
300.1 292.179992675781 291.256958007812
The output that I want is like this:
Name NCDC Year Month Day HrMn Temp Q V1 V2
244 AP 99999 2014 2 1 0 12 1 293.4942627 291.6426392
245 AP 99999 2014 2 1 300 12.2 1 294.003479 292.3750916
246 AP 99999 2014 2 1 600 14.4 1 296.8099976 295.2078857
247 AP 99999 2014 2 1 900 18.6 1 298.2878113 297.1815491
248 AP 99999 2014 2 1 1200 18 1 298.3175659 297.725708
249 AP 99999 2014 2 1 1500 13.6 1 298.1340027 296.2261658
250 AP 99999 2014 2 1 1800 11.8 1 296.0068054 293.354248
251 AP 99999 2014 2 1 2100 10.8 1 293.7854919 293.5472107
252 AP 99999 2014 2 2 0 8.4 1 294.6387329 293.0198669
253 AP 99999 2014 2 2 300 8.6 1 292.1799927 291.256958
254 AP 99999 2014 2 2 600 19.8 2 292.2477417 291.3471069
255 AP 99999 2014 2 2 900 22.8 1 294.2276306 294.2766418
256 AP 99999 2014 2 2 1200 20.8 1 NA NA
257 AP 99999 2014 2 2 1500 16.4 1 NA NA
258 AP 99999 2014 2 2 1800 13.4 1 NA NA
259 AP 99999 2014 2 2 2100 12.4 1 NA NA
I tried cbindbut it gives me an error
Error in data.frame(..., check.names = FALSE) : arguments imply
differing number of rows: 216, 220
And using rbind.fill() but it gives me something like
V1 V2 Name USAF NCDC Year Month Day HrMn I Type QCP Temp Q
1 293.494262695312 291.642639160156 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
2 294.003479003906 292.375091552734 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
3 296.809997558594 295.207885742188 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
4 298.287811279297 297.181549072266 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
5 298.317565917969 297.725708007813 <NA> NA NA NA NA NA NA NA <NA> NA <NA> NA
6 <NA> <NA> AP 421820 99999 2014 2 1 0 4 FM-12 NA 12 1
7 <NA> <NA> AP 421820 99999 2014 2 1 300 4 FM-12 NA 12.2 1
8 <NA> <NA> AP 421820 99999 2014 2 1 600 4 FM-12 NA 14.4 1
9 <NA> <NA> AP 421820 99999 2014 2 1 900 4 FM-12 NA 18.6 1
10 <NA> <NA> AP 421820 99999 2014 2 1 1200 4 FM-12 NA 18 1
How is it possible to do this in R?
If A and B are the two input data frames, here are some solutions:
1) merge This solutions works regardless of whether A or B has more rows.
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
The first two arguments could be replaced with just A and B respectively if A and B have default rownames, i.e. 1, 2, ..., or if they have consistent rownames. That is, merge(A, B, by = 0, all = TRUE)[-1] .
For example, if we have this input:
# test inputs
A <- data.frame(BOD, row.names = letters[1:6])
B <- setNames(2 * BOD[1:2, ], c("X", "Y"))
then:
merge(data.frame(A, row.names=NULL), data.frame(B, row.names=NULL),
by = 0, all = TRUE)[-1]
gives:
Time demand X Y
1 1 8.3 2 16.6
2 2 10.3 4 20.6
3 3 19.0 NA NA
4 4 16.0 NA NA
5 5 15.6 NA NA
6 7 19.8 NA NA
1a) An equivalent variation is:
do.call("merge", c(lapply(list(A, B), data.frame, row.names=NULL),
by = 0, all = TRUE))[-1]
2) cbind.zoo This solution assumes that A has more rows and that B's entries are all of the same type, e.g. all numeric. A is not restricted. These conditions hold in the data of the question.
library(zoo)
data.frame(A, cbind(zoo(, 1:nrow(A)), as.zoo(B)))
Related
I've the following dataset
Pet Shop
Year
Item
Price
A
2021
dog
300
A
2021
dog
250
A
2021
fish
20
A
2020
turtle
50
A
2020
dog
250
A
2020
cat
280
A
2019
rabbit
180
A
2019
cat
165
A
2019
cat
270
B
2021
dog
350
B
2021
fish
80
B
2021
fish
70
B
2020
cat
220
B
2020
turtle
90
B
2020
turtle
80
B
2020
fish
55
B
2019
fish
75
C
2021
dog
280
C
2020
cat
260
C
2020
cat
270
C
2019
fish
65
C
2019
cat
270
The code for the data is as follows
Pet_Shop = c(rep("A",9), rep("B",8), rep("C",5))
Item = c("Dog","Dog","Fish","Turtle","Dog","Cat","Rabbit","Cat","Cat","Dog","Fish","Fish","Cat","Turtle","Turtle","Fish","Fish","Dog","Cat","Cat","Fish","Cat")
Price = c(300,250,20,50,250,280,180,165,270,350,80,70,220,90,80,55,75,280,260,270,65,270)
Data = data.frame(Pet_Shop, Item, Price)
Does anyone here know how I can use pivot_wider or spread (or any other method) to achieve the following table? It groups the Shop by year and takes the average of the similar item of the same shop for the year. I've issues incorporating the year.
Pet Shop
Year
dog
fish
turtle
cat
rabbit
A
2021
Average(300,250) = 275
20
NA
NA
NA
A
2020
250
NA
50
280
NA
A
2019
NA
NA
NA
217.5
NA
B
2021
350
75
NA
NA
NA
B
2020
NA
55
85
220
NA
B
2019
NA
75
NA
NA
NA
C
2021
280
NA
NA
NA
NA
C
2020
NA
NA
NA
265
NA
C
2019
NA
60
NA
270
NA
In pivot_wider you may pass a function (values_fn) to be applied to each combination of Pet_Shop and Year.
result <- tidyr::pivot_wider(Data, names_from = Item,
values_from = Price, values_fn = mean)
result
# Pet_Shop Year dog fish turtle cat rabbit
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2021 275 20 NA NA NA
#2 A 2020 250 NA 50 280 NA
#3 A 2019 NA NA NA 218. 180
#4 B 2021 350 75 NA NA NA
#5 B 2020 NA 55 85 220 NA
#6 B 2019 NA 75 NA NA NA
#7 C 2021 280 NA NA NA NA
#8 C 2020 NA NA NA 265 NA
#9 C 2019 NA 65 NA 270 NA
The same can also be done with data.table dcast -
library(data.table)
dcast(setDT(Data), Pet_Shop + Year ~ Item,
value.var = "Price", fun.aggregate = mean)
I have some data and I am dividing the mdo value by the count number of mdo instances in the previous group.
I am calculating the sog avg also.
But I want to calculate the sog avg that takes place to the same instances as the result (mdo/count) value.
library(dplyr)
library(lubridate)
library(purrr)
df <- tibble(mydate = as.Date(c("2019-05-11 23:01:00", "2019-05-11 23:02:00", "2019-05-11 23:03:00", "2019-05-11 23:04:00",
"2019-05-12 23:05:00", "2019-05-12 23:06:00", "2019-05-12 23:07:00", "2019-05-12 23:08:00",
"2019-05-13 23:09:00", "2019-05-13 23:10:00", "2019-05-13 23:11:00", "2019-05-13 23:12:00",
"2019-05-14 23:13:00", "2019-05-14 23:14:00", "2019-05-14 23:15:00", "2019-05-14 23:16:00",
"2019-05-15 23:17:00", "2019-05-15 23:18:00", "2019-05-15 23:19:00", "2019-05-15 23:20:00",
"2019-05-15 23:21:00", "2019-05-15 23:22:00", "2019-05-15 23:23:00", "2019-05-15 23:24:00",
"2019-05-15 23:25:00")),
mdo = c(1500, 1500, 1500, 1500,
1500, 1500, NA, 0,
0, 0, 900, 900, NA, NA, 1100, 1100,
1100, 200, 200, 200,200,
1100, 1100, 1100, 0
),
sog = c(12, 12, 12, 11, 10,9,
2,8.8, 8.7, 7.8, 11, 11, 12, 11,
9.54, 9.8, 10.4,4, 4, 4.5, 3.6,
7, 8, 9, 0))
df1 <- df %>%
mutate(grp = data.table::rleid(mdo))
df1 <- df1 %>%
#Keep only non-NA value
filter(!is.na(mdo)) %>%
#count occurence of each grp
count(grp, name = 'count') %>%
#Shift the count to the previous group
mutate(count = lag(count)) %>%
#Join with the original data
right_join(df1, by = 'grp') %>%
arrange(grp)
group_mdo <- df1 %>%
select(grp, mdo) %>%
unique() %>%
mutate(prev_mdo = lag(mdo, na.rm=TRUE)) %>%
select(-mdo) %>%
tidyr::fill(prev_mdo, .direction = "down")
df1 <- df1 %>%
left_join(group_mdo, by = "grp") %>%
mutate(result = ifelse(prev_mdo != 0, mdo / count, 0)) %>%
mutate(sog_avg = ifelse(prev_mdo != 0, map_dbl(.x = grp - 1, ~ mean(sog[grp == .x], na.rm=TRUE), na.rm=TRUE), NA))
The result right now is:
grp count mydate mdo sog prev_mdo result sog_avg
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 11 NA NA NA
1 NA 2019-05-12 1500 10 NA NA NA
1 NA 2019-05-12 1500 9 NA NA NA
2 NA 2019-05-12 NA 2 1500 NA 11
3 6 2019-05-12 0 8.8 1500 0 2
3 6 2019-05-13 0 8.7 1500 0 2
3 6 2019-05-13 0 7.8 1500 0 2
4 3 2019-05-13 900 11 0 0 NA
4 3 2019-05-13 900 11 0 0 NA
5 NA 2019-05-14 NA 12 900 NA 11
5 NA 2019-05-14 NA 11 900 NA 11
6 2 2019-05-14 1100 9.54 900 550 11.5
6 2 2019-05-14 1100 9.8 900 550 11.5
6 2 2019-05-15 1100 10.4 900 550 11.5
7 3 2019-05-15 200 4 1100 66.7 9.91
7 3 2019-05-15 200 4 1100 66.7 9.91
7 3 2019-05-15 200 4.5 1100 66.7 9.91
7 3 2019-05-15 200 3.6 1100 66.7 9.91
8 4 2019-05-15 1100 7 200 275 4.03
8 4 2019-05-15 1100 8 200 275 4.03
8 4 2019-05-15 1100 9 200 275 4.03
9 3 2019-05-15 0 0 1100 0 8
My desired result:
grp count mydate mdo sog prev_mdo result sog_avg
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 12 NA NA NA
1 NA 2019-05-11 1500 11 NA NA NA
1 NA 2019-05-12 1500 10 NA NA NA
1 NA 2019-05-12 1500 9 NA NA NA
2 NA 2019-05-12 NA 2 1500 NA NA
3 6 2019-05-12 0 8.8 1500 0 0
3 6 2019-05-13 0 8.7 1500 0 0
3 6 2019-05-13 0 7.8 1500 0 0
4 3 2019-05-13 900 11 0 0 0
4 3 2019-05-13 900 11 0 0 0
5 NA 2019-05-14 NA 12 900 NA NA
5 NA 2019-05-14 NA 11 900 NA NA
6 2 2019-05-14 1100 9.54 900 550 11
6 2 2019-05-14 1100 9.8 900 550 11
6 2 2019-05-15 1100 10.4 900 550 11
7 3 2019-05-15 200 4 1100 66.7 9.91
7 3 2019-05-15 200 4 1100 66.7 9.91
7 3 2019-05-15 200 4.5 1100 66.7 9.91
7 3 2019-05-15 200 3.6 1100 66.7 9.91
8 4 2019-05-15 1100 7 200 275 4.03
8 4 2019-05-15 1100 8 200 275 4.03
8 4 2019-05-15 1100 9 200 275 4.03
9 3 2019-05-15 0 0 1100 0 0
Where result is zero, sog_avg should be zero, where result is na, sog avg should be na.
And where result is being computed by using the previous group counts, sog avg should be computed with it's previous values.
So, for example:
mdo = 1100 , result is 550 because counts in previous non null group are 2 (mdo value 900).
1100 / 2 = 550 . At this point sog avg should be (11 + 11) / 2 = 11 because counts were 2 in the previous non null group.
Here is a data.table approach. It extensively uses the idea of making groups by using base table or tapply and then lags those results. Note, this answer would fail if mdo is not constant throughout a group.
library(data.table)
dt = as.data.table(df)
dt[, grp := rleid(mdo)]
dt[!is.na(mdo),
count := {
cnt = table(grp)
rep(shift(cnt), cnt)
}
]
setcolorder(dt, c("grp", "count", "mydate", "mdo", "sog"))
dt[,
prev_mdo := {
ord = table(grp)
nafill(rep(shift(mdo[cumsum(ord)]), ord), "locf")
}
]
dt[, result := fifelse(prev_mdo != 0L, mdo / count, 0)]
dt[!is.na(result),
sog_avg := {
mn = tapply(sog, grp, mean)
rep(shift(mn), table(grp))
}]
dt[result == 0 | is.na(result), sog_avg := result]
dt
#> grp count mydate mdo sog prev_mdo result sog_avg
#> 1: 1 NA 2019-05-11 1500 12.00 NA NA NA
#> 2: 1 NA 2019-05-11 1500 12.00 NA NA NA
#> 3: 1 NA 2019-05-11 1500 12.00 NA NA NA
#> 4: 1 NA 2019-05-11 1500 11.00 NA NA NA
#> 5: 1 NA 2019-05-12 1500 10.00 NA NA NA
#> 6: 1 NA 2019-05-12 1500 9.00 NA NA NA
#> 7: 2 NA 2019-05-12 NA 2.00 1500 NA NA
#> 8: 3 6 2019-05-12 0 8.80 1500 0.00000 0.000000
#> 9: 3 6 2019-05-13 0 8.70 1500 0.00000 0.000000
#> 10: 3 6 2019-05-13 0 7.80 1500 0.00000 0.000000
#> 11: 4 3 2019-05-13 900 11.00 0 0.00000 0.000000
#> 12: 4 3 2019-05-13 900 11.00 0 0.00000 0.000000
#> 13: 5 NA 2019-05-14 NA 12.00 900 NA NA
#> 14: 5 NA 2019-05-14 NA 11.00 900 NA NA
#> 15: 6 2 2019-05-14 1100 9.54 900 550.00000 11.000000
#> 16: 6 2 2019-05-14 1100 9.80 900 550.00000 11.000000
#> 17: 6 2 2019-05-15 1100 10.40 900 550.00000 11.000000
#> 18: 7 3 2019-05-15 200 4.00 1100 66.66667 9.913333
#> 19: 7 3 2019-05-15 200 4.00 1100 66.66667 9.913333
#> 20: 7 3 2019-05-15 200 4.50 1100 66.66667 9.913333
#> 21: 7 3 2019-05-15 200 3.60 1100 66.66667 9.913333
#> 22: 8 4 2019-05-15 1100 7.00 200 275.00000 4.025000
#> 23: 8 4 2019-05-15 1100 8.00 200 275.00000 4.025000
#> 24: 8 4 2019-05-15 1100 9.00 200 275.00000 4.025000
#> 25: 9 3 2019-05-15 0 0.00 1100 0.00000 0.000000
#> grp count mydate mdo sog prev_mdo result sog_avg
I've got a list with more than 5000 elements and I want to save them in a .csv data frame with specific disposition.
library(XML)
url <- "http://www.omie.es/aplicaciones/datosftp/datosftp.jsp?path=/marginalpdbc/"
doc <- htmlParse(url)
links <- xpathSApply(doc, "//a/#href")
free(doc)
head(links)
wanted <- links[grepl("http*", links)]
head(wanted)
GetMe <- paste("", wanted, sep = "")
datos<-lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = F, sep = ";", as.is = TRUE,skip=1))
Like this I've got 7 variables with 25 instances in each list element.
V1 V2 V3 V4 V5 V6 V7
1 1999 1 1 1 3.350 0.02030303 NA
2 1999 1 1 2 3.595 0.02178788 NA
3 1999 1 1 3 3.293 0.01995758 NA
4 1999 1 1 4 2.800 0.01696970 NA
5 1999 1 1 5 2.516 0.01524848 NA
6 1999 1 1 6 2.516 0.01524848 NA
7 1999 1 1 7 2.516 0.01524848 NA
8 1999 1 1 8 2.516 0.01524848 NA
9 1999 1 1 9 2.516 0.01524848 NA
10 1999 1 1 10 2.516 0.01524848 NA
11 1999 1 1 11 2.516 0.01524848 NA
12 1999 1 1 12 2.840 0.01721212 NA
13 1999 1 1 13 2.840 0.01721212 NA
14 1999 1 1 14 3.595 0.02178788 NA
15 1999 1 1 15 3.586 0.02173333 NA
16 1999 1 1 16 2.840 0.01721212 NA
17 1999 1 1 17 2.840 0.01721212 NA
18 1999 1 1 18 2.840 0.01721212 NA
19 1999 1 1 19 4.172 0.02528485 NA
20 1999 1 1 20 3.639 0.02205455 NA
21 1999 1 1 21 3.661 0.02218788 NA
22 1999 1 1 22 3.661 0.02218788 NA
23 1999 1 1 23 3.661 0.02218788 NA
24 1999 1 1 24 3.638 0.02204848 NA
25 * NA NA NA NA NA NA
I want to have them all in the same dataframe with the following disposition:
FECHA A„O MES DIASEM DIA H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 H13 H14 H15
01/01/2003 2003 1 M 1 15 10.97 8.22 5.24 2.65 2.13 2.06 0.02 0 0 0.77 2.1 3.5 5.33 6.33
02/01/2003 2003 1 J 2 8.33 4.2 2.87 2.63 2.56 2.56 3.51 5.15 10 17.17 20 21.02 21.02 20 17.62
03/01/2003 2003 1 V 3 14.27 9.47 5.08 3.57 3.01 3.01 4.61 9.41 12.83 16.27 17.62 19.66 19.6 17.62 16.2
Where V1 is the year, V2 is the month, V3 is the day, V4 is de hour and V6 of the list corresponds to the values of each row.
In the final data frame each hour has to be one column.
Thanks for your help!
My data takes the following form:
df <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("USA", 16)),
Quarter=rep(1:8,2),Income=20:35)
df2 <- data.frame(Sector=c(rep("A",8),rep("B",8)), Country = c(rep("UK", 16)),
Quarter=rep(1:8,2),Income=32:47)
df <- rbind(df, df2)
What I want to do is to calculate the growth rate from the first quarter each year to the first quarter the second year, within country and sector. In the example above it would be the growth rate from quarter 1 to quarter 5. So for Sector A, in the USA, it would be (24/20)-1=0.2
I then want to append this data to the dataframe as a new column.
I looked at the solutions in:
How calculate growth rate in long format data frame?
But didn't have the r-skills to get it to work if the lag is more then one time-unit. Any suggestions?
ADDITION
So what i want is the growth-rate, that is (24/20)-1=0.2 in the example below. Not 1-(24/20), which I first wrote. The desired output should look something like this:
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2
6 A USA 6 25 0.1904
7 A USA 7 26 0.1818
I think you need something like this:
library(dplyr)
df %>%
#group by sector and country
group_by(Sector, Country) %>%
#calculate growth as (quarter / 5-period-lagged quarter) - 1
mutate(growth = Income / lag(Income, 4) - 1)
Output
Source: local data frame [32 x 5]
Groups: Sector, Country [4]
Sector Country Quarter Income growth
(fctr) (fctr) (int) (int) (dbl)
1 A USA 1 20 NA
2 A USA 2 21 NA
3 A USA 3 22 NA
4 A USA 4 23 NA
5 A USA 5 24 0.2000000
6 A USA 6 25 0.1904762
7 A USA 7 26 0.1818182
8 A USA 8 27 0.1739130
9 B USA 1 28 NA
10 B USA 2 29 NA
.. ... ... ... ... ...
df3 = copy(df)
df3$Quarter = df3$Quarter - 4
df = merge(df,df3,c('Sector','Country','Quarter'), suffixes = c('','_prev'), all.x = T)
df$growth = 1 - (df$Income_prev/df$Income
> df
Sector Country Quarter Income Income_prev growth
1 A USA 1 20 24 -4
2 A USA 2 21 25 -4
3 A USA 3 22 26 -4
4 A USA 4 23 27 -4
5 A USA 5 24 NA NA
6 A USA 6 25 NA NA
7 A USA 7 26 NA NA
8 A USA 8 27 NA NA
9 A UK 1 32 36 -4
10 A UK 2 33 37 -4
11 A UK 3 34 38 -4
12 A UK 4 35 39 -4
13 A UK 5 36 NA NA
14 A UK 6 37 NA NA
15 A UK 7 38 NA NA
16 A UK 8 39 NA NA
17 B USA 1 28 32 -4
18 B USA 2 29 33 -4
19 B USA 3 30 34 -4
20 B USA 4 31 35 -4
21 B USA 5 32 NA NA
22 B USA 6 33 NA NA
23 B USA 7 34 NA NA
24 B USA 8 35 NA NA
25 B UK 1 40 44 -4
26 B UK 2 41 45 -4
27 B UK 3 42 46 -4
28 B UK 4 43 47 -4
29 B UK 5 44 NA NA
30 B UK 6 45 NA NA
31 B UK 7 46 NA NA
32 B UK 8 47 NA NA
>
I'm new to R, and I was looking for similar questions, but was not able to find one to fix mine, any help would be appreciated.
I have a data frame M:
date value
1 182-2002-01-01 23.95
2 182-2002-01-02 17.47
3 182-2002-01-03 NA
4 183-2002-01-01 NA
5 183-2002-01-02 5.50
6 183-2002-01-03 17.02
What I need to do is: if there are less than 5 NA (continuously), I will just repeat the previous number(17.47), and if there are more than 5 NA in a row, I will need to delete the whole month.
I tried function rle many times, but didn't work, many thanks for your help.
I'm going to adjust your question a little bit for the purposes of demonstration.
I'm going to use a similar dataset to you, but for 2 NAs in a row. This generalises to 5 very easily, don't worry. I'm also going to use a data set that better demonstrates the solution
So first, how to get your data to look like what I'm going to use:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
Now that's out of the road, this is the data I'm going to work with:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
set.seed(1234)
M2<-expand.grid(ID=182, year=2002:2004, month=1:2, day=1:3, KEEP.OUT.ATTRS=FALSE)
M2 <- M2[with(M2, order(year, month, day, ID)),] #sort the data
M2$value <- sample(c(NA, rnorm(100)), nrow(M2),
prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
M2
ID year month day value
1 182 2002 1 1 -0.5012581
7 182 2002 1 2 1.1022975
13 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
10 182 2002 2 2 1.1022975
16 182 2002 2 3 -1.2519859
2 182 2003 1 1 NA
8 182 2003 1 2 NA
14 182 2003 1 3 NA
5 182 2003 2 1 0.9729168
11 182 2003 2 2 0.9594941
17 182 2003 2 3 NA
3 182 2004 1 1 NA
9 182 2004 1 2 -1.1088896
15 182 2004 1 3 0.9594941
6 182 2004 2 1 -0.4027320
12 182 2004 2 2 -0.0151383
18 182 2004 2 3 -1.0686427
First, we're going to remove all cases where, within a month, there are 2 or more NAs in a row:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$value))
if(any(runs$lengths[runs$values] >= maxlen)) NULL else x
}
library(plyr)
rem <- ddply(M2, .(ID, year, month), NA_run, 2)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 NA
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
You can see that the two in a row NAs have been removed. The one remaining is there because it belongs to two different months. Now we're going to fill in the remaining NAs. The na.rm=FALSE argument is there to keep the NAs if they're right at the beginning (which is what you want, I think).
library(zoo)
rem$value <- na.locf(rem$value, na.rm=FALSE)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 0.9594941
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
Now all you need to do to make this 5 or more with your data is to change the value of the maxlen argument in NA_run to 5.
EDIT: Alternatively, if you don't want values to copy over from previous months:
library(zoo)
rem$value <- ddply(rem, .(ID, year, month), summarise,
value=na.locf(value, na.rm=FALSE))$value
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
I'd do this in two steps:
An rle, rollapply, or shift-based strategy to fill in the small gaps (fewer than 5 NAs in a row).
A by, aggregate, or ddply-based strategy to take any month with NAs remaining after step 1 and make the whole month NA.