I have got a data.frame with 4 columns and ca. 6000 rows.
The columns are:
ID,
Day (the number of day since the first day of the time period considered),
Year (the year in which the observation has been recorded) and
Count (the number of observations pooled, between all the IDs, in that particular day).
df = read.table(text = 'ID Day Year Count
33012 12448 2001 46
35004 12448 2001 46
35008 12448 2001 46
37006 12448 2001 46
21009 4835 1980 44
24005 4835 1980 44
27001 4835 1980 44
27002 4835 1980 44
25005 5569 1982 34
29001 5569 1982 34
29002 5569 1982 34
30003 5569 1982 34', header = TRUE)
I need to create a time window of three days and run a for loop for each Day, counting the number of observations in that time range.
e.g. starting from Day 12448 (or "day 0") I need to check in all the dataframe for Day 12447 (day prior or "day -1") and Day 12449 (day after or "day +1") if exist observations recorded.
In other words, taking di = 12448, there exist any "di +1" and/or "di -1" in the dataframe?
If yes, I have to delete from the dataframe "di +1" and/or "di -1" in order to avoid overlapping and add both their "Count" values into the "di" 's "Count" observation.
Do you have any hint that can help me making the for loop?
#thepule, thanks a lot. I tried to run your code within my dataset. So I created a vector with all the days in the column "Day"
days <- unique(df$Day)
and adjusted the for loop appropriately, but it doesn't work, in the sense that I obtain very low values in the column Count.
Where is the mistake?
Here an example of my dataframe:
df = read.table(text ='ID Day Year Count
33012 12448 2001 5
35004 12448 2001 5
35008 12448 2001 5
37006 12448 2001 5
37008 12448 2001 5
27900 12800 2002 4
27987 12800 2002 4
27123 12800 2002 4
27341 12800 2002 4
56124 14020 2003 3
12874 14020 2003 3
11447 14020 2003 3
11231 12447 2001 2
31879 12447 2001 2
56784 12449 2001 1
64148 12799 2002 1
45613 12801 2001 1
77632 10324 1991 1
55313 14002 2003 1
11667 14019 2003 1', header = TRUE)
My output, after the for loop, should be:
ID Day Year Count
1 33012 12448 2001 8
2 35004 12448 2001 8
3 35008 12448 2001 8
4 37006 12448 2001 8
5 37008 12448 2001 8
6 27900 12800 2002 6
7 27987 12800 2002 6
8 27123 12800 2002 6
9 27341 12800 2002 6
10 56124 14020 2003 4
11 12874 14020 2003 4
12 11447 14020 2003 4
13 77632 10324 1991 1
14 55313 14002 2003 1
n.b each observations, for each ID, are max 1 per year.
n.b.b. the Count column is ordered as decreasing = TRUE
Updated answer:
# Create data frame
tt <- read.table(text = "
ID Day Year Count
33012 12448 2001 5
35004 12448 2001 5
35008 12448 2001 5
37006 12448 2001 5
37008 12448 2001 5
27900 12800 2002 4
27987 12800 2002 4
27123 12800 2002 4
27341 12800 2002 4
56124 14020 2003 3
12874 14020 2003 3
11447 14020 2003 3
11231 12447 2001 2
31879 12447 2001 2
56784 12449 2001 1
64148 12799 2002 1
45613 12801 2001 1
77632 10324 1991 1
55313 14002 2003 1
11667 14019 2003 1", header= T)
# Vector of day targets you want to repeat the procedure for
targets <- unique(tt$Day)
for (i in targets) {
temp <- tt$Count[tt$Day == i]
if(length(temp >0)) {
condition <- tt$Day == i - 1
if(any(condition)) {
tt$Count[tt$Day == i] <- mean(tt$Count[condition]) + tt$Count[tt$Day == i]
tt <- tt[!condition,]
}
condition2 <- tt$Day == i + 1
if(any(condition2)) {
tt$Count[tt$Day == i] <- mean(tt$Count[condition2]) + tt$Count[tt$Day == i]
tt <- tt[!condition2,]
}
}
}
Output:
tt
ID Day Year Count
1 33012 12448 2001 8
2 35004 12448 2001 8
3 35008 12448 2001 8
4 37006 12448 2001 8
5 37008 12448 2001 8
6 27900 12800 2002 6
7 27987 12800 2002 6
8 27123 12800 2002 6
9 27341 12800 2002 6
10 56124 14020 2003 4
11 12874 14020 2003 4
12 11447 14020 2003 4
18 77632 10324 1991 1
19 55313 14002 2003 1
Related
I'm trying to calculate the compound annual growth rate of my data (snipet shown below), does anyone know the best way to do this or if there is a function that does part of the job?
Data: (only woried about the preds column here, others can be ignored)
year month timestep ymin ymax preds date
1 1998 1 1 17.84037 18.58553 18.21295 1998-01-01
2 1998 2 2 17.05009 17.70642 17.37826 1998-02-01
3 1998 3 3 16.97067 17.61320 17.29193 1998-03-01
4 1998 4 4 18.38551 19.00838 18.69695 1998-04-01
5 1998 5 5 21.39082 21.97338 21.68210 1998-05-01
6 1998 6 6 24.77679 25.35464 25.06571 1998-06-01
7 1998 7 7 27.27057 27.82818 27.54938 1998-07-01
8 1998 8 8 28.24703 28.76702 28.50702 1998-08-01
9 1998 9 9 27.72370 28.24619 27.98494 1998-09-01
10 1998 10 10 25.83783 26.33969 26.08876 1998-10-01
11 1998 11 11 22.94968 23.42268 23.18618 1998-11-01
12 1998 12 12 19.50499 20.05466 19.77982 1998-12-01
13 1999 1 13 17.98323 18.50530 18.24426 1999-01-01
14 1999 2 14 17.20124 17.61746 17.40935 1999-02-01
15 1999 3 15 17.11064 17.53492 17.32278 1999-03-01
I have a time-series dataset with yearly values for 30 years for >200,000 study units that all start off as the same value of 'healthy==1' and can transition to 3 classes - 'exposed==2', 'infected==3' and 'recover==4'; some units also remain as 'healthy' throughout the time series. The dataset is in long format.
I would like to manipulate the dataset that keeps all 30 years for each unit but collapsed to only 'heathy==1' and 'infected==3' i.e. I would classify 'exposed==2' as 'healthy==1' and the first time a 'healthy' unit gets 'infected==3', it remains as infected for the remaining of the time-series even though it might 'recover==4'/change state again (gets infected and recover again).
Healthy units that never transition to another class will remain classified as healthy throughout the time series.
I am kinda stumped on how to code this out in r; any ideas would be greatly appreciated
example of dataset for two units; one remains health throughout the time series and another has multiple transitions.
UID annual_change_val year
1 control1 1 1990
4 control1 1 1991
5 control1 1 1992
7 control1 1 1993
9 control1 1 1994
12 control1 1 1995
13 control1 1 1996
16 control1 1 1997
18 control1 1 1998
20 control1 1 1999
22 control1 1 2000
24 control1 1 2001
26 control1 1 2002
28 control1 1 2003
30 control1 1 2004
31 control1 1 2005
33 control1 1 2006
35 control1 1 2007
38 control1 1 2008
40 control1 1 2009
42 control1 1 2010
44 control1 1 2011
46 control1 1 2012
48 control1 1 2013
50 control1 1 2014
52 control1 1 2015
53 control1 1 2016
55 control1 1 2017
57 control1 1 2018
59 control1 1 2019
61 control1 1 2020
2 control64167 1 1990
3 control64167 1 1991
6 control64167 1 1992
8 control64167 2 1993
10 control64167 2 1994
11 control64167 2 1995
14 control64167 2 1996
15 control64167 2 1997
17 control64167 3 1998
19 control64167 3 1999
21 control64167 4 2000
23 control64167 4 2001
25 control64167 4 2002
27 control64167 4 2003
29 control64167 3 2004
32 control64167 4 2005
34 control64167 4 2006
36 control64167 4 2007
37 control64167 4 2008
39 control64167 4 2009
41 control64167 4 2010
43 control64167 4 2011
45 control64167 4 2012
47 control64167 4 2013
49 control64167 4 2014
51 control64167 4 2015
54 control64167 4 2016
56 control64167 4 2017
58 control64167 4 2018
60 control64167 4 2019
62 control64167 4 2020
If for some reason you only want to use base R,
df$annual_change_val[df$annual_change_val == 2] <- 1
df$annual_change_val[df$annual_change_val == 4] <- 3
The first line means: take the annual_change_val column from ($) dataframe df, subset it ([) so that you're only left with values equal to 2, and re-assign (<-) to those a value of 1 instead. Similarly for the second line.
Update, based on comment/clarification.
Here, I replace the values as before, and then I create a temp variable called max_inf which holds the maximum year that the UID was "infected" (status=3). I then replace the status to 3 for any year that is beyond that year (within UID).
d %>%
mutate(status = if_else(annual_change_val %in% c(1,2),1,3)) %>%
group_by(UID) %>%
mutate(max_inf = max(year[which(status==3)],na.rm=T),
status = if_else(!is.na(max_inf) & year>max_inf & status==1,3,status)) %>%
select(!max_inf)
You can simply change the values from 2 to 1, and from 4 to 3, as Andrea mentioned in the comments. If d is your data, then
library(dplyr)
d %>% mutate(status = if_else(annual_change_val %in% c(1,2),1,3))
library(data.table)
setDT(d)[, status:=fifelse(annual_change_val %in% c(1,2),1,3)]
I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.
Here an example:
productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf
# productreference Year assets
# 1 1 2000 2
# 2 1 2001 3
# 3 1 2002 NA
# 4 1 2003 2
# 5 2 1999 34
# 6 2 2000 NA
# 7 2 2001 45
# 8 3 2005 1
# 9 3 2006 23
# 10 3 2007 34
# 11 3 2008 56
# 12 4 1998 56
# 13 4 1999 67
# 14 4 2000 23
# 15 5 2000 23
# 16 5 2001 NA
# 17 5 2002 14
# 18 5 2003 NA
I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.
mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1
I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!
I believe this is what you want:
library(dplyr)
group_by(mydf, productreference) %>%
filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
#
# productreference Year assets
# (dbl) (dbl) (dbl)
# 1 1 2000 2
# 2 1 2001 3
# 3 2 1999 34
# 4 3 2005 1
# 5 3 2006 23
# 6 3 2007 34
# 7 3 2008 56
# 8 4 1998 56
# 9 4 1999 67
# 10 4 2000 23
# 11 5 2000 23
Here is the same approach using data.table:
library(data.table)
dt <- as.data.table(mydf)
dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]
# productreference Year assets nas
# 1: 1 2000 2 0
# 2: 1 2001 3 0
# 3: 2 1999 34 0
# 4: 3 2005 1 0
# 5: 3 2006 23 0
# 6: 3 2007 34 0
# 7: 3 2008 56 0
# 8: 4 1998 56 0
# 9: 4 1999 67 0
#10: 4 2000 23 0
#11: 5 2000 23 0
Here is a base R option
mydf[unsplit(lapply(split(mydf, mydf$productreference),
function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]
# productreference Year assets
#1 1 2000 2
#2 1 2001 3
#5 2 1999 34
#8 3 2005 1
#9 3 2006 23
#10 3 2007 34
#11 3 2008 56
#12 4 1998 56
#13 4 1999 67
#14 4 2000 23
#15 5 2000 23
Or an option with data.table
library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)]
else .SD, by = productreference]
You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.
mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
s1 <- mydf[mydf$productreference==i,]
s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
mydf2 <- rbind(mydf2, s2)
mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2
The following is what I have:
ID Year Score
1 1999 10
1 2000 11
1 2001 14
1 2002 22
2 2000 19
2 2001 17
2 2002 22
3 1998 10
3 1999 12
The following is what I would like to do:
ID Year Score Total
1 1999 10 10
1 2000 11 21
1 2001 14 35
1 2002 22 57
2 2000 19 19
2 2001 17 36
2 2002 22 48
3 1998 10 10
3 1999 12 22
The amount of years and the specific years vary for each Id.
I have a feeling that it's some advanced options in ddply but I have not been able to find the answer. I've also tried working with for/while loops but since these are dreadfully slow in R and my data-set is large, it's not working all that well.
Thanks in advance!
You can use the sumsum function and apply it with ave to all subgroups.
transform(dat, Total = ave(Score, ID, FUN = cumsum))
ID Year Score Total
1 1 1999 10 10
2 1 2000 11 21
3 1 2001 14 35
4 1 2002 22 57
5 2 2000 19 19
6 2 2001 17 36
7 2 2002 22 58
8 3 1998 10 10
9 3 1999 12 22
If your data is large, then ddply will be slow.
data.table is the way to go.
library(data.table)
DT <- data.table(dat)
# create your desired column in `DT`
DT[, agg.Score := cumsum(Score), by = ID]
I'm new to R, and I was looking for similar questions, but was not able to find one to fix mine, any help would be appreciated.
I have a data frame M:
date value
1 182-2002-01-01 23.95
2 182-2002-01-02 17.47
3 182-2002-01-03 NA
4 183-2002-01-01 NA
5 183-2002-01-02 5.50
6 183-2002-01-03 17.02
What I need to do is: if there are less than 5 NA (continuously), I will just repeat the previous number(17.47), and if there are more than 5 NA in a row, I will need to delete the whole month.
I tried function rle many times, but didn't work, many thanks for your help.
I'm going to adjust your question a little bit for the purposes of demonstration.
I'm going to use a similar dataset to you, but for 2 NAs in a row. This generalises to 5 very easily, don't worry. I'm also going to use a data set that better demonstrates the solution
So first, how to get your data to look like what I'm going to use:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
Now that's out of the road, this is the data I'm going to work with:
library(reshape)
M2<-data.frame(colsplit(M$date, "-", c("ID", "year", "month", "day")),
value=M$value)
set.seed(1234)
M2<-expand.grid(ID=182, year=2002:2004, month=1:2, day=1:3, KEEP.OUT.ATTRS=FALSE)
M2 <- M2[with(M2, order(year, month, day, ID)),] #sort the data
M2$value <- sample(c(NA, rnorm(100)), nrow(M2),
prob=c(0.5, rep(0.5/100, 100)), replace=TRUE)
M2
ID year month day value
1 182 2002 1 1 -0.5012581
7 182 2002 1 2 1.1022975
13 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
10 182 2002 2 2 1.1022975
16 182 2002 2 3 -1.2519859
2 182 2003 1 1 NA
8 182 2003 1 2 NA
14 182 2003 1 3 NA
5 182 2003 2 1 0.9729168
11 182 2003 2 2 0.9594941
17 182 2003 2 3 NA
3 182 2004 1 1 NA
9 182 2004 1 2 -1.1088896
15 182 2004 1 3 0.9594941
6 182 2004 2 1 -0.4027320
12 182 2004 2 2 -0.0151383
18 182 2004 2 3 -1.0686427
First, we're going to remove all cases where, within a month, there are 2 or more NAs in a row:
NA_run <- function(x, maxlen){
runs <- rle(is.na(x$value))
if(any(runs$lengths[runs$values] >= maxlen)) NULL else x
}
library(plyr)
rem <- ddply(M2, .(ID, year, month), NA_run, 2)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 NA
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 NA
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
You can see that the two in a row NAs have been removed. The one remaining is there because it belongs to two different months. Now we're going to fill in the remaining NAs. The na.rm=FALSE argument is there to keep the NAs if they're right at the beginning (which is what you want, I think).
library(zoo)
rem$value <- na.locf(rem$value, na.rm=FALSE)
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 0.9594941
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
Now all you need to do to make this 5 or more with your data is to change the value of the maxlen argument in NA_run to 5.
EDIT: Alternatively, if you don't want values to copy over from previous months:
library(zoo)
rem$value <- ddply(rem, .(ID, year, month), summarise,
value=na.locf(value, na.rm=FALSE))$value
rem
ID year month day value
1 182 2002 1 1 -0.5012581
2 182 2002 1 2 1.1022975
3 182 2002 1 3 1.1022975
4 182 2002 2 1 -0.1623095
5 182 2002 2 2 1.1022975
6 182 2002 2 3 -1.2519859
7 182 2003 2 1 0.9729168
8 182 2003 2 2 0.9594941
9 182 2003 2 3 0.9594941
10 182 2004 1 1 NA
11 182 2004 1 2 -1.1088896
12 182 2004 1 3 0.9594941
13 182 2004 2 1 -0.4027320
14 182 2004 2 2 -0.0151383
15 182 2004 2 3 -1.0686427
I'd do this in two steps:
An rle, rollapply, or shift-based strategy to fill in the small gaps (fewer than 5 NAs in a row).
A by, aggregate, or ddply-based strategy to take any month with NAs remaining after step 1 and make the whole month NA.