R Fill cells with previous data - r

I have a table like the following:
days Debit loaddate
1 23/01/2014 138470289.4 23/01/2014
2 24/01/2014 NA NA
3 25/01/2014 NA NA
4 26/01/2014 NA NA
5 27/01/2014 NA NA
one row for each day and then in the columns loaddate after a few NA another date appears:
28 19/02/2014 NA NA
29 20/02/2014 NA NA
30 21/02/2014 NA NA
31 22/02/2014 9090967.9 22/02/2014
32 23/02/2014 NA NA
33 24/02/2014 308083.5 24/02/2014
I would like to replace each NA in loaddate column with the previous date in loaddate.
I tried:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-data3[i-1,'loaddate1']}
}
But I got the wrong format:
> head(data3)
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> 16093
3 2014-01-25 NA <NA> 16093
4 2014-01-26 NA <NA> 16093
5 2014-01-27 NA <NA> 16093
6 2014-01-28 NA <NA> 16093
I need to get the date format also. If I do:
for(i in 1:nrow(data3))
{ if (!is.na(data3[i,'Debit']))
{data3[i,'loaddate1']<-as.Date(data3[i,'loaddate'], format='%Y-%m-%d')}
else {data3[i,'loaddate1']<-as.Date(data3[i-1,'loaddate1'], format='%Y-%m-%d')}
}
I got the wrong result (with NA).
days Debit loaddate loaddate1
1 2014-01-23 138470289 2014-01-23 16093
2 2014-01-24 NA <NA> <NA>
3 2014-01-25 NA <NA> <NA>
4 2014-01-26 NA <NA> <NA>
5 2014-01-27 NA <NA> <NA>
6 2014-01-28 NA <NA> <NA>
How can I get the right result and with the right format?
Also, Is there a better way to do this replacement? I mean without a loop.
Thanks.

Try zoo::na.locf and make sure to use the appropriate date format:
library(zoo)
data3$loaddate <- as.Date(na.locf(data3$loaddate), format='%d/%m/%Y'))

Related

Reconstruct dataframe with dates as date intervals in R

I have a dataset basically looks like that, giving which campaigns are active for each household with given start and end dates of respective campaigns:
campaign_id household_id campaign_type start_date end_date
1 26 1 Type B 2016-12-28 2017-02-19
2 8 1 Type A 2017-05-08 2017-06-25
3 12 1 Type B 2017-07-12 2017-08-13
4 13 1 Type A 2017-08-08 2017-09-24
5 18 1 Type A 2017-10-30 2017-12-24
6 20 1 Type C 2017-11-27 2018-02-05
7 22 1 Type B 2017-12-06 2018-01-07
8 23 1 Type B 2017-12-28 2018-02-04
And I create a new dataframe with given structure, which will show which campaigns are active for given household in a given time (having all the campaign numbers as columns, i have omitted the rest while putting here):
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 NA NA NA NA
2 1 2016-12-06 NA NA NA NA
3 1 2016-12-28 NA NA NA NA
4 1 2017-02-08 NA NA NA NA
5 1 2017-03-03 NA NA NA NA
6 1 2017-03-08 NA NA NA NA
7 1 2017-03-13 NA NA NA NA
8 1 2017-03-29 NA NA NA NA
9 1 2017-04-03 NA NA NA NA
10 1 2017-04-19 NA NA NA NA
What I want to do is assigning the active promotions in the given dates as rows in the second dataframe. For example if household_id 1 is having campaign 2 running in 2016-11-14 but no other campaigns, then it will look like this:
household_id date campaign1 campaign2 campaign3 campaign4
1 1 2016-11-14 0 1 0 0
How can i manage this construction, should I use for loops in the initial dataframe and assign to second one in each loop, or there is a better and faster way? Thanks in advance.

Transfer pivottable to another table in R

In my research I have a dataset of cancer patients with some clinical information like cancer stage and treatment etc. Each patient has one row in a table with this clinical information. In addition, each patient has, at one or several timepoints during the treatment, taken blood samples, depending on how long the patient has been followed at the clinic. The first sample is from the first visit and the second sample is from the second visit at the clinic, and so on.
In the table, there is a variable (ie. column) that is named Sample_Time_1, which is the time for the first sample. Sample_Time_2 has the time (date) for the second sample and so on.
However - the samples were analysed at the lab and I got the result in a pivottable, which means I have a table where each sample has one row and therefore the results from one patient is displayed on several rows.
For example, create two tables:
x <- c(1,2,2,3,3,3,3,4,5,6,6,6,6,7,8,9,9,10)
y <- as.Date(c("2011-05-17","2012-06-30","2012-08-11","2011-10-15","2011-11-25","2012-01-07","2012-02-15","2011-08-13","2012-02-03","2011-11-08","2011-12-21","2012-02-01","2012-03-12","2012-01-03","2012-04-20","2012-03-31","2012-05-10","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
z <- c(123,185,153,153,125,148,168,187,194,115,165,167,143,151,129,130,151,134)
Sheet_1 <- matrix(c(x,y,z), ncol=3, byrow=FALSE)
colnames(Sheet_1) <- c("ID","Sample_Time", "Sample_Value")
Sheet_1 <- as.data.frame(Sheet_1)
Sheet_1$Sample_Time <- y
x1 <- c(1,2,3,4,5,6,7,8,9,10)
x2 <- c(3,3,2,3,2,2,4,2,3,3)
x3 <- c(1,2,2,3,3,1,3,1,1,2)
x4 <- as.Date(c("2011-05-17","2012-06-30","2011-10-15","2011-08-13","2012-02-03","2011-11-08","2012-01-03","2012-04-20","2012-03-31","2011-12-15"), format="%Y-%m-%d", origin="1960-01-01")
x5 <- as.Date(c(NA,"2012-08-11","2011-11-25",NA,NA,"2011-12-21",NA,NA,"2012-05-10",NA), format="%Y-%m-%d", origin="1960-01-01")
x6 <- as.Date(c(NA,NA,"2012-01-07",NA,NA,"2012-02-01",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
x7 <- as.Date(c(NA,NA,"2012-02-15",NA,NA,"2012-03-12",NA,NA,NA,NA), format="%Y-%m-%d", origin="1960-01-01")
Sheet_2 <- as.data.frame(c(1:10))
colnames(Sheet_2) <- "ID"
Sheet_2$Stage <- x2
Sheet_2$Treatment <- x3
Sheet_2$Sample_Time_1 <- x4
Sheet_2$Sample_Time_2 <- x5
Sheet_2$Sample_Time_3 <- x6
Sheet_2$Sample_Time_4 <- x7
Sheet_2$Sample_Value_1 <- NA
Sheet_2$Sample_Value_2 <- NA
Sheet_2$Sample_Value_3 <- NA
Sheet_2$Sample_Value_4 <- NA
I would like to transfer the Sample_Value for the first date a sample was taken from a patient from Sheet_1 to Sheet_2$Sample_Value_1 and if there are more samples, I would like to transfer them to column "Sample_Value_2" and so on.
I have tried with a double for-loop. For each patient (=ID) in Sheet_1 I have run through Sheet_2 and if there is a mach on ID, then I use another for-loop to see if there is a mach on a Sample_Time and insert (using if) the Sample_Value. However, I do not manage to get it to work and I have a strong feeling there must be a better way.
Any suggestions?
Is this what you want:
Prepare Sheet_1 for reshaping from long to wide by introducing an extra column with unique ID for each blood sample per patient
Sheet_1$uniqid <- with(Sheet_1, ave(as.character(ID), ID, FUN = seq_along))
And with this, do the re-shaping
S_1 <- reshape( Sheet_1, idvar = "ID", timevar = "uniqid", direction = "wide")
which gives you
> S_1
ID Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2 Sample_Time.3
1 1 2011-05-17 123 <NA> NA <NA>
2 2 2012-06-30 185 2012-08-11 153 <NA>
4 3 2011-10-15 153 2011-11-25 125 2012-01-07
8 4 2011-08-13 187 <NA> NA <NA>
9 5 2012-02-03 194 <NA> NA <NA>
10 6 2011-11-08 115 2011-12-21 165 2012-02-01
14 7 2012-01-03 151 <NA> NA <NA>
15 8 2012-04-20 129 <NA> NA <NA>
16 9 2012-03-31 130 2012-05-10 151 <NA>
18 10 2011-12-15 134 <NA> NA <NA>
Sample_Value.3 Sample_Time.4 Sample_Value.4
1 NA <NA> NA
2 NA <NA> NA
4 148 2012-02-15 168
8 NA <NA> NA
9 NA <NA> NA
10 167 2012-03-12 143
14 NA <NA> NA
15 NA <NA> NA
16 NA <NA> NA
18 NA <NA> NA
The number after the dot in the colnames is the uniqid.
Now you can merge the relevant columns from Sheet_2
S_2 <- merge( Sheet_2[ 1:3 ], S_1, by = "ID" )
and the result should be what you are looking for:
> S_2
ID Stage Treatment Sample_Time.1 Sample_Value.1 Sample_Time.2 Sample_Value.2
1 1 3 1 2011-05-17 123 <NA> NA
2 2 3 2 2012-06-30 185 2012-08-11 153
3 3 2 2 2011-10-15 153 2011-11-25 125
4 4 3 3 2011-08-13 187 <NA> NA
5 5 2 3 2012-02-03 194 <NA> NA
6 6 2 1 2011-11-08 115 2011-12-21 165
7 7 4 3 2012-01-03 151 <NA> NA
8 8 2 1 2012-04-20 129 <NA> NA
9 9 3 1 2012-03-31 130 2012-05-10 151
10 10 3 2 2011-12-15 134 <NA> NA
Sample_Time.3 Sample_Value.3 Sample_Time.4 Sample_Value.4
1 <NA> NA <NA> NA
2 <NA> NA <NA> NA
3 2012-01-07 148 2012-02-15 168
4 <NA> NA <NA> NA
5 <NA> NA <NA> NA
6 2012-02-01 167 2012-03-12 143
7 <NA> NA <NA> NA
8 <NA> NA <NA> NA
9 <NA> NA <NA> NA
10 <NA> NA <NA> NA

I need to add several rows together based on the fact that they have something in common with another row

Using the information on hand I need to predict how much of a particular product we need next month. I have several months worth of data going back, however the data is separated by both VPN and by a separate warehouse number. I just need to know how much to order in general and ignore the warehouse separation. we'll be adding that back in later.
There are multiple duplicates of many of the VPN's and i would like to consolidate all the duplicates and also sum the numbers that have been separated.
VPN Month To Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>
So i want to combine all the duplicates and add all the numbers from the rows into just one row per VPN.
I've tried using the aggregate function and it didn't work for me. i may have used it wrong though.
any help would be appreciated!
also there are some cases where it may cause an infinite number to show up. if anyone has any further advice for how to handle that it would be welcome.
You basically want to know how to perform sum while grouping in your data frame.
You will find plenty of answer.
I have a data.table solution for your case:
plouf <- read.table(text = " VPN Month.To.Date December November October September August July June May April March
0A36227-AA 15 6 4 2 NA 4 6 4 2 <NA> 4
0A36227-AA NA 1 NA NA NA NA 1 <NA> <NA> <NA> <NA>
0A36227-AA 2 3 1 NA 2 3 3 1 <NA> 2 3
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA 1 NA NA NA <NA> 1 <NA> <NA> <NA>
0A36258-AA NA NA NA 1 NA NA <NA> <NA> 1 <NA> <NA>
0A36258-AA 1 NA NA NA NA NA <NA> <NA> <NA> <NA> <NA>",
stringsAsFactors = FALSE, header = TRUE)
here is the code
DT <- setDT(plouf)
tochange <- names(DT)[!names(DT) %in% "VPN"]
here the tochange vector is the list of your column you want to average
DT[,c(tochange) := lapply(.SD,function(x){as.numeric(x)}),.SDcols = tochange]
DT[,lapply(.SD,function(x){sum(x,na.rm = TRUE)}),.SDcols = tochange,by = VPN]
The first line is to set everything to numeric¨
The second line perform the sum ignoring the NAs and grouping by VPN. I am not 100% sure that is what you wanted.
VPN Month.To.Date December November October September August July June May April March i
1: 0A36227-AA 17 10 5 2 2 7 10 5 2 2 7 10
2: 0A36258-AA 2 0 1 2 0 0 0 1 2 0 0 0
I hope it helps
here is the dplyr equivalent
plouf %>%
mutate_at(vars(tochange),funs(as.numeric)) %>%
group_by(VPN) %>%
summarise_at(vars(tochange),funs(sum(.,na.rm = TRUE)))

Day difference between rows related to other row that is not NA

Let I have such a data frame(df):
Date x
20.01.2016 34
21.01.2016 28
22.01.2016 NA
23.01.2016 NA
24.01.2016 56
25.01.2016 NA
26.01.2016 28
I want to add such a column(z) to this data frame
Date x z
20.01.2016 34 -
21.01.2016 28 1
22.01.2016 NA NA
23.01.2016 NA NA
24.01.2016 56 3
25.01.2016 NA NA
26.01.2016 28 2
where z shows the day difference between the related row's date and closest previous date (where x is not NA).
For example for the date 24.01.2016 the closest previous date is 21.01.2016 where x is not NA. So the day difference of these two dates is 3.
How can I do this using R?
I will be very glad for any help. Thanks a lot.
Cinsidering that your date variable is as.Date,(i.e. df$Date <- as.Date(df$Date, format = '%d.%m.%Y')) then,
df$z[!is.na(df$x)] <- c(NA, diff.difftime(df$Date[!is.na(df$x)]))
df
# Date x z
#1 2016-01-20 34 NA
#2 2016-01-21 28 1
#3 2016-01-22 NA NA
#4 2016-01-23 NA NA
#5 2016-01-24 56 3
#6 2016-01-25 NA NA
#7 2016-01-26 28 2
We can use data.table
library(data.table)
setDT(df)[, Date := as.IDate(Date, "%d.%m.%Y")][!is.na(x), z := Date - shift(Date)]
df
# Date x z
#1: 2016-01-20 34 NA
#2: 2016-01-21 28 1
#3: 2016-01-22 NA NA
#4: 2016-01-23 NA NA
#5: 2016-01-24 56 3
#6: 2016-01-25 NA NA
#7: 2016-01-26 28 2

How can I fill up missing information using the previous values for each column? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Replacing NAs with latest non-NA value
How can I fill up missing information using the previous values for each column?
Date.end Date.beg Pollster Serra.PSDB
2012-06-26 2012-06-25 Datafolha 31.0
2012-06-27 <NA> <NA> NA
2012-06-28 <NA> <NA> NA
2012-06-29 <NA> <NA> NA
2012-06-30 <NA> <NA> NA
2012-07-01 <NA> <NA> NA
2012-07-02 <NA> <NA> NA
2012-07-03 <NA> <NA> NA
2012-07-04 <NA> Ibope 22
2012-07-05 <NA> <NA> NA
2012-07-06 <NA> <NA> NA
2012-07-07 <NA> <NA> NA
2012-07-08 <NA> <NA> NA
2012-07-09 <NA> <NA> NA
2012-07-10 <NA> <NA> NA
2012-07-11 <NA> <NA> NA
2012-07-12 2012-07-09 Veritá 31.4
I'm not sure if that is the best way to do it. Probably there is some package with exactly that functionality out there. The following approach might not be the one with the very best performance, but it certainly works and should be fine for small to medium datasets. I would be cautious to apply it for very large datasets (more than a million rows or something like that)
fillNAByPreviousData <- function(column) {
# At first we find out which columns contain NAs
navals <- which(is.na(column))
# and which columns are filled with data.
filledvals <- which(! is.na(column))
# If there would be no NAs following each other, navals-1 would give the
# entries we need. In our case, however, we have to find the last column filled for
# each value of NA. We may do this using the following sapply trick:
fillup <- sapply(navals, function(x) max(filledvals[filledvals < x]))
# And finally replace the NAs with our data.
column[navals] <- column[fillup]
column
}
Here is some example using a test dataset:
set.seed(123)
test <- 1:20
test[floor(runif(5,1, 20))] <- NA
> test
[1] 1 2 3 4 5 NA 7 NA 9 10 11 12 13 14 NA 16 NA NA 19 20
> fillNAByPreviousData(test)
[1] 1 2 3 4 5 5 7 7 9 10 11 12 13 14 14 16 16 16 19 20

Resources