Replacing zero with previous value in time series R - r

I have this time series as
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 125 100
2013-01-18 120 0
And wants output as
Quant1 Quant2
2013-01-23 400 200
2013-01-22 125 100
2013-01-21 125 100
2013-01-20 125 100
2013-01-18 120 0
I am trying this, but it does not seem to work. I am getting null error NULL Warning encountered while processing method
replace(df,df == 0, NA)
df <- na.locf(df)
df[is.na(df)] <- 0
Any suggestions?
Update
As per most voted answer I tried (I modified input dates)
> z <- structure(c(400L, 0L, 0L, 125L, 120L, 200L, 0L, 0L, 100L,
+ 0L), .Dim = c(5L, 2L), .Dimnames = list(NULL, c("Quant1", "Quant2"
+ )), index = structure(c(15728, 15727, 15726, 15725, 15723), class = "Date"),
+ class = "zoo")
> z
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 125 100
2013-01-18 120 0
> L <- rowSums(z != 0) > 0
> z[] <- coredata(z)[which(L)[cumsum(L)],]
> z
Quant1 Quant2
2013-01-23 400 200
2013-01-22 0 0
2013-01-21 0 0
2013-01-20 0 0
2013-01-18 120 0

In the future please make your questions self-contained including the library calls and dput(x) output of any input x.
We assume this is a zoo object as shown at the end. We will call it z since df suggests that its a data frame.
library(zoo)
L <- rowSums(z != 0) > 0
z[] <- coredata(z)[which(L)[cumsum(L)],]
giving:
> z
Quant1 Quant2
2013-01-18 400 200
2013-01-20 400 200
2013-01-21 400 200
2013-01-22 125 100
2013-01-23 120 0
Note: This input was used:
z <- structure(c(400L, 400L, 400L, 125L, 120L, 200L, 200L, 200L, 100L,
0L), .Dim = c(5L, 2L), .Dimnames = list(NULL, c("Quant1", "Quant2"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")

I also assumed it to be a zoo-object and build the following function by hand which only cares about Quant1 to be zero or not.
It is less elegant and probably slower (one should replace the for loop by some apply-function) than the previous approach by Grothendieck but maybe is somewhat instructive to you.
require(zoo)
times <- as.POSIXct(c("2013-01-18", "2013-01-20", "2013-01-21", "2013-01-22", "2013-01-23", "2013-01-25", "2013-01-29", "2013-02-02", "2013-02-04"))
quant1 <- c(400,0,0,125,120,0,70,0,0)
quant2 <- c(200,0,0,100,0,300,150,80, 200)
z <- zoo(data.frame(Quant1 = quant1, Quant2 = quant2), order.by = times)
repl_zeros <- function (z) {
diffs <- c(0, diff(as.numeric(z$Quant1 == 0)))
beginnings <- which(diffs == 1)
ends <- which(diffs == -1) - 1
valueindices <- ends + 1
for (i in 1:length(valueindices)) {
z[beginnings[i]:ends[i],]$Quant1 <- z[valueindices[i],]$Quant1
z[beginnings[i]:ends[i],]$Quant2 <- z[valueindices[i],]$Quant2
}
z
}
Note: repl_zeros replaces zeros by following values as in your example, where you said you want to replace by previous values in the title of your question. Adjusting it to what you really meant should be easy though.

Related

How to find the minimum value in between two values using R

In a column of data I am trying to identify the minimum value in between a new high and the previous new high. In the example below I marked where the new highs are and what the minimum value is between them. What is the R formula to figure this out? In excel I would be able to do it using the match and max/min formula. I am not sure how to find the minimum value in a segment of a column in r.
data
0 New High
-80
-160
-160
-160
-160
-160
-347
-351
-351
-444
-444
-444
43 New High -444
43
10
10
-6
20
352 New High -6
352
352
528 New High 352
528
511
511
518
472
You can use cummax to calculate cumulative maximum until that row and create a new group whenever the current row's value is greater than previous cummax value. Within each group you can return the minimum value.
library(dplyr)
df %>%
group_by(group = cumsum(V1 > lag(cummax(V1), default = first(V1)))) %>%
summarise(min_value = min(V1))
# group min_value
# <int> <int>
#1 0 -444
#2 1 -6
#3 2 352
#4 3 472
This considers the last part as another group hence also returns minimum value in that part. You can remove the last row if it is not needed.
To apply for multiple columns, we can write a function and call it with lapply :
apply_fun <- function(data, col) {
col1 <- sym(col)
df %>%
group_by(group = cumsum(!!col1 > lag(cummax(!!col1),
default = first(!!col1)))) %>%
summarise(min_value = min(!!col1))
}
result <- lapply(names(df), apply_fun, data = df)
data
df <- structure(list(V1 = c(0L, -80L, -160L, -160L, -160L, -160L, -160L,
-347L, -351L, -351L, -444L, -444L, -444L, 43L, 43L, 10L, 10L,
-6L, 20L, 352L, 352L, 352L, 528L, 528L, 511L, 511L, 518L, 472L
)), class = "data.frame", row.names = c(NA, -28L))

loop to create a dataframe with new column, then combine them together

I want to duplicate my dataset on different flight altitude levels. I can do it manually creating dataframes with differing levels of altitude and then rbind them together. But, i want to make it faster by involving a for loop?
this is the example dataset:
structure(list(heading = c(0L, 71L, 132L, 143L, 78L, 125L, 0L,
171L, 165L, 159L), thermal = c(1.25823300871478, 1.2972715238927,
1.65348398199965, 2.04165937130312, 1.496194948775, 1.70668245624966,
1.32775326817617, 1.37003605552932, 1.85841102388127, 1.20642577473389
), WS = c(17.1590022110329, 7.60663206413036, 16.3515501561529,
15.8336908137001, 7.11013207359218, 8.69420768960291, 5.23228331387401,
10.2762569508197, 3.79321542059933, 4.80008774506314), trackId = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("ke1601", "ke1607",
"mwb1501", "mwb1502", "mwb1503", "mwb1504", "nsm1605", "rcees17110",
"rcees17111", "X27230893", "X27231081", "X27233186", "X27234135",
"X52409530"), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
I was coding manually like this:
msl100 <- df %>% mutate(alt = 100)
msl200 <- df %>% mutate(alt = 200)
msl300 <- df %>% mutate(alt = 300)
msl400 <- df %>% mutate(alt = 400)
msl500 <- df %>% mutate(alt = 500)
df1 <- rbind(msl100, .........)
I need to do this for every 100 meters up to height 5100 meters.
This can be done purely through a cbind as the rows of the original data will repeat:
cbind(dat, alt=rep(seq(100,5100,100), each=nrow(dat)))
This should be much faster than looping over values.
Consider a cross join merge:
expanded_df <- merge(df, data.frame(alt=seq(100, 5100, 100)), by = NULL)
Create a sequence, use lapply to loop over it transform to add new column and rbind
do.call(rbind, lapply(seq(100, 5100, 100), function(x) transform(df, alt = x)))
# heading thermal WS trackId alt
#1 0 1.258233 17.159002 mwb1501 100
#2 71 1.297272 7.606632 mwb1501 100
#3 132 1.653484 16.351550 mwb1501 100
#4 143 2.041659 15.833691 mwb1501 100
#5 78 1.496195 7.110132 mwb1501 100
#6 125 1.706682 8.694208 mwb1501 100
#7 0 1.327753 5.232283 mwb1501 100
#8 171 1.370036 10.276257 mwb1501 100
#9 165 1.858411 3.793215 mwb1501 100
#10 159 1.206426 4.800088 mwb1501 100
#11 0 1.258233 17.159002 mwb1501 200
#12 71 1.297272 7.606632 mwb1501 200
#....
Using tidyverse that would be
library(dplyr)
library(purrr)
map_df(seq(100, 5100, 100), ~df %>% mutate(alt = .x))
We can use crossing from the tidyr package.
library(dplyr)
library(tidyr)
df2 <- crossing(df, tibble(alt = seq(100, 5100, 100)))
If the order is important, create an ID column, arrage it, and then delete it.
df3 <- df %>%
mutate(ID = 1:n()) %>%
crossing(tibble(alt = seq(100, 5100, 100))) %>%
arrange(alt, ID) %>%
select(-ID)
Another (fast) data.table-based alternative would be to do
library(data.table)
setDT(df)[, .(alt = seq(100, 5100, 100)), by = names(df)]
# heading thermal WS trackId alt
# 1: 0 1.258233 17.159002 mwb1501 100
# 2: 0 1.258233 17.159002 mwb1501 200
# 3: 0 1.258233 17.159002 mwb1501 300
# 4: 0 1.258233 17.159002 mwb1501 400
# 5: 0 1.258233 17.159002 mwb1501 500
#---
#506: 159 1.206426 4.800088 mwb1501 4700
#507: 159 1.206426 4.800088 mwb1501 4800
#508: 159 1.206426 4.800088 mwb1501 4900
#509: 159 1.206426 4.800088 mwb1501 5000
#510: 159 1.206426 4.800088 mwb1501 5100

Create new variable based on the Look up table

I want to create a new variable on the data frame that uses a look up table. So I had df1 (dataframe) that has Amount and Term. And I need to create a new variable "Premium" that create its values using the look up table.
I tried the ifelse function but it's too tedious.
Below is an illustration/example
df1 <- data.frame(Amount, Term)
df1
# Amount Term
# 1 2500 23
# 2 3600 30
# 3 7000 45
# 4 12000 50
# 5 16000 38
And I need to create new variable the 'Premium' by using the Premium Lookup table below.
Term
Amount 0-24 Mos 25-36 Mos 37-48 Mos 49-60 Mos
0 - 5,000 133 163 175 186
5,001 - 10,000 191 213 229 249
10,001 - 15,000 229 252 275 306
15,001 - 20,000 600 615 625 719
20,001 - 25,000 635 645 675 786
So the output for premium should be.
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
Data
df1 <- structure(list(Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)),
class = "data.frame",
row.names = c(NA, -5L))
lkp <- structure(c(133L, 191L, 229L, 600L, 635L,
163L, 213L, 252L, 615L, 645L,
175L, 229L, 275L, 625L, 675L,
186L, 249L, 306L, 719L, 786L),
.Dim = 5:4,
.Dimnames = list(Amount = c("0 - 5,000", "5,001 - 10,000",
"10,001 - 15,000", "15,001 - 20,000",
"20,001 - 25,000"),
Term = c("0-24 Mos", "25-36 Mos", "37-48 Mos",
"49-60 Mos")))
Code
Create first the upper limits for month and amount using regular expressions from the column and row names (you did not post your data in a reproducible way, so this regex may need adaptation based on your real lookup table structure):
(month <- c(0, as.numeric(sub("\\d+-(\\d+) Mos$",
"\\1",
colnames(lkp)))))
# [1] 0 24 36 48 60
(amt <- c(0, as.numeric(sub("^\\d+,*\\d* - (\\d+),(\\d+)$",
"\\1\\2",
rownames(lkp)))))
# [1] 0 5000 10000 15000 20000 25000
Get the positions for each element of df1 using findInterval:
(rows <- findInterval(df1$Amount, amt))
# [1] 1 1 2 3 4
(cols <- findInterval(df1$Term, month))
# [1] 1 2 3 4 3
Use these indices to subset the lookup matrix:
df1$Premium <- lkp[cbind(rows, cols)]
df1
# Amount Term Premium
# 1 2500 23 133
# 2 3600 30 163
# 3 7000 45 229
# 4 12000 50 306
# 5 16000 38 625
To get to what you want you need to organise the table and categorise the data. I have provided a potential workflow to handle such situations. Hope this is helpful:
library(tidyverse)
df1 <- data.frame(
Amount = c(2500L, 3600L, 7000L, 12000L, 16000L),
Term = c(23L, 30L, 45L, 50L, 38L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# functions for analysis ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
amount_tier_function <- function(x){
case_when(x <= 5000 ~ "Tier_5000",
x <= 10000 ~ "Tier_10000",
x <= 15000 ~ "Tier_15000",
x <= 20000 ~ "Tier_20000",
TRUE ~ "Tier_25000")
}
month_tier_function <- function(x){
case_when(x <= 24 ~ "Tier_24",
x <= 36 ~ "Tier_36",
x <= 48 ~ "Tier_48",
TRUE ~ "Tier_60")
}
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Recut lookup table headings ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df <- data.frame(stringsAsFactors=FALSE,
amount_tier = c("Tier_5000", "Tier_10000", "Tier_15000", "Tier_20000",
"Tier_25000"),
Tier_24 = c(133L, 191L, 229L, 600L, 635L),
Tier_36 = c(163L, 213L, 252L, 615L, 645L),
Tier_48 = c(175L, 229L, 275L, 625L, 675L),
Tier_60 = c(186L, 249L, 306L, 719L, 786L)
)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Join everything together ####
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
lookup_df_tidy <- lookup_df %>%
gather(mth_tier, Premium, - amount_tier)
df1 %>%
mutate(amount_tier = amount_tier_function(Amount),
mth_tier = month_tier_function(Term)) %>%
left_join(., lookup_df_tidy) %>%
select(-amount_tier, -mth_tier)

How to assign a value in a data frame based on multiple conditions of another data frame

I have two data frames, one consisting of numerical values called 'esame':
media id_poll fin
1 5.330000e+00 360 1
2 6.833333e-02 361 0
3 0.000000e+00 362 0
4 NA 363 0
5 8.200000e-01 364 0
6 3.416667e-01 365 0
7 0.000000e+00 366 0
8 0.000000e+00 367 0
9 0.000000e+00 368 0
10 NA 369 0
11 6.150000e-01 370 0
12 0.000000e+00 371 0
13 0.000000e+00 372 0
14 NA 373 0
15 0.000000e+00 374 0
16 0.000000e+00 375 0
17 0.000000e+00 376 0
18 1.298333e+00 377 0
And the second one consisting of numerical ranges which I would like to use to check in which range the 'media' field of the first data.frame is.
If it's in the first range I would like to assign "1" to the field "fin" of the first data.frame, if it's in the second I would like to assign "2" and so on.
So here it is the second data.frame with some of the conditions I'll need:
Range1 Range2 Range3 Range4 ID
0.5 9.9 29.9 >30 360
0.5 15.9 49.9 >50 361
0 4.9 24.9 >25 362
First of all I suppose I won't need to declare Range4 as it's already an information included in Range3. I removed the initial value of all numerical ranges as I need just a single number to check against (or so I think). The same row for ID 360 could be written as:
Range1 Range2 Range3 Range4 ID
0.5 0.6-9.9 10-29.9 >30 360
So my guess is to do something like this:
esame$fin<-ifelse (esame$media<0.6 & datofinale$id_poll=="360", "1", "0")
I could substitute the "0" value with another 'ifelse' statement and go on manually.
Is there any faster way to do that? (the list containing all the condititions is actually pretty larger than the example).
Thank you for any advice.
Not too nice, but this should work:
require(dplyr)
inner_join(Data,Data1,by=c("id_poll"="ID")) %>% rowwise() %>%
mutate(fin = findInterval(media,c(-Inf,Range1,Range2,Range3),left.open=TRUE))
Reproducible data
esame <- structure(list(media = c(5.33, 0.06833333, 0, NA, 0.82, 0.3416667,
0, 0, 0, NA, 0.615, 0, 0, NA, 0, 0, 0, 1.298333), id_poll = 360:377,
fin = c(1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L)), .Names = c("media", "id_poll", "fin"
), row.names = c(NA, -18L), class = c("data.table", "data.frame"
), .internal.selfref = <pointer: 0x0000000014320788>)
df1 <- structure(list(Range1 = c(0.5, 0.5, 0), Range2 = c(9.9, 15.9,
4.9), Range3 = c(29.9, 49.9, 24.9), Range4 = c(">30", ">50",
">25"), ID = 360:362), .Names = c("Range1", "Range2", "Range3",
"Range4", "ID"), row.names = c(NA, -3L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000014320788>)
dplyr solution
Using case_when
library(dplyr)
df2 <- left_join(esame1, df1, by=c("id_poll" = "ID")) %>%
mutate(fin = case_when( media > Range3 ~ 4,
media > Range2 ~ 3,
media > Range1 ~ 2,
media <= Range1 ~ 1,
is.na(Range1) == T ~ 0)) # else case
Output
media ID fin Range1 Range2 Range3 Range4
1 5.33000000 360 2 0.5 9.9 29.9 >30
2 0.06833333 361 1 0.5 15.9 49.9 >50
3 0.00000000 362 1 0.0 4.9 24.9 >25
4 NA 363 0 NA NA NA <NA>
5 0.82000000 364 0 NA NA NA <NA>
We can consider each row in the range data.frame as a vector and ask whether the current media value is greater than the value in this vector.
For simplicity, I'm assuming that all values in the first data.frame has a correspondent in the second, and that they are all ordered the same way.
for(i in 1:nrow(esame)) {
greater.than <- esame[i,1]>range[i,1:3] #this returns a vector of TRUE (greater than this range) and FALSE (within) you want the first FALSE
esame$fin <- max(which(greater.than))+1 #returns the position of the last TRUE +1, which is the position of the first FALSE
}
dat - first df, tad - second. It will put 0 if NA, nested ifelse() and assume that first range is from 0 to present value. However show some example result to check if it works properly.
dat$fin <- sapply(1:nrow(dat), function(x) ifelse(dat[x,1] >= tad[x,1] & !is.na(dat[x,1]), 1, ifelse(dat[x,1] >= tad[x,2] & !is.na(dat[x,1]), 2, ifelse(dat[x,1] >= tad[x,3] & !is.na(dat[x,1]), 3, 0))))
>dat
media id_poll fin
1 5.33000000 360 1
2 0.06833333 361 0
3 0.00000000 362 1

Operation on multiple zoo object with data frame

I have two time series (zoo) objects and a data frame
z1
z1 <- structure(c(400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L,400L, 125L, 125L, 125L, 120L
,400L, 125L, 125L, 125L, 120L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 400 400 400 400
2013-01-20 125 125 125 125
2013-01-21 125 125 125 125
2013-01-22 125 125 125 125
2013-01-23 120 120 120 120
z2
z2 <- structure(c(40L, 12L, 25L, 15L, 10L,40L, 25L, 15L, 123L, 190L,150L, 115L, 155L, 105L, 80L
,40L, 425L, 225L, 115L, 20L), .Dim = c(5L, 4L), .Dimnames = list(NULL, c("T1", "T2", "T3", "T6"
)), index = structure(c(15723, 15725, 15726, 15727, 15728), class = "Date"),
class = "zoo")
T1 T2 T3 T6
2013-01-18 40 40 150 40
2013-01-20 12 25 115 425
2013-01-21 25 15 155 225
2013-01-22 15 123 105 115
2013-01-23 10 190 80 20
df
l <- "Name, DOB, TypeOfApply, House
T1, 2008-12-16, sync,44
T2, 2008-12-15, sync,54
T3, 2008-12-19, async,34
T4, 2008-12-18, async,84
T5, 2008-12-11, sync,94"
df <- read.csv(text = l)
I want to apply a formula(function I created to use "calc") bsaed on condition that TypeOfApply == "sync". Z1 and Z2 is going to have same no of rows and columns.
calc(z1,z2,df$DOB-2013-01-18,df$House)
T1 T2 T3 T6
2013-01-18 calc(400,40,((2008-12-16)-(2013-01-18)),44) calc(400,40,((2008-12-15)-(2013-01-18)),54) 400 400
2013-01-20 calc(125,12,((2008-12-16)-(2013-01-20)),44) calc(400,25,((2008-12-15)-(2013-01-20)),54) 125 125
2013-01-21 calc(125,25,((2008-12-16)-(2013-01-21)),44) calc(400,15,((2008-12-15)-(2013-01-21)),54) 125 125
2013-01-22 calc(125,15,((2008-12-16)-(2013-01-22)),44) calc(400,123,((2008-12-15)-(2013-01-22)),54) 125 125
2013-01-23 calc(120,10,((2008-12-16)-(2013-01-23)),44) calc(400,190,((2008-12-15)-(2013-01-23)),54) 120 120
So, in this code T1 and T2 will have formula to be applied, but others will not
T3 - Type of Apply is async
T5 - Does not exist in z1 and z2
T6 - Does not exist in df
Update
Sequence of names in df may be different. So it may be like T2, T1, T3, T5, T4
Just as sample calc function
calc <- function(x,y,z,v)
{
val <- x+y+(z/365)+v
return(val)
}
Here, I am using str_trim as there are leading/lagging spaces in "df" columns. Converted the "factor" column "DOB' to "Date" class, created a "indx" based on the condition that of "TypeOfApply" elements are "sync" and corresponding "Name" elements are present in the column names of "z1". This "indx" is used for subsetting the "df", as well as "z1", and "z2". Then use "Map" function and get the corresponding columns of "z1", "z2", elements of "df1$DOB", "df1$House", which can be used as inputs in the "calc" function.
library(stringr)
indx <- intersect(with(df,str_trim(Name[str_trim(TypeOfApply)=='sync'])),
colnames(z1))
df1 <- df[str_trim(as.character(df$Name)) %in% indx,c(2,4)]
df1$DOB <- as.Date(str_trim(df1$DOB))
Map(function(u,v,x,y) calc(u,v, x-'2013-01-18', y),
as.data.frame(z1[,indx]), as.data.frame(z2[,indx]), df1$DOB, df1$House)
Update
Using the calc function from OP's post
z3 <- z1[,indx]
index <- as.Date('2013-01-18')
z3[] <- mapply(calc, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB-index, df1$House)
z3
# T1 T2
#2013-01-18 479.9068 489.9041
#2013-01-20 176.9068 199.9041
#2013-01-21 189.9068 189.9041
#2013-01-22 179.9068 297.9041
#2013-01-23 169.9068 359.9041
Suppose, if I change the order of "df" rows
set.seed(24)
df <- df[sample(1:nrow(df)),]
Then, the "Map" list elements will be in the same order as "indx", for example,
indx
#[1] "T2" "T1"
df1
# DOB House
#2 2008-12-15 54
#1 2008-12-16 44
Map(function(u,v,x,y) u, as.data.frame(z1[,indx]),
as.data.frame(z2[,indx]), df1$DOB, df1$House)
#$T2
#[1] 400 125 125 125 120
#$T1
#[1] 400 125 125 125 120

Resources