Here is what I try to achieve on whatiwant column:
df1 <- data.frame(value = c(99.99,99.98,99.97,99.96,99.95,99.94,
99.93,99.92,99.91,99.9,99.9,99.9),
new_value = c(NA,NA,99.98,NA,99.97,NA,
NA,NA,NA,NA,NA,NA),
whatiswant = c(99.99,99.96,99.98,99.95,99.97,99.94,
99.93,99.92,99.91,99.9,99.9,99.9))
To explain it with words whatiswant should have the value of new_value and for those not having the new_value, it should take the next lowest value available.
I think it is kind of a lag stuff. Here is the data.frame:
value new_value whatiswant
1 99.99 NA 99.99
2 99.98 NA 99.96
3 99.97 99.98 99.98
4 99.96 NA 99.95
5 99.95 99.97 99.97
6 99.94 NA 99.94
7 99.93 NA 99.93
8 99.92 NA 99.92
9 99.91 NA 99.91
10 99.90 NA 99.90
11 99.90 NA 99.90
12 99.90 NA 99.90
EDIT: Logic explained in following steps:
Step 1. if new_value is not NA then col3 takes the value. So the 3rd and
5th row are sorted.
Step 2. 1st row col3 takes the value of col1, as col2 is NA.
Step 3. 2nd row col3 takes the value of col1-row4, as 2nd and 3nd
row values for col1 is already used in Step 1.
Step 4. 4th row col3 takes the value of col1-row5, as all above rows
from col1 are taken in previous steps.
Step 5. The rest of the rows6-12 in col3 take the same value from
col1-rows6-12 as col2 is NA and non of the numbers col1-row6-12 are
used in previous steps.
In form of a function, each step in comment, ask if it's unclear:
t1 <- function(df) {
df[,'whatiswant'] <- df[,'new_value'] # step 1, use value of new_value
sapply(1:nrow(df),function(row) { # loop on each row
x <- df[row,] # take the row, just to use a single var instead later
ret <- unlist(x['whatiswant']) # initial value
if(is.na(ret)) { # If empty
if (x['value'] %in% df$whatiswant) { # test if corresponding value is already present
ret <- df$value[!df$value %in% df$whatiswant][1] # If yes take the first value not present
} else {
ret <- unlist(x['value']) # if not take this value
}
}
if(is.na(ret)) ret <- min(df$value) # No value left, take the min
df$whatiswant[row] <<- ret # update the df from outside sapply so the next presence test is ok.
})
return(df) # return the updated df
}
Output:
>df1[,3] <- NA # Set last column to NA
> res <- t1(df1)
> res
value new_value whatiswant
1 99.99 NA 99.99
2 99.98 NA 99.96
3 99.97 99.98 99.98
4 99.96 NA 99.95
5 99.95 99.97 99.97
6 99.94 NA 99.94
7 99.93 NA 99.93
8 99.92 NA 99.92
9 99.91 NA 99.91
10 99.90 NA 99.90
11 99.90 NA 99.90
12 99.90 NA 99.90
Related
I must be asking the question terribly because I can't find what I looking for!
I have a large excel file that looks like this for every day of the month:
Date
Well1
1/1/16
10
1/2/16
NA
1/3/16
NA
1/4/16
NA
1/5/16
20
1/6/16
NA
1/7/16
25
1/8/16
NA
1/9/16
NA
1/10/16
35
etc
NA
I want to make a new column that has the difference between the non-zero rows and divide that by the number of rows between each non zero row. Aiming for something like this:
Date
Well1
Adjusted
1/1/16
10
=(20-10)/4 = 2.5
1/2/16
NA
1.25
1/3/16
NA
1.25
1/4/16
NA
1.25
1/5/16
20
=(25-20)/2= 2.5
1/6/16
NA
2.5
1/7/16
25
=(35-25)/3 = 3.3
1/8/16
NA
3.3
1/9/16
NA
3.3
1/10/16
35
etc
etc
NA
etc
I'm thinking I should use lead or lag, but the thing is that the steps are different between each nonzero row (so I'm not sure how to use n in the lead/lag function). I've used group_by so that each month stands alone, as well as attempted case_when and ifelse Mostly need ideas on translating excel format into a workable R format.
With some diff-ing and repeating of values, you should be able to get there.
dat$Date <- as.Date(dat$Date, format="%m/%d/%y")
nas <- is.na(dat$Well1)
dat$adj <- with(dat[!nas,],
diff(Well1) / as.numeric(diff(Date), units="days")
)[cumsum(!nas)]
# Date Well1 adj
#1 2016-01-01 10 2.5
#2 2016-01-02 NA 2.5
#3 2016-01-03 NA 2.5
#4 2016-01-04 NA 2.5
#5 2016-01-05 20 2.5
#6 2016-01-06 NA 2.5
#7 2016-01-07 25 5.0
#8 2016-01-08 NA 5.0
#9 2016-01-09 NA 5.0
#10 2016-01-10 40 NA
dat being used is:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Base R in the same vein as #thelatemail but with transformations all in one expression:
nas <- is.na(dat$Well1)
res <- within(dat, {
Date <- as.Date(Date, "%m/%d/%y")
Adjusted <- (diff(Well1[!nas]) /
as.numeric(diff(Date[!nas]), units = "days"))[cumsum(!nas)]
}
)
Data:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Maybe this should work
library(dplyr)
df1 %>%
#// remove the rows with NA
na.omit %>%
# // create a new column with the lead values of Well1
transmute(Date, Well2 = lead(Well1)) %>%
# // join with original data
right_join(df1 %>%
mutate(rn = row_number())) %>%
# // order by the original order
arrange(rn) %>%
# // create a grouping column based on the NA values
group_by(grp = cumsum(!is.na(Well1))) %>%
# // subtract the first element of Well2 with Well1 and divide
# // by number of rows - n() in the group
mutate(Adjusted = (first(Well2) - first(Well1))/n()) %>%
ungroup %>%
select(-grp, - Well2)
I have the following database (X) that contains monthly stock returns over time. I show the first 12 rows. The stock returns can contain random NAs.
Obs. Asset Date Ret
1 DJ 1997-10-06 NA
2 DJ 1997-10-07 NA
3 DJ 1997-10-08 -1.13
4 DJ 1997-10-09 -0.136
5 DJ 1997-10-10 NA
6 DJ 1997-10-14 NA
7 DJ 1997-10-15 NA
8 DJ 1997-10-16 -0.225
9 DJ 1997-10-17 -0.555
10 DJ 1997-10-20 NA
11 DJ 1997-10-21 0.102
12 DJ 1997-10-22 NA
I want to calculate the cumulative return over a 5 day window. So I get a cumulative return from observation 5 on, ignoring NAs. The cumulative return will only be a NA when the returns within the window are also NA.
I tried:
Y <- Y %>%
mutate(product = (as.numeric(rollapply(1 + ret/100, 5, prod,
partial = TRUE, na.rm = TRUE, align = "right"))-1)*100)
Which gives an undesired result:
> 1 1997-10-06 DJ NA 0.000000000
> 2 1997-10-07 DJ NA 0.000000000
> 3 1997-10-08 DJ -1.1277917526 -1.127791753
> 4 1997-10-09 DJ -0.1364864885 -1.262738958
> 5 1997-10-10 DJ NA -1.262738958
> 6 1997-10-14 DJ NA -1.262738958
> 7 1997-10-15 DJ NA -1.262738958
> 8 1997-10-16 DJ -0.2250333841 -0.361212732
> 9 1997-10-17 DJ -0.5545946845 -0.778380045
> 10 1997-10-20 DJ NA -0.778380045
> 11 1997-10-21 DJ 0.1022404757 -0.676935389
> 12 1997-10-22 DJ NA -0.676935389
I want to get NAs before the 5th observation, so row 1-4 are NA. Row 5 computes the cumulative return over row 1-5, Row 6 computes the cumulative return over 2-6 etc.
Reprex:
X <- data.frame(Date=c("1997-10-06" ,"1997-10-07", "1997-10-08" ,"1997-10-09", "1997-10-10",
"1997-10-14", "1997-10-15" ,"1997-10-16", "1997-10-17","1997-10-20", "1997-10-21" ,"1997-10-22"),
ret=c(NA,NA,-1.1277918,-0.1364865, NA , NA , NA ,-0.2250334 ,-0.5545947, NA, 0.1022405, NA))
You could replace the NA with 0 and then use zoo::rollsum default behavior:
library(zoo)
library(tidyr)
library(dplyr)
df %>%
dplyr::mutate(ret = zoo::rollsum(tidyr::replace_na(ret, 0), k = 5, na.pad = T, align = "right))
Date ret
1 1997-10-06 NA
2 1997-10-07 NA
3 1997-10-08 NA
4 1997-10-09 NA
5 1997-10-10 -1.2642783
6 1997-10-14 -1.2642783
7 1997-10-15 -1.2642783
8 1997-10-16 -0.3615199
9 1997-10-17 -0.7796281
10 1997-10-20 -0.7796281
11 1997-10-21 -0.6773876
12 1997-10-22 -0.6773876
Here's some code that takes your data, changes all the NA in ret to 0 and then calculates a rolling 5-period sum using the RcppRoll and tidyverse packages.
# Load libraries
library('RcppRoll')
library('tidyverse')
# Load data
df <- data.frame(Date=c("1997-10-06" ,"1997-10-07", "1997-10-08" ,"1997-10-09", "1997-10-10",
"1997-10-14", "1997-10-15" ,"1997-10-16", "1997-10-17","1997-10-20", "1997-10-21" ,"1997-10-22"),
ret=c(NA,NA,-1.1277918,-0.1364865, NA , NA , NA ,-0.2250334 ,-0.5545947, NA, 0.1022405, NA), stringsAsFactors = F)
# Change NA's to Zeros
df[is.na(df[,2]),2] <- 0
# Calculate Rolling Sum
df_new <- df %>% mutate(rollsum = roll_sum(ret, n=5, align = 'right', fill=NA))
Hey I want to compute the variance of column. My dataframe is sorted by the as.Date() format. Here you can see a snippet of it:
Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA
The dataframe ranges from january 2004 up to dezember 2018. But I do not want to compute the compute the variance of the whole columnes.
I want to compute the variance of one year (or 12 values) which is moving month by month.
I do not really know how to start. I can imagine using the zoo package and the rollapply. But here the problem is (I think) that R computes uses the values around it and not past it?
I also found this question: R: create a data frame out of a rolling window, so my idea was to get rid of the date column. It is easy to build the matrix, but now I do not understand how to apply the variance function to my data...
Is there a smart way to compute it all in one and also using the information of the date? If not, I also appreciate any other solution from you!
We can use rollappyr to perform the rolling computations. Since there are only 11 rows in the data in the question we can't take 12 month averages but using 3 month averages instead we can illustrate it. Remove fill = NA if you want to omit the NA rows or replace it with partial = TRUE if you want variances using fewer than 12 near the beginning. If you want a data frame result use fortify.zoo(zv) .
library(zoo)
z <- read.zoo(DF)
zv <- rollapplyr(z, 3, var, fill = NA)
zv
giving this zoo object:
USA ARG BRA CHL COL MEX PER
2012-04-01 NA NA NA NA NA NA NA
2012-05-01 NA NA NA NA NA NA NA
2012-06-01 0 1.287083e-04 4.998008e-04 1.126781e-09 1.237524e-11 5.208793e-06 NA
2012-07-01 0 1.033001e-04 5.217420e-05 9.109406e-10 3.883996e-12 3.565057e-06 NA
2012-08-01 0 9.358558e-06 1.396497e-05 2.060928e-09 4.221043e-12 4.600220e-06 NA
2012-09-01 0 1.113297e-05 3.108380e-08 9.159058e-10 4.826929e-12 7.453672e-07 NA
2012-10-01 0 1.988357e-06 4.498977e-08 2.485889e-10 2.953403e-12 8.001948e-07 NA
2012-11-01 0 3.560373e-06 1.944961e-05 2.615387e-10 1.168389e-11 2.971477e-07 NA
2012-12-01 0 3.717777e-05 2.655440e-05 1.271886e-10 1.814869e-11 4.312436e-07 NA
2013-01-01 0 2.042867e-05 3.268476e-05 2.806455e-10 7.540331e-11 1.231438e-06 NA
2013-02-01 0 4.134729e-07 1.129013e-04 1.186146e-10 1.983651e-11 3.263780e-07 NA
We can plot the log of the variances like this:
library(ggplot2)
autoplot(log(zv), facet = NULL) + geom_point() + ylab("log(var(.))")
Note
We assume that the starting point is the data frame generated reproducibly below:
Lines <- "Date USA ARG BRA CHL COL MEX PER
2012-04-01 1 0.2271531 0.4970299 0.001956865 0.0005341452 0.07341428 NA
2012-05-01 1 0.2218906 0.4675895 0.001911405 0.0005273186 0.07026524 NA
2012-06-01 1 0.2054076 0.4531661 0.001891352 0.0005292575 0.06897811 NA
2012-07-01 1 0.2033470 0.4596730 0.001950686 0.0005312600 0.07269619 NA
2012-08-01 1 0.1993882 0.4596039 0.001980537 0.0005271514 0.07268987 NA
2012-09-01 1 0.1967152 0.4593390 0.002011212 0.0005305549 0.07418838 NA
2012-10-01 1 0.1972730 0.4597584 0.002002203 0.0005284380 0.07428555 NA
2012-11-01 1 0.1937618 0.4519187 0.001979805 0.0005238670 0.07329656 NA
2012-12-01 1 0.1854037 0.4500448 0.001993309 0.0005323795 0.07453949 NA
2013-01-01 1 0.1866007 0.4607501 0.002013112 0.0005412329 0.07551040 NA
2013-02-01 1 0.1855950 0.4712956 0.002011067 0.0005359562 0.07554661 NA"
DF <- read.table(text = Lines, header = TRUE)
when I have a dataframe named "Historical_Stock_Prices_R" like this
Date1 MSFT AAPL GOOGL
25-01-05 21.03 4.87 88.56
26-01-05 21.02 4.89 94.62
27-01-05 21.10 4.91 94.04
28-01-05 21.16 5.00 95.17
I use the following formulas to get a lsit of monthly max and monthly mean log return from daily price data file
return<- cbind.data.frame(date=Historical_Stock_Prices_R$Date1[2:nrow(Historical_Stock_Prices_R)],apply(Historical_Stock_Prices_R[,2:4],2,function(x) log(x[-1]/x[-length(x)])*100))
return$Date <- as.Date(return$date,format="%d-%m-%y")
RMax <- aggregate(return[,-1],
by=list(Month=format(return$Date,"%y-%m")),
FUN=max)
RMean <- aggregate(return[,-1],
by=list(Month=format(return$Date,"%y-%m")),
FUN=mean)
But now I have a matrix (not a dataframe) named "df" like this
AAPL.High ABT.High ABBV.High ACN.High ADBE.High
07-01-02 NA NA NA NA NA
03-01-07 12.37 24.74 NA 37 41.32
04-01-07 12.28 25.12 NA 37.23 41
05-01-07 12.31 25 NA 36.99 40.9
Now how can I calculate same monthly mean and monthly max using similar kind of code?
I have a data frame simplified as follow:
head(dendro)
X DateTime ID diameter dendro ring DOY month mday year Rain_mm_Tot Through_Tot temp
1 1 2012-06-21 13:45:00 r1_1 5482 1 1 173 6 22 113 NA NA NA
2 2 2012-06-21 13:45:00 r2_3 NA 3 2 173 6 22 113 NA NA NA
3 3 2012-06-21 13:45:00 r1_2 5534 2 1 173 6 22 113 NA NA NA
4 4 2012-06-21 13:45:00 r2_4 NA 4 2 173 6 22 113 NA NA NA
5 5 2012-06-21 13:45:00 r1_3 5606 3 1 173 6 22 113 NA NA NA
6 6 2012-06-21 13:45:00 r2_5 NA 5 2 173 6 22 113 NA NA NA
The dataframe is first splitted by "ID", so it's a list of IDs
After that I apply a function, that includes a loop, and the result is a new column "Diameter2", with the result I want from the function, that works OK:
dendro_sp <- split(dendro, dendro$ID)
library(changepoint)
dendro_sp <- lapply(dendro_sp, function(x){
x <- subset(x, !is.na(diameter))
cpfit <- cpt.mean(x$diameter, method="BinSeg")
x$diameter2 <- x$diameter
cpts <- cpfit#cpts
means <- param.est(cpfit)$mean
meanZero <- means[1]
for(i in 1:(length(cpts)-1)){
x$diameter2[(cpts[i]+1):cpts[i+1]] <- x$diameter2[(cpts[i]+1):cpts[i+1]] + (meanZero - means[i+1])
}
return(x)
})
dendro2 <- do.call(rbind, dendro_sp)
rownames(dendro2) <- NULL
My problem is that I want it to apply it conditionally, for example to r1_1 and r1_3, and grab the "diameter" value for r3 in the new column "diameter2", instead of applying the function for the rest of IDs:
ifelse(diameter$ID==c("r1_1","r1_3"), apply_the_function_to_r11_and_r13_to_calculate_diameter2, otherwise_write_diameter_value_in_diameter2_column)
Remember that the dataframe "dendro" is splitted by ID, I don't know if that is important to define the condition for several IDs.
Thanks
I am not sure if I understand the problem correctly. I try to answer.
I assume you want to apply a function to the "diameter" field of the "diameter" data.frame, conditioning on the "ID" field and retunr the result in the corresponding diameter2 field. I don't know how the function works, so forgive me if this will not work.
Selected fields
diameter$diameter2[diameter$ID=="r1_1"|diameter$ID=="r1_3"]<- yourfun(diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]
Unselected fields
diameter$diameter2[diameter$ID!="r1_1" & diameter$ID=="r1_3"]<- diameter$diameter[diameter$ID=="r1_1"|diameter$ID=="r1_3"]