R filter produces NAs - r

I see posts on how the filter() function can deal with NAs, but I have the opposite problem.
I have a fully complete dataset that I process through a long algorithm. One piece of that algorithm is a series of FIR filters. After going through the filters, SOMETIMES my data comes back with NAs at the beginning and end. They are not padding the original dimensions- they "replace" the values that would have otherwise come out of the filter.
I've come up with a couple of hacks to remove the NAs after they are created, but I'm wondering if I can prevent the NAs from showing up in the first place?
set.seed(500)
xoff <- sample(-70:70, 8200, replace=TRUE)
filt <- c(-75, -98, -130, -174, -233, -312, -412, -524, -611, -574, -246, 485, 1503, 2545, 3446, 4174, 4749, 5189, 5502, 5689, 5750, 5689, 5502, 5189, 4749, 4174, 3446, 2545, 1503, 485, -246, -574, -611, -524, -412, -312, -233, -174, -130, -98, -75)
xs = filter(xoff,filt)
40 NAs are returned. 20 at head, 20 at the tail.
sum(is.na(xs)==TRUE)
head(xs, n=21)
tail(xs, n=21)
The length of the original data and the filtered vector are identical.
length(xoff)==length(xs)
Another filter I use sometimes produces 20 NAs, 10 at the head, 10 at the tail. So it is filter dependent.
Things I've thought about:
-Is the length of the data indivisible by the length of the filter? No, the filter is length=41, and the data is length=8200. 8200/41 = 200.
-Unlike a moving average which would need n observations prior to starting the smooth, this FIR filter provides the filter values and doesn't rely on prior observations.
so, any help debugging this is much appreciated! Thanks!

Related

Simple Data manipulation task in R

Let's say I have a data frame like the following sample data.
qty_available<- c(13500, 8500, 4600)
supply_qty<- c(0, 1000, 0)
forecast<- c(1200, 400, 3000)
demand_q<- c( 100, 800, 6000)
df<- data.frame(qty_available, supply_qty, forecast, demand_q)
I am attempting to do the following manipulation: I want qty_available to equal previous qty_available + supply qty - forecast - demand quantity. I can ignore the first observation because it is irrelevant in the context of my task.
So in the second observation, we would have 13,500 + 1000 -400 -800 giving us 13,300. The third observation would then be the 13,300 + 0 - 3000 -6000 giving us 4300.
I have attempted this as follows, but it won't work as I don't think that the answers "flow through"
df<- mutate(df, qty_available = lag(qty_available) + supply_qty - forecast - demand_q)
I am trying to work this so that the answer ends up becoming 4300 for the third observation.
I am mimicking a process in Excel through R in which the correct value is 4300. I just can't figure out how to mimic that process in R.
How would I go about doing this in R? Any help is greatly appreciated. I'm sure it's fairly simple, but I just can't seem to figure it out.
I think the third observation would give us -500, as the second qty_available observation is 8500 instead of 13,300.
So I think it would be 8,500 + 0 - 3000 - 6000 = -500.
If you fix the qty_available to the first qty_available (13,300), then we would expect the answer to be 4300.
The lag function takes the previous value in the rows instead of fixing it to the first value.

How to create a "dynamic" column in R?

I'm coding a portfolio analysis tool based off back-tests. In short, I want to add a column that starts at X value which will be the initial capital plus the result of the first trade, and have the rest of the values updated from the % change of each trade, but I haven't sorted out a way to put that logic into code. The following code is a simplified example.
profit <- c(10, 15, -5, -6, 20)
change <- profit / 1000
balance <- c(1010, 1025, 1020, 1014, 1036)
data <- data.frame(profit, change, balance)
So far, the only way I can think about is to create a separate vector that increases or decreases based off the change column, but I'm not sure how to do it in a way that it takes into account like the previous value, so doing balance = start_capital * (1 + change) which would give the proportional increase taking always into account the same initial value, not the previous value plus the change of the new one (I hope I explained myself).
Thanks,
Fernando.
EDIT
I have the correct change value on the actual program as each back-test updates the balance with the result of each new trade, so the change column on the real data is correct as it is properly updating, but my code combines several back-test and as the balance update is for each separate back-test and not the combined, it is not usable when combining everything, that's why I added the change column.
If you want to do this via change column we can use Reduce
start_capital <- 1000
Reduce(function(x, y) x + x*y, data$change, init = start_capital, accumulate = TRUE)[-1]
#[1] 1010.000 1025.150 1020.024 1013.904 1034.182
Reduce with accumulate = TRUE gives the output in a cumulative form taking the output of the current iteration as input to the next one.

How to compute for the mean and sd

I need help on 4b please
‘Warpbreaks’ is a built-in dataset in R. Load it using the function data(warpbreaks). It consists of the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn. It has three variables namely, breaks, wool, and tension.
b. For the ‘AM.warpbreaks’ dataset, compute for the mean and the standard deviation of the breaks variable for those observations with breaks value not exceeding 30.
data(warpbreaks)
warpbreaks <- data.frame(warpbreaks)
AM.warpbreaks <- subset(warpbreaks, wool=="A" & tension=="M")
mean(AM.warpbreaks<=30)
sd(AM.warpbreaks<=30)
This is what I understood this problem and typed the code as in the last two lines. However, I wasn't able to run the last two lines while the first 3 lines ran successfully. Can anybody tell me what is the error here?
Thanks! :)
Another way to go about it:
This way you aren't generating a bunch of datasets and then working on remembering which is which. This is more a personal thing though.
data(warpbreaks)
mean(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
sd(AM.warpbreaks[which(AM.warpbreaks$breaks<=30),"breaks"])
There are two problems with your code. The first is that you are comparing to 30, but you're looking at the entire data frame, rather than just the "breaks" column.
AM.warpbreaks$breaks <= 30
is an expression that refers to the breaks being less than thirty.
But mean(AM.warpbreaks$breaks <= 30) will not give the answer you want either, because R will evaluate the inner expression as a vector of boolean TRUE/FALSE values indicating whether that break is less than 30.
Generally, you just want to take another subset for an analysis like this.
AM.lt.30 <- subset(AM.warpbreaks, breaks <= 30)
mean(AM.lt.30$breaks)
sd(AM.lt.30$breaks)

Interpreting output from package robfilter (robust.filter)

I'm using the R package robfilter to analyse some time series data. More specifically the function robust.filter. However when I pass a time series of length 38 as an argument, the vectors that make up the output list are of inconsistent lengths. I would've expected them to be the same length, but possibly I'm misinterpreting the output. Here is an example
tmp1 <- c(21.40253, 21.71123, 23.62187, 23.34300, 22.81753, 25.05459, 19.13591,
18.75162, 19.92034, 19.98294, 20.07355, 19.76710, 18.87155, 20.06639,
19.69027, 21.33667, 21.57617, 20.84389, 22.28439, 21.73989, 21.82995,
23.02375, 21.99251, 24.88138, 27.75982, 28.84098, 27.67645, 27.04585,
27.16779, 25.62208, 25.90392, 26.92163, 26.83929, 26.83194, 30.43352,
30.95034, 32.41596, 31.87539)
length(tmp1)
The length is 38
library(robfilter)
tmp2 <- robust.filter(tmp1, width=7, shiftd=4, extrapolate=TRUE)
length(tmp2[["y"]])
length(tmp2[["ol"]])
length(tmp2[["level"]])
y which represents the original data is 38, but the vector which identifies the outliers is of length 41, and that which represents the filtered series is of length 40!
Can I assume for example that the first observation in "ol" corresponds to the first point in the original series? I would appreciate any insights as this inconsistent length has me confused.
In the case of the above series there were actually no outliers identified. So I have provided another example where robust.filter indicates there are outliers
tmp3 <- c(9.590999, 10.736618, 11.213917, 11.486491, 9.727762,
7.564208, 7.995007, 7.718619, 7.908130, 7.598344, 7.884147,
8.560636, 9.490633, 9.768715, 9.221128, 8.535356, 9.589786,
9.900386, 10.496643, 10.817289, 11.371327, 12.310138, 11.572224,
13.117717, 13.720533, 12.880585, 12.830893, 12.578935, 11.910936,
11.063447, 10.916194, 10.881677, 11.246900, 11.264994, 9.844785,
10.512842, 10.609419, 10.866941, 11.541334, 12.041648, 12.188250,
12.289139, 11.412508, 11.603581, 11.018384, 12.374552, 12.010114,
11.701049, 11.550803, 11.029398, 11.109258, 11.263335, 11.201110,
11.047172, 11.398097, 11.068206, 11.639072, 12.182218, 11.574394,
12.970866, 12.214502, 12.271814, 11.529558, 13.413776, 13.452780,
12.854925, 13.494725, 13.381464, 13.054178, 13.346170, 12.622088,
15.365530, 10.252811, 11.067396, 10.791832, 9.584768, 10.765442,
10.781584, 9.646298, 10.452633)
length(tmp3) #80
tmp4 <- robust.filter(tmp3, width=9, shiftd=4, extrapolate=TRUE)
length(tmp4[["y"]]) #80
length(tmp4[["ol"]]) #81
length(tmp4[["level"]]) #80
Again the vector which indicates which data point is an outlier is of inconsistent length.
Note this problem doesn't always occur. Often the lengths of the vectors are consistent.
Looked a bit at the source code and the file robust-filter.R and noted a few index values resulting from ceiling(m/2) and ceiling(m/3) that could result in the somewhat padded vectors. It seems that the lengths of the vectors $ol, $level, ect are related to the window size. For example, in the case given if the set size is 42, with window size of 7, then all result sets are 42 since 42/7 divides evenly.

randomly separate a group of elements into several groups

I have a set of 1000 elements, and would like to put 200 in subset1, 300 in subset2, and 500 in subset3. All of the element are equivalent with each other in terms of their assignment probability. How can do that in R?
My current approach is that first choose 200 by random, and put them into subset1. Afterwards, I will randomly pick 300 from the remaining 800. I do not think it is exactly correct.
I think the correct approach is re-order the 1000 element sequence by random, and select the first 200, then the second 300, and the remaining 500. But I do not how how to do that in R.
You can use function sample() to get "a random permutation" of your original data and then select first 200, then 300 and so on.
#original data
x<-runif(1000)
#random permutation
y<-sample(x)
#data selection
y[1:200]
y[201:500]
y[501:1000]
This is a slightly different version of what #Didzis has proposed that uses split to return a list of three vectors (or something else, if x was something else):
Using rep to get exactly 200, 300, and 500 elements:
split(sample(x),rep(1:3,times=c(200,300,500)))
Using the prob argument of sample to get 200, 300, and 500 elements in expectation:
split(x,sample(1:3,1000,replace=TRUE,prob=c(.2,.3,.5)))
You probably want the first of these.

Resources