I have a data.table with 3 input columns as follows and a fourth column representing my target output:
require(data.table)
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(9,15,15,23,27,27,31,39,49,49,50),
Valid_reversal = c(T,T,F,F,T,F,T,F,T,F,F),
Target_output = c(5,5,13,5,19,23,19,19,39,42,39))
I'm not sure if this is completely necessary, but I'll try to explain the dataset to hopefully make it easier to see what I'm trying to do. This is a little hard to explain in writing, so please bear with me!
The "Created" column represents the row number location of a price 'peak' (i.e. reversal point) in a time-series of financial data that I'm analysing. The "Next_peak" column represents the corresponding row number (in the original data set) of the next peak which exceeds the peak for that row. e.g. looking at row 1, the "Next_peak" value is 9, corresponding to the same row location as the "Created" level on row 2 of this summarised table. This means that the second peak exceeds the first peak. Conversely, in row 2 where the second peak's data is stored, the "Next peak" value of 15 suggests that it isn't until the 4th peak (i.e. corresponding to the '15' value in the "Created" column) that the second peak's price level is exceeded.
Lastly, the "Valid_reversal" column denotes whether the "Created" and "Next_peak" levels are within a predefined threshold. For example, "T" in the first row suggests that the peaks at rows 5 and 9 ("Next_peak") met this criteria. If I then go to the value of "Created" corresponding to a value of 9, there is also a "T", suggesting that the "Next_peak" value of 15 also meet the criteria. However, when I go to the 4th row where Created = 15, there is a "F", we find that the next peak does not meet the criteria.
What I'm trying to do is to link the 'chains' of valid reversal points and then return the original starting "Created" value. i.e. I want rows 1, 2 and 4 to have a value of '5', suggesting that the peaks for these rows were all within a predefined threshold of the original peak in row 5 of the original data-set.
Conversely, row 3 should simply return 13 as there were no valid reversals at the "Next_peak" value of 15 relative to the peak formed at row 13.
I can create the desired output with the following code, however, it's not a workable solution as the number of steps could easily exceed 3 with my actual data sets where there are more than 3 peaks which are 'linked' with the same reversal point.
I could do this with a 'for' loop, but I'm wondering if there is a better way to do this, preferably in a manner which is as vectorised as possible as the actual data set that I'm using contains millions of rows.
Here's my current approach:
Test[Valid_reversal == T,Step0 := Next_peak]
Test[,Step1 := sapply(seq_len(.N),function(x) ifelse(any(!(Created[x] %in% Step0[seq_len(x)])),
Created[x],NA))]
Test[,Step2 := unlist(ifelse(is.na(Step1),
lapply(.I,function(x) Step1[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step1))]
Test[,Step3 := unlist(ifelse(is.na(Step2),
lapply(.I,function(x) Step2[which.max(Step0[seq_len(x-1)] == Created[x])]),
Step2))]
As you can see, while this data set only needs 3 iterations, the number of steps in the approach that I've taken is not definable in advance (as far as I can see). Therefore, to implement this approach, I'd have to repeat Step 2 until all values had been calculated, potentially via a 'while' loop. I'm struggling a little to work out how to do this.
Please let me know if you have any thoughts on how to address this in a more efficient way.
Thanks in advance,
Phil
Edit: Please note that I didn't mention in the above that the "Next_peak" values aren't necessarily monotonically increasing. The example above meant that nafill could be used, however, as the following example / sample output shows, it wouldn't give the correct output in the following instance:
Test <- data.table(Created = c(5,9,13,15,19,23,27,31,39,42,49),
Next_peak = c(27,15,15,19,23,27,42,39,42,49,50),
Valid_reversal = c(T,T,F,T,F,F,T,F,F,T,F),
Target_output = c(5,9,13,9,9,23,5,31,39,5,5))
Not sure if I understand your requirements correctly, you can use nafill after Step 1:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#your steps 2, 3, ...
Test[Valid_reversal | is.na(out), out := nafill(out, "locf")]
edit for the new example. You can use igraph to find the chains:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#steps 2, 3, ...
library(igraph)
g <- graph_from_data_frame(Test[Valid_reversal | is.na(out)])
DT <- setDT(stack(clusters(g)$membership), key="ind")[,
ind := as.numeric(levels(ind))[ind]][,
root := min(ind), values]
Test[Valid_reversal | is.na(out), out := DT[.SD, on=.(ind=Created), root]]
just for completeness, here is a while loop version:
#step 0 & 1
Test[, out :=
Test[(Valid_reversal)][.SD, on=.(Next_peak=Created), mult="last",
fifelse(is.na(x.Created), i.Created, NA_integer_)]
]
#step 2, 3, ...
while(Test[, any(is.na(out))]) {
Test[is.na(out), out := Test[.SD, on=.(Next_peak=Created), mult="first", x.out]]
}
Test
I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")
I cluster product IDs on amount of sales and profit of sales to identify product IDs on which I need to focus more.
The code below takes column 2 (amount of sales) and column 3 (profit of sales) as input for kmeans. Instead of the current labeling, row 1 is product 1, row 2 is product 2, etc. I want the labels to be product IDs (which is data_nz[,1]) instead of row indices.
k2 <- kmeans(data_nz[,2:3], centers = 3, nstart = 1000)
When I output the data examples in my clusters (exclude cluster 2 because these are the ones I don't care about):
k2$cluster[k2$cluster != 2]
I get the row indices and the cluster number, but what I want is the product ID and the cluster number.
Example of my dataset below: Product_ID, amount_of_sales, profit_of_sales
Can someone point me in the right direction?
You already have an ordered vector of product IDs in data_nz[, 1], which matches the vector with cluster number (k2$cluster). You can look at them side by side like this:
data.frame(product_id = data_nz[[1]],
cluster = k2$cluster)
If you want to drop certain rows you can:
data.frame(product_id = data_nz[[1]],
cluster = k2$cluster
)[k2$cluster != 2, ]
i want to calculate running total in the column of invested money....my condition is if buy_indicator==BUY and sell_indicator==HOLD then my invested money value should be negative of close_price*100 where 100 is volume of shares which is constant.....else if buy_indicator==HOLD and sell_indicator==SELL then my invested money value should be positive of close_price*100......else buy_indicator==HOLD and sell_indicator==HOLD then it should contain the previous row value....
my dataset looks like this
enter image description here
You can use ifelse to generate the columns with +1 or -1, that you can then multiply by 100*close prices. example:
positiveorneg <- ifelse(buy_indicator==BUY&sell_indicator==HOLD, -1, 1)
moneytoinvest <- positiveorneg*100*closeprice
You can then use cumsum to get the hopefully positively trended line of your money.
mymoney <- cumsum(moneytoinvest)
Don't spend it all in one place.
EDIT: if you have more than one condition you can embed ifelse statements:
ifelse(buy_indicator==BUY&sell_indicator==HOLD, -1, ifelse(buy_indicator==HOLD&sell_indicator==SELL, 1, 0))
I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"