Is there a more efficient way to subset time series data frame on irregular, repeated binary trigger column? - r

I am working with a time-series data stream from an experiment. We record multiple data channels, including a trigger channel ('X7.ramptrig' in linked data: Time-Series Data Example), that indicates when a relevant event occurs in other channels.
I am trying to create subsets of the next n-rows (e.g. 15,000) of the time-series (time steps are 0.1ms) that occur after onset of a trigger ('1'). That column has multiple triggers ('1') interspersed at irregular intervals. Every other time step is a '0', indicating no new event.
I am asking to see if there is a more efficient solution to directly subset the subsequent n-rows after a trigger is detected instead of the indirect (possibly inflexible) solution I have come up with.
Link to simple example data:
https://gtvault-my.sharepoint.com/:t:/g/personal/shousley6_gatech_edu/EZZSVk6pPpJPvE0fXq1W2KkBhib1VDoV_X5B0CoSerdjFQ?e=izlkml
I have a working solution that creates an index from the trigger channel and splits the dataset on that index. Because triggers have variability in placement in time, the subsequent data frame subsets are not consistent and there are occasionally 'extra' subsets that precede 'important' ones ('res$0' in example). Additionally, I need to have the subsets be matched for total time and aligned for trigger onset.
My current solution 'cuts' the lists of data frames to the same size (in the example to the first 15,000 rows). While this technically works it seems clunky. I also tried to translate a SQL solution using FETCH NEXT but those functions are not available in the SQLite supported in R.
I am completely open to alternatives so please be unconstrained by my current solution.
##create index to detect whenever an event trigger occurs
idx<-c(0, cumsum(diff(Time_Series_Data_Example$X7.ramptrig) >0))
## split the original dataframe on event triggers
split1<-split(Time_Series_Data_Example, idx)
## cuts DFs down to 1.5s
res <- lapply(split1, function(x){
x <- top_n(x, -15000)
})
Here is an example of data output: 'head(res[["1"]]' 2
For the example data and code provided, the output is 4 subsets, 3 of which are 'important' and time synced to the trigger. The first 'res$0' is a throw away subset.
Thanks in advance and please let me know how I can improve my question asking (this is my first attempt).

Related

Grouping and transposing data in R

It is hard to explain this without just showing what I have, where I am, and what I need in terms of data structure:
What structure I had:
Where I have got to with my transformation efforts:
What I need to end up with:
Notes:
I've not given actual names for anything as the data is classed as sensitive, but:
Metrics are things that can be measured- for example, the number of permanent or full-time jobs. The number of metrics is larger than presented in the test data (and the example structure above).
Each metric has many years of data (whilst trying to do the code I have restricted myself to just 3 years. The illustration of the structure is based on this test). The number of years captured will change overtime- generally it will increase.
The number of policies will fluctuate, I've just labelled them policy 1, 2 etc for sensitivity reasons and limited the number whilst testing the code. Again, I have limited the number to make it easier to check the outputs.
The source data comes from a workbook of surveys with a tab for each policy. The initial import creates a list of tibbles consisting of a row for each metric, and 4 columns (the metric names, the values for 2024, the values for 2030, and the values for 2035). I converted this to a dataframe, created a vector to be a column header and used cbind() to put this on top to get the "What structure I had" data.
To get to the "Where I have got to with my transformation efforts" version of the table, I removed all the metric columns, created another vector of metrics and used rbind() to put this as the first column.
The idea in my head was to group the data by policy to get a vector for each metric, then transpose this so that the metric became the column, and the grouped data would become the row. Then expand the data to get the metrics repeated for each year. A friend of mine who does coding (but has never used R) has suggested using loops might be a better way forward. Again, I am not sure of the best approach so welcome advice. On Reddit someone suggested using pivot_wider/pivot_longer but this appears to be a summarise tool and I am not trying to summarise the data rather transform its structure.
Any suggestions on approaches or possible tools/functions to use would be gratefully received. I am learning R whilst trying to pull this data together to create a database that can be used for analysis, so, if my approach sounds weird, feel free to suggest alternatives. Thanks

R - how to use a function on a list of data frames

I have a little problem with my code. I hope you can help me :)
I used a function apply to create a list of 20 data frames (data about stock index returns, grouped by year and index - about three companies and the stock, for 5 years). And now I want to use function with two arguments (it calculates proportion of covariance of the returns for selected company and the stock to variance (for every year) - this is why I'm trying to group the data. How to do it... automatically, without manual typing code for every year and company?
I don't have any idea if I should use for loop or there is any other way...?
And the other thing is in which way can I delete uneccesary columns from list of data frames?
I'll be thankful for your help.
And sorry for my English :D
You may consider purrr::map_dfr(). The first argument will be your list of data frames, and the second the action to do with that data frame. The final result will be a single data frame uniting the result of all of the above. Your code will likely look something like this:
purrr::map_dfr(list_of_dataframes, function(x) {...})
Within the bracketes, instead of ... insert your logic. In that context, x will be the same as list_of_dataframes[[1]], and then list_of_dataframes[[2]], etc.
You may want to consult the documentation of the package purrr for further details.

R Updating A Column In a Large Dataframe

I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}

DADA2 - Calculating percent of reads that merged

I have been following the tutorial for DADA2 in R for a 16S data-set, and everything runs smoothly; however, I do have a question on how to calculate the total percent of merged reads. After the step to track reads through the pipeline with the following code:
merger <- mergePairs(dadaF1, derepF1, dadaR1, derepR1, verbose=TRUE)
and then tracking the reads through each step:
getN <- function(x) sum(getUniques(x))
track <- cbind(out_2, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
I get a table that looks like this, here I am viewing the resuling track data-frame made w/ the above code:
Where input is the total sequences I put in (after demuxing) and filtered is the total sequences after they were filtered based on a parameter of my choosing. The denoisedF and denoisedR are sequences that have been denoised (one for forward reads and another for reverse reads), the total number of merged reads (from the mergePairs command above) and the nonchim are the total sequences that are not chimeras.
My question is this .... to calculate the percent of merged reads - is this a simple division? Say take the first row - (417/908) * 100 = 46% or should I somehow incorporate the denoisedF and denoisedR columns in this calculation?
Thank you very much in advance!
The track object is a matrix (see class(track)), thus you can run operations accordingly. In your case:
track[, "merged"]/track[, "input"] * 100
Or, you could convert the track object into a data frame for a "table" output.
However, I usually export the track output as an excel file and then do my modification there. It is easier to be shared and commented on with non-R users.
I find the write_xlsx function from the writexl package particularly convenient.
Cheers

Summarizing attributes across sequences in a single sequence object?

I'm using TraMineR to analyze sets of sequences. Each coherent set of sequences may contain 100 work processes from a single project for a single period of time. Using TraMineR I can easily calculate descriptive statistics for each sequence, however I'm more interested in descriptive statistics of the sequence object itself - subsuming all the smaller sequences within.
For example, to get state frequencies, I run:
seqstatd(sequences.sts)
However, this gives me the state frequencies for each sequence within my sequence object. I want to access the frequencies of states across all sequences inside of my sequence object. How can I accomplish this?
I am not sure to understand your question since seqstatd() returns the cross-sectional frequencies at each successive position, and NOT the state frequencies for each sequence. The latter is returned by seqistatd().
Assuming you refer to the outcome of seqistatd() you would get the mean time spent in each state with seqmeant(sequence.sts).
For other summaries you can use the apply function. For instance, you get the variance of the time spent in each state with
tab <- seqistatd(mvad.seq)
vart <- apply(tab,2,var)
head(vart)
Hope this helps.

Resources