I have a bunch of sales opportunities in various excel files- broken down by region, type, etc.- that are one column each and simply list the dollar amounts of each opportunity. In R I have run a simulation to determine the likelihood of each opportunity closing with a sale or not, and repeated the simulation 100,000 times. I know that I can't pass the full results table back to Tableau because it has 100,000 rows- one total for each simulation- and the data I'm pulling into Tableau would just have the $ value of each opportunity so would only have a length of the number of opportunities of that type.
What I have in R is basically this first block of code; repeated a number of times with varying inputs and changing probabilities; then ultimately combine the totals vectors to get a quarter total vector.
APN<-ncol(APACPipelineNew)
APNSales<-matrix(rbinom(APN, 1, 0.033), 100000, APN)
APNSales<-sweep(APNSales,2,APACPipelineNew,'*')
APNTotals<-rowSums(APNSales)
...
Q1APACN<-APNTotals+ABNTotals+AFNTotals
...
Q1Total<-Q1APACT+Q1EMEAT+Q1NAMT
What I'd like to do is set this up as a dashboard in Tableau so that it can automatically update each week, but I'm not sure how to pass the simulation back into Tableau given the difference in length of the data.
Some suggestions:
For R you can use a windows scheduler to run a job at any given interval (or use the package taskscheduleR).
After you save the R data you can manually update your dashboard if it is on a desktop version (I do not know if you can schedule an extract refresh with a desktop dashboard).
However, if your dashboard lives on a tableau server you can schedule an extract refresh every week. Obviously, I would schedule the r update before the tableau extract refresh.
If you only wanted the data to update if there was a differing number of rows from the previous weekly run you can build that logic into R. Although saving the r data and refreshing the extract with the same data and number of rows should not cause any problems.
Related
I've got a dataframe, which is stored in a csv, of 63 columns and 1.3 million rows. Each row is a chess game, each column is details about the game (e.g. who played in the game, what their ranking was, the time it was played, etc). I have a column called "Analyzed", which is whether someone later analyzed the game, so it's a yes/no variable.
I need to use the API offered by chess.com to check whether a game is analyzed. That's easy. However, how do I systematically update the csv file, without wasting huge amounts of time reading in and writing out the csv file, while accounting for the fact that this is going to take a huge amount of time and I need to do it in stages? I believe a best practice for chess.com's API is to use Sys.sleep after every API call so that you lower the likelihood that you are accidentally making concurrent requests, which the API doesn't handle very well. So I have Sys.sleep for a quarter of a second. If we assume the API call itself takes no time, then this means this program will need to run for 90 hours because of the sleep time alone. My goal is to make it so that I can easily run this program in chunks, so that I don't need to run it for 90 hours in a row.
The code below works great to get whether a game has been analyzed, but I don't know how to intelligently update the original csv file. I think my best bet would be to rewrite the new dataframe and replace the old Games.csv every 1000 or say API calls. See the commented code below.
My overall question is, when I need to update a column in csv that is large, what is the smart way to update that column incrementally?
library(bigchess)
library(rjson)
library(jsonlite)
df <- read.csv <- "Games.csv"
for(i in 1:nrow(df)){
data <- read_json(df$urls[i])
if(data$analysisLogExists == TRUE){
df$Analyzed[i] <- 1
}
if(data$analysisLogExists==FALSE){
df$Analyzed[i] = 0
}
Sys.sleep(.25)
##This won't work because the second time I run it then I'll just reread the original lines
##if i try to account for this by subsetting only the the columns that haven't been updated,
##then this still doesn't work because then the write command below will not be writing the whole dataset to the csv
if(i%%1000){
write.csv(df,"Games.csv",row.names = F)
}
}
I have been following the tutorial for DADA2 in R for a 16S data-set, and everything runs smoothly; however, I do have a question on how to calculate the total percent of merged reads. After the step to track reads through the pipeline with the following code:
merger <- mergePairs(dadaF1, derepF1, dadaR1, derepR1, verbose=TRUE)
and then tracking the reads through each step:
getN <- function(x) sum(getUniques(x))
track <- cbind(out_2, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
I get a table that looks like this, here I am viewing the resuling track data-frame made w/ the above code:
Where input is the total sequences I put in (after demuxing) and filtered is the total sequences after they were filtered based on a parameter of my choosing. The denoisedF and denoisedR are sequences that have been denoised (one for forward reads and another for reverse reads), the total number of merged reads (from the mergePairs command above) and the nonchim are the total sequences that are not chimeras.
My question is this .... to calculate the percent of merged reads - is this a simple division? Say take the first row - (417/908) * 100 = 46% or should I somehow incorporate the denoisedF and denoisedR columns in this calculation?
Thank you very much in advance!
The track object is a matrix (see class(track)), thus you can run operations accordingly. In your case:
track[, "merged"]/track[, "input"] * 100
Or, you could convert the track object into a data frame for a "table" output.
However, I usually export the track output as an excel file and then do my modification there. It is easier to be shared and commented on with non-R users.
I find the write_xlsx function from the writexl package particularly convenient.
Cheers
I am working with a time-series data stream from an experiment. We record multiple data channels, including a trigger channel ('X7.ramptrig' in linked data: Time-Series Data Example), that indicates when a relevant event occurs in other channels.
I am trying to create subsets of the next n-rows (e.g. 15,000) of the time-series (time steps are 0.1ms) that occur after onset of a trigger ('1'). That column has multiple triggers ('1') interspersed at irregular intervals. Every other time step is a '0', indicating no new event.
I am asking to see if there is a more efficient solution to directly subset the subsequent n-rows after a trigger is detected instead of the indirect (possibly inflexible) solution I have come up with.
Link to simple example data:
https://gtvault-my.sharepoint.com/:t:/g/personal/shousley6_gatech_edu/EZZSVk6pPpJPvE0fXq1W2KkBhib1VDoV_X5B0CoSerdjFQ?e=izlkml
I have a working solution that creates an index from the trigger channel and splits the dataset on that index. Because triggers have variability in placement in time, the subsequent data frame subsets are not consistent and there are occasionally 'extra' subsets that precede 'important' ones ('res$0' in example). Additionally, I need to have the subsets be matched for total time and aligned for trigger onset.
My current solution 'cuts' the lists of data frames to the same size (in the example to the first 15,000 rows). While this technically works it seems clunky. I also tried to translate a SQL solution using FETCH NEXT but those functions are not available in the SQLite supported in R.
I am completely open to alternatives so please be unconstrained by my current solution.
##create index to detect whenever an event trigger occurs
idx<-c(0, cumsum(diff(Time_Series_Data_Example$X7.ramptrig) >0))
## split the original dataframe on event triggers
split1<-split(Time_Series_Data_Example, idx)
## cuts DFs down to 1.5s
res <- lapply(split1, function(x){
x <- top_n(x, -15000)
})
Here is an example of data output: 'head(res[["1"]]' 2
For the example data and code provided, the output is 4 subsets, 3 of which are 'important' and time synced to the trigger. The first 'res$0' is a throw away subset.
Thanks in advance and please let me know how I can improve my question asking (this is my first attempt).
I am trying to integrate Time series Model of R in Tableau and I am new to integration. Please help me in resolving below mentioned Error. Below is my code in tableau for integration with R. Calculation is Valid bur getting an error.
SCRIPT_REAL(
"library(forecast);
cln_count_ts <- ts(.arg1,frequency = 7);
arima.fit <- auto.arima(log10(cln_count_ts));
forecast_ts <- forecast(arima.fit, h =10);",
SUM([Count]))
Error : Error in auto.arima(log10(cln_count_ts)) : No suitable ARIMA model found
When Tableau calls R, Python, or another tool, it does so as a "table calc". That means it sends the external system one or more vectors as arguments and expects a single vector in response.
Depending on your data and calculation, you may want to send all your data to R in a single call, passing a very large vector, or call it several times with different vectors - say forecasting each region separately. Or even call R multiple times with many vectors of size one (aka scalars).
So with table calcs, you have other decisions to make beyond just choosing the function to invoke. Chiefly, you have to decide how to partition your data for analysis. And in some cases, you also need to determine the order that the data appears in the vectors you send to R - say if the order implies a time series.
The Tableau terms for specifying how to divide and order data for table calculations are "partitioning and addressing". See the section on that topic in the online help. You can change those settings by using the "Edit Table Calc" menu item.
I have a 3GB csv file. It is too large to load into R on my computer. Instead I would like to load a sample of the rows (say, 1000) without loading the full dataset.
Is this possible? I cannot seem to find an answer anywhere.
If you don't want to pay thousands of dollars to Revolution R so that you can load/analyze your data in one go, sooner or later, you need to figure out a way to sample you data.
And that step is easier to happen outside R.
(1) Linux Shell:
Assuming your data falls into a consistent format. Each row is one record. You can do:
sort -R data | head -n 1000 >data.sample
This will randomly sort all the rows and get the first 1000 rows into a separate file - data.sample
(2) If the data is not small enough to fit into memory.
There is also a solution to use database to store the data. For example, I have many tables stored in MySQL database in a beautiful tabular format. I can do a sample by doing:
select * from tablename order by rand() limit 1000
You can easily communicate between MySQL and R using RMySQL and you can index your column to guarantee the query speed. Also you can verify the mean or standard deviation of the whole dataset versus your sample if you want taking the power of database into consideration.
These are the two most commonly used ways based on my experience for dealing with 'big' data.