I have 30GB of COVID risk from viral exposure data in a single flat CSV file. I have made an Rshiny app with filters that allow me to select a subset of these data and plot them. Eventually I would like a handful of people to use the Rshiny app in a secure environment outside of my organisation with password protection. Obviously I can't read then 30GB into memory so I have tried:
Disk.frame
sqlite
Downsampling
But these are still super slow. I'm told I need Azure SQL and my data bottleneck problem will go away but that's about 10K per year. Is that my best option?
Please ask if I need to give more info.
EDI: Some info about the data:
Each row is a pretend passenger on public transport. Each column is some attribute about their journey: where they got on/off, the time they they got on/off, how close they were to an infectious person, whether they themselves were infectious and then the dose they received through breathing it or touching a surface. Then there are columns about the passenger loading, prevalence of then virus in the community and others about the train itself. So these are a lot of repetition and no indexing. I think that might be where the bottleneck is coming from.
An example query would be:
Select passengers wearing masks &
Passengers that got on at FIN &
All passengers who wore masks &
Passengers where the train was 50% full &
Passengers that were on a train with bad ventilation.
Plot the dose they received.
Azure SQL will meet the need for a database that will handle that data volume, and it will provide a secure environment in which to do that.
As for pricing, it doesn't need to cost $10K per year, unless you have very specific performance requirements. I just quoted an S2 database (50 DTUs, 250GB storage) for $89/month. If you want super-scalability, you can go serverless, and the same size database can support 2 vCores scaling on demand to 16 vCores for $113/month.
Now, does that mean you have to use Azure's SQL offering? No, but it could be a viable solution for you.
Related
We've set up a cosmos account in North Europe with Geo replication to West Europe. Consistency is set to "Session"(Default). The intent is to use North Europe as a single write region and both regions as read. This is because the requirements are to have no performance degradation during batch data ingestion data into the database. We are using ADF to do the batch ingestion.
The question I have is how do I monitor the metrics for the read only region? When I look at the Metrics on Cosmos, I can only still see North Europe in the drop down.
Thanks. So this is not a problem.
I found that when you create a write region and 1 or more read regions, the other regions metrics will not be visible until there is some metrics to report. The replication of data does not contribute to the Metrics/throughput usage.
To test this, I wrote some python code to fetch some data and set the secondary read region as the preferred location. Just 2 minutes after executing the code, the read region appeared on the Metrics region drop down.
The python code I used to define the client is below :
client = CosmosClient(ENDPOINT, {'masterKey': MASTER_KEY}, preferred_locations = ['Central US'])
Am closing this question.
Your problem also appeared in my side.
I have a test db which set east-asia as write only and other regions read. When I reached metrics page, only east-asia in the drop down of region filter. I guess it comes from the location of the operation(all my operations are from this region so there only provides the only one choice). After I delete the east-asia region in Replicate data globally and did some query, then I can see another region in metrics.
I also tested in my another database, it doesn't enable global distribute and I haven't use the database for a long time. When I opened the metric page, I find it provides no choice for region. But after execute a query and wait for a while, the region showed in drop down.
Couldn't find much support for this for R. I'm trying to read a number of RTF files into R to construct a data frame, but I'm struggling to find a good way to parse the RTF file and ignore the structure/formatting of the file. There are really only two lines of text I want to pull from each file -- but it's nested within the structure of the file.
I've pasted a sample RTF file below. The two strings I'd like to capture are:
"Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
"The technology level [...] and managerial implications." (the full paragraph)
Any thoughts on how to efficiently parse this? I think regular expressions might help me, but I'm struggling to form the right expression to get the job done.
{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720
\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720
\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720
\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720
\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720
\f1\fs24 \cf0 \
}
1) A simple way if you are on Windows is to read it in using WordPad or Word and then save it as a plain text document.
2) Alternately, to parse it directly in R, read in the rtf file, find lines with the given pattern, pat producing g. Then replace any \\' strings with single quotes producing noq. Finally remove pat and any trailing junk. This works on the sample but you might need to revise the patterns if there are additional embedded \\ strings other than the \\' which we already handle:
Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))
For the indicated file this is the output:
[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."
Revised several times. Added Wordpad/Word solution.
Is there a proper way, equation or technique in general to say, "My web application needs to support N number of total users which via this equation/technique/rockHardExperience tells me that I need to support X number of concurrent page requests"?
From my research and/or gut feeling it seems like it would be something like:
totalLoadCapabilityRequired = (totalUsersN x .10 ) * .5
where .10 is for roughly 10% on at any given time
and the whole thing multiplied by 50% to suggest a 50% chance of those total users online executing a request at roughly the same time
any insights would help me in making sure I implement support in my application that is on par for the demand. I expect a lot of users but don't want to over anticipate too early. I know for starters that the org I am programming for will have 45,000 users that they want to use my system, with an anticipation on success for many more.
Here's a couple of things to think about:
What's the time span in which you expect the bulk of your visits? If it's an office application within the same physical company your capacity planning should be based on an 8 hour period. If most visits will come from the same continent you can plan for a 12 hour period instead, etc. Base your visitor spread on that.
Which pages do you anticipate will be the most popular and how heavy are those pages (i.e. how many pages can you load in one second)? Get an understanding of parts that would benefit from caching to squeeze out more performance.
Don't plan based on peak load; design your app to scale and start small.
Design your app in a way that you can take run snapshots at every 500th request; you can use tools like xhprof to create files that you can run through cachegrind tools to analyze the performance as it runs.
In short, there's no catch-all formula :) for a ballpark figure your formula will probably be good enough, but take the above points in consideration.
Suppose I want to regress in R Gross Profit on Total Revenue. I need data for this, and the more, the better.
There is a library on CRAN that I find very useful: quantmod , that does what I need.
library(quantmod)
getFinancials(Symbol="AMD", src="google")
#to get the names of the matrix: rownames(AMD.f$IS$A)
Total.Revenue<-AMD.f$IS$A["Revenue",]
Gross.Profit<-AMD.f$IS$A["Gross Profit",]
#finally:
reg1<-lm(Gross.Profit~Total.Revenue)
The biggest issue that I have is that this library gets me data only for 4 years (4 observations, and who runs a regression with only 4 observations???). Is there any other way (maybe other libraries) that would get data for MORE than 4 years?
I agree that this is not an R programming question, but I'm going to make a few comments anyway before this question is (likely) closed.
It boils down to this: getting reliable fundamental data across sectors and markets is difficult enough even if you have money to spend. If you are looking at the US then there are a number of options, but all the major (read 'relatively reliable') providers require thousands of dollars per month - FactSet, Bloomberg, Datastream and so on. For what it's worth, for working with fundamental data I prefer and use FactSet.
Generally speaking, because the Excel tools offered by each provider are more mature, I have found it easier to populate spreadsheets with the data and then read the data into R. Then again, I typically deal with the fundamentals of a few dozen companies at most, because once you move out of the domain of your "known" companies the time it takes to check anomalies increases exponentially.
There are numerous potential "gotchas". The most obvious is that definitions vary from sector to sector. "Sales" for an industrial company is very different from "sales" for a bank, for example. Another problem is changes in definitions. Pretty much every year some accounting regulation or other changes and breaks your data series. Last year minorities were reported here, but this year this item is moved to another position in the P&L and so on.
Another problem is companies themselves changing. How does one deal with mergers, acquisitions and spin-offs, for example? This sort of thing can make measuring organic sales growth next to impossible. Yet another point to bear in mind is that if you're dealing with operating or net profit, you have to consider exceptionals and whether to adjust for them.
Dealing with companies outside the US adds a whole bunch of further problems. Of course, the major data providers try to standardise globally (FactSet Fundamentals for example). This just adds another layer of abstraction and typically it is hard to check to see how the data has been manipulated.
In short, getting the data is onerous and I know of no reliable free sources. Unless you're dealing with the simplest items for a very homogenous group of companies, this is a can of worms even if you do have the data.
I have a SQL Server database which contains stock market quotes and other related data.
This database needs to be updated at regular interval, say 1 minute.
My question is:
How do I get stock quotes every 1 minute and update it to database?
I really appreciate your help.
Thanks!
You know, you seriously put the question from the wrong side. Like "I have a car, Mercedes, Coupe - how can I find the best road from A to B". Totally unrelated to the car.
Same with your question - this is not a sql or even an asp.net question to start with. The solution is independant of both, the sql server used and your web technology. Your main question is:
How do I get stock quotes every 12 minute and update it to the database?
Here we go. I assume you (a) talk of US stocks and (b) mean all of them, not a handfull.. 1 minute is too small an interval to make scanning things like yahoo.com feasible - main problem here is that there are tousands of stocks (actually more in the tens of thousands), and you dont want to go to yahoo scrapping thousands of pages per minute.
Same time, a end retail user data feed provider will not work. They support X symbols at a time, and x being typcially in the low hundred area, sometimes upgradable to 500 or so.
If you need STOCK DATA every minute, as per all US stocks, then this is technically identical to "real time prices", which ends up costing money. In adition you need a commercial higher end data feed of which I know of... one. Sorry. Costs going to be near or full four digit, without (!) publication rights.
And that is NxCore - their system has a data offer that offers US Stocks (all exchanges) real time, complete feed with all corretions etc. Native and C# wrapper API, so you can take the real time data feed, update your current pricing in memory and write them out to sql server every minute. Preferably not from asp.net (baaaaad choice for something that should run 24/7 without interruption unless you do heavy setup changes etc.) but from an installed windows service. Takes some bandwidth - no real idea how much (I am getting 4 exchanges from them, but no stocks, only the cme group futures, CME, CBOT, NYMEX and COMEX).
Note that wwith this setup you can go faster, too, but if you go fully real time you need a serious server. We talk of a billion updates or so per day...
End user sql server setup (i.e. little ram, and few slow discs) wont work.
Too expensive? Ther are plenty of data feeds around for a lower price, but they will not give you "stocks" as in "all of them", just "a selection".
If you are ok with not real time data - i.e. pulling stuff down at the end of the day, eoddata.com has a decent offer. YOu could also thnen pull things up via an asp.net page, but again.... you will not have the data during the day, just - well - after close. Smallest granularity is 1 minute. Repluublication rights again a no - but probably you can talk to them.
This isn't really SQL Server specific; a typical solution is that your run a process that polls an external source (a web service or the like) at regular intervals and uses this information to update the database. You can either implement this as a simple command-line program that gets executed every minute from the task scheduler, or you can make it a windows service that sleeps most of the time and only wakes up once a minute to do its processing. Once you have that, writing to the database is as usual.