Estimating CPU and Memory Requirements for a Big Data Project [closed] - r

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am working on an analysis of big data, which is based on social network data combined with data on the social network users from other internal sources, such as a CRM database.
I realize there are a lot of good memory profiling, CPU benchmarking, and HPC packages and code snippets out there. I'm currently using the following:
system.time() to measure the current CPU usage of my functions
Rprof(tf <- "rprof.log", memory.profiling=TRUE) to profile memory
usage
Rprofmem("Rprofmem.out", threshold = 10485760) to log objects that
exceed 10MB
require(parallel) to give me multicore and parallel functionality
for use in my functions
source('http://rbenchmark.googlecode.com/svn/trunk/benchmark.R') to
benchmark CPU usage differences in single core and parallel modes
sort( sapply(ls(),function(x){format(object.size(get(x)), units = "Mb")})) to list object sizes
print(object.size(x=lapply(ls(), get)), units="Mb") to give me total memory used at the completion of my script
The tools above give me lots of good data points and I know that many more tools exist to provide related information as well as to minimize memory use and make better use of HPC/cluster technologies, such as those mentioned in this StackOverflow post and from CRAN's HPC task view. However, I don't know a straighforward way to synthesize this information and forecast my CPU, RAM and/or storage memory requirements as the size of my input data increases over time from increased usage of the social network that I'm analyzing.
Can anyone give examples or make recommendations on how to do this? For instance, is it possible to make a chart or a regression model or something like that that shows how many CPU cores I will need as the size of my input data increases, holding constant CPU speed and amount of time the scripts should take to complete?

Related

Partition Limit in AAS [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 months ago.
Improve this question
Noob Question
Is there a limit to the number or partitions that can be created in AAS?
We would like to create Date-wise partitions in Azure Analysis Services to speed up the incremental load that we intend to perform on the data we receive from multiple sources and constant updates to the data.
A limit to the number of partitions is not listed here. I would recommend leaning towards having dozens or hundreds of partitions per table not thousands. And I would lean towards ensuring that partitions generally have at least a million rows for optimal performance. Why? Because a million rows is the size of a segment in the Vertipaq compression scheme and if your partitions are all much smaller (say 50,000 rows) then you will limit the maximum segment size and make compression and performance worse.
That being said, partitioning is mainly about processing performance. So if partitioning by day and processing one or a few days of data incrementally minimizes processing time significantly over partitioning weekly or monthly then that sounds like a great partitioning scheme.
I suppose you could merge older daily partitions together into monthly partitions after they get past the window in which they are often processed. I would recommend checking whether this does reduce the number of segments using DAX Studio though. (I can’t recall off the top of my head whether it does.)

Single Process (Shared-Memory) Multiple-CPU Parallelism in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have used mclapply quite a bit and love it. It is a memory hog but very convenient. Alas, now I have a different problem that is not simply embarrassingly parallel.
Can R (esp Unix R) employ multiple CPU cores on a single computer, sharing the same memory space, without resorting to copying full OS processes, so that
there is minimal process overhead; and
modification of global data by one CPU are immediately available to other CPUs?
If yes, can R lock some memory just like files (flock)?
I suspect that the answer is no and learning this definitively would be very useful. If the answer is yes, please point me the right way.
regards,
/iaw
You can use the Rdsm package to use distributed shared memory parallelism, i.e. multiple R processes using the same memory space.
Besides that, you can employ multi-threaded BLAS/LAPACK (e.g. OpenBLAS or Intel MKL) and you can use C/C++ (and probably Fortran) code together with OpenMP. See assembling a matrix from diagonal slices with mclapply or %dopar%, like Matrix::bandSparse for an example.
Have you take a look at Microsoft's R Open (available for Linux), with the custom Math Kernel Library (MKL).
I've seen very good performance improvements without rewriting code.
https://mran.microsoft.com/documents/rro/multithread

Need advice to choose graph database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I'm looking for options for graph database to be used in a project. I expect to have ~100000 writes (vertix + edge) per day. And much less reads (several times per hour). The most frequent query takes 2 edges depth tracing that I expect to return ~10-20 result nodes.
I don't have experience with graph databases and want to work with gremlin to be able to switch to another graph database if needed. Now I consider 2 possibilities: neo4j and Titan.
As I can see there is enough community, information and tools for Neo4j, so I'd prefer to start from it. Their capacity numbers should be enough for our needs (∼ 34 billion nodes, ∼ 34 billion edges). But I'm not sure which hardware requirements will I face in this case. Also I didn't see any parallelisation options for their queries.
On the other hand Titan is built for horizontal scalability and has integrations with intensively parallel tools like spark. So I can expect that hardware requirements can scale in a linear way. But there is much less information/community/tools for Titan.
I'll be glad to hear your suggestions
Sebastian Good made a wonderful presentation comparing several databases to each other. You might have a look at his results in here.
A quick summary of the presentation is here
For benchmarks on each graph databases with different datasets, different node sizes and caches, please have a look at this Github repository by socialsensor. Just to let you know, the results in the repo are a bit different that the ones in the presentation.
My personal recommendation is:
If you have deep pockets, go for Neo4j. With the technical support and easy CIPHER, things will go pretty quickly.
If you support Open Source (and are patient for its development cycles), go for Titan DB with Amazon Dynamo DB backend. This will give you "infinite" scalability and good performance with both EC2 machines and Dynamo tables. Check here for docs and here for their code for more information.

How to make a choice between OpenTSDB and InfluxDB or other TSDS? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
They both are open source distributed time series databases, OpenTSDB for metrics, InfluxDB for metrics and events with no external dependencies, on the other OpenTSDB based on HBase.
Any other comparation between them?
And if I want to store and query|analyze metrics real-time with no deterioration loss based on time series, which would be better?
At one of the conferences I've heard people running something like Graphite/OpenTSDB for collecting metrics centrally and InfluxDB locally on each server to collect metrics only for this server. (InfluxDB was chosen for local storage as it is easy to deploy and lightweight on memory).
This is not directly related to your question but the idea appealed to me much so I wanted to share it.
Warp 10 is another option worth considering (I'm part of the team building it), check it out at http://www.warp10.io/.
It is based on HBase but also has a standalone version which will work fine for volumes in the low 100s billions of datapoints, so it should fit most use cases out there.
Among the strengths of Warp 10 is the WarpScript language which is built from the ground up for manipulating (Geo) Time Series.
Yet another open-source option is blueflood: http://blueflood.io.
Disclaimer: like Paul Dix, I'm biased by the fact that I work on Blueflood.
Based on your short list of requirements, I'd say Blueflood is a good fit. Perhaps if you can specify the size of your dataset, the type of analysis you need to run or any other requirements that you think make your project unique, we could help steer you towards a more precise answer. Without knowing more about what you want to do, it's going to be hard for us to answer more meaningfully.

what's the best instance size selection for azure vm roles? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
I currently have 2 ExtraSmall webroles(MVC4) running on Azure cloud services(windows server 2012).I logged into the RDP and checked is's resources usage by task manager, found out that the memory usage is very high, one is about 92% used and only 56Mb free memory left, another is 86% has 150Mb free memory left. The website is very slow, is it possible the low performance's caused by the low memory?Do you think it's better upgrade the VM size to Small or larger?
Thx a lot
Honestly, only you can determine best instance size. From Small (1 core, 1.75GB, 100Mbps NIC) to Extra Large (8 core, 14GB, 800Mbps NIC), machines scale in a straightforward way, and you should pick the smallest instance size that can properly and efficiently run your app, and then scale out/in as necessary. The A6/A7 machines are significantly larger (4 core, 28GB, 1000Mbps NIC, 8 core, 56GB, 2000Mbps NIC), and the Extra Small is very limited (shared core, 768MB, 5Mbps NIC). Extra Small instances may have issues running certain workloads.
So: You may be having issues related to the XS resource limitations for your particular app. You should do some empirical testing on Small through Extra Large to see where low-volume app experiences work fine, and then pick that size, using multiple instances to handle heavier load.
When picking the size, you'll probably reach a bottleneck with a specific resource (CPU, RAM, network), and you'll need to pick based on that. For example, if you really need 6GB RAM, you're now looking at a Large, even if you're barely utilizing CPU.
More details on instances sizes, here.
It's always easy to scale up to the Small first then go to Large. You are going to be doubling your memory with a small at 1.75 GB. Plus on the Extra Small you are using Shared CPU Cores on a Small you don't share the cores.
Going to a Large with 7 GB of memory would be overkill, I think.

Resources