I've scraped a lot of data (Twitter user information) for research purpose and at the moment all this is stored as a list-object in my global environment. Due to the Twitter limit I append entries frequently till I reach my goal (~200,000 entries). At the moment I've about 100,000 entries in this list object with ~70MB. The problem is that I want to save all this to my SSD (Backup) but when I save my Environment it runs the whole night and then gives an error. Means, in case my computer crashes, I'll lose all my effort! When I save just the object with the "list.save"-function from the rlist-package, it also runs several hours.
Do you have any suggestions how I should handle this issue? Thank you!
I think saveRDS should help.
SaveRDS is used when you want to save only one object
Related
I would like to know how to transfer data between tasks without storing them in between.
Attached image one can find the flow of tasks. As of now I am storing the output csv files of each task as a file in my local machine and fetching this csv file again as an input to next task. I wanted to know if there is any otherway to pass data between tasks without storing it after each task. I researched a bit and came across Xcoms. I wanted to make sure if Xcoms are the right way to achieve this or am I wrong. I could not find any practical examples. Any help is appreciated as I am just a newbie in airflow started couple of days
Short answer is no, tasks require data to be at rest before moving to the nest task. Xcom's are most suited to short strings that can be shared between tasks (file directories, object names, etc.). Your current flow of storing the data in csv files between tasks is the optimal way of running your flow.
XCom is intended for sharing little pieces of information, like the len of the sql table, any specific values or things like that. It is not made for sharing dataframes (which can be huge) because the shared information is written in the metadata database.
So either you keep exporting the csv to your computer (or uploading them somewhere), for reading it in the next Operator, or you combine the operators into one.
I'm running Rcrawler on a very large website, so it takes a very long time (3+ days with default page depth). Is there a way to not download all the HTMLs to make the process faster?
I only need the URLs that are stored in the INDEX.
Or can anyone recommend another way to make Rcrawler run faster?
I have tried running it with a smaller page depth (5), but it is still taking forever.
I am dealing with the same issue. Depending on the source, in some cases I am even running at depth 1.
Best,
Janusz
How do you usually work with the data contained in a RecordStore:
Do you always "query" directly the RecordStore when you have to
perform
some operations over its records (searching, sorting,etc) or
Do you "cache" those records in a vector or array so that you query
that vector or array later, instead of the RecordStore?
Personally, I was following the second approach until yesterday when I got a nasty exception, reminding me that memory is a luxury we should be really careful about when developing j2me apps :S
Taking memory in consideration, now I'm not really sure that keeping arrays would be such a good idea.
In any case, I would like to hear your opinions guys.After all, you've got more experience.
Thanks for your time.
That depends on the number of records and the size of each record.
If you have already had OOME with the Vector approach, then try to work with only a single register at a time.
If you structure well your record you can do some fast searches on it. String searches will probably be slower.
Keep in mind that, although RMS has no fixed max size, it is advisable to call RecordStore.getSizeAvailable to give you an idea of how much info you can store in a given device.
Here you have a good tutorial on RMS:
http://www.ibm.com/developerworks/library/j-j2me3/
I have a web application that talks to R using plr when doing adaptive testing.
I would need to find a way to store static data persistently between calls.
I have an expensive calculation creating an item bank than a lot of cheap ones getting the next item after each response submission. However currently I can't find a way to store the result of the expensive calculation persistently.
Putting it into the db seems to be a lot of overhead.
library(catR)
data(tcals)
itembank <- createItemBank(tcals) --this is the expensive call
nextItem(itembank, 0) # item 63 is selected
I tried to save and load the result, like this, but it doesn't seem to work, the result of the second NOTICE is 'itembank'.
save(itembank, file="pltrial.Rdata")
pg.thrownotice(itembank)
aaa=load("pltrial.Rdata")
pg.thrownotice(aaa)
I tried saving and loading the workspace as well, but didn't succeed with that either.
Any idea how to do this?
The load function directly loads objects into your workspace. You don't have to assign the return value (which is just the names of the objects loaded, as you discovered). If you do a ls() after loading, you should find your itembank object sitting there.
I am designing a web API which requires fast read only access to a large dataset which will be hopefully be constantly stored and ready for access. Access will be from a static class which will just do some super fast lookups on the data.
So, I want to pre-cache a Dictionary<string,Dictionary<string,Dictionary<string,myclass>>>, with the total number of elements at the third level dictionary being around 1 Million, which will increase eventually, but lets say not more than 2 million ever. 'myclass' is a small class with a (small) list of strings, an int, an enum and a couple of bools, so nothing major. It should be a bit over 100mb in memory.
From what I can tell, the way to do this is simply call my StaticClass.Load() method to read all this data in from a file with the Application_Start event in Global.asax.
I am wondering what the things I need to consider/worry about with this. I am guessing it is not just as simple as calling Load() and then assuming everything will be OK for future access. Will the GC know to leave the data there even if the API is not hit for a couple of hours?
To complicate things, I want to reload this data every day as well. I think I'll just be able to throw out the old dataset and load the new one in from another file, but I'll get to that later.
Cheers
Please see my similar question IIS6 ASP.NET 2.0 Application Cache - data storage options and performance for large amounts of data BUT in particular the answer from Marc and his last paragraph about options for large cache's which I think would apply to your case.
The Standard ASP.net application Cache could work for you here. Check out this article. With this you get built in management of dependance (the file changes) or time based expiry. The linked article shows an On_start application
My concern is the size of what you want to cache.
Cheers guys
I also found this article which addresses the options: http://www.asp.net/data-access/tutorials/caching-data-at-application-startup-cs
Nothing really gives recommendations for large amounts of data - or what is even defined as 'large amounts'. I'll keep doing my research, but Redis looks pretty good