Dask data type converting , and re sampling - bigdata

I am here from the chat room and will post my code
df['DateTime']=dd.to_datetime(df['DateTime'])
df['KWH/hh (per half hour) '].astype(float)
df=df['KWH/hh (per half hour) '].fillna(0)
df=df.set_index(df['DateTime'], npartitions='auto')
df=df.resample('D').sum()
when I run code in prototype 20% of my original data in took around 1.8 minutes but as I go for real data (1.2 gB) the time increase exponentially (more than1 hourse)
I tried same problem with chunks and read csv it is much faster and better than that is generators .
My question is how to enhance the performance I as using pro with 8 cores. Dask should give me better performance by parallel computing righ?
edited one this is my generators:
start=time.time()
before= memory_footprint()
dataframes = (pd.read_csv(fname) for fname in filenames)
temp=(remove_unnessasey_data(df,l1) for df in dataframes)
help=(change_structure(t) for t in temp)
for i in(1,688):
dd.append(next(help))
after= memory_footprint()
du=time.time()-start

You mention that you can do this operation "by generator" or "chunks". This suggests to me that your time column is already sorted. As the doc points out, you should tell Dask this, so that it doesn't sort your data, which might well be the principal cause for slowness here.
Another suggestion, you start with three row-wise operations one after the other. You may want to try map_partitions to do the three in a single step instead.
def preprocess(df):
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['KWH/hh (per half hour) '] = df['KWH/hh (per half hour) '].astype(float)
return df['KWH/hh (per half hour) '].fillna(0)
df = df.map_partitions(preprocess)

Related

Efficient way to compile NC4 file information from separate files in R

I am currently trying to compile temperature information from the WDFE5 Data set which is quite large in size and am struggling to find an efficient way to meet my goal. My main goals are to:
Determine the max temperature for individual days for each individual grid cell
Change the time step from hourly to daily and from UTC to MST.
The data set is stored in monthly NC4 files and contains the temperature data in a 3 dimensional matrix (time lat lon). My main question is if there is a efficient way to compile this data to meet my goals or to manipulate the NC4 files to be easier to play around with (Somehow merge the monthly files into one mega file?)
I have tried two rather convoluted ways to catch holes between months (Example : due to the time conversion, some dates end up spanning between two months, which requires me to read in the next file and then continuing to read the data).
My first try was to individually read 1 month / file at a time, using pmax() to get the max value of each grid cell, and comparing time steps for 24 hours, and then repeating the process. I have been using
ncvar_get() with start and count to only read one time step at a time. To catch days that span two months, I was able to create a convoluted function to merge the two, by calculating the number of 1 hour periods left in one month, and how much would be needed from the next.
My second try still involved pmax(), but I tried a different method to fill in any holes between months. I set a date vector from the time variable to each hour time step, and match by same day. While this seems better, it still has to read multiple NC4 files which gets very convoluting compared to being able to just reading one NC4 file with all the needed information.
In the end, I tested a few test cases and both seem to solutions seem to work, but run extremely slow and seem very overcomplicated to me. I was wondering if anyone had suggestions on how to better set up the NC4 files for reading and time conversion.

Poor performance Arrow Parquet multiple files

After watching the mind-blowing webinar at Rstudio conference here I was pumped enough to dump an entire SQL server table to parquet files. The result was 2886 files, (78 entities over 37 months) with around 700 millons rows in total.
Doing a basic select returned all rows in less than 15 seconds! (Just out of this world result!!) At the webinar Neal Richardson from Ursa Labs was showcasing the Ny-Taxi dataset with 2 billions rows under 4 seconds.
I felt it was time to do something more daring like basic mean, sd, mode over a year worth of data, but that took a minute per month, so I was sitting 12.4 minutes waiting for a reply from R.
What is the issue? My badly written R-query? or simply too many files or granularity (decimal values?)??
Any ideas??
PS: I did not want to put a Jira-case in apache-arrow board as I see google search does not retrieve answers from there.
My guess (without actually looking at the data or profiling the query) is two things:
You're right, the decimal type is going to require some work in converting to an R type because R doesn't have a decimal type, so that will be slower than just reading in an int32 or float64 type.
You're still reading in ~350 million rows of data to your R session, and that's going to take some time. In the example query on the arrow package vignette, more data is filtered out (and the filtering is very fast).

time taken to read a large CSV file in Julia

I have a large CSV file - almost 28 million rows and 57 columns - 8.21GB - the data is of different types - integers, strings, floats - but nothing unusual.
When I load it in Python/Pandas it takes 161 seconds, using the following code.
df = pd.read_csv("file.csv", header=0, low_memory=False)
In Julia, it takes a little longer - over an hour. UPDATE: I am not sure why, but when I ran the code this morning (twice to check), it took around 702 and 681 seconds. This much better than an hour, but it is still way slower than Python.
My Julia code is also pretty simple:
df = CSV.File("file.csv") |> DataFrame
Am I doing something wrong? Is there something I can do to speed it up? Or is this just the price you pay to play with Julia?
From the CSV.jl documentation:
In some cases, sinks may make copies of incoming data for their own safety; by calling CSV.read(file, DataFrame), no copies of the parsed CSV.File will be made, and the DataFrame will take direct ownership of the CSV.File's columns, which is more efficient than doing CSV.File(file) |> DataFrame which will result in an extra copy of each column being made.
so you could try
CSV.read("file.csv", DataFrame)

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Reduce computation time

Most of the data sets that I have worked with has generally been of moderate size (mostly less than 100k rows) and hence my code's execution time has usually not been that big a problem for me.
But I was recently trying to write a function that takes 2 dataframes as arguments (with, say, m & n rows) and returns a new dataframe with m*n rows. I then have to perform some operations on the resulting data set. So, even with small values of m & n (say around 1000 each ) the resulting dataframe would have more than a million rows.
When I try even simple operations on this dataset, the code takes an intolerably long time to run. Specifically, my resulting dataframe has 2 columns with numeric values and I need to add a new column which will compare the values of these columns and categorize them as - "Greater than", "less than", "Tied"
I am using the following code:
df %>% mutate(compare=ifelse(var1==var2,"tied",
ifelse(var1>var2,"Greater than","lesser then")
And, as I mentioned before, this takes forever to run. I did some research on this, and I figured out that apparently operations on data.table is significantly faster than dataframe, so maybe that's one option I can try.
But I have never used data.tables before. So before I plunge into that, I was quite curious to know if there are any other ways to speed up computation time for large data sets.
What other options do you think I can try?
Thanks!
For large problems like this I like to parallelize. Since operations on individual rows are atomic, meaning that the outcome of an operation on a particular row is independent of every other row, this is an "embarassingly parallel" situation.
library(doParallel)
library(foreach)
registerDoParallel() #You could specify the number of cores to use here. See the documentation.
df$compare <- foreach(m=df$m, n=df$n, .combine='c') %dopar% {
#Borrowing from #nicola in the comments because it's a good solution.
c('Less Than', 'Tied', 'Greater Than')[sign(m-n)+2]
}

Resources