How to convert RDD to spark dataframe using sparklyr? - r

I have a lot of files with text data pushed by azure IOT on a blob storage in a lot of folders, and I want to read them and have a delta lake table with one row for each line of a file. I used to read them file by file, but it takes too much time so I want to use spark to speed up this treatment. It needs to integrate a databricks workflow made in R.
I've found spark_read_text function to read text file, but it cannot recursively read directory, it only understand if all the files are in one directory.
Here is an example of a file path (appid/partition/year/month/day/hour/minute/file):
app_id/10/2023/02/06/08/42/gdedir22hccjq
Partition is a random folder (there is around 30 of them right now) that azure IoT seems to create to treat data in parallel, so data for the same date can be split in several folders, which does not simplify the reading efficiency.
So the only function I found to do that is spark.textFile, which works with jokers and recursively handle directories. The only problem is that it return a RDD, and I can't find a way to transform it to a spark dataframe, which could ultimatly be accessed using a tbl_spark R object.
Here is what I did so far:
You need to set the config to recursively read the folder (here I do this on databricks in a dedicate python cell):
%py
sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")
Then I can create a RDD:
j_rdd <- spark_context(sc) %>%
invoke("textFile", "/mnt/my_cont/app_id/*/2022/11/17/*", 10L)
This work to create the RDD, and as you can see I can map all the partitions (before the year) with a "*", as well as the folders four hours and minutes recursively with the "*" at the end.
I can collect it and create a R dataframe:
lst <- invoke(j_rdd, "collect")
data.frame(row = unlist(lst))
This correctly get my data, one column of text and one row for each line of each file (I can't display an example for privacy reason but it's not important).
The problem is I don't want to collect, but want to update a delta table with this data, and can't find a way to get a sparklyr object that I can use. The j_rdd object I got is like this:
>j_obj
<jobj[2666]>
org.apache.spark.rdd.MapPartitionsRDD
/mnt/my_cont/app_id/*/2022/11/17/* MapPartitionsRDD[80] at textFile at NativeMethodAccessorImpl.java:0
The closer I got so far: I tried to copy code here to convert data to a dataframe using invoke, but I don't seems to do it correctly:
contents_field <- invoke_static(sc, "sparklyr.SQLUtils", "createStructField", "contents", "character", TRUE)
schema <- invoke_static(sc, "sparklyr.SQLUtils", "createStructType", list(contents_field))
j_df <- invoke(hive_context(sc), "createDataFrame", j_rdd, schema)
invoke(j_df, "createOrReplaceTempView", "tmp_test")
dfs <- tbl(sc, "tmp_test")
dfs %>% sdf_nrow()
I only have one column with character in it so I thought it would work, but I get this error:
Error : org.apache.spark.SparkException: Job aborted due to stage failure: Task 14 in stage 25.0 failed 4 times, most recent failure: Lost task 14.3 in stage 25.0 (TID 15158) (10.221.193.133 executor 2): java.lang.RuntimeException: Error while encoding: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, contents), StringType, false), true, false, true) AS contents#366
at org.apache.spark.sql.errors.QueryExecutionErrors$.expressionEncodingError(QueryExecutionErrors.scala:1192)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:236)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:208)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hashAgg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$3(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.$anonfun$runTask$1(ShuffleMapTask.scala:81)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:156)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:125)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:95)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:832)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1681)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:835)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:690)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Serializer.apply(ExpressionEncoder.scala:233)
... 28 more
Does anyone have an idea how to convert this RDD object (using R/sparklyr) that I got in return of the invoke function in something usable without collecting data ?

Finally, I found that spark_read_text can also read multiple files with jokers, but you have to put a joker for each directories and files, it cannot discover folders recursively.
For example:
dfs <- spark_read_text(sc, "/mnt/container/app_id/10/2023/02/06/*")
...doesn't work. But:
dfs <- spark_read_text(sc, "/mnt/container/app_id/10/2023/02/06/*/*/*")
...works. Also:
dfs <- spark_read_text(sc, "/mnt/container/app_id/*/2023/02/06/*/*/*")
...with a joker above the date also works.
As the directory depth doesn't change in my case, that's enough for me.

Related

RStudio read_delim(): intermittently receive error std::bad_alloc upon opening files with unusual delimeter

I received a series of 100+ files from a client. This client received the files as part of litigation, so they didn't have to be transmitted in a convenient fashion, they just all had to be present. In a single .zip file, all the files are all tracked with names like Folder1.001, Folder1.002, Folder3.001, etc. When unpackaged these files using the 7-Zip program, they don't show up with a .txt, .csv, or any other file extension. Windows incorrectly interprets the unzipped files as a ".001 File" or ".002 File." This is not the issue, because I know that the files are delimited by a ~ and are 118 columns wide. Each file has between 2.5M and 4.9M rows, and each is about 1 GB in size when unzipped.
This is my first ever post here, so please excuse any breach of etiquette.
I am working in a .Rmd file on a virtual machine running Windows. I have R4.2.2 (64-bit), and RStudio 2022.12.0+353. All work is being done within a drive on the virtual machine that has 9+ GB free out of 300 GB total. The size of this virtual drive could be increased, if necessary.
My goal here is examine one variable in each file, to see if cases fall within a given range for that variable, and save those rows that do. I have been saving them as .rds files using write_rds().
I have been bringing in the files using a read_delim() statement specifying 'delim = "~"'. I created a vector of 120 column names which I use because the columns are not labeled. These commands on their own are not an issue. A successful import looks like the below.
work1 <- read_delim("Data\\Folder1\\File1.001"), delim = "~", col_names = vNames1)
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
It mishandles the columns named Skip1 and Skip9 as logical values, but those aren't a necessary part of my analysis.
I then filter and write the file using
work1 <- work1 %>% filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)
write_rds(work1, "Data\\Working\\Folder1_001.rds")
I have also done this with the read_delim() and filter() piped into a single command. This is not the issue. NOTE: Before I read in the next file (File1.002), I now have a work1 file that is at most, 4000 cases, down from millions when it was imported.
Since I have over 100 of these files, I have written multiple code chunks to do a few of these at a time. After one to three read_delim() statements in a row, I get the below error.
work2 <- read_delim("Data\\Folder1\\File1.002"), delim = "~", col_names = vNames1)
Error std::bad_alloc
Which I understand has to memory allocation. I can close out RStudio and restart and that will allow me to do one or two more imports, filterings, then writings. Doing that for over 100 files is far too inefficient.
I condensed my code a step further by writing the read_delim() step within the write_rds() step, which looks like the below.
write_rds((read_delim("Data\\Folder1\\File003",
delim = "~", col_names = vNames1) %>%
filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)),
"Data\\Working\\Folder1_003.rds")
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
Yet after 1 or 2 successful runs, I get the same
Error std::bad_alloc message.
Using traceback(), it seems like it is related to vroom::vroom(), but I'm not sure how to check any further.

downloading data and saving data to a folder in batches

I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html

SparkR collect method crashes with OutOfMemory on Java heap space

With SparkR, I'm trying for a PoC to collect an RDD that I created from text files which contains around 4M lines.
My Spark cluster is running in Google Cloud, is bdutil deployed and is composed with 1 master and 2 workers with 15gb of RAM and 4 cores each. My HDFS repository is based on Google Storage with gcs-connector 1.4.0.
SparkR is intalled on each machine, and basic tests are working on small files.
Here is the script I use :
Sys.setenv("SPARK_MEM" = "1g")
sc <- sparkR.init("spark://xxxx:7077", sparkEnvir=list(spark.executor.memory="1g"))
lines <- textFile(sc, "gs://xxxx/dir/")
test <- collect(lines)
First time I run this, it seems to be working fine, all the tasks are run successfully, spark's ui says that the job completed, but I never get the R prompt back :
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.executor.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 WARN SparkConf: Setting 'spark.driver.extraClassPath' to ':/home/hadoop/hadoop-install/lib/gcs-connector-1.4.0-hadoop1.jar' as a work-around.
15/06/04 13:36:59 INFO Slf4jLogger: Slf4jLogger started
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SocketConnector#0.0.0.0:52439
15/06/04 13:37:00 INFO Server: jetty-8.y.z-SNAPSHOT
15/06/04 13:37:00 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/06/04 13:37:54 INFO GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop1
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library is available
15/06/04 13:37:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/06/04 13:37:55 WARN LoadSnappy: Snappy native library not loaded
15/06/04 13:37:55 INFO FileInputFormat: Total input paths to process : 68
[Stage 0:=======================================================> (27 + 10) / 68]
Then after a CTRL-C to get the R prompt back, I try to run the collect method again, here is the result :
[Stage 1:==========================================================> (28 + 9) / 68]15/06/04 13:42:08 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at org.spark_project.protobuf.ByteString.toByteArray(ByteString.java:515)
at akka.remote.serialization.MessageContainerSerializer.fromBinary(MessageContainerSerializer.scala:64)
at akka.serialization.Serialization$$anonfun$deserialize$1.apply(Serialization.scala:104)
at scala.util.Try$.apply(Try.scala:161)
at akka.serialization.Serialization.deserialize(Serialization.scala:98)
at akka.remote.MessageSerializer$.deserialize(MessageSerializer.scala:23)
at akka.remote.DefaultMessageDispatcher.payload$lzycompute$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.payload$1(Endpoint.scala:58)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:76)
at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:937)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:415)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I understand the exception message, but I don't understand why I am getting this the second time.
Also, why the collect never returns after completing in Spark?
I Googled every piece of information I have, but I had no luck finding a solution. Any help or hint would be greatly appreciated!
Thanks
This does appear to be a simple combination of Java in-memory object representations being inefficient combined with some apparent long-lived object references which cause some collections to fail to be garbage-collected in time for the new collect() call to overwrite the old one in-place.
I experimented with some options, and for my sample 256MB file that contains ~4M lines, I indeed reproduce your behavior where collect is fine the first time, but OOMs the second time, when using SPARK_MEM=1g. I then set SPARK_MEM=4g instead, and then I'm able to ctrl+c and re-run test <- collect(lines) as many times as I want.
For one thing, even if references didn't leak, note that after the first time you ran test <- collect(lines), the variable test is holding that gigantic array of lines, and the second time you call it, the collect(lines) executes before finally being assigned to the test variable and thus in any straightforward instruction-ordering, there's no way to garbage-collect the old contents of test. This means the second run will make the SparkRBackend process hold two copies of the entire collection at the same time, leading to the OOM you saw.
To diagnose, on the master I started SparkR and first ran
dhuo#dhuo-sparkr-m:~$ jps | grep SparkRBackend
8709 SparkRBackend
I also checked top and it was using around 22MB of memory. I fetched a heap profile with jmap:
jmap -heap:format=b 8709
mv heap.bin heap0.bin
Then I ran the first round of test <- collect(lines) at which point running top showed it using ~1.7g of RES memory. I grabbed another heap dump. Finally, I also tried test <- {} to get rid of references to allow garbage-collection. After doing this, and printing out test and showing it to be empty, I grabbed another heap dump and noticed RES still showed 1.7g. I used jhat heap0.bin to analyze the original heap dump, and got:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [B 25126 14174163
class [C 19183 1576884
class [<other> 11841 1067424
class [Lscala.concurrent.forkjoin.ForkJoinTask; 16 1048832
class [I 1524 769384
...
After running collect, I had:
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [C 2784858 579458804
class [B 27768 70519801
class java.lang.String 2782732 44523712
class [Ljava.lang.Object; 2567 22380840
class [I 1538 8460152
class [Lscala.concurrent.forkjoin.ForkJoinTask; 27 1769904
Even after I nulled out test, it remained about the same. This shows us 2784858 instances of char[], for a total size of 579MB, and also 2782732 instances of String, presumably holding those char[]'s above it. I followed the reference graph all the way up, and got something like
char[] -> String -> String[] -> ... -> class scala.collection.mutable.DefaultEntry -> class [Lscala.collection.mutable.HashEntry; -> class scala.collection.mutable.HashMap -> class edu.berkeley.cs.amplab.sparkr.JVMObjectTracker$ -> java.util.Vector#0x785b48cd8 (36 bytes) -> sun.misc.Launcher$AppClassLoader#0x7855c31a8 (138 bytes)
And then AppClassLoader had something like thousands of inbound references. So somewhere along that chain something should've been removing their reference but failing to do so, causing the entire collected array to sit in memory while we try to fetch a second copy of it.
Finally, to answer your question about hanging after the collect, it appears it has to do with the data not fitting in the R process's memory; here's a thread related to that issue: https://www.mail-archive.com/user#spark.apache.org/msg29155.html
I confirmed that using a smaller file with only a handful of lines, and then running collect indeed does not hang.

Allowing (or circumventing) parallel write access to a file

I'm calling a Windows executable from multiple parallel R processes (via a system call within parSapply). This .exe (let's call it my.exe) is passed a filename as an argument, and processes this file (details are probably irrelevant). Unfortunately, my.exe creates a log file (in the same directory as my.exe) that it writes to while it runs, and, since the log file's name is fixed, subsequent R processes calling my.exe results in my.exe` throwing the error:
Cannot create result file "log.res".
Do you have write access in the current directory?
I've managed to work around this by creating multiple copies of the my.exe (as many as the number of cores in my cluster, i.e. 7). I can then ensure that each is only in use by a single R process at any one time, by passing to the cores a vector of 7 paths to .bat files, each of which repeatedly calls a given copy of my.exe.
Is there a more elegant way to deal with this issue, perhaps by having the processes create virtual instances of my.exe automagically? I don't require the log files.
Since this is an error thrown by the program and not by R, I suspect there might be no way to permit concurrent write access to the log file from the R side of things.
Ideally, I want to be doing something like this:
ff <- c('a', 'long', 'vector', 'of', 'file', 'paths') # abbreviated
parSapply(cl, ff, function(f) system(sprintf("my.exe %s", f)))
but instead I've resorted to doing (more or less) this (after copying my.exe to c:/1/, c:/2/, c:/3/, through c:/7/):
cat(paste('CALL C:/1/my.exe', ff[1:10], '/RUN=YES'), file='run1.bat', sep='\n')
cat(paste('CALL C:/2/my.exe', ff[11:20], '/RUN=YES'), file='run2.bat', sep='\n')
cat(paste('CALL C:/3/my.exe', ff[21:30], '/RUN=YES'), file='run3.bat', sep='\n')
cat(paste('CALL C:/4/my.exe', ff[31:40], '/RUN=YES'), file='run4.bat', sep='\n')
cat(paste('CALL C:/5/my.exe', ff[41:50], '/RUN=YES'), file='run5.bat', sep='\n')
cat(paste('CALL C:/6/my.exe', ff[51:60], '/RUN=YES'), file='run6.bat', sep='\n')
cat(paste('CALL C:/7/my.exe', ff[61:70], '/RUN=YES'), file='run7.bat', sep='\n')
parSapply(cl, c('run1.bat', 'run2.bat', 'run3.bat', 'run4.bat',
'run5.bat', 'run6.bat', 'run7.bat'), system)
(Above, instead of letting parSapply assign the 70 elements of ff to the various processes, I manually split them when creating the batch files, and then run the batch files in parallel.)
It sounds like your basic strategy is the only known solution to the problem, but I think it can be done more elegantly. For instance, you could avoid creating .BAT files by having each worker execute a different command line based on a worker ID. The worker ID could be assigned using:
# Assign worker ID's to each of the cluster workers
setid <- function(id) assign(".Worker.id", id, pos=globalenv())
clusterApply(cl, seq_along(cl), setid)
Also, you may want to automate the creation of the directories that contain "my.exe". I also prefer to use a symlink rather than a copy of the executable:
# Create directories containing a symlink to the real executable
exepath <- "C:/bin/my.exe" # Path to the real executable
pdir <- getwd() # Parent of the new executable directories
myexe <- file.path(pdir, sprintf("work_%d", seq_along(cl)), "my.exe")
for (x in myexe) {
dir.create(dirname(x), showWarnings=FALSE)
if (file.exists(x))
unlink(x)
file.symlink(exepath, x)
}
If symlinks don't fool "my.exe" into creating the log file in the desired directory, you could try using "file.copy" instead of "file.symlink".
Now you can run your parallel job using:
# Each worker executes a different symlink to the real executable
worker.fun <- function(f, myexe) {
system(sprintf("%s %s /RUN=YES", myexe[.Worker.id], f))
}
ff <- c('a', 'long', 'vector', 'of', 'file', 'paths')
parSapply(cl, ff, worker.fun, myexe)
You could also delete the directories that were created, but they don't use much space since symlinks are used, so it might be better to keep them, especially during debugging/testing.

Logfile analysis in R?

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
$> pwd
/var/log/apache2
$> ls
access_log error_log
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
# what are the most common issues that cause an error to be logged?
err_order = function(df){
t0 = xtabs(~Issue_Descr, df)
m = cbind( names(t0), t0)
rownames(m) = NULL
colnames(m) = c("Cause", "Count")
x = m[,2]
x = as.numeric(x)
ndx = order(x, decreasing=T)
m = m[ndx,]
m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
subset(m1, CountAsProp >= 1.)
}
# calling this function, passing in a data frame, returns something like:
Cause Count CountAsProp
1 'connect to unix://var/ failed' 200 40.0
2 'object buffered to temp file' 185 37.0
3 'connection refused' 94 18.8
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
I have used R to load and parse IIS Log files with some success here is my code.
Load IIS Log files
require(data.table)
setwd("Log File Directory")
# get a list of all the log files
log_files <- Sys.glob("*.log")
# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )
# Add field names - Copy the "Fields" line from one of the log files :header line
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")
#Change it to a data.table
IIS <- data.table( IIS )
#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.
#!python
import argparse
import csv
import cStringIO as StringIO
class OurDialect:
escapechar = ','
delimiter = ' '
quoting = csv.QUOTE_NONE
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()
try:
with open(arguments.line, 'wb') as fin:
line = fin.readlines()
except:
pass
finally:
line = arguments.line
header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']
lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]
out = StringIO.StringIO()
writer = csv.writer(out)
writer.writerow(header)
writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])
print(out.getvalue())
Demo output:
IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.

Resources