I am very new to the Big Data technologies I am attempting to work with, but have so far managed to set up sparklyr in RStudio to connect to a standalone Spark cluster. Data is stored in Cassandra, and I can successfully bring large datsets into Spark memory (cache) to run further analysis on it.
However, recently I have been having a lot of trouble bringing in one particularly large dataset into Spark memory, even though the cluster should have more than enough resources (60 cores, 200GB RAM) to handle a dataset of its size.
I thought that by limiting the data being cached to just a few select columns of interest I could overcome the issue (using the answer code from my previous query here), but it does not. What happens is the jar process on my local machine ramps up to take over up all the local RAM and CPU resources and the whole process freezes, and on the cluster executers keep getting dropped and re-added. Weirdly, this happens even when I select only 1 row for cacheing (which should make this dataset much smaller than other datasets which I have had no problem cacheing into Spark memory).
I've had a look through the logs, and these seem to be the only informative errors/warnings early on in the process:
17/03/06 11:40:27 ERROR TaskSchedulerImpl: Ignoring update with state FINISHED for TID 33813 because its task set is gone (this is likely the result of receiving duplicate task finished status updates) or its executor has been marked as failed.
17/03/06 11:40:27 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 8167), so marking it as still running
...
17/03/06 11:46:59 WARN TaskSetManager: Lost task 3927.3 in stage 0.0 (TID 54882, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 3863), so marking it as still running
17/03/06 11:46:59 WARN TaskSetManager: Lost task 4300.3 in stage 0.0 (TID 54667, 213.248.241.186, executor 100): ExecutorLostFailure (executor 100 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 167626 ms
17/03/06 11:46:59 INFO DAGScheduler: Resubmitted ShuffleMapTask(0, 14069), so marking it as still running
And then after 20min or so the whole job crashes with:
java.lang.OutOfMemoryError: GC overhead limit exceeded
I've changed my connect config to increase the heartbeat interval ( spark.executor.heartbeatInterval: '180s' ), and have seen how to increase memoryOverhead by changing settings on a yarn cluster ( using spark.yarn.executor.memoryOverhead ), but not on a standalone cluster.
In my config file, I have experimented by adding each of the following settings one at a time (none of which have worked):
spark.memory.fraction: 0.3
spark.executor.extraJavaOptions: '-Xmx24g'
spark.driver.memory: "64G"
spark.driver.extraJavaOptions: '-XX:MaxHeapSize=1024m'
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
UPDATE: and my full current yml config file is as follows:
default:
# local settings
sparklyr.sanitize.column.names: TRUE
sparklyr.cores.local: 3
sparklyr.shell.driver-memory: "8G"
# remote core/memory settings
spark.executor.memory: "32G"
spark.executor.cores: 5
spark.executor.heartbeatInterval: '180s'
spark.ext.h2o.nthreads: 10
spark.cores.max: 30
spark.memory.storageFraction: 0.6
spark.memory.fraction: 0.3
spark.network.timeout: 300
spark.driver.extraJavaOptions: '-XX:+UseG1GC'
# other configs for spark
spark.serializer: org.apache.spark.serializer.KryoSerializer
spark.executor.extraClassPath: /var/lib/cassandra/jar/guava-18.0.jar
# cassandra settings
spark.cassandra.connection.host: <cassandra_ip>
spark.cassandra.auth.username: <cassandra_login>
spark.cassandra.auth.password: <cassandra_pass>
spark.cassandra.connection.keep_alive_ms: 60000
# spark packages to load
sparklyr.defaultPackages:
- "com.datastax.spark:spark-cassandra-connector_2.11:2.0.0-M1"
- "com.databricks:spark-csv_2.11:1.3.0"
- "com.datastax.cassandra:cassandra-driver-core:3.0.2"
- "com.amazonaws:aws-java-sdk-pom:1.10.34"
So my question are:
Does anyone have any ideas about what to do in this instance?
Are
Are there config settings I can change to help with this issue?
Alternatively, is there a way to import the cassandra data in
batches with RStudio/sparklyr as the driver?
Or alternatively again, is there a way to munge/filter/edit data as it is brought into cache so that the resulting table is smaller (similar to using SQL querying, but with more complex dplyr syntax)?
OK, I've finally managed to make this work!
I'd initially tried the suggestion of #user6910411 to decrease the cassandra input split size, but this failed in the same way. After playing around with LOTS of other things, today I tried changing that setting in the opposite direction:
spark.cassandra.input.split.size_in_mb: 254
By INCREASING the split size, there were fewer spark tasks, and thus less overhead and fewer calls to the GC. It worked!
I have a strange problem with one of my plone: when I clear and rebuild the zcatalog the zope client quits silently after some time. No errors.
I did the process using the ZMI (zeo + zeoclient, standalone) and using 'zinstance debug'. Same result the client quits silently.
I'm using a standard Plone 4.3 with some addon Products on a Ubuntu Server 12.04 box.
Tasks I have done in order to find out the problem, without success:
I've checked the permissions on the filesystem.
I've reinstalled Plone 4.3
Packing the database works ok, but the problem persists.
Checked the free inodes on the filesystem.
Executing the process on other computer works successfully.
Executing the client with fg parameter, no messages when quitting.
Backup the db and restoring. Same result after restoring, but if restoring on other computer the process rebuilds the catalog (with same Plone version and addons).
Reindexing a index of the catalog causes the same fail: quitting with no messages.
ZODB/scripts/fstest.py shows no errors.
ZODB/scripts/fsrefs.py shows no errors.
Any clues?
David, you are right, I just found the problem yesterday, but it was too late (I was tired) to report it here.
This plone instance is installed on an VPS (OpenVZ) with 512 MB and the kernel killed the python process silently when there was no memory free.
One of my last tests was to rebuild the catalog with "log progress" enabled, there I show that the process quitted at different points, but all around 30%. Then by chance I executed dmesg, and "voila", the mistery was resolved, look:
[2233907.698115] Out of memory in UB: OOM killed process 17819 (python) score 0 vm:799612kB, rss:497324kB, swap:45480kB
[2235168.564053] Out of memory in UB: OOM killed process 445 (python) score 0 vm:790380kB, rss:498036kB, swap:46924kB
[2236752.744927] Out of memory in UB: OOM killed process 17964 (python) score 0 vm:790392kB, rss:494232kB, swap:45584kB
[2237461.280724] Out of memory in UB: OOM killed process 26584 (python) score 0 vm:790328kB, rss:497932kB, swap:45940kB
[2238443.104334] Out of memory in UB: OOM killed process 1216 (python) score 0 vm:799512kB, rss:494132kB, swap:44632kB
[2239457.938721] Out of memory in UB: OOM killed process 12821 (python) score 0 vm:794896kB, rss:502000kB, swap:42656kB}
Is it possible to debug core file generated by a executable compiled without gdb flag ?
If yes, any pointers or tutorials on it ?
Yes you can. It will not be easy though. I will give you an example.
Lets say that I have the following program called foo.c:
main()
{
*((char *) 0) = '\0';
}
I'll compile it and make sure that there is no symbols:
$ cc foo.c
$ strip a.out
$ file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, stripped
Ok, time to run it:
$ ./a.out
Segmentation fault (core dumped)
Oops. There seems to be a bug. Let's start a debugger:
$ gdb ./a.out core
[..]
Reading symbols from /tmp/a.out...(no debugging symbols found)...done.
[..]
Core was generated by `./a.out'.
Program terminated with signal 11, Segmentation fault.
#0 0x0804839c in ?? ()
(gdb) bt
#0 0x0804839c in ?? ()
#1 0xb7724e37 in __libc_start_main () from /lib/i386-linux-gnu/libc.so.6
#2 0x08048301 in ?? ()
Hmm, looks bad. No symbols. Can we figure out what happened?
(gdb) x/i $eip
=> 0x804839c: movb $0x0,(%eax)
Looks like it tried to store a byte with a value of zero to the memory location pointed by the EAX register. Why did it fail?
(gdb) p $eax
$1 = 0
(gdb)
It failed because the EAX register is pointing to a memory address zero and it tried to store a byte at that address. Oops!
Unfortunately I do not have pointers to any good tutorials. Searching for "gdb reverse engineering" gives some links which have potentially helpful bits and pieces.
Update:
I noticed the comment that this is about debugging a core dump at a customer. When you ship stripped binaries to a customer, you should always keep a debug version of that binary.
I would recommend not stripping and even giving the source code though. All code that I write goes to a customer with the source code. I have been on the customer side too many times facing an incompetent vendor which has shipped a broken piece of software but does not know how to fix it. It sucks.
This seems to be actually a duplicate of this question:
Debug core file with no symbols
There is some additional info there.
Yes, you can,
this is what people who i.e. write cracks are doing,
unfortunately i don't have the slides and documents of a course i followed at university anymore, but googling for reverse engineering or disassembly tutorials will give you some starting points. Also knowing your way around in assembly code is essential.
Our class was based on a book mainly chapter 1 & 3 but there is a new edition out now
Computer Systems: A programmer's perspective by R.E. Bryant and D.R. O'Hallaron
which explains the basics behind computer systems and also gives you good knowledge of the working of programs in systems.
Also when learning this be aware that 64bit cpus have different assembly code than 32bit cpu's, just in case.
If the program is compiled without -g flag,you cannot debug core file.
Otherwise you can do so as:
gdb executable corefile
More you can find at:
http://wwwpub.zih.tu-dresden.de/~mlieber/practical_debugging/04_gdb.pdf
What would be a more simplified description of file descriptors compared to Wikipedia's? Why are they required? Say, take shell processes as an example and how does it apply for it?
Does a process table contain more than one file descriptor. If yes, why?
In simple words, when you open a file, the operating system creates an entry to represent that file and store the information about that opened file. So if there are 100 files opened in your OS then there will be 100 entries in OS (somewhere in kernel). These entries are represented by integers like (...100, 101, 102....). This entry number is the file descriptor.
So it is just an integer number that uniquely represents an opened file for the process.
If your process opens 10 files then your Process table will have 10 entries for file descriptors.
Similarly, when you open a network socket, it is also represented by an integer and it is called Socket Descriptor.
I hope you understand.
A file descriptor is an opaque handle that is used in the interface between user and kernel space to identify file/socket resources. Therefore, when you use open() or socket() (system calls to interface to the kernel), you are given a file descriptor, which is an integer (it is actually an index into the processes u structure - but that is not important). Therefore, if you want to interface directly with the kernel, using system calls to read(), write(), close() etc. the handle you use is a file descriptor.
There is a layer of abstraction overlaid on the system calls, which is the stdio interface. This provides more functionality/features than the basic system calls do. For this interface, the opaque handle you get is a FILE*, which is returned by the fopen() call. There are many many functions that use the stdio interface fprintf(), fscanf(), fclose(), which are there to make your life easier. In C, stdin, stdout, and stderr are FILE*, which in UNIX respectively map to file descriptors 0, 1 and 2.
Hear it from the Horse's Mouth : APUE (Richard Stevens).
To the kernel, all open files are referred to by File Descriptors. A file descriptor is a non-negative number.
When we open an existing file or create a new file, the kernel returns a file descriptor to the process. The kernel maintains a table of all open file descriptors, which are in use. The allotment of file descriptors is generally sequential and they are allotted to the file as the next free file descriptor from the pool of free file descriptors. When we closes the file, the file descriptor gets freed and is available for further allotment.
See this image for more details :
When we want to read or write a file, we identify the file with the file descriptor that was returned by open() or create() function call, and use it as an argument to either read() or write().
It is by convention that, UNIX System shells associates the file descriptor 0 with Standard Input of a process, file descriptor 1 with Standard Output, and file descriptor 2 with Standard Error.
File descriptor ranges from 0 to OPEN_MAX. File descriptor max value can be obtained with ulimit -n. For more information, go through 3rd chapter of APUE Book.
Other answers added great stuff. I will add just my 2 cents.
According to Wikipedia we know for sure: a file descriptor is a non-negative integer. The most important thing I think is missing, would be to say:
File descriptors are bound to a process ID.
We know most famous file descriptors are 0, 1 and 2.
0 corresponds to STDIN, 1 to STDOUT, and 2 to STDERR.
Say, take shell processes as an example and how does it apply for it?
Check out this code
#>sleep 1000 &
[12] 14726
We created a process with the id 14726 (PID).
Using the lsof -p 14726 we can get the things like this:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sleep 14726 root cwd DIR 8,1 4096 1201140 /home/x
sleep 14726 root rtd DIR 8,1 4096 2 /
sleep 14726 root txt REG 8,1 35000 786587 /bin/sleep
sleep 14726 root mem REG 8,1 11864720 1186503 /usr/lib/locale/locale-archive
sleep 14726 root mem REG 8,1 2030544 137184 /lib/x86_64-linux-gnu/libc-2.27.so
sleep 14726 root mem REG 8,1 170960 137156 /lib/x86_64-linux-gnu/ld-2.27.so
sleep 14726 root 0u CHR 136,6 0t0 9 /dev/pts/6
sleep 14726 root 1u CHR 136,6 0t0 9 /dev/pts/6
sleep 14726 root 2u CHR 136,6 0t0 9 /dev/pts/6
The 4-th column FD and the very next column TYPE correspond to the File Descriptor and the File Descriptor type.
Some of the values for the FD can be:
cwd – Current Working Directory
txt – Text file
mem – Memory mapped file
mmap – Memory mapped device
But the real file descriptor is under:
NUMBER – Represent the actual file descriptor.
The character after the number i.e "1u", represents the mode in which the file is opened. r for read, w for write, u for read and write.
TYPE specifies the type of the file. Some of the values of TYPEs are:
REG – Regular File
DIR – Directory
FIFO – First In First Out
But all file descriptors are
CHR – Character special file (or character device file)
Now, we can identify the File Descriptors for STDIN, STDOUT and STDERR easy with lsof -p PID, or we can see the same if we ls /proc/PID/fd.
Note also that file descriptor table that kernel keeps track of is not the same as files table or inodes table. These are separate, as some other answers explained.
You may ask yourself where are these file descriptors physically and what is stored in /dev/pts/6 for instance
sleep 14726 root 0u CHR 136,6 0t0 9 /dev/pts/6
sleep 14726 root 1u CHR 136,6 0t0 9 /dev/pts/6
sleep 14726 root 2u CHR 136,6 0t0 9 /dev/pts/6
Well, /dev/pts/6 lives purely in memory. These are not regular files, but so called character device files. You can check this with: ls -l /dev/pts/6 and they will start with c, in my case crw--w----.
Just to recall most Linux like OS define seven types of files:
Regular files
Directories
Character device files
Block device files
Local domain sockets
Named pipes (FIFOs) and
Symbolic links
File Descriptors (FD) :
In Linux/Unix, everything is a file. Regular file, Directories,
and even Devices are files. Every File has an associated number called File Descriptor (FD).
Your screen also has a File Descriptor. When a program is executed
the output is sent to File Descriptor of the screen, and you see
program output on your monitor. If the output is sent to File
Descriptor of the printer, the program output would have been
printed.
Error Redirection :
Whenever you execute a program/command at the terminal, 3 files are always open
standard input standard output
standard error.
These files are always present whenever a program is run. As explained before a file descriptor, is associated with each of
these files.
File File Descriptor
Standard Input STDIN 0
Standard Output STDOUT 1
Standard Error STDERR 2
For instance, while searching for files, one
typically gets permission denied errors or some other kind of errors. These errors can be saved to a particular file.
Example 1
$ ls mydir 2>errorsfile.txt
The file descriptor for standard error is 2.
If there is no any directory named as mydir then the output of command will be save to file errorfile.txt
Using "2>" we re-direct the error output to a file named "errorfile.txt"
Thus, program output is not cluttered with errors.
I hope you got your answer.
More points regarding File Descriptor:
File Descriptors (FD) are non-negative integers (0, 1, 2, ...) that are associated with files that are opened.
0, 1, 2 are standard FD's that corresponds to STDIN_FILENO, STDOUT_FILENO and STDERR_FILENO (defined in unistd.h) opened by default on behalf of shell when the program starts.
FD's are allocated in the sequential order, meaning the lowest possible unallocated integer value.
FD's for a particular process can be seen in /proc/$pid/fd (on Unix based systems).
As an addition to other answers, unix considers everything as a file system. Your keyboard is a file that is read only from the perspective of the kernel. The screen is a write only file. Similarly, folders, input-output devices etc are also considered to be files. Whenever a file is opened, say when the device drivers[for device files] requests an open(), or a process opens an user file the kernel allocates a file descriptor, an integer that specifies the access to that file such it being read only, write only etc. [for reference : https://en.wikipedia.org/wiki/Everything_is_a_file ]
File descriptors
To Kernel all open files are referred to by file descriptors.
A file descriptor is a non - negative integer.
When we open an existing or create a new file, the kernel returns a file descriptor to a process.
When we want to read or write on a file, we identify the file with file descriptor that was retuned by open or create, as an argument to either read or write.
Each UNIX process has 20 file descriptors and it disposal, numbered 0 through 19 but
it was extended to 63 by many systems.
The first three are already opened when the process begins
0: The standard input
1: The standard output
2: The standard error output
When the parent process forks a process, the child process inherits the file descriptors of the parent
All answer that are provided is great here is mine version --
File Descriptors are non-negative integers that act as an abstract handle to “Files” or I/O resources (like pipes, sockets, or data streams). These descriptors help us interact with these I/O resources and make working with them very easy. The I/O system is visible to a user process as a stream of bytes (I/O stream). A Unix process uses descriptors (small unsigned integers) to refer to I/O streams. The system calls related to the I/O operations take a descriptor as as argument.
Valid file descriptor ranges from 0 to a max descriptor number that is configurable (ulimit, /proc/sys/fs/file-max). Kernel assigns desc. for std input(0), std output(1) and std error(2) of the FD table. If a file open is not successful, fd return -1.
When a process makes a successful request to open a file, the kernel returns a file descriptor which points to an entry in the kernel's global file table. The file table entry contains information such as the inode of the file, byte offset, and the access restrictions for that data stream (read-only, write-only, etc.).
Any operating system has processes (p's) running, say p1, p2, p3 and so forth. Each process usually makes an ongoing usage of files.
Each process is consisted of a process tree (or a process table, in another phrasing).
Usually, Operating systems represent each file in each process by a number (that is to say, in each process tree/table).
The first file used in the process is file0, second is file1, third is file2, and so forth.
Any such number is a file descriptor.
File descriptors are usually integers (0, 1, 2 and not 0.5, 1.5, 2.5).
Given we often describe processes as "process-tables", and given that tables has rows (entries) we can say that the file descriptor cell in each entry, uses to represent the whole entry.
In a similar way, when you open a network socket, it has a socket descriptor.
In some operating systems, you can run out of file descriptors, but such case is extremely rare, and the average computer user shouldn't worry from that.
File descriptors might be global (process A starts in say 0, and ends say in 1 ; Process B starts say in 2, and ends say in 3) and so forth, but as far as I know, usually in modern operating systems, file descriptors are not global, and are actually process-specific (process A starts in say 0 and ends say in 5, while process B starts in 0 and ends say in 10).
File descriptors are nothing but references for any open resource. As soon as you open a resource the kernel assumes you will be doing some operations on it. All the communication via your program and the resource happens over an interface and this interface is provided by the file-descriptor.
Since a process can open more than one resource, it is possible for a resource to have more than one file-descriptors.
You can view all file-descriptors linked to the process by simply running,
ls -li /proc/<pid>/fd/ here pid is the process-id of your process
Addition to above all simplified responses.
If you are working with files in bash script, it's better to use file descriptor.
For example:
If you want to read and write from/to the file "test.txt", use the file descriptor as show below:
FILE=$1 # give the name of file in the command line
exec 5<>$FILE # '5' here act as the file descriptor
# Reading from the file line by line using file descriptor
while read LINE; do
echo "$LINE"
done <&5
# Writing to the file using descriptor
echo "Adding the date: `date`" >&5
exec 5<&- # Closing a file descriptor
I'm don't know the kernel code, but I'll add my two cents here since I've been thinking about this for some time, and I think it'll be useful.
When you open a file, the kernel returns a file descriptor to interact with that file.
A file descriptor is an implementation of an API for the file you're opening. The kernel creates this file descriptor, stores it in an array, and gives it to you.
This API requires an implementation that allows you to read and write to the file, for example.
Now, think about what I said again, remembering that everything is a file — printers, monitors, HTTP connections etc.
That's my summary after reading https://www.bottomupcs.com/file_descriptors.xhtml.