Can two Unix processes simultaneous write to different positions in a single file? - unix

This is an unresolved exam question of mine.
Can two Unix processes simultaneous write to different positions
in a single file?
Yes, the two processes will have their own file table entries
no, the shared i-node contains a single offset pointer
only one process will have write privilege
yes, but only if we operate using NFS

There is no file offset recorded in an inode so answer 2. is incorrect.
There is no documented reason for a process to have its access rights modified so 3. is incorrect.
NFS allows simultaneous access by processes on different hosts, the question here is for processes on the same host so NFS shouldn't make a difference.
Here is a shell script demonstrating the remaining answer 1. is correct:
# create a 10m file
dd if=/dev/zero of=/var/tmp/file bs=1024k count=10
# create two 1 MB files
cd /tmp
printf "aaaaaaaa" > aa
printf "bbbbbbbb" > bb
i=0
while [ $i -lt 17 ]; do
cat aa aa > aa.new && mv aa.new aa
cat bb bb > bb.new && mv bb.new bb
i=$((i+1))
done
ls -lG /var/tmp/file /tmp/aa /tmp/bb
# launch 10 processes that will write at different locations in the same file.
# Uses dd notrunc option for the file not to be truncated
# Uses GNU dd fdatasync option for unbuffered writes
i=0
while [ $i -lt 5 ]; do
(
dd if=/tmp/aa of=/var/tmp/file conv=notrunc,fdatasync bs=1024k count=1 seek=$((i*2)) 2>/dev/null &
dd if=/tmp/bb of=/var/tmp/file conv=notrunc,fdatasync bs=1024k count=1 seek=$((i*2+1)) 2>/dev/null &
) &
i=$((i+1))
done
# Check concurrency
printf "\n%d processes are currently writing to /var/tmp/file\n" "$(fuser /var/tmp/file 2>/dev/null | wc -w)"
# Wait for write completion and check file contents
wait
printf "/var/tmp/file contains:\n"
od -c /var/tmp/file
Its output shows ten processes successfully and simultaneously write to the very same file:
-rw-r--r-- 1 jlliagre 1048576 oct. 30 08:25 /tmp/aa
-rw-r--r-- 1 jlliagre 1048576 oct. 30 08:25 /tmp/bb
-rw-r--r-- 1 jlliagre 10485760 oct. 30 08:25 /var/tmp/file
10 processes are currently writing to /var/tmp/file
/var/tmp/file contains:
0000000 a a a a a a a a a a a a a a a a
*
4000000 b b b b b b b b b b b b b b b b
*
10000000 a a a a a a a a a a a a a a a a
*
14000000 b b b b b b b b b b b b b b b b
*
20000000 a a a a a a a a a a a a a a a a
*
24000000 b b b b b b b b b b b b b b b b
*
30000000 a a a a a a a a a a a a a a a a
*
34000000 b b b b b b b b b b b b b b b b
*
40000000 a a a a a a a a a a a a a a a a
*
44000000 b b b b b b b b b b b b b b b b
*
50000000

Definition:
Yes, the two processes will have their own file table entries.
If the file is opened twice with the open function the two file descriptor is created.
Each file descriptor have seperate file status flags.
So the two file descriptor have a write permission file descriptor1 and file descriptor2 have initial position of point to first character to file.
If we specify some position to both descriptor and write in file it can be tested easily.
The content of file.txt
My name is Chandru. This is an empty file.
Coding for testing:
#include<stdio.h>
#include<fcntl.h>
#include<stdlib.h>
main()
{
int fd1, fd2;
if((fd1=open("file.txt", O_WRONLY)) <0){
perror("Error");
exit(0);
}
if((fd2=open("file.txt", O_WRONLY)) < 0) {
perror("Error");
exit(0);
}
if(lseek(fd1,20,SEEK_SET) != 20)
{
printf("Cannot seek\n");
exit(0);
}
if(write(fd1,"testing",7) != 7)
{
printf("Error write\n");
exit(0);
}
if(lseek(fd2,10,SEEK_SET) != 10)
{
printf("Cannot seek\n");
exit(0);
}
if(write(fd2,"condition",9) != 9)
{
printf("Error write\n");
exit(0);
}
}
Output:
After that my output is
My name isconditionitesting empty file.

Yes, they can of course, with the following caveats:
Depending on open() mode, one process can easily wipe the file contents
Depending on scheduling, the order of write operations is not deterministic
There is no mandatory locking (in general) - careful design calls for advisory locking.
If they write in the same area using buffered I/O results can be non-deterministic.

Related

Checking on equal values of 2 different data frame row by row

I have 2 different data frame, one is of 5.5 MB and the other is 25 GB. I want to check if these two data frame have the same value in 2 different columns for each row.
For e.g.
x 0 0 a
x 1 2 b
y 1 2 c
z 3 4 d
and
x 0 0 w
x 1 2 m
y 5 6 p
z 8 9 q
I want to check if the 2° and 3° column are equal for each row, if yes I return the 4° columns for the both data frame.Then I should have:
a w
b m
c m
the 2 data frame are sorted respect the 2° and 3° column value. I try in R but the 2° file (25 GB) is too big. How can I obtain this new file in a "faster" (even some hours) way ???
With GNU awk for arrays of arrays:
$ cat tst.awk
NR==FNR { a[$2,$3][$4]; next }
($2,$3) in a {
for (val in a[$2,$3]) {
print val, $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
and with any awk (a bit less efficiently):
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] FS $4; next }
($2,$3) in a {
split(a[$2,$3],vals)
for (i in vals) {
print vals[i], $4
}
}
$ awk -f tst.awk small_file large_file
a w
b m
c m
The above when reading small_file (NR==FNR is only true for the first file read - look up those variables in the awk man page or google) creates an associative array a[] that maps an index created from the concatenation of the 2nd+3rd fields to the list of value of the 4th field for those 2nd/3rd field combinations. Then when reading large_file it looks up that array for the current 2nd/3rd field combination and loops through all of the values stored for that combination in the previous phase printing that value (the $4 from small_file) plus the current $4.
You said your small file is 5.5 MB and the large file is 25 GB. Since 1 MB is about 1,047,600 characters (see https://www.computerhope.com/issues/chspace.htm) and each of your lines is about 8 characters long that means your small file is about 130 thousand lines long and your large one about 134 million lines long so I expect on an average powered computer the above should take no more than a minute or 2 to run, it certainly won't take anything like an hour!
An alternative to the solution of Ed Morton, but with an identical idea:
$ cat tst.awk
NR==FNR { a[$2,$3] = a[$2,$3] $4 ORS; next }
($2,$3) in a {
s=a[$2,$3]; gsub(ORS,OFS $4 ORS,s)
printf "%s",s;
}
$ awk -f tst.awk small_file large_file
a w
b m
c m

Selecting specific rows of a tab-delimited file using bash (linux)

I have a directory lot of txt tab-delimited files with several rows and columns, e.g.
File1
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C221C:p.D461W
3 s5 t1 c.G31T:p.G61R
File2
Id Sample Time ... Variant[Column16] ...
1 s1 t0 c.B481A:p.G861S
2 s2 t2 c.C21C:p.D61W
3 s5 t1 c.G1T:p.G1R
and what I am looking for is to create a new file with:
all the different variants uniq
the number of variants repeteated
and the file location
i.e.:
NewFile
Variant Nº of repeated Location
c.B481A:p.G861S 2 File1,File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
c.G1T:p.G1R 1 File2
I think using a basic script in bash with awk sort and uniq it will work, but I do not know where to start. Or if using Rstudio or python(3) is easier, I could try.
Thanks!!
Pure bash. Requires version 4.0+
# two associative arrays
declare -A files
declare -A count
# use a glob pattern that matches your files
for f in File{1,2}; do
{
read header
while read -ra fields; do
variant=${fields[3]} # use index "15" for 16th column
(( count[$variant] += 1 ))
files[$variant]+=",$f"
done
} < "$f"
done
for variant in "${!count[#]}"; do
printf "%s\t%d\t%s\n" "$variant" "${count[$variant]}" "${files[$variant]#,}"
done
outputs
c.B481A:p.G861S 2 File1,File2
c.G1T:p.G1R 1 File2
c.C221C:p.D461W 1 File1
c.G31T:p.G61R 1 File1
c.C21C:p.D61W 1 File2
The order of the output lines is indeterminate: associative arrays have no particular ordering.
Pure bash would be hard I think but everyone has some awk lying around :D
awk 'FNR==1{next}
{
++n[$16];
if ($16 in a) {
a[$16]=a[$16]","ARGV[ARGIND]
}else{
a[$16]=ARGV[ARGIND]
}
}
END{
printf("%-24s %6s %s\n","Variant","Nº","Location");
for (v in n) printf("%-24s %6d %s\n",v,n[v],a[v])}' *

Subtracting lines in one file from another file

I couldn't find an answer that truly subtracts one file from another.
My goal is to remove lines in one file that occur in another file.
Multiple occurences should be respected, which means for exammple if one line occurs 4 times in file A and only once in file B, file C should have 3 of those lines.
File A:
1
3
3
3
4
4
File B:
1
3
4
File C (desired output)
3
3
4
Thanks in advance
In awk:
$ awk 'NR==FNR{a[$0]--;next} ($0 in a) && ++a[$0] > 0' f2 f1
3
3
4
Explained:
NR==FNR { # for each record in the first file
a[$0]--; # for each identical value, decrement a[value] (of 0)
next
}
($0 in a) && ++a[$0] > 0' # if record in a, increment a[value]
# once over remove count in first file, output
If you want to print items in f1 that are not in f2 you can lose ($0 in a) &&:
$ echo 5 >> f1
$ awk 'NR==FNR{a[$0]--;next} (++a[$0] > 0)' f2 f1
3
3
4
5
If the input files are already sorted as shown in sample, comm would be more suited
$ comm -23 f1 f2
3
3
4
option description from man page:
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do:
awk 'NR==FNR{++cnt[$1]
next}
cnt[$1]-->0{next}
1' f2 f1

Extract data before and after matching (BIG FILE )

I have got a big file ( arounf 80K lines )
my main goal is to find the patterns and pring for example 10 lines before and 10 lines after the pattern .
the pattern accures multiple times across the file .
using the grep command :
grep -i <my_pattern>* -B 10 -A 10 <my_file>
i get only some of the data , i think it must be something related to the buffer size ....
i need a command ( grep , sed , awk ) that will handle all the matching
and will print 10 line before and after the pattern ...
Example :
my patterns hides here :
a
b
c
pattern_234
c
b
a
a
b
c
pattern_567
c
b
a
this happens multiple times across the file .
running this command :
grep -i pattern_* -B 3 -A 3 <my_file>
will get he right output :
a
b
c
c
b
a
a
b
c
c
b
it works but not full time
if i have 80 patterns not all the 80 will be shown
awk to the rescue
awk -vn=4 # pass the argument of context line count
'{
for(i=1;i<=n;i++) # store the past n lines in an indexed array
p[i]=p[i+1];
p[n+1]=$0
}
/pattern/ # if pattern matched
{
c=n+1; # set the counter to after match line count
for(i=1;i<=n;i++) # print previously saved entries
print p[i]
}
c-->0' # print the lines after match until counter runs out
will print 4 lines before and 4 lines after the match of pattern, change the value of n as per your need.
if non-symmetric before/after you need two variables
awk -vb=2 -va=3 '{for(i=1;i<=b;i++) p[i]=p[i+1];p[b+1]=$0} /pattern/{c=a+1;for(i=1;i<=b;i++) print p[i]} c-->0'

Command line tool to listen on several FIFOs at once

I am looking for a tool to read several FIFOs at once (probably using select(2)) and output what is read, closing the stream when all the FIFOs are closed. To be more precise, the program would behave as follows:
$ mkfifo a b
$ program a b > c &
$ echo 'A' > a
$ echo 'B' > b
[1] + done program a b > c
$ cat c
A
B
$ program a b > c &
$ echo 'B' > b
$ echo 'A' > a
[1] + done program a b > c
$ cat c
B
A
My first attempt was to use cat, but the second example will not work (echo 'B' > b will hang), because cat reads from each argument in order, not simultaneously. What is the correct tool to use in this case?
tail will do that.
Use:
tail -q -n +1 a b
Edit: Sorry that didn't work. I'll see if I can find something else.
Sorry, I could not find anything.
If you don't want to program this yourself then my suggestion is multiple commands:
#!/bin/sh
rm c
cat a >> c &
cat b >> c &
wait
You may get some interleaving but otherwise everything should work fine. The wait is to prevent the program from exiting till all the cat programs are done (just in case you have some command you need to run after everything is done). And the rm is to make sure c starts empty since the cat commands append to the file.

Resources