how to remove bad lines using cmd in data.table? - r

I have a csv.gz file whose content looks like:
bogusfile <- '1,2,3
1,2,3
2,,3
1,2,3,4
1,2,3
1,2,3
1,2,3'
I know there are only 3 columns but sometimes an extra bogus 4th column pops up and messes my parsing with fread.
Fortunately there is a cmd command in fread. How can we use it to discard all the lines that contains exactly more than 2 commas (these would be the offending rows with extra coumns).
Something like fread(cmd = ' linux magic to clean myfile.csv.gz')?
I was not able to make it work.
What do you think?
Thanks!

data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "[^,]*,[^,]*,[^,]*," Noobie.txt')
# V1 V2 V3
# 1: 1 2 3
# 2: 1 2 3
# 3: 2 NA 3
# 4: 1 2 3
# 5: 1 2 3
# 6: 1 2 3
I had to use grep -E vice egrep because of windows ... and I had to specify the full path to grep because RTools is not in my default path. If you are on something other than windows, you should be able to shorten this to fread(cmd="egrep -v ..."). (And make sure you are in the correct directory or provide relative/absolute path to the file.)
The regex "[^,]*,[^,]*,[^,]*," is a bit literal, it can be shorted to "([^,]*,){3,}", which says
([^,]*,) group of non-comma followed by a comma
{3,} three or more
-v omit lines that match the pattern
so
data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "([^,]*,){3,}" Noobie.txt')
If the data is compressed (gz), on non-Windows platforms you can choose from among:
gzip -cd filename.csv.gz | egrep -v "([^,]*,){3,}"
gunzip -c filename.csv.gz | egrep -v "([^,]*,){3,}"
zgrep -E -V "([^,]*,){3,}" filename.csv.gz
It won't work on windows since system and similar functionality on R on windows does not use bash for its shell, so the | in-fix operator doesn't do what one expects. There might be a way to get | to work in system et al, but I don't know how to get it to work with data.table::fread(..., cmd=).
Admittedly untested since ... I'm on Windows :-(

Related

Using graph6 format with graph-tool

I have downloaded a bunch of graphs from http://users.cecs.anu.edu.au/~bdm/data/graphs.html and I want to do some analysis. I want to use the graph-tool Python module for this but I cant find a convenient way to convert from graph6 format to a format compatible with graph-tools. There must be an easy way to do this... any help would be appreciated.
-- EDIT:
A possible solution is convert from g6 to gt format... but I haven't found any tools that do this.
The graph6 format looks annoying to work with, but fortunately the documentation mentions a tool named showg for pretty-printing graphs. It's easy to simply parse the output of that program.
First, build the showg tool. (Use clang or gcc as appropriate for your system. Or just download the binary they provide on their website.)
$ curl -s http://users.cecs.anu.edu.au/%7Ebdm/data/showg.c > showg.c
$ clang -o showg showg.c
$ ./showg --help
Download some example data and look at it. I think the -e option produces the easiest output to work with.
$ curl -s http://users.cecs.anu.edu.au/%7Ebdm/data/graph4.g6 > graph4.g6
$ ./showg -p10 -e graph4.g6
Graph 10, order 4.
4 5
0 2 0 3 1 2 1 3 2 3
Here's a simple script that reads the edge list from ./showg -p<N> -e and creates a graph_tool.Graph object:
# load_graph.py
import sys
import graph_tool as gt
# Read stdin and parse last line as a list of edge pairs
line = sys.stdin.readlines()[-1]
nodes = [int(n) for n in line.split()]
n0 = nodes[0::2]
n1 = nodes[1::2]
edges = list(zip(n0, n1))
# Load graph
g = gt.Graph()
g.add_edge_list(edges)
print("Loaded graph with the following edges:")
print(g.get_edges())
Let's give it a try:
$ ./showg -p10 -e graph4.g6 | python load_graph.py
Loaded graph with the following edges:
[[0 2]
[0 3]
[1 2]
[1 3]
[2 3]]

List a count of similar filenames in a folder

I have a folder which contains multiple files of a similar name (differentiated by a datetime in the filename)
I would like to be able to get a count of each file group/type in that folder.
i.e.
file1_25102019_111402.csv
file1_24102019_111502.csv
file1_23102019_121402.csv
file1_22102019_101402.csv
file2_25102019_161404.csv
file2_24102019_131205.csv
file2_23102019_121306.csv
I need to be able to return something like this;
file1 4
file2 3
Ideally the answer would be something like "count of files whose first x characters are ABCD"
The filenames can be anything. the date part in the example was just to demonstrate that the filenames begin with similar text but are distinghised by "something" further along in the name (in this case a date)
So I want to be able to Group them by the first X characters in the filename.
i.e. i want to be able to say, "give me counts of all files grouped by the first 4 characters, or the first 5 characters etc."
in SQL i'd do somthing like this
select substr(object_name,1,5),
count(*)
from all_objects
group by substr(object_name,1,5)
Edited to show more examples;
File1weifwoeivnw
File15430293fjwnc
File15oiejfiwem
File2sidfsfe
File29fu09f4n
File29ewfoiwwf
File22sdiufsnvfvs
Pseudo Code:
Example 1:
ls count of first 4 characters
Output
File 7
Example 2:
ls count of first 5 characters
Output
File1 3
File2 4
Example 3
ls count of first 6 characters
Output
File1w 1
File15 2
File2s 1
File29 2
File22 1
If you want to extract the first 5 characters you can use
ls | cut -c1-5 | sort | uniq -c |awk '{ print $2,$1 }'
which prints for the first example from the question
file1 3
file2 3
If you want to have a different number of characters, change the cut command as necessary, e.g. cut -c1-6 for the first 6 characters.
If you want to separate the fields with a TAB character instead of a space, change the awk command to
awk -vOFS=\\t '{ print $2,$1 }'
This would result in
file1 3
file2 3
Other solutions that work with the first example that shows file names with a date and time string, but don't work with the additional example added later:
With your first example files, the command
ls | sed 's/_[0-9]\{8\}_[0-9]\{6\}/_*/' | sort | uniq -c
prints
3 file1_*.csv
3 file2_*.csv
Explanation:
The sed command replaces the sequence of a _, 8 digits, another _ and another 6 digits with _*.
With your first example file names, you will get file1_*.csv or
file2_*.csv 3 times each.
sort sorts the lines.
uniq -c counts the number of subsequent lines that are equal.
Or if you want to strip everything from the first _ up to the end, you can use
ls | sed 's/_.*//' | sort | uniq -c
which will print
3 file1
3 file2
You can add the awk command from the first solution to change the output format.

Extract column using grep

I have a data frame with >100 columns each labeled with a unique string. Column 1 represents the index variable. I would like to use a basic UNIX command to extract the index column (column 1) + a specific column string using grep.
For example, if my data frame looks like the following:
Index A B C...D E F
p1 1 7 4 2 5 6
p2 2 2 1 2 . 3
p3 3 3 1 5 6 1
I would like to use some command to extract only column "X" which I will specify with grep, and display both column 1 & the column I grep'd. I know that I can use cut -f1 myfile for the first bit, but need help with the grep per column. As a more concrete example, if my grep phrase were "B", I would like the output to be:
Index B
p1 7
p2 2
p3 3
I am new to UNIX, and have not found much in similar examples. Any help would be much appreciated!!
You need to use awk:
awk '{print $1,$3}' <namefile>
This simple command allows printing the first ($1) and third ($3) column of the file. The software awk is actually much more powerful. I think you should have a look at the man page of awk.
A nice combo is using grep and awk with a pipe. The following code will print column 1 and 3 of only the lines of your file that contain 'p1':
grep 'p1' <namefile> | awk '{print $1,$3}'
If, instead, you want to select lines by line number you can replace grep with sed:
sed 1p <namefile> | awk '{print $1,$3}'
Actually, awk can be used alone in all the examples:
awk '/p1/{print $1,$3}' <namefile> # will print only lines containing p1
awk '{if(NR == 1){print $1,$3}}' <namefile> # Will print only first line
First figure out the command to find the column number.
columnname=C
sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c
Once you know the number, use cut
cut -f1,3 < datafile
Combine into one command
cut -f1,$(sed -n "1 s/${columnname}.*//p" datafile |
sed 's/[^\t*]//g' | wc -c) < datafile
Finished? No, you should improve the first sed command when one header can be a substring of another header: include tabs in your match and put the tabs back in the replacement string.
If you want to retain the first column and a column which contains a specific string in the first row (eg. B), then this should work. It assumes that your string is only present once though.
awk '{if(NR==1){c=0;for(i=1;i<=NF;i++){c++;if($i=="B"){n=c}}}; print $1,$n}' myfile.txt
There is probably a better solution with amazing awk, but this should work.
EXPLANATION: In the first row (NR==1), it iterates through all columns for(i=1;i<=NF;i++) until it finds the string, save the column number and then prints it.
If you want to pass the string as a variable then you can use the -v option.

grep command to count number of lines which match certain parameters and does not match certain parameters

I know how to count lines using grep which match certain parameters and does not match certain parameters separately. But how can we combine this functionality.
For Example:
Name Date Value
A 04-08-2014 1000
B 04-08-2014 2000
C 04-06-2014 3000
D 04-06-2014 1000
2000 04-08-2014 5000
So, I am looking for command which counts number of lines which matches Date 04-08-2014 but does not have Value 2000.
Note: 2000 should be part of "Value" column and dates should be in "Date" column.
Any help would be highly appreciated !
Using awk:
$ awk '$2=="04-08-2014" && $3!="2000" {count++}END{print count}' file
2
What about:
cat file |grep 04-08-2014 |grep -c -v 2000
grep "04-08-2014" file | grep -vc "2000"

Awk, compare 2 files and write out incomparables

I usually use
awk BEGIN {FS=" "} NR==FNR{arry[$1]=$0; next} $1 in array && $0=arr[$1] FS in fields infile1 infile2 > outfile
to extract common fields in 2 files based on field of interest. But this time I need incomparables. I have 2 files with equal lines but 200 lines in the second file do not have the same coding as in file1.
I tried to :
paste f1 f2 | sort -n -k1,2
by both fields hoping to get $1==$2 and take unequal fields but I don't get $1==$2 even when there should be.
How can I do this?
Since you seems to compare by the first field and since I don't know what your data files look like, I am going to blindly attempt at this:
$ cat data1.txt
dana 100
john 101
fiona 102
$ cat data2.txt
dana 100
john 501
fiona 102
$ cat data[12].txt|sort|uniq -u
john 101
john 501
The above solution will prints out lines that are not the same, based on the first field. Since I don't fully understand your data file, I am going ask this question. Does the following solve your problem?
diff data1.txt data2.txt

Resources