I have downloaded a bunch of graphs from http://users.cecs.anu.edu.au/~bdm/data/graphs.html and I want to do some analysis. I want to use the graph-tool Python module for this but I cant find a convenient way to convert from graph6 format to a format compatible with graph-tools. There must be an easy way to do this... any help would be appreciated.
-- EDIT:
A possible solution is convert from g6 to gt format... but I haven't found any tools that do this.
The graph6 format looks annoying to work with, but fortunately the documentation mentions a tool named showg for pretty-printing graphs. It's easy to simply parse the output of that program.
First, build the showg tool. (Use clang or gcc as appropriate for your system. Or just download the binary they provide on their website.)
$ curl -s http://users.cecs.anu.edu.au/%7Ebdm/data/showg.c > showg.c
$ clang -o showg showg.c
$ ./showg --help
Download some example data and look at it. I think the -e option produces the easiest output to work with.
$ curl -s http://users.cecs.anu.edu.au/%7Ebdm/data/graph4.g6 > graph4.g6
$ ./showg -p10 -e graph4.g6
Graph 10, order 4.
4 5
0 2 0 3 1 2 1 3 2 3
Here's a simple script that reads the edge list from ./showg -p<N> -e and creates a graph_tool.Graph object:
# load_graph.py
import sys
import graph_tool as gt
# Read stdin and parse last line as a list of edge pairs
line = sys.stdin.readlines()[-1]
nodes = [int(n) for n in line.split()]
n0 = nodes[0::2]
n1 = nodes[1::2]
edges = list(zip(n0, n1))
# Load graph
g = gt.Graph()
g.add_edge_list(edges)
print("Loaded graph with the following edges:")
print(g.get_edges())
Let's give it a try:
$ ./showg -p10 -e graph4.g6 | python load_graph.py
Loaded graph with the following edges:
[[0 2]
[0 3]
[1 2]
[1 3]
[2 3]]
Related
I have a csv.gz file whose content looks like:
bogusfile <- '1,2,3
1,2,3
2,,3
1,2,3,4
1,2,3
1,2,3
1,2,3'
I know there are only 3 columns but sometimes an extra bogus 4th column pops up and messes my parsing with fread.
Fortunately there is a cmd command in fread. How can we use it to discard all the lines that contains exactly more than 2 commas (these would be the offending rows with extra coumns).
Something like fread(cmd = ' linux magic to clean myfile.csv.gz')?
I was not able to make it work.
What do you think?
Thanks!
data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "[^,]*,[^,]*,[^,]*," Noobie.txt')
# V1 V2 V3
# 1: 1 2 3
# 2: 1 2 3
# 3: 2 NA 3
# 4: 1 2 3
# 5: 1 2 3
# 6: 1 2 3
I had to use grep -E vice egrep because of windows ... and I had to specify the full path to grep because RTools is not in my default path. If you are on something other than windows, you should be able to shorten this to fread(cmd="egrep -v ..."). (And make sure you are in the correct directory or provide relative/absolute path to the file.)
The regex "[^,]*,[^,]*,[^,]*," is a bit literal, it can be shorted to "([^,]*,){3,}", which says
([^,]*,) group of non-comma followed by a comma
{3,} three or more
-v omit lines that match the pattern
so
data.table::fread(cmd = 'c:/Rtools/bin/grep.exe -E -v "([^,]*,){3,}" Noobie.txt')
If the data is compressed (gz), on non-Windows platforms you can choose from among:
gzip -cd filename.csv.gz | egrep -v "([^,]*,){3,}"
gunzip -c filename.csv.gz | egrep -v "([^,]*,){3,}"
zgrep -E -V "([^,]*,){3,}" filename.csv.gz
It won't work on windows since system and similar functionality on R on windows does not use bash for its shell, so the | in-fix operator doesn't do what one expects. There might be a way to get | to work in system et al, but I don't know how to get it to work with data.table::fread(..., cmd=).
Admittedly untested since ... I'm on Windows :-(
I have a problem.
I need to import data in R but the separator is ",".
Not just a comma but a comma surrounded by two quote.
But if I put it as a separator I have the command:
"DownloadFormat"="","".
And r does not understand. How can I protect this separator?
1) readLines/gsub Questions to SO on R should include a complete verifiable example. Without such we provide our own in the Note at the end. The code may need to be modified depending on the actual data. First read the data line by line using readLines and remove all double quotes. Then re-read it using read.csv.
L <- gsub('"', '', readLines("hugo.dat"))
DF <- read.csv(text = L)
DF
giving:
a b c d
1 1 2 3 4
2 13 14 15 16
2) pipe/sed Another possibility is the one-liner:
read.csv(pipe("sed -e 's/\"//g' hugo.dat"))
On Windows be sure that you have Rtools installed and that C:\Rtools\bin is on your Windows PATH (assuming the default Rtools installation directory). Although this worked for me on both straight Windows and on Linux using bash you might need to modify it slightly depending on what shell you use due to differences in how different shells deal with escaping and quoting.
Note
Lines <- 'a","b","c","d
1","2","3","4
13","14","15","16'
cat(Lines, "\n", file = "hugo.dat")
Using #G.Grothendieck's example hugo.dat file, we can add missing quotes, and read as CSV:
read.csv(textConnection(paste0('"', readLines("hugo.dat"), '"')))
# a b c d
# 1 1 2 3 4
# 2 13 14 15 16
Once again, I am having a great time with Notebook and the emerging rmagic infrastructure, but I have another question about the bridge between the two. Currently I am attempting to pass several subsets of a pandas DataFrame to R for visualization with ggplot2. Just to be clear upfront, I know that I could pass the entire DataFrame and perform additional subsetting in R. My preference, however, is to leverage the data management capability of Python and the subset-wise operations I am performing are just easier and faster using pandas than the equivalent operations in R. So for the sake of efficiency and morbid curiosity...
I have been trying to figure out if there is a way to push several objects at once. The wrinkle is that sometimes I don't know in advance how many items will need to be pushed. To retain flexibility, I have been populating dictionaries with DataFrames throughout the front end of the script. The following code provides a reasonable facsimile of what I am working through (I have not converted via com.convert_to_r_dataframe for simplicity, but my real code does take this step):
import pandas as pd
from pandas import DataFrame
%load_ext rmagic
d1=DataFrame(np.arange(16).reshape(4,4))
d2=DataFrame(np.arange(20).reshape(5,4))
d_list=[d1,d2]
names=['n1','n2']
d_dict=dict(zip(names,d_list))
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
%Rpush n1
As can be seen, I can assign a static name and push the DataFrame into the R namespace individually (as well as in a 'list' >> %Rpush n1 n2). What I cannot do is something like the following:
for name in d_dict.keys():
%Rpush d_dict[name]
That snippet raises an exception >> KeyError: u'd_dict[name]'. I also tried to deposit the dynamically named DataFrames in a list, the list references end up pointing to the data rather than the object reference:
df_list=[]
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
exec 'df_list.append(%s)' % name
print df_list
for df in df_list:
%Rpush df
[ 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15,
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19]
%Rpush did not throw an exception when I looped through the lists contents, but the DataFrames could not be found in the R namespace. I have not been able to find much discussion of this topic beyond talk about the conversion of lists to R vectors. Any help would be greatly appreciated!
Rmagic's push uses the name that you give it both to look up the Python variable, and to name the R variable it creates. So it needs a valid name, not just any expression, on both sides.
There's a trick you can do to get the name from a Python variable:
d1=DataFrame(np.arange(16).reshape(4,4))
name = 'd1'
%Rpush {name}
# equivalent to %Rpush d1
But if you want to do more advanced things, it's best to get hold of the r object and use that to put your objects in. Rmagic is just a convenience wrapper over rpy2, which is a full API. So you can do:
from rpy2.robjects import r
r.assign('a', 1)
You can mix and match which interface you use - rmagic and rpy2 are talking to the same instance of R.
I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
I think you may be looking for littler which integrates R into the Unix command-line pipelines.
Here is a simple example computing the file size distribution of of /bin:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4 5736 23580 61180 55820 1965000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
1 | 01111112233459
2 | 3
3 | 15
4 |
5 |
6 |
7 |
8 |
9 | 5
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | 6
edd#max:~/svn/littler/examples$
and it takes for that is three lines:
edd#max:~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
See ?system for how to run shell commands from within R.
Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:
You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...
It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages
Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.
Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.
However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.
I usually use
awk BEGIN {FS=" "} NR==FNR{arry[$1]=$0; next} $1 in array && $0=arr[$1] FS in fields infile1 infile2 > outfile
to extract common fields in 2 files based on field of interest. But this time I need incomparables. I have 2 files with equal lines but 200 lines in the second file do not have the same coding as in file1.
I tried to :
paste f1 f2 | sort -n -k1,2
by both fields hoping to get $1==$2 and take unequal fields but I don't get $1==$2 even when there should be.
How can I do this?
Since you seems to compare by the first field and since I don't know what your data files look like, I am going to blindly attempt at this:
$ cat data1.txt
dana 100
john 101
fiona 102
$ cat data2.txt
dana 100
john 501
fiona 102
$ cat data[12].txt|sort|uniq -u
john 101
john 501
The above solution will prints out lines that are not the same, based on the first field. Since I don't fully understand your data file, I am going ask this question. Does the following solve your problem?
diff data1.txt data2.txt