I use R for most of my statistical analysis. However, cleaning/processing data, especially when dealing with sizes of 1Gb+, is quite cumbersome. So I use common UNIX tools for that. But my question is, is it possible to, say, run them interactively in the middle of an R session? An example: Let's say file1 is the output dataset from an R processes, with 100 rows. From this, for my next R process, I need a specific subset of columns 1 and 2, file2, which can be easily extracted through cut and awk. So the workflow is something like:
Some R process => file1
cut --fields=1,2 <file1 | awk something something >file2
Next R process using file2
Apologies in advance if this is a foolish question.
Try this (adding other read.table arguments if needed):
# 1
DF <- read.table(pipe("cut -fields=1,2 < data.txt| awk something_else"))
or in pure R:
# 2
DF <- read.table("data.txt")[1:2]
or to not even read the unwanted fields assuming there are 4 fields:
# 3
DF <- read.table("data.txt", colClasses = c(NA, NA, "NULL", "NULL"))
The last line could be modified for the case where we know we want the first two fields but don't know how many other fields there are:
# 3a
n <- count.fields("data.txt")[1]
read.table("data.txt", header = TRUE, colClasses = c(NA, NA, rep("NULL", n-2)))
The sqldf package can be used. In this example we assume a csv file, data.csv and that the desired fields are called a and b . If its not a csv file then use appropriate arguments to read.csv.sql to specify other separator, etc. :
# 4
library(sqldf)
DF <- read.csv.sql("data.csv", sql = "select a, b from file")
I think you may be looking for littler which integrates R into the Unix command-line pipelines.
Here is a simple example computing the file size distribution of of /bin:
edd#max:~/svn/littler/examples$ ls -l /bin/ | awk '{print $5}' | ./fsizes.r
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4 5736 23580 61180 55820 1965000 1
The decimal point is 5 digit(s) to the right of the |
0 | 00000000000000000000000000000000111111111111111111111111111122222222+36
1 | 01111112233459
2 | 3
3 | 15
4 |
5 |
6 |
7 |
8 |
9 | 5
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 | 6
edd#max:~/svn/littler/examples$
and it takes for that is three lines:
edd#max:~/svn/littler/examples$ cat fsizes.r
#!/usr/bin/r -i
fsizes <- as.integer(readLines())
print(summary(fsizes))
stem(fsizes)
See ?system for how to run shell commands from within R.
Staying in the tradition of literate programming, using e.g. org-mode and org-babel will do the job perfectly:
You can combine several different programming languages in one script and execute then separate, in sequence, export the results or the code, ...
It is a little bit like sweave, only that the code blocks can by python, bash, R, sql, and numerous other. Check t out: org-mode and bable and an example using different programming languages
Apart from that, I think org-mode and babel is the perfect way of writing even pure R scripts.
Preparing data before working with it in R is quite common, and I have a lot of scripts for Unix and Perl pre-processing, and have, at various times, maintained scripts/programs for MySQL, MongoDB, Hadoop, C, etc. for pre-processing.
However, you may get better mileage for portability if you do some kinds of pre-processing in R. You might try asking new questions focused on some of these particulars. For instance, to load large amounts of data into memory mapped files, I seem to evangelize bigmemory. Another example is found in the answers (especially JD Long's) to this question.
Related
I would like to get the number of individuals in each population, in the order in which populations are read in, from a vcf file. The fields of my file look like this
##fileformat=VCFv4.2
##fileDate=20180425
##source="Stacks v1.45"
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=AD,Number=1,Type=Integer,Description="Allele Depth">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihood">
##INFO=<ID=locori,Number=1,Type=Character,Description="Orientation the
corresponding Stacks locus aligns in">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
CHALIFOUR_2003_ChHis-1 CHALIFOUR_2003_ChHis-13 CHALIFOUR_2003_ChHis-14
CHALIFOUR_2003_ChHis-15
un 1027 13_65 C T . PASS NS=69;AF=0.188;locori=p GT:DP:AD
0/1:16:9,7 0/0:39:39,0 0/0:17:17,0 0/0:39:39,0
See example file here vcf file
For example, in the file that I have linked to, I have two populations, Chalifour 2003 and Chalifour 2015. Individuals have a prefix "CHALIFOUR_2003..." that identifies this.
I would like to be able to extract something like:
Chalifour_2003* 35
Chalifour 2015* 45
With the "35" and "45" indicating the number of individuals in each population (though these numbers are made up). I don't care at all about the format of the output, I just need the numbers, and it is important that the populations are listed in the order in which they would be read into the file.
Any suggestions for avenues to try to get this information would be much appreciated.
Using the data.table package to read in the vcf file you can do the following:
library(data.table)
df <- fread("~/Downloads/ChaliNoOddsWithOuts.vcf")
samples <- colnames(df)[-c(1:9)]
table(gsub("(.*_.*)_.*","\\1", samples))
If you don't insist on using R then this is one liner in bash that does the job
grep "#CHROM" file.vcf | tr "\t" "\n " | tail -n +10 | cut -f1,2 -d'_' | uniq -c
Once again, I am having a great time with Notebook and the emerging rmagic infrastructure, but I have another question about the bridge between the two. Currently I am attempting to pass several subsets of a pandas DataFrame to R for visualization with ggplot2. Just to be clear upfront, I know that I could pass the entire DataFrame and perform additional subsetting in R. My preference, however, is to leverage the data management capability of Python and the subset-wise operations I am performing are just easier and faster using pandas than the equivalent operations in R. So for the sake of efficiency and morbid curiosity...
I have been trying to figure out if there is a way to push several objects at once. The wrinkle is that sometimes I don't know in advance how many items will need to be pushed. To retain flexibility, I have been populating dictionaries with DataFrames throughout the front end of the script. The following code provides a reasonable facsimile of what I am working through (I have not converted via com.convert_to_r_dataframe for simplicity, but my real code does take this step):
import pandas as pd
from pandas import DataFrame
%load_ext rmagic
d1=DataFrame(np.arange(16).reshape(4,4))
d2=DataFrame(np.arange(20).reshape(5,4))
d_list=[d1,d2]
names=['n1','n2']
d_dict=dict(zip(names,d_list))
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
%Rpush n1
As can be seen, I can assign a static name and push the DataFrame into the R namespace individually (as well as in a 'list' >> %Rpush n1 n2). What I cannot do is something like the following:
for name in d_dict.keys():
%Rpush d_dict[name]
That snippet raises an exception >> KeyError: u'd_dict[name]'. I also tried to deposit the dynamically named DataFrames in a list, the list references end up pointing to the data rather than the object reference:
df_list=[]
for name in d_dict.keys():
exec '%s=d_dict[name]' % name
exec 'df_list.append(%s)' % name
print df_list
for df in df_list:
%Rpush df
[ 0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15,
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19]
%Rpush did not throw an exception when I looped through the lists contents, but the DataFrames could not be found in the R namespace. I have not been able to find much discussion of this topic beyond talk about the conversion of lists to R vectors. Any help would be greatly appreciated!
Rmagic's push uses the name that you give it both to look up the Python variable, and to name the R variable it creates. So it needs a valid name, not just any expression, on both sides.
There's a trick you can do to get the name from a Python variable:
d1=DataFrame(np.arange(16).reshape(4,4))
name = 'd1'
%Rpush {name}
# equivalent to %Rpush d1
But if you want to do more advanced things, it's best to get hold of the r object and use that to put your objects in. Rmagic is just a convenience wrapper over rpy2, which is a full API. So you can do:
from rpy2.robjects import r
r.assign('a', 1)
You can mix and match which interface you use - rmagic and rpy2 are talking to the same instance of R.
I know its not a code level question but wanted your views.
I need to perform “Prediction Analysis” in UNIX level using Time series model (like ARIMA).
We have implemented the same using R , but my work environment is not supporting R.
Data snapshot
Year | Month| Data1| Data2 | Data3
2012 | Jan | 1 |1 |3
2012 | Feb | 2 |21 | 4
So I wanted to implement some algorithm which will help me in finding the predicted values for future months.
Is there any other way of implementing “Time series Prediction Analysis” in UNIX (preferably Perl/Shell).
Since you are interested in perl and statistics, I'm sure you are aware of PDL. There are some specific time-series statistics modules available and of course, since perl is involved, other CPAN modules can be used.
R is still king and has a lot of packages to choose from - and, lucky us, R and perl play nice together using Statistics::R. I've not tried using Statistics-R from the PDL shell but this too may be possible to some extent.
Here's a pdl example session using MVA
/home/zombiepl % pdl
pdl> use Statistics::MVA::MultipleRegression;
pdl> $lol = [ [qw/745 36 66/],
[qw/895 37 68/],
[qw/442 47 64/],
[qw/440 32 53/],
[qw/1598 1 101/],];
pdl> linear_regression($lol);
The coefficients are: B[0] = -281.426985090045, B[1] = -7.61102966577879,
B[2] = 19.0102910918022.
R^2 is 0.943907302962818
Cheers and good luck with your project.
I usually use
awk BEGIN {FS=" "} NR==FNR{arry[$1]=$0; next} $1 in array && $0=arr[$1] FS in fields infile1 infile2 > outfile
to extract common fields in 2 files based on field of interest. But this time I need incomparables. I have 2 files with equal lines but 200 lines in the second file do not have the same coding as in file1.
I tried to :
paste f1 f2 | sort -n -k1,2
by both fields hoping to get $1==$2 and take unequal fields but I don't get $1==$2 even when there should be.
How can I do this?
Since you seems to compare by the first field and since I don't know what your data files look like, I am going to blindly attempt at this:
$ cat data1.txt
dana 100
john 101
fiona 102
$ cat data2.txt
dana 100
john 501
fiona 102
$ cat data[12].txt|sort|uniq -u
john 101
john 501
The above solution will prints out lines that are not the same, based on the first field. Since I don't fully understand your data file, I am going ask this question. Does the following solve your problem?
diff data1.txt data2.txt
I am interested in (functional) vector manipulation in R. Specifically, what are R's equivalents to Perl's map and grep?
The following Perl script greps the even array elements and multiplies them by 2:
#a1=(1..8);
#a2 = map {$_ * 2} grep {$_ % 2 == 0} #a1;
print join(" ", #a2)
# 4 8 12 16
How can I do that in R? I got this far, using sapply for Perl's map:
> a1 <- c(1:8)
> sapply(a1, function(x){x * 2})
[1] 2 4 6 8 10 12 14 16
Where can I read more about such functional array manipulations in R?
Also, is there a Perl to R phrase book, similar to the Perl Python Phrasebook?
Quick ones:
Besides sapply, there are also lapply(), tapply, by, aggregate and more in the base. Then there are loads of add-on package on CRAN such as plyr.
For basic functional programming as in other languages: Reduce(), Map(), Filter(), ... all of which are on the same help page; try help(Reduce) to get started.
As noted in the earlier answer, vectorisation is even more appropriate here.
As for grep, R actually has three regexp engines built-in, including a Perl-based version from libpcre.
You seem to be missing a few things from R that are there. I'd suggest a good recent book on R and the S language; my recommendation would be Chambers (2008) "Software for Data Analysis"
R has "grep", but it works entirely different than what you're used to. R has something much better built in: it has the ability to create array slices with a boolean expression:
a1 <- c(1:8)
a2 <- a1 [a1 %% 2 == 0]
a2
[1] 2 4 6 8
For map, you can apply a function as you did above, but it's much simpler to just write:
a2 * 2
[1] 4 8 12 16
Or in one step:
a1[a1 %% 2 == 0] * 2
[1] 4 8 12 16
I have never heard of a Perl to R phrase book, if you ever find one let me know! In general, R has less documentation than either perl or python, because it's such a niche language.