Gnuplot: How do I skip columns in matrix input to plot? - plot

I have data file of the form:
unimportant1 unimportant2 unimportant3 matrixdata[i]
1e4 2e5 3e2 1 2 3 4 5
2e3 1e1 7e3 5 4 3 2 1
... ... ... ...
2e3 1e4 4e2 4 4 4 4 4
So it has columnheaders (here "unimportant1" to "unimportant3") as the first row. I want gnuplot to ignore these first three unimportant columns columns so the data entries in exponential notation. I want gnuplot to plot the matrixdata as a matrix. So as if I did it like this:
#!/usr/bin/gnuplot -p
plot '-' matrix with image
1 2 3 4 5
5 4 3 2 1
...
4 4 4 4 4
e
How do I get gnuplot to ignore the first three columns and the header row and plot the rest as matrix image? For compatibility, I would prefere a gnuplot built-in to do that, but I could write a shell script and use the `plot '< ...' syntax preprocessing the data file.
Edit: So neuhaus' answer almost solved it. The only thing I'm missing is, how to ignore the first row (line) with the text header data. Every seems to expect numeric data and so the whole plot fails as it's not a matrix. I don't want to comment out the fist line, as I'm using the unimportant data sets for other 2D plots that, in turn, use the header data.
So how do I skip a row in a matrix plot that already uses every to skip columns?

When using matrix gnuplot must first parse the data file before it can skip rows and columns. Now, your first row evaluates to four invalid number, the second row has 8 number and I get an error that Matrix does not represent a grid.
If you don't want to comment out the first line or skip it with an external tool like < tail -n +2 matrix.dat, then you could change it to contain some dummy strings like
unimportant1 unimportant2 unimportant3 matrixdata[i] B C D E
1e4 2e5 3e2 1 2 3 4 5
2e3 1e1 7e3 5 4 3 2 1
... ... ... ...
2e3 1e4 4e2 4 4 4 4 4
Now your first row has as many entries as the other rows, and you can plot this file with
plot 'test.txt' matrix every ::3:1 with image
This still gives you a warning: matrix contains missing or undefined values, but you don't need to care.

I'm not familiar with matrix plots, but I got some sample data and
plot 'matrix.dat' matrix every ::3 with image
seems to do the trick.

You could probably use shell commands, for instance, the following skips the first six lines of a file:
plot '<tail -n +7 terrain0.dem' matrix with image

Related

GNUPLOT with point-size variables stored in a different file

I have a data file with the following format :
y1 y2 y3 y4 ...
1.3 1.1 0.5 0.5 ...
0.2 0.4 0.6 0.1 ...
I know how to use Gnuplot to plot the data in this file. Suppose I have 50 columns, then I use:
plot for [col=0:150] filename using 0:col with lines ...
Now, I want to make a scatter instead of a line plot with points having variable size. I have a different file storing the pointsize variables. I know I need to also use a for loop and:
w p ps variable
However, since the point-size variables are stored in a different file, I do not know how to write the using specification. Normally one uses
using 0:1:2
where the point size variables are stored in the second column etc. But what if these variables are stored in a different file ?
I think I can solve this problem by combining both the data and the pointsize variables file into a single file, but I wonder if one can do this using gnuplot.
Thanks
If there is a one-to-one matchup of lines in the two files, then yes. Assuming file.dat is formatted like the one you show above, and ps.dat contains one header record and then in column 1 the point size for all points in that same line of the data file:
# read point sizes into a data block in gnuplot
set datafile columnheaders
set table $pointsize
plot "ps.dat" using 1 with table
unset table
# Now plot the data, using the value of $pointsize[j+1] for row j of points
# There are two tricky bits here
# 1) the line numbers are counted starting with 0
# but array and datablock entries are counted starting from 1.
# 2) $pointsize is an array of strings. We need to convert this to a
# real number in order to use it as a point size
plot for [i=1:*] "file.dat" using 0:i:(real($pointsize[$0+1])) with points ps variable
file.dat
y1 y2 y3 y4
1 2 4 3
2 3 5 4
3 4 6 5
4 5 8 6
ps.dat
ps
1
5
2
3

x must be numeric while trying to create histogram in R

I am a newbie in R. I need to generate some graphs. I imported an excel file and need to create a histogram on one column. My importing code is-
file=read.xlsx('femalecommentcount.xlsx',1,header=FALSE)
col=file[2]
col looks like this (part) -
36961 1
36962 1
36963 7
36964 1
36965 2
36966 1
36967 1
36968 4
36969 1
36970 6
36971 3
36972 1
36973 6
36974 6
36975 2
36976 2
36977 8
36978 2
36979 1
36980 1
36981 1
the first column is the row number. I'm not sure how to remove this. The second column is my data that I want a histogram on. hist() function requires a vector, I'm not sure how exactly to convert.
If I just simple call -
hist(col)
it gives-
Error in hist.default(col) : 'x' must be numeric
I have tried few commands randomly from the internet, but they didn't work.
My eventual goal is to just generate a good histogram (and maybe other charts) on that column, to get a good understadning of the spread of my data.
It should be col=file[[2]] or col=file[, 2] --- solution given in comment
data import should be in correct way to avoid numeric issue

Replace semicolon-separated values to tab

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

merge and plot multiple text files

I have sixty text files, each with two columns as shown below, each representing a unique sample, and headed 'Coverage' and 'counts'. The length of each file differs by a few rows, because for some values of Coverage, the Count is zero, therefore not printed. Each file is about 1000 rows long. Each file is named in the format "B001.BaseCovDist.txt" to "B060.BaseCovDist.txt", and in R I have them as "B001" to "B060".
How can I combine the data frames by Coverage? This is complicated by missing rows. I've tried various approaches in bash, base R, reshape(2), and dplyr.
How can I make a single graph of the Counts(y-axis) against Coverage (x-axis) with each unique sample as a different series. Ggplot2 seems ideal but I seem to need a loop or a list to add the series without having to type out all of the names in full (which would be ridiculous).
One approach that seemed good was to add a third column that contains the unique sample name because this creates a molten dataset. However this didn't work in bash (awk) because the number of whitespace delimiters varies by row.
Any help would be very welcome.
Coverage Count
1 0 7089359
2 1 983611
3 2 658253
4 3 520767
5 4 448916
6 5 400904
A good starting point is to consider a long-format for the data vice a wide-format. Since you mentioned reshape2, this should make sense, but check out tidyr as well, as the docs for both document the differences between long/wide.
Going with a long format, try the following:
allfiles <- lapply(list.files(pattern='foo.csv'),
function(fname) cbind(fname=fname, read.csv(fname)))
dat <- rbind_all(allfiles)
dat
## fname Coverage Count
## 1 B001.BaseCovDist.txt 0 7089359
## 2 B001.BaseCovDist.txt 1 983611
## 3 B001.BaseCovDist.txt 2 658253
## 4 B001.BaseCovDist.txt 3 520767
## 5 B001.BaseCovDist.txt 4 448916
## 6 B001.BaseCovDist.txt 5 400904
ggplot(data=dat, aes(x=Coverage, y=Count, group=fname)) + geom_line()
Just to add to your answer, r2evans I added a gsub command so that the filename suffix is removed from the added column (and also some boring import modifers).
allfiles <- lapply(list.files(pattern='.BasCovDis.txt'), function(sample) cbind(sample=gsub("[.]BasCovDis.txt","", sample), read.table(sample, header=T, skip=3)))

Stepwise fill dataframe

I'm using a for-loop to perform operations on specific subsets of my data. At the end of each iteration of the for loop, I have all the values that I need to fill a row of my dataframe.
So far I tried
df=NULL
for(...){
//stuff to calculate
newline=c(allthethingscalculated)
df=rbind(df,newline)
}
this results in the contents of the dataframe not being accessable using '$' , because the rows are then atomic vectors.
I also tried to append the values I get at the end of each iteration to an already existing vector and when the for loop ends create a dataframe from these vectors using but appending the values to the respective vector didn't work, the values weren't added.
x<-data.frame(a,b,c,d,...)
Any ideas on this?
Since my for loop iterates over IDs in my data, I realized I could do something like this:
uids=unique(data$id)
filler=c(1:length(uids))
df=data.frame(uids,filler,filler,filler,filler,filler,filler,filler,filler,filler)
for(i in uids){
...
df[i,]<-newline
}
I used filler to create a dataframe with the correct number of columns and rows so I don't get an error like 'replacement has length of 9, replacement has length of 1'
Is there a better way to do this? Using this approach I still have the values of filler in the respective row that I'd need to remove?
This should work, can your show us you data ?
R) x=data.frame(a=rep(1,3),b=rep(2,3),c=rep(3,3))
R) d=c(4,4,4)
R) rbind(x,d)
a b c
1 1 2 3
2 1 2 3
3 1 2 3
4 4 4 4
R) cbind(x,d)
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Resources