Cut Out Middle of String - unix

This is what my data looks like
orthogroup12213.faa.aligned.treefile.rooting.0.gtpruned.rearrange.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0
orthogroup12706.faa.aligned.treefile.rooting.0.gtpruned.rearrange.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
...
I want to end with something like this (without the .faa.aligned.treefile.rooting.0.gtpruned.rearrange.0) :
orthogroup12213 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 6.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 8.0
orthogroup12706 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have tried a variety of 'cut' functions but with no luck. Please help!

I would use sed:
sed 's/\.[^ ]*//'
This says, "Find the first dot, and all characters that follow it that aren't a space, and replace them with nothing."

Your example shows 2 lines with both .faa.aligned.treefile.rooting.0.gtpruned.rearrange.0. When this is a fixed string AND the first part is always exact 15 positions you might use cut:
# bad solution, only cut
cut -c1-15,68- file
This solution can be marked as terrible. When the length of the startstring or middle part changes, you are out of order.
When you know that the string to remove starts with a dot and the first space is the next cutting point, you can use
# also bad
sed 's/[.]/ /' file | cut -d" " -f1,3-
It is nice to keep it simple with cut, but cut needs simple input.
First think what is the best way to find the middle string and use something like sed or awk for this.
# example with sed
str='.faa.aligned.treefile.rooting.0.gtpruned.rearrange.0'
sed 's/'$str'//' file

Related

Create a non-linear structure in higher dimensional space in Julia

Suppose I’m creating a non-linear structure in a 20-dim space. Right now I have code
using Random
"`Uniform(0,b)`, with `0` excluded for sure, and we really mean it."
struct PositiveUniform{T}
b::T
end
function Base.rand(rng::Random.AbstractRNG, pu::PositiveUniform)
while true
r = rand(rng)
r > 0 && return r * pu.b
end
end
m = rand(PositiveUniform(20))
mat_new = [cos(m),sin(m),cos(m),sin(m),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]'
for i in 1:84
m = rand(PositiveUniform(20))
vector = [cos(m),sin(m),cos(m),sin(m),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
mat_new = vcat(mat_new, vector')
end
mat_new'
And my mat_new' is like
I'm wondering is this matrix satisfying my expectation?
(edit: I made a 2D structure 20 x 85, padding with zeros, since that is what was wanted, not a 20D array.)
const cols = 85
const rows = 20
const vec85 = rand(cols) .* rows
const mat2D = vcat(cos.(vec85)', sin.(vec85)', cos.(vec85)', sin.(vec85)', zeros(rows - 4, cols))
display(mat2D)
displays:
20×85 Matrix{Float64}:
-0.917208 -0.999591 -0.95458 -0.681959 0.999834 … 0.704834 0.961039 0.982991 0.967226 0.306118
0.398409 0.0286128 -0.297954 -0.731391 0.0182257 0.709372 0.276413 0.183653 -0.253917 -0.951993
-0.917208 -0.999591 -0.95458 -0.681959 0.999834 0.704834 0.961039 0.982991 0.967226 0.306118
0.398409 0.0286128 -0.297954 -0.731391 0.0182257 0.709372 0.276413 0.183653 -0.253917 -0.951993
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋱ ⋮
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note I kept your algorithm (perhaps not intended?) of setting the argument to sin and cos as nrows times rand(). If you want to pad the 2D array to contain more than just your matrix, I would look at PaddedViews.jl or similar.

Convert an unlabeled NxN matrix to a table of position and values in R

I have an unlabeled N x N matrix like the one below. It is saved in a csv.
0.5 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.3 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.2 0.0 0.0 0.0 0.2 0.0 0.0 0.0 0.0 0.0
0.1 0.0 0.0 0.0 0.0 0.2 0.0 0.7 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.0 0.0
0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I want to convert this into a data table with x coordinates, y coordinates, and values as the columns as I believe this is what needs to be done to plot the matrix as a heatmap.
I am completely unfamiliar with R, besides basic syntax, so please be verbose in any suggestions!
Thank you all so much for any help you can provide!
We may read the data with read.table/read.csv, convert the data.frame object to matrix (as.matrix) and then add the table attribute (as.table) and convert to data.frame which will return a data.frame with three columns i.e. row, column and the value in the long format
as.data.frame(as.table(m1))
data
m1 <- as.matrix(read.table('file.txt', header = FALSE))

How to create row subgroups by name in a dataframe with R

I have a big dataframe list[378x87](S3:dataframe), and I want to simplify it by subsetting the rows in a way that has experimental significance.
I can do that because the 387 rows are in fact subgroups of the same geographical region, (with around 14 groups recognizable by the initial letters), which are named as follows:
DAW2-11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.1 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 5.0
DAW2-12 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0 0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 3.0
DAW2-13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.5 0 0 0.0 0.0 0.0 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 10.0
DAW2-21 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.5 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 1.0
DAW2-22 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.5 0.0 0.5 0 0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 10.0
DAW2-23 0.0 0.0 0.0 5.0 0.5 0.0 0.0 0.0 0.0 0.0 0.1 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.1 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.5 0.0 2.0
.
.
.
The question is ¿How can I create a new dataframe subsetting by the first letters, DAW2 in this case, and for the rest of the 14 groups?
Replicating data from the OP to create two split groups, a solution in Base R is as follows:
textFile <- "DAW2-11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.1 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 5.0
DAW2-12 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0 0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 3.0
DAW2-13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.5 0 0 0.0 0.0 0.0 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 10.0
DAW2-21 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.5 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 1.0
DAW2-22 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.5 0.0 0.5 0 0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 10.0
DAW2-23 0.0 0.0 0.0 5.0 0.5 0.0 0.0 0.0 0.0 0.0 0.1 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.1 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.5 0.0 2.0
DAW2-11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.1 0.0 0.1 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 5.0
DAW3-12 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0 0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 3.0
DAW3-13 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 0.1 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.0 0.0 0.5 0 0 0.0 0.0 0.0 0.0 15.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 10.0
DAW3-21 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 0.5 7.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.1 0.0 1.0
DAW3-22 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.1 0.5 0.0 0.5 0 0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 10.0
DAW3-23 0.0 0.0 0.0 5.0 0.5 0.0 0.0 0.0 0.0 0.0 0.1 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0 0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.1 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.5 0.0 2.0
"
data <- read.table(text = textFile,header = FALSE,stringsAsFactors = FALSE)
data$splitVar <- as.factor(substr(data$V1,1,4))
splitData <- split(data,data$splitVar)
splitData$DAW2[1:5]
At this point the object splitData contains two data frames, one for the rows from the original data frame where splitVar == 'DAW2' and another where it is equal to DAW3.
The split() function uses the value of the split variable to name each data frame in the resulting list, so subsequent R code can use the $ form of the extract operator to access a data frame by its region code.
We'll print the first 5 columns of the first data frame in the list to illustrate that the first data frame only contains data for DAW2.
> splitData$DAW2[1:5]
V1 V2 V3 V4 V5
1 DAW2-11 0.0 0.0 0 0
2 DAW2-12 0.1 0.0 0 0
3 DAW2-13 0.0 0.0 0 0
4 DAW2-21 0.0 0.0 0 0
5 DAW2-22 0.0 0.1 0 0
6 DAW2-23 0.0 0.0 0 5
7 DAW2-11 0.0 0.0 0 0
>
Note: Given the sample data, it appears that the region data is represented by the first three characters of the first column in the data frame. If the region information has a variable number of characters ending before the -, we can create the splitVar as follows.
data$splitVar <- as.factor(sapply(strsplit(data$V1,"-"),function(x) x[1]))
Now that we have a version of the code that produces correct output, we can simplify the solution as noted in Daniel's comment, which uses a regular expression with sub() to delete all characters in V1 starting with -.
splitData <- split(data,sub("-.*","",data$V1))
...and the output from the DAW3 data frame:
> splitData <- split(data,sub("-.*","",data$V1))
> splitData$DAW3[1:5]
V1 V2 V3 V4 V5
8 DAW3-12 0.1 0.0 0 0
9 DAW3-13 0.0 0.0 0 0
10 DAW3-21 0.0 0.0 0 0
11 DAW3-22 0.0 0.1 0 0
12 DAW3-23 0.0 0.0 0 5
>

Transform UpperTriangular to Cholesky in Julia

Having a dataset X, I am trying to perform a Cholesky factorization, followed by a Cholesky update. My setting is the following:
data = readtable("PCA_transformed_data_gt1000.csv",header= true)
data = delete!(data, :1)
n,d = size(data)
s = 6.6172
S0 = s*eye(d)
kappa_0 = 0.001
nu_0 = d
mu_0 = zeros(d)
S0 = LinAlg.chol(S0+kappa_0*dot(mu_0,mu_0'))
The type of S0 is
julia> typeof(S0)
UpperTriangular{Float64,Array{Float64,2}}
I am trying to perform the Cholesky update as
U = sqrt((1+1/kappa_0)) * LinAlg.lowrankdowndate!(S0, sqrt(kappa_0)*mu_0)
and get the following error
ERROR: MethodError: no method matching lowrankdowndate!(::UpperTriangular{Float64,Array{Float64,2}}, ::Array{Float64,1})
Closest candidates are:
lowrankdowndate!(::Base.LinAlg.Cholesky{T,S<:AbstractArray{T,2}}, ::Union{Base.ReshapedArray{T,1,A<:DenseArray,MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N}}},DenseArray{T,1},SubArray{T,1,A<:Union{Base.ReshapedArray{T,N,A<:DenseArray,MI<:Tuple{Vararg{Base.MultiplicativeInverses.SignedMultiplicativeInverse{Int64},N}}},DenseArray},I<:Tuple{Vararg{Union{Base.AbstractCartesianIndex,Colon,Int64,Range{Int64}},N}},L}}) at linalg/cholesky.jl:502
I tried something like
convert(S0,Base.LinAlg.Cholesky)
but got the following
ERROR: MethodError: First argument to `convert` must be a Type, got [2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239 0.0; 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.57239]
Any ideas how to perform that task?
There are actually two Cholesky factorization methods and it seems you need the other one, which returns a Cholesky variable. The other method is cholfact. From a Cholesky variable, you can extract an upper triangular factor by indexing with :U like so:
C = LinAlg.cholfact(M)
U = C[:U] # <--- this is upper triangular
For the code in the question, this becomes:
data = readtable("PCA_transformed_data_gt1000.csv",header= true)
data = delete!(data, :1)
n,d = size(data)
s = 6.6172
S0 = s*eye(d)
kappa_0 = 0.001
nu_0 = d
mu_0 = zeros(d)
S1 = LinAlg.cholfact(S0+kappa_0*dot(mu_0,mu_0))
U = sqrt((1+1/kappa_0)) * LinAlg.lowrankdowndate!(S1, sqrt(kappa_0)*mu_0)[:U]
The changes are to the dot product (transpose is unnecessary and causes problem in 0.6), and indexing the result of the lowrankdowndate! with [:U] to get the upper triangular matrix. Also, S1 is used for the result of cholfact instead of overwriting S0 for type stability.
Hope this helps.

How do I use .prt files in R?

I am trying to use climate data for analysis using R code. However, I came across a data format for which I cannot find any documentation.
The .prt extension is used for many applications but I believe mine is a Printer-formatted file.
It has no proper delimiters and it cannot be processed by any other application but I can easily view it in a text editor. Because of the nature of the climate data, processing it in C or Python would be very cumbersome.
Kindly help me to read this file into R or to convert it to a file format readable in R.
EDIT:
The data in the prt file is in the format below. As you can see, it follows a map of India with no proper format or delimiters. Each file consists of certain climate values for each day of the year. I have 53 such files.:
Day= 1-Jan
66.5E 67.5E 68.5E 69.5E 70.5E 71.5E 72.5E 73.5E 74.5E 75.5E 76.5E 77.5E 78.5E 79.5E 80.5E 81.5E 82.5E 83.5E 84.5E 85.5E 86.5E 87.5E 88.5E 89.5E 90.5E 91.5E 92.5E 93.5E 94.5E 95.5E 96.5E 97.5E
37.5N
36.5N 0.0 0.0 0.0 0.0 0.0 0.0
35.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
34.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0
33.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0
32.5N 0.0 0.0 0.0 0.0 0.0 0.0
31.5N 0.0 0.0 0.0 0.0 0.0 0.0
30.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
29.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
28.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.0 8.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
27.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
25.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
24.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
20.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
19.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
18.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
16.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
15.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0
13.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12.5N 0.0 0.0 0.0 0.0 0.0 0.0 1.4
11.5N 0.0 0.0 0.0 0.0 0.5
10.5N 0.0 0.0 0.0 0.0 0.0
9.5N 0.0 0.0 0.0 2.4
8.5N 0.0 0.3 2.5
Day= 2-Jan
I've tried the method suggested in the comments and this is the output I received. But this is not the output I require. I need each of the values under the latitude-longitude to be separate, not as part of an array element.
>
[1] " Day= 1-Jan"
[2] " 66.5E 67.5E 68.5E 69.5E 70.5E 71.5E 72.5E 73.5E 74.5E 75.5E 76.5E 77.5E 78.5E 79.5E 80.5E 81.5E 82.5E 83.5E 84.5E 85.5E 86.5E 87.5E 88.5E 89.5E 90.5E 91.5E 92.5E 93.5E 94.5E 95.5E 96.5E 97.5E"
[3] " 37.5N "
[4] " 36.5N 0.0 0.0 0.0 0.0 0.0 0.0 "
[5] " 35.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[6] " 34.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[7] " 33.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[8] " 32.5N 0.0 0.0 0.0 0.0 0.0 0.0 "
[9] " 31.5N 0.0 0.0 0.0 0.0 0.0 0.0 "
[10] " 30.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[11] " 29.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[12] " 28.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 12.0 8.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
[13] " 27.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0"
[14] " 26.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[15] " 25.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[16] " 24.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[17] " 23.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[18] " 22.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[19] " 21.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
[20] " 20.5N 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 "
Well, your data format is quite irregular. I wasn't sure if you just had one date in each file or if there were multiple (your example seems to have a second day starting at the bottom). But assuming the latter (which should work for the first scenario as well) here's one strategy using readLines() to just get the data in, then extracting the data of interest with read.fwf
lines <- readLines("test.prt")
days <- grep("Day=", xx)
outlist <- lapply(days, function(day){
headers <- strsplit(gsub("^\\s+","",lines[day+1])," ")[[1]]
date <- gsub(".*Day= ", "", lines[day], perl=T)
con <- textConnection(lines[day:(day+30)+1])
dd <- read.fwf(con, widths=rep(6, 33), header=F, skip=1)
names(dd) <- c("lat", headers)
close(con)
dd<-reshape(dd, idvar="lat", ids="lat",
times=names(dd)[-1], timevar="lon",
varying=list(names(dd)[-1]), v.names="obs",
direction="long")
dd <- cbind(date=date, dd)
dd <- subset(dd, !is.na(obs))
rownames(dd)<-NULL
dd
})
do.call(rbind, outlist)
So we read all the lines in, then find all the "Day=" positions. Then we read the headers from the next line, and then we create a textConnection() to read the rest of the data in with read.fwf() (which apparently does not have a text= parameter). Next, I reshape the data so that you get one row for each lat/lon. I chose to also merge in the data from the section header and to remove the missing values. Finally, after I have a data.frame for each list, I rbind all the data together. The results look like this
date lat lon obs
1 1-Jan 24.5N 68.5E 0
2 1-Jan 23.5N 68.5E 0
3 1-Jan 27.5N 69.5E 0
4 1-Jan 24.5N 69.5E 0
5 1-Jan 23.5N 69.5E 0
6 1-Jan 22.5N 69.5E 0

Resources