I have downloaded a file with the extension .mea. It's climate data. I don't know how to import it in r. even I don't know how to open in MacOS. Here is what the first lines of data look like.
IPCC Data Distribution Centre Results from model HADCM3 11-07-2002
Grid is 96 * 73 Month is Jan
HADCM A1F
Total precipitation (mm/day)
7008 format is (10F8.2) missing code is 9999.99
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I did it the following way:
First I split the file into 12 small files, each containing one month's data, using the command line "split" function:
split -l 706 filename newfilePrefix
Then read in each small file with the following
readr::read_table(filename, col_names=FALSE, skip=5)
Related
Say I have a set of numbers and I want to sum them to fit cohorts based on a predetermined distribution. A simple example would be if the cumulative amount of a set of numbers is 100 and the distribution is 0.2, 0.3, 0.5 for cohorts 1,2 and 3 respectively then I'd want to find a subset of numbers whose sum is 20, another unique subset whose sum is 30 and a final unique subset whose sum is 50. Obviously it doesn't have to be exact it just must match the distribution reasonably closely.
I have a way in vba in which I can utilize the solver addin to find the best way to take a subset of numbers within a set of numbers and get close to (within 3000 say) a predetermined distribution. This involve using a sumproduct with a binary 0,1 constraint and the list of numbers, and then finding the difference between the total amount required at that cohort and the sumproduct described.
Once this has been done, any number in this subset is removed and we perform the solver method with a shrunk subset. I have attached an image of the procedure's evolution that I hope is clear, the colors correspond to the iteration. The first (green) iteration we have the full list and the variables that change are those in the corresponding green column containing 0/1, to get a sum product close to 142,449.09
Note the sum of the full list is: 1,424,490.85 in this example.
The "Difference" line is the solver objective, and after each iteration the objective is one column shift to the right. (I have set it so that if the difference is within 1000 then it displays zero - as this appeared to speed up the method). The simulated is that calculated from the corresponding colored sumproduct and the theoretical is simply just the probability multiplied by the total sum of all the numbers.
I have attached the code below but in reality this method isn't time effective, especially if I have to do this across multiple datasets - which is the reality of the problem. I'd like to be able to move this project into a more efficient language like R (which I've had experience with - albeit not to a high degree), as I believe it could do this process quicker and more effectively?
I am also aware my algorithm has flaws as some of the later cohorts won't be as accurate as we are observing a smaller dataset. It seems to include zeros in the sum product which I don't want it to do (see grey column). Also I want all the numbers to be used and sometimes a number will be omitted as it's inclusion means it is further away from the theoretical distribution. I am not sure how to do the above so I'd appreciate some advice on this front.
Has anyone done such a thing in R?
I also understand this could be a problem for Cross Validated - I really wasn't sure so feel free to move. I have attached the code and the tables in text form below.
Thanks in advance,
Sub solversimple()
Dim wb As Workbook
Dim ws As Worksheet
Dim rCell, rIter, rSum
Dim i as Integer
Set wb = Application.ThisWorkbook
Set ws = wb.Sheets("Output")
For i = 1 To 5
rCell = ws.Range("q8").Offset(0, i - 1).Address
rChange = ws.Range("h4:h36").Offset(0, i - 1).Address
rSum = ws.Range("I5:I39").Offset(0, i - 1).Address
solverreset
SolverOk SetCell:=rCell, MaxMinVal:=2, ValueOf:=0, ByChange:=rChange, _
Engine:=3, EngineDesc:="Evolutionary"
SolverAdd CellRef:=rChange, Relation:=5, FormulaText:="binary"
SolverAdd CellRef:=rSum, Relation:=5, FormulaText:="binary"
SolverSolve True
Next i
End Sub
Full List List after 1st It List after 2nd List after 3rd List after 4th 1 2 3 4 5
49000.21 49000.21 49000.21 49000.21 49000.21 0.00 0.00 0.00 0.00 1.00
51591.99 51591.99 51591.99 51591.99 51591.99 0.00 0.00 0.00 0.00 1.00
18390.18 18390.18 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
45490.39 45490.39 45490.39 45490.39 45490.39 0.00 0.00 0.00 0.00 1.00
37506.41 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00
1460.11 1460.11 1460.11 0.00 0.00 0.00 0.00 1.00 1.00 0.00
136564.86 136564.86 136564.86 136564.86 0.00 0.00 0.00 0.00 1.00 1.00
41581.29 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00
6138.26 6138.26 6138.26 0.00 0.00 0.00 0.00 1.00 0.00 0.00
23831.37 23831.37 23831.37 23831.37 0.00 0.00 0.00 0.00 1.00 1.00
4529.44 4529.44 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1291.53 1291.53 1291.53 0.00 0.00 0.00 0.00 1.00 0.00 0.00
1084.88 1084.88 1084.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00
33516.76 33516.76 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00
43393.83 43393.83 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00
81000.69 81000.69 81000.69 0.00 0.00 0.00 0.00 1.00 0.00 0.00
25397.64 25397.64 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00
29473.54 29473.54 29473.54 0.00 0.00 0.00 0.00 1.00 1.00 1.00
39097.70 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00
59669.99 59669.99 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
18639.97 18639.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
97198.13 97198.13 97198.13 0.00 0.00 0.00 0.00 1.00 0.00 1.00
5558.69 5558.69 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
16298.63 16298.63 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00
67621.61 67621.61 67621.61 0.00 0.00 0.00 0.00 1.00 0.00 0.00
69388.09 69388.09 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
193524.89 193524.89 193524.89 193524.89 0.00 0.00 0.00 0.00 1.00 1.00
12455.61 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00
7261.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00
77879.68 77879.68 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
53891.97 53891.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
70602.68 70602.68 70602.68 70602.68 70602.68 0.00 0.00 0.00 0.00 1.00
4157.96 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00
Cohort 1.00 2.00 3.00 4.00 5.00
Probability 0.10 0.30 0.20 0.25 0.15
Theoretical 142449.09 427347.26 284898.17 356122.71 213673.63
Simulated 142060.85 426554.86 285268.75 353921.12 216685.28
Difference 0.00 0.00 0.00 2201.59 3011.65
Using R, create some test input. Then using a greedy approach determine the ordering indices o. Use findInterval to determine the breakpoints b and then create a grouping vector and rearrange it to correspond to the original order of x so that x[i] is in group g[i]. Note that split(x, g) creates a length(d) list of groups.
# test input
set.seed(123)
x <- sample(20, 20)
d <- c(.2, .3, .5) # assume in increasing order
o <- order(x)
b <- findInterval(cumsum(d) * sum(x), cumsum(x[o]))
g <- rep(seq_along(d), diff(c(0, b)))[order(o)]
# check distribution of result
tapply(x, g, sum) / sum(x)
## 1 2 3
## 0.1714286 0.3285714 0.5000000
I have the following month-wise data for each customer.
I would like to do a 3-month forecast for each customer.
Note:many obs have zeros(no transaction)------Need to tackle this sparse dataset
CustomerName 01/2009 02/2009 03/2009 04/2009 05/2009 06/2009 07/2009 08/2009 09/2009 10/2009
Aaron Bergman 0.00 0.00 0.00 0.00 0.00 0.0 4270.87 0.00 0.00 0
Aaron Hawkins 0.00 0.00 0.00 0.00 0.00 0.0 0.00 455.04 0.00 0
Aaron Smayling 136.29 4658.69 0.00 119.34 4674.16 0.0 0.00 0.00 0.00 0
Adam Bellavance 0.00 0.00 0.00 0.00 2107.55 0.0 0.00 0.00 0.00 0
Adam Hart 60.52 0.00 0.00 0.00 0.00 0.0 0.00 0.00 0.00 0
Adam Shillingsburg 0.00 1749.50 125.86 0.00 0.00 5689.4 3275.74 1296.30 9887.52 0
Adrian Barton 0.00 66.00 0.00 0.00 0.00 55.0 0.00 0.00 0.00 0
Adrian Hane 0.00 23.66 0.00 0.00 46.22 0.0 0.00 0.00 0.00 0
Adrian Shami 10.00 0.00 0.00 33.00 0.00 48.0 0.00 0.00 42.00 0
Aimee Bixby 56.33 22.99 0.00 44.28 0.00 0.0 0.00 66.12 0.00 48.22
How can i do some sort of batch times series forecasting say using auto.arima for each customer ......
I have file that has space separated columns from that i want to extract specific data .below is the format of the file :
12:00:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 AM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
01:03:01 AM all 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
01:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
01:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 PM 0 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 PM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 PM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 PM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
Now from this file i want those rows that have time like 12:01:01 AM/PM i means for every hourly basis and have all in the CPU column
So after extraction i want below data but i am not able to get that.
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
Please suggest me how we can get that data in UNIX
If you add the -E option to grep it allows you to look for "Extended Regular Expressions". One such expression is
"CPU|01:01"
which will allow you to find all lines containing the word "CPU" (such as your column heading line) and also any lines with "01:01" in them. It is called an "alternation" and uses the pipe symbol (|) to separate alternate sub-parts.
So, an answer would be"
grep -E "CPU|01:01 .*all" yourFile > newFile
Try running:
man grep
to get the manual (help) page.
awk to the rescue!
if you need field specific matches awk is the right tool.
$ awk '$3=="all" && $1~/01:01$/' file
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
you can extract the header as well, with this
$ awk 'NR==1 || $3=="all" && $1~/01:01$/' file
I would like a transition probability matrix that looks like this (adding up to 1.0 for each row).
0.00 0.00 1.00 0.00 0.00 0.00 0.00
0.00 0.50 0.50 0.00 0.00 0.00 0.00
0.33 0.00 0.33 0.33 0.00 0.00 0.00
How can I get it in R?
transMatrx <-matrix(c(0,0,1,0,0,0,0,.5,.5,0,0,0,0,.33,0,.33,.33,0,0,0),nrow=6)
nrow=6 is the amount of columns
Go left to right, top to bottom
I'm trying to run compositional analysis of the use of different type of habitats by ground nesting chicks on a set of data using R Studio. It starts processing but gives never stops. I have to manually stop the processing or kill R Studio. (Same result in R.)
I'm using the campana function from the adehabitatHS package. From the adehabitat I'm able to run the sample pheasant and squirrel data without any problems. (I've tried calling campana from both packages with the same result.)
For each chick, the habitat available varies as it's taken as a buffer zone around their nest site.
My data
This is the available habitats for each chick:
grass fallow.plot oil.seed.rape spring.barley winter.wheat maize other.crops other woodland hedgerow
1 23.35 7.53 45.75 0.00 0.00 0.00 0.00 0.00 23.37 0.00
2 86.52 10.35 0.00 0.00 1.24 0.00 0.00 1.89 0.00 0.00
3 5.18 10.33 28.36 38.82 0.00 0.00 17.17 0.14 0.00 0.00
4 4.26 18.32 27.31 32.66 3.82 0.00 0.00 5.02 5.52 3.09
5 4.26 18.32 27.31 32.66 3.82 0.00 0.00 5.02 5.52 3.09
6 12.52 10.35 0.00 0.00 0.00 18.02 43.59 13.15 2.37 0.00
7 21.41 11.56 59.25 0.00 0.00 0.00 0.00 5.82 0.00 1.96
8 21.41 11.56 59.25 0.00 0.00 0.00 0.00 5.82 0.00 1.96
9 36.17 16.93 0.00 30.14 0.00 0.00 0.00 7.08 9.68 0.00
10 0.00 12.17 26.49 0.00 3.99 55.77 0.00 1.58 0.00 0.00
11 0.00 10.27 67.41 1.93 18.30 0.00 0.00 1.18 0.00 0.91
12 2.66 5.38 0.00 14.39 54.06 0.00 8.40 3.83 7.84 3.44
13 2.66 5.38 0.00 14.39 54.06 0.00 8.40 3.83 7.84 3.44
14 84.22 8.00 0.00 0.00 0.00 2.90 0.00 0.22 3.84 0.82
15 84.22 8.00 0.00 0.00 0.00 2.90 0.00 0.22 3.84 0.82
16 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
17 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
18 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
19 86.85 13.04 0.00 0.00 0.00 0.00 0.00 0.11 0.00 0.00
20 21.41 8.11 0.47 8.08 0.00 0.00 56.78 2.26 0.00 2.89
This is the used habitats (mcp):
grass fallow.plot oil.seed.rape spring.barley winter.wheat maize other.crops other woodland hedgerow
1 41.14 58.67 0.19 0.00 0.00 0.00 0.00 0.00 0 0.0
2 35.45 64.55 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
3 10.10 60.04 7.72 21.37 0.00 0.00 0.00 0.77 0 0.0
4 0.00 44.55 0.00 50.27 0.00 0.00 0.00 5.18 0 0.0
5 2.82 48.48 44.80 0.00 0.00 0.00 0.00 0.00 0 3.9
6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 0 0.0
7 0.00 87.41 12.59 0.00 0.00 0.00 0.00 0.00 0 0.0
8 0.00 83.59 16.41 0.00 0.00 0.00 0.00 0.00 0 0.0
9 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
10 0.00 18.93 0.00 0.00 0.00 81.07 0.00 0.00 0 0.0
11 0.00 100.00 0.00 0.00 0.00 0.00 0.00 0.00 0 0.0
12 0.00 22.79 0.00 0.00 77.13 0.00 0.00 0.08 0 0.0
13 0.00 0.00 0.00 0.00 100.00 0.00 0.00 0.00 0 0.0
14 54.60 44.97 0.00 0.00 0.00 0.00 0.00 0.43 0 0.0
15 62.86 36.57 0.00 0.00 0.00 0.00 0.00 0.57 0 0.0
16 11.15 88.10 0.00 0.00 0.00 0.00 0.00 0.75 0 0.0
17 20.06 79.62 0.00 0.00 0.00 0.00 0.00 0.32 0 0.0
18 38.64 60.95 0.00 0.00 0.00 0.00 0.00 0.41 0 0.0
19 3.81 95.81 0.00 0.00 0.00 0.00 0.00 0.38 0 0.0
20 0.00 3.56 0.00 0.00 0.00 0.00 96.44 0.00 0 0.0
I've tried both parametric and randomisation tests with the same results. The code I'm running:
habuse <- compana(used, avail, test = "randomisation",rnv = 0.001, nrep = 500, alpha = 0.1)
habuse <- compana(used, avail, test = "parametric")
Any ideas where I'm going wrong?
I've discovered the answer to my own question. For the used data, the function replaces 0 values with the value you specify (0.001 in my case). But it doesn't replace 0 values in the available data, and it doesn't like them either.
I replaced all the 0s with 0.001 in the available table, adjusted the other values and the function worked.