Creating a transition matrix in R - r

I would like a transition probability matrix that looks like this (adding up to 1.0 for each row).
0.00 0.00 1.00 0.00 0.00 0.00 0.00
0.00 0.50 0.50 0.00 0.00 0.00 0.00
0.33 0.00 0.33 0.33 0.00 0.00 0.00
How can I get it in R?

transMatrx <-matrix(c(0,0,1,0,0,0,0,.5,.5,0,0,0,0,.33,0,.33,.33,0,0,0),nrow=6)
nrow=6 is the amount of columns
Go left to right, top to bottom

Related

How to match elements from data frame with values from an array in R?

I want to match elements from df1 with values from an array1.
df1 <- (c('A','S','E','E','V','G','H','P','K','L','W','N','P','A','A','S','E','N','M','Y','S','G','D','R','H'))
array1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A 0.15 0.00 0.10 0.10 0.05 0.00 0.05 0.00 0.00 0.05 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00
C 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
D 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.05 0.05 0.00 0.0 0.10 0.0 0.00 0.25 0.10 0.20 0.10 0.00 0.15 0.05 0.00 0.00 0.05
E 0.05 0.10 0.05 0.05 0.00 0.05 0.00 0.10 0.10 0.20 0.00 0.0 0.05 0.0 0.00 0.00 0.05 0.10 0.00 0.20 0.10 0.05 0.15 0.10 0.10
F 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.05 0.0 0.05 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
G 0.00 0.00 0.10 0.00 0.05 0.00 0.00 0.00 0.05 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.05 0.00 0.00 0.00
H 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.2 0.05 0.1 0.05 0.05 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00
K 0.00 0.10 0.00 0.05 0.00 0.05 0.05 0.05 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.10 0.00 0.05
L 0.00 0.00 0.05 0.05 0.05 0.05 0.10 0.00 0.10 0.00 0.00 0.0 0.00 0.2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00
M 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
N 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.05 0.05 0.00 0.00 0.00 0.00 0.05 0.00
P 0.00 0.00 0.00 0.05 0.05 0.00 0.10 0.10 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.10 0.00 0.05 0.00
Q 0.00 0.05 0.05 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.05 0.0 0.00 0.1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05
R 0.00 0.00 0.05 0.00 0.05 0.15 0.00 0.00 0.00 0.05 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.00
S 0.10 0.10 0.00 0.00 0.05 0.00 0.00 0.00 0.00 0.00 0.05 0.0 0.00 0.0 0.15 0.10 0.20 0.05 0.10 0.10 0.05 0.00 0.05 0.05 0.10
T 0.00 0.00 0.00 0.05 0.00 0.05 0.00 0.05 0.05 0.00 0.00 0.0 0.00 0.0 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.05 0.05 0.00
V 0.05 0.05 0.00 0.05 0.00 0.00 0.00 0.05 0.05 0.00 0.10 0.2 0.15 0.0 0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
W 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Y 0.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.10 0.0 0.00 0.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The expected outcome can be a list or a df:
0.15, 0.10, 0.05, 0.05, 0.00, 0.00, 0.00, 0.10, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.10, 0.05, 0.05, 0.00, 0.00, 0.05, 0.05, 0.00, 0.00, 0.00
This is what I have tried:
res <- left_join(df1, array1, by = array1[[y]])
view(res)
You can use matrix subsetting on array1 :
array1[cbind(match(df1, rownames(array1)), 1:ncol(array1))]
#[1] 0.15 0.10 0.05 0.05 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00
#[14] 0.00 0.00 0.10 0.05 0.05 0.00 0.00 0.05 0.05 0.00 0.00 0.00
match(df1, rownames(array1)) creates a row number to subset based on values in df1.

How to effectively fit a set of numbers into a predetermined distribution

Say I have a set of numbers and I want to sum them to fit cohorts based on a predetermined distribution. A simple example would be if the cumulative amount of a set of numbers is 100 and the distribution is 0.2, 0.3, 0.5 for cohorts 1,2 and 3 respectively then I'd want to find a subset of numbers whose sum is 20, another unique subset whose sum is 30 and a final unique subset whose sum is 50. Obviously it doesn't have to be exact it just must match the distribution reasonably closely.
I have a way in vba in which I can utilize the solver addin to find the best way to take a subset of numbers within a set of numbers and get close to (within 3000 say) a predetermined distribution. This involve using a sumproduct with a binary 0,1 constraint and the list of numbers, and then finding the difference between the total amount required at that cohort and the sumproduct described.
Once this has been done, any number in this subset is removed and we perform the solver method with a shrunk subset. I have attached an image of the procedure's evolution that I hope is clear, the colors correspond to the iteration. The first (green) iteration we have the full list and the variables that change are those in the corresponding green column containing 0/1, to get a sum product close to 142,449.09
Note the sum of the full list is: 1,424,490.85 in this example.
The "Difference" line is the solver objective, and after each iteration the objective is one column shift to the right. (I have set it so that if the difference is within 1000 then it displays zero - as this appeared to speed up the method). The simulated is that calculated from the corresponding colored sumproduct and the theoretical is simply just the probability multiplied by the total sum of all the numbers.
I have attached the code below but in reality this method isn't time effective, especially if I have to do this across multiple datasets - which is the reality of the problem. I'd like to be able to move this project into a more efficient language like R (which I've had experience with - albeit not to a high degree), as I believe it could do this process quicker and more effectively?
I am also aware my algorithm has flaws as some of the later cohorts won't be as accurate as we are observing a smaller dataset. It seems to include zeros in the sum product which I don't want it to do (see grey column). Also I want all the numbers to be used and sometimes a number will be omitted as it's inclusion means it is further away from the theoretical distribution. I am not sure how to do the above so I'd appreciate some advice on this front.
Has anyone done such a thing in R?
I also understand this could be a problem for Cross Validated - I really wasn't sure so feel free to move. I have attached the code and the tables in text form below.
Thanks in advance,
Sub solversimple()
Dim wb As Workbook
Dim ws As Worksheet
Dim rCell, rIter, rSum
Dim i as Integer
Set wb = Application.ThisWorkbook
Set ws = wb.Sheets("Output")
For i = 1 To 5
rCell = ws.Range("q8").Offset(0, i - 1).Address
rChange = ws.Range("h4:h36").Offset(0, i - 1).Address
rSum = ws.Range("I5:I39").Offset(0, i - 1).Address
solverreset
SolverOk SetCell:=rCell, MaxMinVal:=2, ValueOf:=0, ByChange:=rChange, _
Engine:=3, EngineDesc:="Evolutionary"
SolverAdd CellRef:=rChange, Relation:=5, FormulaText:="binary"
SolverAdd CellRef:=rSum, Relation:=5, FormulaText:="binary"
SolverSolve True
Next i
End Sub
Full List List after 1st It List after 2nd List after 3rd List after 4th 1 2 3 4 5
49000.21 49000.21 49000.21 49000.21 49000.21 0.00 0.00 0.00 0.00 1.00
51591.99 51591.99 51591.99 51591.99 51591.99 0.00 0.00 0.00 0.00 1.00
18390.18 18390.18 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
45490.39 45490.39 45490.39 45490.39 45490.39 0.00 0.00 0.00 0.00 1.00
37506.41 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00
1460.11 1460.11 1460.11 0.00 0.00 0.00 0.00 1.00 1.00 0.00
136564.86 136564.86 136564.86 136564.86 0.00 0.00 0.00 0.00 1.00 1.00
41581.29 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00 0.00
6138.26 6138.26 6138.26 0.00 0.00 0.00 0.00 1.00 0.00 0.00
23831.37 23831.37 23831.37 23831.37 0.00 0.00 0.00 0.00 1.00 1.00
4529.44 4529.44 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
1291.53 1291.53 1291.53 0.00 0.00 0.00 0.00 1.00 0.00 0.00
1084.88 1084.88 1084.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00
33516.76 33516.76 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00
43393.83 43393.83 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00
81000.69 81000.69 81000.69 0.00 0.00 0.00 0.00 1.00 0.00 0.00
25397.64 25397.64 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00
29473.54 29473.54 29473.54 0.00 0.00 0.00 0.00 1.00 1.00 1.00
39097.70 0.00 0.00 0.00 0.00 1.00 0.00 0.00 1.00 1.00
59669.99 59669.99 0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00
18639.97 18639.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
97198.13 97198.13 97198.13 0.00 0.00 0.00 0.00 1.00 0.00 1.00
5558.69 5558.69 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
16298.63 16298.63 0.00 0.00 0.00 0.00 1.00 0.00 1.00 1.00
67621.61 67621.61 67621.61 0.00 0.00 0.00 0.00 1.00 0.00 0.00
69388.09 69388.09 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00
193524.89 193524.89 193524.89 193524.89 0.00 0.00 0.00 0.00 1.00 1.00
12455.61 0.00 0.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00
7261.88 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00
77879.68 77879.68 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00
53891.97 53891.97 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
70602.68 70602.68 70602.68 70602.68 70602.68 0.00 0.00 0.00 0.00 1.00
4157.96 0.00 0.00 0.00 0.00 1.00 1.00 1.00 0.00 1.00
Cohort 1.00 2.00 3.00 4.00 5.00
Probability 0.10 0.30 0.20 0.25 0.15
Theoretical 142449.09 427347.26 284898.17 356122.71 213673.63
Simulated 142060.85 426554.86 285268.75 353921.12 216685.28
Difference 0.00 0.00 0.00 2201.59 3011.65
Using R, create some test input. Then using a greedy approach determine the ordering indices o. Use findInterval to determine the breakpoints b and then create a grouping vector and rearrange it to correspond to the original order of x so that x[i] is in group g[i]. Note that split(x, g) creates a length(d) list of groups.
# test input
set.seed(123)
x <- sample(20, 20)
d <- c(.2, .3, .5) # assume in increasing order
o <- order(x)
b <- findInterval(cumsum(d) * sum(x), cumsum(x[o]))
g <- rep(seq_along(d), diff(c(0, b)))[order(o)]
# check distribution of result
tapply(x, g, sum) / sum(x)
## 1 2 3
## 0.1714286 0.3285714 0.5000000

How to extract value of CPU idle from sar command using AWK

From the outut of a sar command, I want to extract only the lines in which the %iowait value is higher than a set threshold.
I tried using AWK but somehow I'm not able to perform the action.
sar -u -f sa12 | sed 's/\./,/g' | awk -f" " '{ if ( $7 -gt 0 ) print $0 }'
I tried to substitute the . with , and using -gt but still no joy.
Can someone suggest a solution?
If we need entire line output of sar -u with iowait > 0.01 then, we can use this ,
Command
sar -u | grep -v "CPU" | awk '$7 > 0.01'
Output will be similar to
03:40:01 AM all 3.16 0.00 0.05 0.11 0.00 96.68
04:40:01 PM all 0.19 0.00 0.05 0.02 0.00 99.74
if wish to out specific fields, say only iowait, we can use as given below,
Command to out specific field(s),
sar -u | grep -v "CPU" | awk '{if($7 > 0.01 ) print $7}'
Output will be
0.11
0.02
Note : grep -v is used just to remove the headings in the output
Hope this helps,
My sar -u gives several lines similar to the following:
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:05:01 AM all 0.29 0.00 0.30 0.01 0.00 99.40
12:15:01 AM all 0.33 0.00 0.34 0.00 0.00 99.32
12:25:01 AM all 0.33 0.00 0.30 0.01 0.00 99.36
12:35:01 AM all 0.31 0.00 0.29 0.01 0.00 99.39
12:45:01 AM all 0.33 0.00 0.32 0.01 0.00 99.35
12:55:01 AM all 0.32 0.00 0.30 0.00 0.00 99.38
01:05:01 AM all 0.32 0.00 0.28 0.00 0.00 99.39
01:15:01 AM all 0.33 0.00 0.30 0.01 0.00 99.37
01:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.39
01:35:01 AM all 0.31 0.00 0.33 0.00 0.00 99.36
01:45:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
01:55:01 AM all 0.31 0.00 0.30 0.00 0.00 99.38
02:05:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
02:15:01 AM all 0.32 0.00 0.30 0.01 0.00 99.38
02:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.38
02:35:01 AM all 0.33 0.00 0.33 0.00 0.00 99.33
02:45:01 AM all 0.35 0.00 0.32 0.01 0.00 99.32
02:55:01 AM all 0.28 0.00 0.30 0.00 0.00 99.42
03:05:01 AM all 0.32 0.00 0.31 0.00 0.00 99.37
03:15:01 AM all 0.34 0.00 0.30 0.01 0.00 99.36
03:25:01 AM all 0.32 0.00 0.29 0.01 0.00 99.38
03:35:01 AM all 0.33 0.00 0.26 0.00 0.00 99.40
03:45:01 AM all 0.34 0.00 0.29 0.00 0.00 99.36
03:55:01 AM all 0.30 0.00 0.28 0.01 0.00 99.41
04:05:01 AM all 0.32 0.00 0.30 0.01 0.00 99.37
04:15:01 AM all 0.37 0.00 0.31 0.01 0.00 99.32
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
To filter out those where %iowait is greater than, let's say, 0.01:
sar -u | awk '$7>0.01{print}'
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
05:15:01 AM all 0.34 0.00 0.32 0.02 0.00 99.32
06:35:01 AM all 0.33 0.22 1.23 4.48 0.00 93.74
06:45:01 AM all 0.16 0.00 0.12 0.02 0.00 99.71
10:35:01 AM all 0.22 0.00 0.13 0.02 0.00 99.63
12:15:01 PM all 0.42 0.00 0.16 0.03 0.00 99.40
01:45:01 PM all 0.17 0.00 0.11 0.02 0.00 99.71
04:05:01 PM all 0.15 0.00 0.12 0.03 0.00 99.70
04:15:01 PM all 0.42 0.00 0.23 0.10 0.00 99.25
Edit:
As correctly pointed out by #Ed Morton, the awk code can be shortened to simply awk '$7>0.01', since the default action is to print the current line.

Opening .mea file in R

I have downloaded a file with the extension .mea. It's climate data. I don't know how to import it in r. even I don't know how to open in MacOS. Here is what the first lines of data look like.
IPCC Data Distribution Centre Results from model HADCM3 11-07-2002
Grid is 96 * 73 Month is Jan
HADCM A1F
Total precipitation (mm/day)
7008 format is (10F8.2) missing code is 9999.99
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
I did it the following way:
First I split the file into 12 small files, each containing one month's data, using the command line "split" function:
split -l 706 filename newfilePrefix
Then read in each small file with the following
readr::read_table(filename, col_names=FALSE, skip=5)

Extract Data from a File Unix

I have file that has space separated columns from that i want to extract specific data .below is the format of the file :
12:00:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 AM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
01:03:01 AM all 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
01:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
01:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 PM 0 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 PM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 PM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 PM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
Now from this file i want those rows that have time like 12:01:01 AM/PM i means for every hourly basis and have all in the CPU column
So after extraction i want below data but i am not able to get that.
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
Please suggest me how we can get that data in UNIX
If you add the -E option to grep it allows you to look for "Extended Regular Expressions". One such expression is
"CPU|01:01"
which will allow you to find all lines containing the word "CPU" (such as your column heading line) and also any lines with "01:01" in them. It is called an "alternation" and uses the pipe symbol (|) to separate alternate sub-parts.
So, an answer would be"
grep -E "CPU|01:01 .*all" yourFile > newFile
Try running:
man grep
to get the manual (help) page.
awk to the rescue!
if you need field specific matches awk is the right tool.
$ awk '$3=="all" && $1~/01:01$/' file
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
you can extract the header as well, with this
$ awk 'NR==1 || $3=="all" && $1~/01:01$/' file

Resources