How to extract value of CPU idle from sar command using AWK - unix

From the outut of a sar command, I want to extract only the lines in which the %iowait value is higher than a set threshold.
I tried using AWK but somehow I'm not able to perform the action.
sar -u -f sa12 | sed 's/\./,/g' | awk -f" " '{ if ( $7 -gt 0 ) print $0 }'
I tried to substitute the . with , and using -gt but still no joy.
Can someone suggest a solution?

If we need entire line output of sar -u with iowait > 0.01 then, we can use this ,
Command
sar -u | grep -v "CPU" | awk '$7 > 0.01'
Output will be similar to
03:40:01 AM all 3.16 0.00 0.05 0.11 0.00 96.68
04:40:01 PM all 0.19 0.00 0.05 0.02 0.00 99.74
if wish to out specific fields, say only iowait, we can use as given below,
Command to out specific field(s),
sar -u | grep -v "CPU" | awk '{if($7 > 0.01 ) print $7}'
Output will be
0.11
0.02
Note : grep -v is used just to remove the headings in the output
Hope this helps,

My sar -u gives several lines similar to the following:
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
12:00:01 AM CPU %user %nice %system %iowait %steal %idle
12:05:01 AM all 0.29 0.00 0.30 0.01 0.00 99.40
12:15:01 AM all 0.33 0.00 0.34 0.00 0.00 99.32
12:25:01 AM all 0.33 0.00 0.30 0.01 0.00 99.36
12:35:01 AM all 0.31 0.00 0.29 0.01 0.00 99.39
12:45:01 AM all 0.33 0.00 0.32 0.01 0.00 99.35
12:55:01 AM all 0.32 0.00 0.30 0.00 0.00 99.38
01:05:01 AM all 0.32 0.00 0.28 0.00 0.00 99.39
01:15:01 AM all 0.33 0.00 0.30 0.01 0.00 99.37
01:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.39
01:35:01 AM all 0.31 0.00 0.33 0.00 0.00 99.36
01:45:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
01:55:01 AM all 0.31 0.00 0.30 0.00 0.00 99.38
02:05:01 AM all 0.31 0.00 0.28 0.01 0.00 99.40
02:15:01 AM all 0.32 0.00 0.30 0.01 0.00 99.38
02:25:01 AM all 0.31 0.00 0.30 0.01 0.00 99.38
02:35:01 AM all 0.33 0.00 0.33 0.00 0.00 99.33
02:45:01 AM all 0.35 0.00 0.32 0.01 0.00 99.32
02:55:01 AM all 0.28 0.00 0.30 0.00 0.00 99.42
03:05:01 AM all 0.32 0.00 0.31 0.00 0.00 99.37
03:15:01 AM all 0.34 0.00 0.30 0.01 0.00 99.36
03:25:01 AM all 0.32 0.00 0.29 0.01 0.00 99.38
03:35:01 AM all 0.33 0.00 0.26 0.00 0.00 99.40
03:45:01 AM all 0.34 0.00 0.29 0.00 0.00 99.36
03:55:01 AM all 0.30 0.00 0.28 0.01 0.00 99.41
04:05:01 AM all 0.32 0.00 0.30 0.01 0.00 99.37
04:15:01 AM all 0.37 0.00 0.31 0.01 0.00 99.32
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
To filter out those where %iowait is greater than, let's say, 0.01:
sar -u | awk '$7>0.01{print}'
Linux 4.4.0-127-generic (v1) 06/12/2018 _x86_64_ (1 CPU)
04:25:01 AM all 1.78 2.04 0.59 0.05 0.00 95.55
05:15:01 AM all 0.34 0.00 0.32 0.02 0.00 99.32
06:35:01 AM all 0.33 0.22 1.23 4.48 0.00 93.74
06:45:01 AM all 0.16 0.00 0.12 0.02 0.00 99.71
10:35:01 AM all 0.22 0.00 0.13 0.02 0.00 99.63
12:15:01 PM all 0.42 0.00 0.16 0.03 0.00 99.40
01:45:01 PM all 0.17 0.00 0.11 0.02 0.00 99.71
04:05:01 PM all 0.15 0.00 0.12 0.03 0.00 99.70
04:15:01 PM all 0.42 0.00 0.23 0.10 0.00 99.25
Edit:
As correctly pointed out by #Ed Morton, the awk code can be shortened to simply awk '$7>0.01', since the default action is to print the current line.

Related

Minimizing weighted sum of matrix while ensuring distribution of outcomes in R

I am struggeling with an optimization problem involving a simple matrix operation. The task is the following: I have a sqare matrix D containing "damage multipliers" stemming from a prodcuction reduction in producing countries (columns) and felt by "receiving" countries (rows).
AUT BEL BGR CYP CZE DEU DNK ESP EST FIN FRA GBR GRC HRV HUN IRL ITA LTU LUX LVA MLT NLD POL PRT ROU SVK SVN SWE
AUT 1.48 0.15 0.18 0.08 0.19 0.22 0.01 0.01 0.02 0.02 0.05 0.01 0.01 0.02 0.14 0.00 0.02 0.03 0.02 0.02 0.00 0.04 0.10 0.09 0.11 0.16 0.17 0.11
BEL 0.03 2.70 0.34 0.09 0.05 0.03 0.02 0.01 0.04 0.09 0.09 0.02 0.01 0.01 0.03 0.01 0.01 0.03 0.08 0.02 0.00 0.04 0.03 0.37 0.09 0.07 0.15 0.29
BGR 0.01 0.02 9.81 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 0.02 0.12 0.01 0.00 0.01
CYP 0.00 0.01 0.01 9.87 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
CZE 0.19 0.11 0.08 0.07 4.14 0.27 0.01 0.00 0.01 0.01 0.03 0.01 0.00 0.00 0.05 0.00 0.03 0.05 0.01 0.01 0.00 0.02 0.32 0.07 0.03 2.57 0.05 0.05
DEU 0.29 2.54 0.27 0.15 0.19 1.71 0.10 0.04 0.06 0.22 0.22 0.09 0.03 0.02 0.11 0.03 0.08 0.12 0.08 0.07 0.00 0.28 0.28 0.55 0.25 0.26 0.11 1.09
DNK 0.01 0.09 0.02 0.09 0.01 0.14 3.43 0.00 0.02 0.12 0.02 0.02 0.00 0.00 0.01 0.00 0.01 0.02 0.01 0.02 0.00 0.01 0.03 0.05 0.01 0.01 0.01 1.39
ESP 0.02 0.26 0.06 0.05 0.02 0.03 0.02 2.72 0.45 0.04 0.22 0.05 0.04 0.01 0.01 0.05 0.06 0.02 0.01 0.01 0.00 0.02 0.03 1.28 0.05 0.02 0.01 0.32
EST 0.00 0.01 0.00 0.03 0.00 0.00 0.00 0.00 5.03 0.17 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05 0.00 0.04 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.05
FIN 0.01 0.09 0.02 0.03 0.01 0.01 0.06 0.00 0.21 5.48 0.01 0.01 0.00 0.00 0.00 0.01 0.00 0.02 0.01 0.02 0.00 0.01 0.02 0.05 0.01 0.01 0.00 1.99
FRA 0.04 0.89 0.11 0.13 0.03 0.08 0.03 0.18 0.04 0.08 5.19 0.05 0.02 0.01 0.03 0.05 0.06 0.06 0.03 0.03 0.00 0.14 0.04 0.54 0.08 0.04 0.03 0.79
GBR 0.03 0.80 0.09 2.13 0.03 0.05 0.12 0.08 0.03 0.30 0.15 3.13 0.02 0.01 0.02 0.41 0.02 0.12 0.02 0.05 0.00 0.19 0.06 0.36 0.05 0.04 0.02 2.28
GRC 0.00 0.04 0.14 0.26 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 2.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.03 0.00 0.00 0.02
HRV 0.19 0.01 0.01 0.03 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.25 0.03 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.09 0.01
HUN 0.29 0.07 0.08 0.17 0.30 0.08 0.02 0.00 0.01 0.01 0.06 0.00 0.00 0.01 4.83 0.00 0.01 0.09 0.01 0.05 0.00 0.01 0.05 0.04 0.13 0.23 0.06 0.04
IRL 0.00 0.03 0.01 0.06 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.03 0.00 0.00 0.00 1.80 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.02 0.00 0.00 0.00 0.03
ITA 0.76 0.46 0.40 0.20 0.06 0.24 0.02 0.18 0.04 0.05 0.19 0.03 0.14 0.06 0.06 0.06 4.16 0.05 0.02 0.07 0.00 0.14 0.05 0.37 0.15 0.08 0.21 0.34
LTU 0.00 0.02 0.01 0.01 0.00 0.00 0.00 0.00 0.02 0.03 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.18 0.00 0.03 0.00 0.00 0.01 0.01 0.00 0.00 0.00 0.02
LUX 0.00 0.14 0.00 0.02 0.00 0.02 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.04 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.01
LVA 0.00 0.01 0.00 0.03 0.00 0.00 0.00 0.00 0.05 0.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.16 0.00 6.77 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.03
MLT 0.00 0.00 0.00 0.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.67 0.00 0.00 0.00 0.00 0.00 0.00 0.01
NLD 0.02 0.86 0.07 0.08 0.02 0.04 0.03 0.01 0.03 0.11 0.08 0.03 0.01 0.01 0.02 0.05 0.01 0.07 0.03 0.02 0.00 2.03 0.03 0.23 0.04 0.03 0.02 0.43
POL 0.02 0.09 0.03 0.19 0.16 0.13 0.01 0.01 0.01 0.02 0.06 0.01 0.00 0.00 0.02 0.00 0.01 0.33 0.00 0.03 0.00 0.01 2.18 0.05 0.02 0.11 0.01 0.11
PRT 0.00 0.05 0.01 0.10 0.00 0.03 0.01 0.07 0.02 0.01 0.04 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.53 0.01 0.00 0.00 0.07
ROU 0.04 0.06 0.89 0.13 0.04 0.02 0.00 0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.31 0.00 0.02 0.02 0.00 0.01 0.00 0.00 0.03 0.04 10.52 0.06 0.01 0.03
SVK 0.23 0.04 0.02 0.08 1.12 0.60 0.00 0.00 0.00 0.01 0.32 0.00 0.00 0.00 0.11 0.00 0.00 0.07 0.00 0.02 0.00 0.00 0.34 0.03 0.03 7.06 0.02 0.03
SVN 0.13 0.01 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.01 0.00 0.05 0.01 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01 6.77 0.01
SWE 0.02 0.20 0.05 0.08 0.02 0.03 0.26 0.01 0.12 0.90 0.04 0.03 0.00 0.01 0.01 0.01 0.01 0.03 0.01 0.06 0.00 0.02 0.05 0.12 0.03 0.02 0.02 8.05
The values represent the effect of a unitary shock in production: ie. if country AUT reduces production by one unit, the damage felt in country DEU is 0.29. Hence, the matrix can be seen as a symmetric network of production effects between countries.
My goal is to find the optimal weights of a weighted unitary shock (i.e. weighting the columns so that the total reduction of production summed over all countries is = 1) that:
ensure a certain distribution of damage across receiving (row) countries (i.e. the row sums), lets say an equal distribution
while at the same minimizing the damage in the overall economic system
I've tried solving it as a simple non-linear optimization problem with equality constraints, using the package Rsolnp:
# objective function to be minimized (global damage)
damage <- function(weights) {
D_weighted <- t(t(D)*weights); return(sum(D_weighted))
}
# constraints (combined in one function:
constr <- function(weights) {
# constraint 1: sum of weights needs to be 1
c1 = sum(weights)
# constraint 2: equal distribution in damage outcome
D_weighted <- t(t(D)*weights)
damage_per_country <- rowSums(D_weighted)/sum(D_weighted)
c2 = damage_per_country/sum(D_weighted)
return(c(c1, c2))
}
# target distribution of damage outcome (for example: equal distribution)
targ_dist <- c(rep(1/(ncol(D)), ncol(D)))
# starting weights (sart with same production reduction in every country)
startweights <- rep(1/ncol(D), ncol(D))
# run optimization with Rsolnp
opt_weights <- solnp(pars = startweights, fun = damage, eqfun = constr, eqB = c(1, targ_dist), LB = rep(0, ncol(D)), UB = rep(1, ncol(D)), control=list(outer.iter=1000,trace=0, tol= 0.001))
but it doesn't converge and returns a warning message:
"The linearized problem has no feasible solution. The problem may not be feasible".
Changing the tolerance doesn't solve the problem. It might be that this solver is not suited for this kind of problem or I need to reformulate the problem completely. I'd be thankful for any help!

Correct displaced columns in a table

A bit of context: the File I've shown below is generated by a VLSI tool. It consists of timing delays caused by various components in a circuit. When I generate this "timing file" the fields are not properly organised sometimes.
The generated file:
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
What I want:
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
The displacement for  something4
and something6keep recurring throughout the table in a particular order(say every 2 lines or 1 line). Only something2 has a different displacement whereas all the other displacements follow something4/something6.
So far I have no clue how to proceed with this. Any way to fix this?
$ awk '{gsub(/ {6}/,","); gsub(/ +/,",")} 1' file | column -s, -t
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
or:
$ awk 'BEGIN{FS=OFS="\t"} {gsub(/ {6}/,FS); gsub(/ +/,FS); $1=$1} 1' file
something1 0.20 0.00 0.00
something2 6 12.95
something3 0.00 0.08 0.00 0.00 0.07
something4 6 8.70
something5 0.00 0.03 0.00 0.00 0.05
something6 5 4.70
Another way with awk
awk 'NF>3{$2=OFS$2}NF==4{$2=OFS$2}{$1=$1}1' OFS='\t' infile

Extract Data from a File Unix

I have file that has space separated columns from that i want to extract specific data .below is the format of the file :
12:00:01 AM CPU %usr %nice %sys %iowait %steal %irq %soft %guest %idle
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 AM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:02:01 AM all 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
01:03:01 AM all 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
01:01:01 AM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
01:01:01 AM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:02:01 PM 0 93.42 0.00 0.53 0.00 0.00 0.00 0.10 0.00 5.95
12:03:01 PM 1 88.62 0.00 1.71 0.00 0.00 0.00 0.71 0.00 8.96
12:01:01 PM 2 92.56 0.00 0.70 0.00 0.00 0.00 1.17 0.00 5.58
12:01:01 PM 3 86.90 0.00 1.57 0.00 0.00 0.00 0.55 0.00 10.99
Now from this file i want those rows that have time like 12:01:01 AM/PM i means for every hourly basis and have all in the CPU column
So after extraction i want below data but i am not able to get that.
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
Please suggest me how we can get that data in UNIX
If you add the -E option to grep it allows you to look for "Extended Regular Expressions". One such expression is
"CPU|01:01"
which will allow you to find all lines containing the word "CPU" (such as your column heading line) and also any lines with "01:01" in them. It is called an "alternation" and uses the pipe symbol (|) to separate alternate sub-parts.
So, an answer would be"
grep -E "CPU|01:01 .*all" yourFile > newFile
Try running:
man grep
to get the manual (help) page.
awk to the rescue!
if you need field specific matches awk is the right tool.
$ awk '$3=="all" && $1~/01:01$/' file
12:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
01:01:01 AM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
12:01:01 PM all 78.13 0.00 0.98 0.00 0.00 0.00 0.56 0.00 20.33
you can extract the header as well, with this
$ awk 'NR==1 || $3=="all" && $1~/01:01$/' file

corr.test arguments imply differing number of rows

I have seen this error multiple times in different projects and I was wondering if there is a way to tell which line caused the error in general?
My specific case:
http://archive.ics.uci.edu/ml/machine-learning-databases/00275/
#using the bike.csv
data<-read.csv("PATH_HERE\\Bike-Sharing-Dataset\\day.csv",header=TRUE)
require(psych)
corr.test(data)
data<-data[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
"season","temp","weathersit","weekday","windspeed","workingday","yr")]
data[data=='']<-NA
#View(data)
require(psych)
cors<-corr.test(data)
returns the error:
Error in data.frame(lower = lower, r = r[lower.tri(r)], upper = upper, :
arguments imply differing number of rows: 0, 91
It works for me
> #using the bike.csv
> data <- read.csv("day.csv",header=TRUE)
> require(psych)
> corr.test(data)
Error in cor(x, use = use, method = method) : 'x' must be numeric
> data <- data[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
+ "season","temp","weathersit","weekday","windspeed","workingday","yr")]
> data[data==''] <- NA
> #View(data)
>
> require(psych)
> cors <- corr.test(data)
> cors
Call:corr.test(x = data)
Correlation matrix
atemp casual cnt holiday hum mnth registered season temp
atemp 1.00 0.54 0.63 -0.03 0.14 0.23 0.54 0.34 0.99
casual 0.54 1.00 0.67 0.05 -0.08 0.12 0.40 0.21 0.54
cnt 0.63 0.67 1.00 -0.07 -0.10 0.28 0.95 0.41 0.63
holiday -0.03 0.05 -0.07 1.00 -0.02 0.02 -0.11 -0.01 -0.03
hum 0.14 -0.08 -0.10 -0.02 1.00 0.22 -0.09 0.21 0.13
mnth 0.23 0.12 0.28 0.02 0.22 1.00 0.29 0.83 0.22
registered 0.54 0.40 0.95 -0.11 -0.09 0.29 1.00 0.41 0.54
season 0.34 0.21 0.41 -0.01 0.21 0.83 0.41 1.00 0.33
temp 0.99 0.54 0.63 -0.03 0.13 0.22 0.54 0.33 1.00
weathersit -0.12 -0.25 -0.30 -0.03 0.59 0.04 -0.26 0.02 -0.12
weekday -0.01 0.06 0.07 -0.10 -0.05 0.01 0.06 0.00 0.00
windspeed -0.18 -0.17 -0.23 0.01 -0.25 -0.21 -0.22 -0.23 -0.16
workingday 0.05 -0.52 0.06 -0.25 0.02 -0.01 0.30 0.01 0.05
yr 0.05 0.25 0.57 0.01 -0.11 0.00 0.59 0.00 0.05
weathersit weekday windspeed workingday yr
atemp -0.12 -0.01 -0.18 0.05 0.05
casual -0.25 0.06 -0.17 -0.52 0.25
cnt -0.30 0.07 -0.23 0.06 0.57
holiday -0.03 -0.10 0.01 -0.25 0.01
hum 0.59 -0.05 -0.25 0.02 -0.11
mnth 0.04 0.01 -0.21 -0.01 0.00
registered -0.26 0.06 -0.22 0.30 0.59
season 0.02 0.00 -0.23 0.01 0.00
temp -0.12 0.00 -0.16 0.05 0.05
weathersit 1.00 0.03 0.04 0.06 -0.05
weekday 0.03 1.00 0.01 0.04 -0.01
windspeed 0.04 0.01 1.00 -0.02 -0.01
workingday 0.06 0.04 -0.02 1.00 0.00
yr -0.05 -0.01 -0.01 0.00 1.00
Sample Size
[1] 731
Probability values (Entries above the diagonal are adjusted for multiple tests.)
atemp casual cnt holiday hum mnth registered season temp
atemp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.00
casual 0.00 0.00 0.00 1.00 1.00 0.04 0.00 0.00 0.00
cnt 0.00 0.00 0.00 1.00 0.28 0.00 0.00 0.00 0.00
holiday 0.38 0.14 0.06 0.00 1.00 1.00 0.15 1.00 1.00
hum 0.00 0.04 0.01 0.67 0.00 0.00 0.58 0.00 0.03
mnth 0.00 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00
registered 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00
season 0.00 0.00 0.00 0.78 0.00 0.00 0.00 0.00 0.00
temp 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00
weathersit 0.00 0.00 0.00 0.35 0.00 0.24 0.00 0.60 0.00
weekday 0.84 0.11 0.07 0.01 0.16 0.80 0.12 0.93 1.00
windspeed 0.00 0.00 0.00 0.87 0.00 0.00 0.00 0.00 0.00
workingday 0.16 0.00 0.10 0.00 0.51 0.87 0.00 0.74 0.15
yr 0.21 0.00 0.00 0.83 0.00 0.96 0.00 0.96 0.20
weathersit weekday windspeed workingday yr
atemp 0.05 1.00 0.00 1.00 1.00
casual 0.00 1.00 0.00 0.00 0.00
cnt 0.00 1.00 0.00 1.00 0.00
holiday 1.00 0.25 1.00 0.00 1.00
hum 0.00 1.00 0.00 1.00 0.13
mnth 1.00 1.00 0.00 1.00 1.00
registered 0.00 1.00 0.00 0.00 0.00
season 1.00 1.00 0.00 1.00 1.00
temp 0.05 1.00 0.00 1.00 1.00
weathersit 0.00 1.00 1.00 1.00 1.00
weekday 0.40 0.00 1.00 1.00 1.00
windspeed 0.29 0.70 0.00 1.00 1.00
workingday 0.10 0.33 0.61 0.00 1.00
yr 0.19 0.88 0.75 0.96 0.00
To see confidence intervals of the correlations, print with the short=FALSE option
>
It works for me:::
rm(list=ls())
# http://archive.ics.uci.edu/ml/machine-learning-databases/00275/
#using the bike.csv
day <- read.csv("Bike-Sharing-Dataset//day.csv")
require(psych)
day<-day[,c("atemp","casual","cnt","holiday","hum","mnth","registered",
"season","temp","weathersit","weekday","windspeed","workingday","yr")]
day[day=='']<-NA
require(psych)
corr.test(day)
# corr.test(day)
# Call:corr.test(x = day)
# Correlation matrix
# atemp casual cnt holiday hum mnth registered season temp weathersit weekday windspeed workingday yr
# atemp 1.00 0.54 0.63 -0.03 0.14 0.23 0.54 0.34 0.99 -0.12 -0.01 -0.18 0.05 0.05
# casual 0.54 1.00 0.67 0.05 -0.08 0.12 0.40 0.21 0.54 -0.25 0.06 -0.17 -0.52 0.25
# cnt 0.63 0.67 1.00 -0.07 -0.10 0.28 0.95 0.41 0.63 -0.30 0.07 -0.23 0.06 0.57
# holiday -0.03 0.05 -0.07 1.00 -0.02 0.02 -0.11 -0.01 -0.03 -0.03 -0.10 0.01 -0.25 0.01
# hum 0.14 -0.08 -0.10 -0.02 1.00 0.22 -0.09 0.21 0.13 0.59 -0.05 -0.25 0.02 -0.11
# mnth 0.23 0.12 0.28 0.02 0.22 1.00 0.29 0.83 0.22 0.04 0.01 -0.21 -0.01 0.00
# registered 0.54 0.40 0.95 -0.11 -0.09 0.29 1.00 0.41 0.54 -0.26 0.06 -0.22 0.30 0.59
# season 0.34 0.21 0.41 -0.01 0.21 0.83 0.41 1.00 0.33 0.02 0.00 -0.23 0.01 0.00
# temp 0.99 0.54 0.63 -0.03 0.13 0.22 0.54 0.33 1.00 -0.12 0.00 -0.16 0.05 0.05
# weathersit -0.12 -0.25 -0.30 -0.03 0.59 0.04 -0.26 0.02 -0.12 1.00 0.03 0.04 0.06 -0.05
# weekday -0.01 0.06 0.07 -0.10 -0.05 0.01 0.06 0.00 0.00 0.03 1.00 0.01 0.04 -0.01
# windspeed -0.18 -0.17 -0.23 0.01 -0.25 -0.21 -0.22 -0.23 -0.16 0.04 0.01 1.00 -0.02 -0.01
# workingday 0.05 -0.52 0.06 -0.25 0.02 -0.01 0.30 0.01 0.05 0.06 0.04 -0.02 1.00 0.00
# yr 0.05 0.25 0.57 0.01 -0.11 0.00 0.59 0.00 0.05 -0.05 -0.01 -0.01 0.00 1.00
# Sample Size
# [1] 731
# Probability values (Entries above the diagonal are adjusted for multiple tests.)
# atemp casual cnt holiday hum mnth registered season temp weathersit weekday windspeed workingday yr
# atemp 0.00 0.00 0.00 1.00 0.01 0.00 0.00 0.00 0.00 0.05 1.00 0.00 1.00 1.00
# casual 0.00 0.00 0.00 1.00 1.00 0.04 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
# cnt 0.00 0.00 0.00 1.00 0.28 0.00 0.00 0.00 0.00 0.00 1.00 0.00 1.00 0.00
# holiday 0.38 0.14 0.06 0.00 1.00 1.00 0.15 1.00 1.00 1.00 0.25 1.00 0.00 1.00
# hum 0.00 0.04 0.01 0.67 0.00 0.00 0.58 0.00 0.03 0.00 1.00 0.00 1.00 0.13
# mnth 0.00 0.00 0.00 0.60 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00
# registered 0.00 0.00 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00
# season 0.00 0.00 0.00 0.78 0.00 0.00 0.00 0.00 0.00 1.00 1.00 0.00 1.00 1.00
# temp 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.05 1.00 0.00 1.00 1.00
# weathersit 0.00 0.00 0.00 0.35 0.00 0.24 0.00 0.60 0.00 0.00 1.00 1.00 1.00 1.00
# weekday 0.84 0.11 0.07 0.01 0.16 0.80 0.12 0.93 1.00 0.40 0.00 1.00 1.00 1.00
# windspeed 0.00 0.00 0.00 0.87 0.00 0.00 0.00 0.00 0.00 0.29 0.70 0.00 1.00 1.00
# workingday 0.16 0.00 0.10 0.00 0.51 0.87 0.00 0.74 0.15 0.10 0.33 0.61 0.00 1.00
# yr 0.21 0.00 0.00 0.83 0.00 0.96 0.00 0.96 0.20 0.19 0.88 0.75 0.96 0.00
#
# To see confidence intervals of the correlations, print with the short=FALSE option
cheers

Principal component analysis (PCA) in R: why are the scores not orthogonal? (using Psych package)

I ran PCA in R using the principal() function in the "psych" package. I made the argument "rotate="none"", which asks for orthogonal rotation method. From what I understand, the scores of PC1 and PC2 should be orthogonal (i.e. there should be zero correlation between (raw data)(loading of PC1)and (raw data)(loading of PC2). However, I got 90% correlation. Why is that?
> #load the package
> library(psych)
> #calculate the correlation matrix
> corMat <- cor(data)
> #run PCA
> pca.results <- principal(r = corMat,**rotate ="none"**, nfactors = 20,covar=FALSE,scores=TRUE)
> pca.results`enter code here`
Principal Components Analysis
Call: principal(r = corMat, nfactors = 20, rotate = "none", covar = FALSE,
scores = TRUE)
Standardized loadings (pattern matrix) based upon correlation matrix
**PC1 PC2** PC3 PC4 PC5 PC6 PC7 PC8 PC9
payroll.chg -0.30 0.85 0.21 0.35 -0.03 0.02 0.07 -0.11 -0.02
HH.empl.chg -0.26 0.62 0.64 -0.35 0.01 -0.06 0.06 0.00 0.01
pop.empl.ratio -0.92 -0.34 0.13 0.04 0.06 -0.03 -0.04 0.03 -0.04
u.rate 0.99 0.10 0.02 0.04 0.01 0.04 0.04 0.04 0.01
median.duration.unempl 0.88 0.44 -0.02 0.02 -0.04 0.06 0.02 0.13 -0.05
LT.unempl.unempl.ratio 0.86 0.49 -0.04 0.01 -0.07 0.02 0.00 0.08 -0.02
U4 0.99 0.13 0.01 0.03 0.01 0.04 0.04 0.05 0.01
U6 0.98 0.13 -0.05 -0.02 0.00 0.06 0.04 0.03 0.04
vacancy.rate -0.87 0.35 -0.18 -0.11 -0.01 0.22 0.10 0.03 -0.01
hires.rate -0.92 0.08 0.24 0.21 -0.16 0.06 0.00 0.05 0.09
unemployed.to.employed 0.89 0.17 0.21 -0.02 0.05 0.24 -0.25 -0.05 0.00
Layoff.rate..JOLT. 0.23 -0.86 0.19 -0.03 -0.40 0.09 0.03 -0.02 -0.05
Exhaustion.rate 0.95 0.19 0.14 0.14 0.00 -0.07 0.01 0.06 -0.04
Quits.rate..JOLT. -0.98 0.01 0.04 0.04 0.01 0.02 -0.06 0.10 0.13
participation.rate -0.67 -0.61 0.31 0.14 0.16 -0.01 -0.03 0.11 -0.08
insured.u.rate 0.88 -0.40 0.17 0.08 0.12 0.05 0.09 -0.03 0.02
Initial.jobless.claims 0.78 -0.60 0.04 -0.06 0.06 0.05 0.07 0.02 0.07
Continuing.claims 0.86 -0.44 0.15 0.06 0.14 0.08 0.09 -0.05 0.03
Jobs.plentiful.jobs.hardtoget -0.98 0.00 -0.02 0.01 0.08 0.13 0.04 -0.02 -0.04
vacancy.unempl.ratio -0.97 0.04 -0.05 -0.03 0.08 0.18 0.07 0.03 -0.03
PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
payroll.chg -0.06 0.02 -0.02 0.00 0.03 0.00 0.00 0.00 0.00
HH.empl.chg 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
pop.empl.ratio -0.02 0.00 -0.01 0.01 0.00 0.00 0.00 0.01 0.01
u.rate -0.01 0.00 0.03 -0.03 0.02 0.00 0.00 -0.01 -0.01
median.duration.unempl 0.02 0.05 -0.06 -0.01 -0.03 0.01 -0.02 0.00 0.00
LT.unempl.unempl.ratio 0.01 0.02 -0.01 0.02 0.00 0.00 0.05 0.00 0.00
U4 -0.01 0.00 0.04 -0.02 0.02 0.00 -0.01 -0.01 0.01
U6 -0.01 0.01 0.03 -0.03 0.02 -0.02 0.00 0.03 0.00
vacancy.rate -0.08 -0.06 0.01 0.01 -0.01 0.04 0.00 0.00 0.00
hires.rate 0.01 0.00 0.04 0.00 -0.06 -0.01 0.00 0.00 0.00
unemployed.to.employed -0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00
Layoff.rate..JOLT. 0.01 0.00 -0.01 -0.01 0.03 0.00 0.00 0.00 0.00
Exhaustion.rate 0.05 -0.07 0.02 0.06 0.01 -0.01 -0.02 0.00 0.00
Quits.rate..JOLT. 0.04 -0.01 -0.04 0.00 0.05 0.02 0.00 0.00 0.00
participation.rate -0.06 0.00 0.02 -0.02 0.01 0.01 0.01 0.00 0.00
insured.u.rate 0.04 -0.02 -0.02 0.00 -0.02 0.02 0.01 0.00 0.02
Initial.jobless.claims -0.09 0.06 0.00 0.06 0.01 -0.01 -0.01 0.00 0.00
Continuing.claims 0.05 -0.02 -0.02 -0.02 -0.01 0.01 0.01 0.01 -0.02
Jobs.plentiful.jobs.hardtoget 0.11 0.07 0.05 0.02 0.01 0.02 0.00 0.00 0.00
vacancy.unempl.ratio 0.03 -0.01 -0.03 0.00 0.01 -0.06 0.00 0.00 0.00
PC19 PC20 h2 u2
payroll.chg 0.00 0.00 1 5.6e-16
HH.empl.chg 0.00 0.00 1 -2.9e-15
pop.empl.ratio 0.01 0.01 1 -1.6e-15
u.rate -0.01 0.01 1 1.1e-16
median.duration.unempl 0.00 0.00 1 -4.4e-16
LT.unempl.unempl.ratio 0.00 0.00 1 -6.7e-16
U4 0.01 0.00 1 -4.4e-16
U6 0.00 0.00 1 2.2e-16
vacancy.rate 0.00 0.00 1 0.0e+00
hires.rate 0.00 0.00 1 4.4e-16
unemployed.to.employed 0.00 0.00 1 -2.2e-16
Layoff.rate..JOLT. 0.00 0.00 1 -2.2e-15
Exhaustion.rate 0.00 0.00 1 -4.4e-16
Quits.rate..JOLT. 0.00 0.00 1 1.1e-16
participation.rate 0.00 -0.01 1 5.6e-16
insured.u.rate -0.01 0.00 1 -6.7e-16
Initial.jobless.claims 0.00 0.00 1 -2.0e-15
Continuing.claims 0.01 0.00 1 -6.7e-16
Jobs.plentiful.jobs.hardtoget 0.00 0.00 1 2.2e-16
vacancy.unempl.ratio 0.00 0.00 1 -2.2e-16
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
SS loadings 14.23 3.73 0.83 0.37 0.28 0.20 0.12 0.07 0.05 0.05 0.02 0.02
Proportion Var 0.71 0.19 0.04 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00
Cumulative Var 0.71 0.90 0.94 0.96 0.97 0.98 0.99 0.99 0.99 1.00 1.00 1.00
Proportion Explained 0.71 0.19 0.04 0.02 0.01 0.01 0.01 0.00 0.00 0.00 0.00 0.00
Cumulative Proportion 0.71 0.90 0.94 0.96 0.97 0.98 0.99 0.99 0.99 1.00 1.00 1.00
PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20
SS loadings 0.01 0.01 0.01 0 0 0 0 0
Proportion Var 0.00 0.00 0.00 0 0 0 0 0
Cumulative Var 1.00 1.00 1.00 1 1 1 1 1
Proportion Explained 0.00 0.00 0.00 0 0 0 0 0
Cumulative Proportion 1.00 1.00 1.00 1 1 1 1 1
Test of the hypothesis that 20 components are sufficient.
The degrees of freedom for the null model are 190 and the objective function was 68.46
The degrees of freedom for the model are -20 and the objective function was 0
Fit based upon off diagonal values = 1
To find the component scores you can skip the step in which you are finding the correlations. principal will do that for you. Then, you can skip the step Hong Ooi suggested andjust find the scores directly. They should be orthogonal.
Using your example:
pca.results <- principal(data,nfactors=20,rotate='none')
#then correlate the scores
cor(pca.results$scores) #these should be orthogonal
Bill
What you've got there are not the PCA scores, but the PCA loadings. To get the latter, use the predict method on your model. You should find that the predicted scores are indeed uncorrelated with each other.

Resources