Is it ok to scale labeled/binary data for principal component analysis (pca)?

Is it ok to scale labeled/binary data for principal component analysis (pca)? - scaling

I have a dataset which have about 20 columns of labeled data (numbered with sklearn.preprocessing.LabelEncoder), 140 binary column (0 and 1) and 3 columns of numerical values. There are about 4400 rows in this dataset and I had a hard time to train a deep neural network on this dataset so I decided to reduce the features and remove unnecessary ones.
So here is what I did: I scaled those 3 columns and did the PCA with sklearn.decomposition.PCA() but the result was 99.7% on pca1 and 0.3% on pca2 and other pcas were 0.
Then for some random reason, I tried to scale the whole dataset:
df = sklearn.preprocessing.scale(df)
And tried PCA again and this time, the result was a bit more promising (but not perfect I guess). here are the pcas:
>>> print(np.round(pca.explained_variance_ratio_ * 100, decimals=1))
[2.7 2.1 1.8 1.4 1.3 1.2 1.1 1.1 1. 1. 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.7
0.7 0.7 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.5
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4
0.4 0.4 0.4 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3
0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.3 0.2 0.2
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2
0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
0.1 0.1 0.1 0.1 0.1 0. 0. 0. 0. 0. 0. ]
So here are my questions:
Is it allowed to scale data in such manners? (scaling labeled/binary data)
If yes, is such PCA useful for feature extraction? (since the best pca is 2.7%)
P.S. I'm pretty new to these stuffs. If I need to provide any other information, please let me know.

Related

how to convert a nested loop to lapply in r

I have a list named "dahak" that contains 30000 number between 1 to 10. I want to check every number with all of the number in the list, if two numbers are equals then append number 1 to weight_list, if two numbers are not equals then calculate their difference and store it as x and append x to the weight_list.
Here is the code:
for(j in 1:num_nodes){
for (k in 1:num_nodes){
if(j==k){
weight_list <- c(weight_list,0)
}
else if(as.numeric(dahak[j])==as.numeric(dahak[k])){
weight_list <- c(weight_list,1)
}
else if(as.numeric(dahak[j])!=as.numeric(dahak[k])){
x = 1 - (abs(as.numeric(dahak[j]) - as.numeric(dahak[k])) / 10)
weight_list <- c(weight_list,x)
}
}
}
How can i optimize this code? and how can i do this with lapply?

It sounds like you want to create a 30,000 x 30,000 matrix. It also sounds like dahak is a vector rather than a list. If that's really what you want to do, you can simplify your logic and vectorize like this;
get_weights <- function(x) 1 - abs(x - as.numeric(dahak))/10
weights <- do.call(rbind, lapply(as.numeric(dahak), get_weights)) - diag(length(dahak))
Using the same dummy data as #ThomasIsCoding I get:
weights
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
#> [1,] 0.0 0.9 0.7 0.3 1.0 0.4 0.3 0.6 0.6 0.8 1.0 0.9 0.6 0.9 0.5
#> [2,] 0.9 0.0 0.8 0.4 0.9 0.5 0.4 0.7 0.7 0.7 0.9 0.8 0.7 1.0 0.6
#> [3,] 0.7 0.8 0.0 0.6 0.7 0.7 0.6 0.9 0.9 0.5 0.7 0.6 0.9 0.8 0.8
#> [4,] 0.3 0.4 0.6 0.0 0.3 0.9 1.0 0.7 0.7 0.1 0.3 0.2 0.7 0.4 0.8
#> [5,] 1.0 0.9 0.7 0.3 0.0 0.4 0.3 0.6 0.6 0.8 1.0 0.9 0.6 0.9 0.5
#> [6,] 0.4 0.5 0.7 0.9 0.4 0.0 0.9 0.8 0.8 0.2 0.4 0.3 0.8 0.5 0.9
#> [7,] 0.3 0.4 0.6 1.0 0.3 0.9 0.0 0.7 0.7 0.1 0.3 0.2 0.7 0.4 0.8
#> [8,] 0.6 0.7 0.9 0.7 0.6 0.8 0.7 0.0 1.0 0.4 0.6 0.5 1.0 0.7 0.9
#> [9,] 0.6 0.7 0.9 0.7 0.6 0.8 0.7 1.0 0.0 0.4 0.6 0.5 1.0 0.7 0.9
#> [10,] 0.8 0.7 0.5 0.1 0.8 0.2 0.1 0.4 0.4 0.0 0.8 0.9 0.4 0.7 0.3
#> [11,] 1.0 0.9 0.7 0.3 1.0 0.4 0.3 0.6 0.6 0.8 0.0 0.9 0.6 0.9 0.5
#> [12,] 0.9 0.8 0.6 0.2 0.9 0.3 0.2 0.5 0.5 0.9 0.9 0.0 0.5 0.8 0.4
#> [13,] 0.6 0.7 0.9 0.7 0.6 0.8 0.7 1.0 1.0 0.4 0.6 0.5 0.0 0.7 0.9
#> [14,] 0.9 1.0 0.8 0.4 0.9 0.5 0.4 0.7 0.7 0.7 0.9 0.8 0.7 0.0 0.6
#> [15,] 0.5 0.6 0.8 0.8 0.5 0.9 0.8 0.9 0.9 0.3 0.5 0.4 0.9 0.6 0.0

I guess this might be the simplification you are looking for, where outer and ifelse are used.
Below is an example with dummy data:
set.seed(1)
num_nodes <- 15
dahak <- sample(10,num_nodes,replace = TRUE)
If you want a matrix for weigth_list of dimensions num_nodes, then you can try
weight_list <- (u<-ifelse((z<-abs(outer(dahak,dahak,FUN = "-")))!=0,1-z/10,1))-diag(diag(u))
such that
> weight_list
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 0.0 0.5 0.8 0.2 0.3 0.8 0.3 0.4 0.2 0.6 0.6 0.9 0.7 0.9 0.8
[2,] 0.5 0.0 0.7 0.7 0.8 0.7 0.8 0.9 0.7 0.9 0.9 0.4 0.8 0.4 0.7
[3,] 0.8 0.7 0.0 0.4 0.5 1.0 0.5 0.6 0.4 0.8 0.8 0.7 0.9 0.7 1.0
[4,] 0.2 0.7 0.4 0.0 0.9 0.4 0.9 0.8 1.0 0.6 0.6 0.1 0.5 0.1 0.4
[5,] 0.3 0.8 0.5 0.9 0.0 0.5 1.0 0.9 0.9 0.7 0.7 0.2 0.6 0.2 0.5
[6,] 0.8 0.7 1.0 0.4 0.5 0.0 0.5 0.6 0.4 0.8 0.8 0.7 0.9 0.7 1.0
[7,] 0.3 0.8 0.5 0.9 1.0 0.5 0.0 0.9 0.9 0.7 0.7 0.2 0.6 0.2 0.5
[8,] 0.4 0.9 0.6 0.8 0.9 0.6 0.9 0.0 0.8 0.8 0.8 0.3 0.7 0.3 0.6
[9,] 0.2 0.7 0.4 1.0 0.9 0.4 0.9 0.8 0.0 0.6 0.6 0.1 0.5 0.1 0.4
[10,] 0.6 0.9 0.8 0.6 0.7 0.8 0.7 0.8 0.6 0.0 1.0 0.5 0.9 0.5 0.8
[11,] 0.6 0.9 0.8 0.6 0.7 0.8 0.7 0.8 0.6 1.0 0.0 0.5 0.9 0.5 0.8
[12,] 0.9 0.4 0.7 0.1 0.2 0.7 0.2 0.3 0.1 0.5 0.5 0.0 0.6 1.0 0.7
[13,] 0.7 0.8 0.9 0.5 0.6 0.9 0.6 0.7 0.5 0.9 0.9 0.6 0.0 0.6 0.9
[14,] 0.9 0.4 0.7 0.1 0.2 0.7 0.2 0.3 0.1 0.5 0.5 1.0 0.6 0.0 0.7
[15,] 0.8 0.7 1.0 0.4 0.5 1.0 0.5 0.6 0.4 0.8 0.8 0.7 0.9 0.7 0.0

Subset using i statement dynamically created from another data.table's variables

I have data similar to the following:
set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"),
t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1),
t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
t7=round(rnorm(10),1),t8=round(rnorm(10),1))
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
I need to apply constraints to dt similar to the following (which are read in from a csv using fread):
dt_constraints <- data.table(columns=c("t1","t3","t7","t8"), operator=c(rep(">=",2),rep("<=",2)),
values=c(-.6,-.5,2.4,.5))
columns operator values
1 t1 >= -0.6
2 t3 >= -0.5
3 t7 <= 2.4
4 t8 <= 0.5
I can easily subset dt by typing in the various constraints in the i statement:
dt_sub <- dt[t1>=-.6 & t3 >=-.5 & t7<=2.4 & t8<=.5,]
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0 -0.7
3 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
4 7 OUT 0.5 0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
5 9 IN 0.6 0.8-0.5 1.1 -0.1 0.6 0.2 0.1
6 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
But, since the constraints are constantly changing (a new constrants csv is read in each time), I am looking for an efficient way to programatically apply the constraints directly from dt_constraints to subset dt. The actual data is quite large as is the number of constraints so efficiency is key.
Thanks so much.

There is an alternative approach which uses non-equi joins for subsetting:
thresholds <- dt_constraints[, values]
cond <- dt_constraints[, paste0(columns, operator, "V", .I)]
dt[dt[as.list(thresholds), on = cond, which = TRUE]]
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
4: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
5: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
6: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6

We can paste it as a single string and then do the eval
dt[eval(parse(text=do.call(paste, c(dt_constraints, collapse= ' & '))))]
# ID Status t1 t2 t3 t4 t5 t6 t7 t8
#1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
#2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
#3: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
#4: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
#5: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
#6: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
If we are using tidyverse, then
library(dplyr)
dt %>%
filter(!!rlang::parse_expr(do.call(paste, c(dt_constraints, collapse= ' & '))))
# ID Status t1 t2 t3 t4 t5 t6 t7 t8
#1 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
#2 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
#3 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
#4 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
#5 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
#6 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6

Using R, data.table, conditionally sum columns

I have a data table similar to this (except it has 150 columns and about 5 million rows):
set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"),
t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1),
t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
t7=round(rnorm(10),1),t8=round(rnorm(10),1))
which outputs:
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
Using data.table, I would like to add a new column (using :=) called Total that would contain the following:
For each row,
if Status=OUT, sum columns t1:t4 and t8
if Status=IN, sum columns t5,t6,t8
if Status=ON, sum columns t1:t3 and t6:t8
The final output should look like this:
ID Status t1 t2 t3 t4 t5 t6 t7 t8 Total
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5 3.7
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7 0.6
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6 1.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9 -1.4
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3 1.4
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3 -1.0
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4 -0.5
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0 -0.2
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1 0.6
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6 2.2
I am fairly new to data.table (currently using version 1.9.6) and would like to try for a solution using efficient data.table syntax.

I think doing it one by one, as suggested in comments, is perfectly fine, but you can also create a lookup table:
cond = data.table(Status = c("OUT", "IN", "ON"),
cols = Map(paste0, 't', list(c(1:4, 8), c(5,6,8), c(1:3, 6:8))))
# Status cols
#1: OUT t1,t2,t3,t4,t8
#2: IN t5,t6,t8
#3: ON t1,t2,t3,t6,t7,t8
dt[cond, Total := Reduce(`+`, .SD[, cols[[1]], with = F]), on = 'Status', by = .EACHI]
# ID Status t1 t2 t3 t4 t5 t6 t7 t8 Total
# 1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5 3.7
# 2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7 0.6
# 3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6 1.6
# 4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9 -1.4
# 5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3 1.4
# 6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3 -1.0
# 7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4 -0.5
# 8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0 -0.2
# 9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1 0.6
#10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6 2.2

r histogram assigns wrong number of observations to each bin

Below is the data I am working with. I do a simple hist(data) and the frequency of -.3 through .4 are correct. However, for some reason R seems to combine the frequency of -.5 and -.4, the two left most bins. There are 3 counts of -.5 and 5 counts of -.4, but R plots 8 counts of both -.5 and -.4.
Any idea why this may be going on? How to fix it?
[1] -0.1 0.0 0.1 0.1 0.3 0.0 0.0 0.1 0.1 0.1 0.2 0.1 -0.1 0.2 0.0
[16] -0.4 0.2 0.0 -0.1 0.0 0.1 0.1 -0.1 0.0 0.0 0.1 0.0 -0.1 0.0 0.3
[31] -0.2 0.4 -0.1 0.0 -0.2 0.0 0.1 0.1 0.0 0.1 0.2 -0.1 0.1 0.1 -0.1
[46] 0.2 0.1 -0.1 0.1 0.0 -0.1 0.4 -0.1 -0.1 0.0 0.0 -0.1 0.1 0.1 0.0
[61] 0.1 -0.1 0.2 -0.1 0.1 -0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 -0.1 0.1
[76] 0.2 -0.2 0.0 0.0 -0.1 0.2 0.0 0.0 0.0 -0.3 0.0 -0.1 -0.1 0.1 -0.2
[91] -0.1 -0.3 -0.1 -0.3 -0.2 -0.2 0.0 0.0 0.0 -0.2 0.1 0.0 0.0 0.1 0.0
[106] 0.0 -0.2 -0.1 0.2 -0.1 0.0 -0.1 -0.1 -0.2 0.1 0.1 0.0 0.1 0.2 0.1
[121] 0.0 0.1 -0.2 0.2 0.0 0.0 0.1 0.1 0.0 -0.1 0.1 0.0 0.1 -0.1 0.2
[136] 0.0 0.1 0.1 0.0 0.1 -0.1 0.0 0.0 0.1 0.2 -0.1 0.1 0.0 0.1 0.0
[151] -0.1 0.0 0.2 0.1 -0.1 0.1 -0.2 0.1 0.1 -0.1 0.1 -0.2 -0.1 0.1 -0.1
[166] 0.0 0.0 -0.3 0.0 0.1 -0.2 0.1 -0.4 -0.2 -0.2 -0.3 0.0 -0.4 -0.3 -0.5
[181] -0.5 -0.5 -0.4 -0.3 -0.4 -0.1 0.0 -0.1 -0.2 -0.2 0.1 0.0 0.2 -0.1 -0.1
[196] 0.0 0.3 0.2 -0.1 0.0 0.0 0.0 -0.3 0.4 0.3 0.1 0.0 -0.1 0.1 -0.1
[211] 0.1 0.0 0.0 0.2 0.2 0.1 0.3 -0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.3
[226] 0.0 0.0 -0.1 0.0 0.2 0.2 0.0 0.0 0.0 0.2 0.1 0.0 0.0 0.2 0.3
[241] 0.1 -0.1 0.0 0.4 0.0 0.2 -0.1 0.1
Here is the output of the histogram. You can see 8 counts of -.5 and -.4, which isn't in the data
$breaks
[1] -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
$counts
[1] 8 8 17 46 75 60 23 7 4

The comments above explain what's happening - the breaks are the left and right limits of the intervals, not the centers.
How to fix it? If you are dealing just with numbers discretized to [natural numbers] * 0.1 you can set your breaks at 0.05, 0.15, ... by
data <- c(-0.5, -0.5, 0.4)
breaks <- ((min(data)*10):(max(data)*10+1))/10-0.05
result <- hist(data, breaks)
But is a histogram really that what you need for this? It seems that you just want to calculate the number of occurrences which is much easier by
data <- c(-0.5, -0.5, 0.4)
aggregate(data, list(data), "length")
returning
Group.1 x
1 -0.4 2
2 0.6 1
And for plotting, have a look at barplot

Gnuplot: "all contours drawn in a single color" does not work

I am trying to draw all contours lines in a same color following the example from here: http://gnuplot.sourceforge.net/demo/contours.25.gnu
However, the example works, but my own code does not work:
set xyplane 0;
set pm3d
set contour
set cntrparam levels 6
unset surface;
unset key;
set pm3d map
set title "t";
splot for [i=1:1] "-" using 1:2:3 notitle with lines lc rgb "dark-blue";
....data....
Can you help me find the problem?
Here to download the code file:
https://dl.dropboxusercontent.com/u/45318932/contourpm3d.plt
I am using gnuplot4.6.5

The relevant line is
unset clabel
I know, that is very unintuitive. Don't know the reason behind it.
Here is the complete script with the respective changes, for reference:
set xyplane 0;
set pm3d
set contour
unset clabel
set cntrparam levels 6
unset surface;
unset key;
set pm3d map
splot for [i=1:1] "-" using 1:2:3 notitle with lines lw 2 lc rgb "dark-blue";
#a1 a2 t
0.0 0.0 25.0
0.0 0.1 28.0
0.0 0.2 37.0
0.0 0.3 23.0
0.0 0.4 23.0
0.0 0.5 15.0
0.0 0.6 16.0
0.0 0.7 33.0
0.0 0.8 16.0
0.0 0.9 20.0
0.0 1.0 14.0
0.1 0.0 25.0
0.1 0.1 47.0
0.1 0.2 26.0
0.1 0.3 14.0
0.1 0.4 16.0
0.1 0.5 15.0
0.1 0.6 27.0
0.1 0.7 13.0
0.1 0.8 14.0
0.1 0.9 20.0
0.1 1.0 0.0
0.2 0.0 25.0
0.2 0.1 28.0
0.2 0.2 26.0
0.2 0.3 14.0
0.2 0.4 16.0
0.2 0.5 16.0
0.2 0.6 32.0
0.2 0.7 14.0
0.2 0.8 19.0
0.2 0.9 0.0
0.2 1.0 0.0
0.3 0.0 57.0
0.3 0.1 36.0
0.3 0.2 26.0
0.3 0.3 14.0
0.3 0.4 15.0
0.3 0.5 16.0
0.3 0.6 31.0
0.3 0.7 18.0
0.3 0.8 0.0
0.3 0.9 0.0
0.3 1.0 0.0
0.4 0.0 42.0
0.4 0.1 23.0
0.4 0.2 26.0
0.4 0.3 19.0
0.4 0.4 15.0
0.4 0.5 16.0
0.4 0.6 34.0
0.4 0.7 0.0
0.4 0.8 0.0
0.4 0.9 0.0
0.4 1.0 0.0
0.5 0.0 54.0
0.5 0.1 23.0
0.5 0.2 26.0
0.5 0.3 17.0
0.5 0.4 15.0
0.5 0.5 16.0
0.5 0.6 0.0
0.5 0.7 0.0
0.5 0.8 0.0
0.5 0.9 0.0
0.5 1.0 0.0
0.6 0.0 21.0
0.6 0.1 23.0
0.6 0.2 23.0
0.6 0.3 16.0
0.6 0.4 16.0
0.6 0.5 0.0
0.6 0.6 0.0
0.6 0.7 0.0
0.6 0.8 0.0
0.6 0.9 0.0
0.6 1.0 0.0
0.7 0.0 21.0
0.7 0.1 16.0
0.7 0.2 27.0
0.7 0.3 12.0
0.7 0.4 0.0
0.7 0.5 0.0
0.7 0.6 0.0
0.7 0.7 0.0
0.7 0.8 0.0
0.7 0.9 0.0
0.7 1.0 0.0
0.8 0.0 61.0
0.8 0.1 27.0
0.8 0.2 33.0
0.8 0.3 0.0
0.8 0.4 0.0
0.8 0.5 0.0
0.8 0.6 0.0
0.8 0.7 0.0
0.8 0.8 0.0
0.8 0.9 0.0
0.8 1.0 0.0
0.9 0.0 27.0
0.9 0.1 21.0
0.9 0.2 0.0
0.9 0.3 0.0
0.9 0.4 0.0
0.9 0.5 0.0
0.9 0.6 0.0
0.9 0.7 0.0
0.9 0.8 0.0
0.9 0.9 0.0
0.9 1.0 0.0
1.0 0.0 35.0
1.0 0.1 0.0
1.0 0.2 0.0
1.0 0.3 0.0
1.0 0.4 0.0
1.0 0.5 0.0
1.0 0.6 0.0
1.0 0.7 0.0
1.0 0.8 0.0
1.0 0.9 0.0
1.0 1.0 0.0
e
with the output

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Is it ok to scale labeled/binary data for principal component analysis (pca)? - scaling

Related

how to convert a nested loop to lapply in r

Subset using i statement dynamically created from another data.table's variables

Using R, data.table, conditionally sum columns

r histogram assigns wrong number of observations to each bin

Gnuplot: "all contours drawn in a single color" does not work

Categories

Resources