I have a data table similar to this (except it has 150 columns and about 5 million rows):
set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"),
t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1),
t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
t7=round(rnorm(10),1),t8=round(rnorm(10),1))
which outputs:
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
Using data.table, I would like to add a new column (using :=) called Total that would contain the following:
For each row,
if Status=OUT, sum columns t1:t4 and t8
if Status=IN, sum columns t5,t6,t8
if Status=ON, sum columns t1:t3 and t6:t8
The final output should look like this:
ID Status t1 t2 t3 t4 t5 t6 t7 t8 Total
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5 3.7
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7 0.6
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6 1.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9 -1.4
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3 1.4
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3 -1.0
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4 -0.5
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0 -0.2
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1 0.6
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6 2.2
I am fairly new to data.table (currently using version 1.9.6) and would like to try for a solution using efficient data.table syntax.
I think doing it one by one, as suggested in comments, is perfectly fine, but you can also create a lookup table:
cond = data.table(Status = c("OUT", "IN", "ON"),
cols = Map(paste0, 't', list(c(1:4, 8), c(5,6,8), c(1:3, 6:8))))
# Status cols
#1: OUT t1,t2,t3,t4,t8
#2: IN t5,t6,t8
#3: ON t1,t2,t3,t6,t7,t8
dt[cond, Total := Reduce(`+`, .SD[, cols[[1]], with = F]), on = 'Status', by = .EACHI]
# ID Status t1 t2 t3 t4 t5 t6 t7 t8 Total
# 1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5 3.7
# 2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7 0.6
# 3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6 1.6
# 4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9 -1.4
# 5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3 1.4
# 6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3 -1.0
# 7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4 -0.5
# 8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0 -0.2
# 9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1 0.6
#10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6 2.2
Related
I have data similar to the following:
set.seed(1)
dt <- data.table(ID=1:10, Status=c(rep("OUT",2),rep("IN",2),"ON",rep("OUT",2),rep("IN",2),"ON"),
t1=round(rnorm(10),1), t2=round(rnorm(10),1), t3=round(rnorm(10),1),
t4=round(rnorm(10),1), t5=round(rnorm(10),1), t6=round(rnorm(10),1),
t7=round(rnorm(10),1),t8=round(rnorm(10),1))
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 3 IN -0.8 -0.6 0.1 0.4 0.7 0.3 0.7 0.6
4: 4 IN 1.6 -2.2 -2.0 -0.1 0.6 -1.1 0.0 -0.9
5: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
6: 6 OUT -0.8 0.0 -0.1 -0.4 -0.7 2.0 0.2 0.3
7: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
8: 8 IN 0.7 0.9 -1.5 -0.1 0.8 -1.0 1.5 0.0
9: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
10: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
I need to apply constraints to dt similar to the following (which are read in from a csv using fread):
dt_constraints <- data.table(columns=c("t1","t3","t7","t8"), operator=c(rep(">=",2),rep("<=",2)),
values=c(-.6,-.5,2.4,.5))
columns operator values
1 t1 >= -0.6
2 t3 >= -0.5
3 t7 <= 2.4
4 t8 <= 0.5
I can easily subset dt by typing in the various constraints in the i statement:
dt_sub <- dt[t1>=-.6 & t3 >=-.5 & t7<=2.4 & t8<=.5,]
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0 -0.7
3 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
4 7 OUT 0.5 0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
5 9 IN 0.6 0.8-0.5 1.1 -0.1 0.6 0.2 0.1
6 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
But, since the constraints are constantly changing (a new constrants csv is read in each time), I am looking for an efficient way to programatically apply the constraints directly from dt_constraints to subset dt. The actual data is quite large as is the number of constraints so efficiency is key.
Thanks so much.
There is an alternative approach which uses non-equi joins for subsetting:
thresholds <- dt_constraints[, values]
cond <- dt_constraints[, paste0(columns, operator, "V", .I)]
dt[dt[as.list(thresholds), on = cond, which = TRUE]]
ID Status t1 t2 t3 t4 t5 t6 t7 t8
1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
3: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
4: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
5: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
6: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
We can paste it as a single string and then do the eval
dt[eval(parse(text=do.call(paste, c(dt_constraints, collapse= ' & '))))]
# ID Status t1 t2 t3 t4 t5 t6 t7 t8
#1: 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
#2: 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
#3: 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
#4: 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
#5: 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
#6: 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
If we are using tidyverse, then
library(dplyr)
dt %>%
filter(!!rlang::parse_expr(do.call(paste, c(dt_constraints, collapse= ' & '))))
# ID Status t1 t2 t3 t4 t5 t6 t7 t8
#1 1 OUT -0.6 1.5 0.9 1.4 -0.2 0.4 2.4 0.5
#2 2 OUT 0.2 0.4 0.8 -0.1 -0.3 -0.6 0.0 -0.7
#3 5 ON 0.3 1.1 0.6 -1.4 -0.7 1.4 -0.7 -1.3
#4 7 OUT 0.5 0.0 -0.2 -0.4 0.4 -0.4 -1.8 -0.4
#5 9 IN 0.6 0.8 -0.5 1.1 -0.1 0.6 0.2 0.1
#6 10 ON -0.3 0.6 0.4 0.8 0.9 -0.1 2.2 -0.6
I have searched all the lapply questions and solutions, and none of those solutions seems to address and/or work for the following...
I have a list "temp" that contains the names of 100 data frames: "sim_rep1.dat" through "sim_rep100.dat".
Each data frame has 2000 observations and the same 11 variables: ARAND and w1-w10, all of which are numeric.
For all 100 data frames, I am trying to create a new variable called "ps_true" that incorporates certain of the "w" variables, each with a unique coefficient.
The only use of lapply that is working for me is the following:
lapply(mget(paste0("sim_rep", 1:100,".dat")), transform,
ps_true = (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1)
When I run the code above, R loops through all 100 data frames and shows newly calculated values for ps_true in the console. Unfortunately, the new column is not getting added to the data frames.
When I try to create a function, the wheels come completely off.
I have tried different variations of the following:
lapply(temp, function(x){
ps_true = (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1
cbind(x, ps_true)
return(x)
})
Error in FUN(X[[i]], ...) : object 'w1' not found results from the function shown above
Error in x$w1 : $ operator is invalid for atomic vectors results if I try to reference x$w1 instead
Error in FUN(X[[i]], ...) : object 'w1' not found results if I try to reference x[[w1]] instead
Error in x[["w1"]] : subscript out of bounds results if I try to reference x[["w1"]] instead
I am hoping there is something obvious that I am missing. I'd appreciate your insights and suggestions to solve this frustrating problem.
In response to Uwe's addendum:
The code I had used to read all the files was the following:
temp = list.files(pattern='*.dat')
for (i in 1:length(temp)) {
assign(temp[i], read.csv(temp[i], header=F,sep="",
col.names = c("ARAND", "w1", "w2", "w3", "w4", "w5", "w6", "w7", "w8", "w9", "w10")))
}
According to the OP, there are 100 data.frames with identical columns names. The OP wants to create a new column in all of the data.frames using exactly the same formula.
This indicates a fundamental flaw in the design of the data structure. I guess, no data base admin would create 100 identical tables where only the data contents differs. Instead, he would create one table with an additional column identifying the origin of each row. Then, all subsequent operations would be applied on one table instead to be repeated for each of many.
In R, the data.table package has the convenient rbindlist() function which can be used for this purpose:
library(data.table) # CRAN version 1.10.4 used
# get list of data.frames from the given names and
# combine the rows of all data sets into one large data.table
DT <- rbindlist(mget(temp), idcol = "origin")
# now create new column for all rows across all data sets
DT[, ps_true := (1 + exp(-(0.8*w1 - 0.25*w2 + 0.6*w3 -
0.4*w4 - 0.8*w5 - 0.5*w6 + 0.7*w7)))^-1]
DT
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep100.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep100.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep100.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep100.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep100.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
DT consists now of 200 K rows. Performance is no reason to worry as data.tablewas built to deal with large (even larger) data efficiently.
The origin of each row can be identified in case the data of the individual data sets need to be treated separately. E.g.,
DT[origin == "sim_rep47.dat"]
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep47.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep47.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep47.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep47.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep47.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
1996: sim_rep47.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
1997: sim_rep47.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
1998: sim_rep47.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
1999: sim_rep47.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
2000: sim_rep47.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
extracts all row belonging to data set sim_rep47.dat.
Data
For test and demonstration, I've created 100 sample data.frames using the code below:
# create vector of file names
temp <- paste0("sim_rep", 1:100, ".dat")
# create one sample data.frame
nr <- 2000L
nc <- 11L
set.seed(123L)
foo <- as.data.frame(matrix(round(rnorm(nr * nc), 1), nrow = nr))
names(foo) <- c("ARAND", paste0("w", 1:10))
str(foo)
# create 100 individually named data.frames by "copying" foo
for (t in temp) assign(t, foo)
# print warning message on using assign
fortunes::fortune(236)
# verify objects have been created
ls()
Addendum: Reading all files at once
The OP has named the single data.frames sim_rep1.dat, sim_rep2.dat, etc. which resemble typical file names. Just in case the OP indeed has 100 files on disk I would like to suggest a way to read all files at once. Let's suppose all files are stored in one directory.
# path to data directory
data_dir <- file.path("path", "to", "data", "directory")
# create vector of file paths
files <- dir(data_dir, pattern = "sim_rep\\d+\\.dat", full.names = TRUE)
# read all files and create one large data.table
# NB: it might be necessary to add parameters to fread()
# or to use another file reader depending on the file type
DT <- rbindlist(lapply(files, fread), idcol = "origin")
# rename origin to contain the file names without path
DT[, origin := factor(origin, labels = basename(files))]
DT
origin ARAND w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ps_true
1: sim_rep1.dat -0.6 -0.5 0.2 -0.7 0.5 2.4 -0.2 -0.9 -1.1 0.3 -0.8 0.0287485
2: sim_rep1.dat -0.2 0.2 0.7 1.0 1.8 -0.2 0.8 0.3 -1.3 -1.6 -0.2 0.4588433
3: sim_rep1.dat 1.6 -0.5 0.7 -0.7 -1.7 0.9 -1.2 -1.0 1.1 -0.3 -2.1 0.2432395
4: sim_rep1.dat 0.1 1.2 -1.3 -0.1 0.3 -0.6 0.4 0.3 0.8 -1.2 -1.7 0.8313184
5: sim_rep1.dat 0.1 0.2 -2.0 0.6 -0.3 0.2 0.2 0.5 -0.9 -0.8 -1.1 0.7738186
---
199996: sim_rep99.dat 0.1 -1.4 1.6 -0.7 -1.0 -0.6 0.8 -0.6 -0.5 -0.4 -0.8 0.1323889
199997: sim_rep99.dat 0.3 1.3 -2.4 -0.7 -0.4 0.0 1.0 -0.2 1.0 -0.1 0.3 0.6769959
199998: sim_rep99.dat 0.3 1.2 0.0 -1.3 -0.8 -0.7 -0.3 0.1 0.9 0.9 -1.3 0.7824498
199999: sim_rep99.dat 0.5 -0.7 0.2 0.5 1.1 -0.3 0.3 -0.5 -0.8 1.9 -0.7 0.2669799
200000: sim_rep99.dat -0.5 1.1 0.8 0.2 -0.6 -0.5 -0.4 1.1 -1.8 0.9 -1.3 0.9175867
All data sets are now stored in one large data.table DT consisting of 200 k rows. However, the order of data sets is different as files is sorted alphabetically, i.e.,
head(files)
[1] "./data/sim_rep1.dat" "./data/sim_rep10.dat" "./data/sim_rep100.dat"
[4] "./data/sim_rep11.dat" "./data/sim_rep12.dat" "./data/sim_rep13.dat"
probably just need single brackets.
test = data.frame('w1' = c(1,2,3),'w2' = c(2,3,4))
temp = list(test,test,test)
temp2 = lapply(temp,function(x){cbind(x,setNames(x['w1'] + x['w2'],'ps_true'))})
temp2
[[1]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7
[[2]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7
[[3]]
w1 w2 ps_true
1 1 2 3
2 2 3 5
3 3 4 7
I have a data table similar to this except much larger:
set.seed(1)
dt <- data.table(t1=round(rnorm(5),1), t2=round(rnorm(5),1), t3=round(rnorm(5),1),
t4=round(rnorm(5),1), t5=round(rnorm(5),1), t6=round(rnorm(5),1),
t7=round(rnorm(5),1),t8=round(rnorm(5),1))
Which outputs:
t1 t2 t3 t4 t5 t6 t7 t8
1: -0.6 -0.8 1.5 0.0 0.9 -0.1 1.4 -0.4
2: 0.2 0.5 0.4 0.0 0.8 -0.2 -0.1 -0.4
3: -0.8 0.7 -0.6 0.9 0.1 -1.5 0.4 -0.1
4: 1.6 0.6 -2.2 0.8 -2.0 -0.5 -0.1 1.1
5: 0.3 -0.3 1.1 0.6 0.6 0.4 -1.4 0.8
I would like to rename columns t3:t8 as hour_t3:hour_t8, to output like this:
t1 t2 hour_t3 hour_t4 hour_t5 hour_t6 hour_t7 hour_t8
1: -0.6 -0.8 1.5 0.0 0.9 -0.1 1.4 -0.4
2: 0.2 0.5 0.4 0.0 0.8 -0.2 -0.1 -0.4
3: -0.8 0.7 -0.6 0.9 0.1 -1.5 0.4 -0.1
4: 1.6 0.6 -2.2 0.8 -2.0 -0.5 -0.1 1.1
5: 0.3 -0.3 1.1 0.6 0.6 0.4 -1.4 0.8
These two methods work:
names(dt)[3:8] <- c(paste0("hour_t", 3:8))
and
setnames(dt, 3:8, c(paste0("hour_t", 3:8)))
but, I would like to be able to subset by reference using something like this:
setnames(dt, "t3":"t8", c(paste0("hour_t", 3:8)))
When I use such syntax or subset with c("t3":"t8"), I get the following error:
Error in "t3":"t8" : NA/NaN argument
In addition: Warning messages:
1: In setnames(dt, c("t3":"t8"), c(paste0("hour_t", 3:8))) :
NAs introduced by coercion
2: In setnames(dt, c("t3":"t8"), c(paste0("hour_t", 3:8))) :
NAs introduced by coercion
Any thoughts on how to subset the columns to rename by reference/column name instead of by position would be greatly appreciated. Thanks.
I am still quite new to data.table and am using data.table version 1.9.6.
Below is the data I am working with. I do a simple hist(data) and the frequency of -.3 through .4 are correct. However, for some reason R seems to combine the frequency of -.5 and -.4, the two left most bins. There are 3 counts of -.5 and 5 counts of -.4, but R plots 8 counts of both -.5 and -.4.
Any idea why this may be going on? How to fix it?
[1] -0.1 0.0 0.1 0.1 0.3 0.0 0.0 0.1 0.1 0.1 0.2 0.1 -0.1 0.2 0.0
[16] -0.4 0.2 0.0 -0.1 0.0 0.1 0.1 -0.1 0.0 0.0 0.1 0.0 -0.1 0.0 0.3
[31] -0.2 0.4 -0.1 0.0 -0.2 0.0 0.1 0.1 0.0 0.1 0.2 -0.1 0.1 0.1 -0.1
[46] 0.2 0.1 -0.1 0.1 0.0 -0.1 0.4 -0.1 -0.1 0.0 0.0 -0.1 0.1 0.1 0.0
[61] 0.1 -0.1 0.2 -0.1 0.1 -0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0 -0.1 0.1
[76] 0.2 -0.2 0.0 0.0 -0.1 0.2 0.0 0.0 0.0 -0.3 0.0 -0.1 -0.1 0.1 -0.2
[91] -0.1 -0.3 -0.1 -0.3 -0.2 -0.2 0.0 0.0 0.0 -0.2 0.1 0.0 0.0 0.1 0.0
[106] 0.0 -0.2 -0.1 0.2 -0.1 0.0 -0.1 -0.1 -0.2 0.1 0.1 0.0 0.1 0.2 0.1
[121] 0.0 0.1 -0.2 0.2 0.0 0.0 0.1 0.1 0.0 -0.1 0.1 0.0 0.1 -0.1 0.2
[136] 0.0 0.1 0.1 0.0 0.1 -0.1 0.0 0.0 0.1 0.2 -0.1 0.1 0.0 0.1 0.0
[151] -0.1 0.0 0.2 0.1 -0.1 0.1 -0.2 0.1 0.1 -0.1 0.1 -0.2 -0.1 0.1 -0.1
[166] 0.0 0.0 -0.3 0.0 0.1 -0.2 0.1 -0.4 -0.2 -0.2 -0.3 0.0 -0.4 -0.3 -0.5
[181] -0.5 -0.5 -0.4 -0.3 -0.4 -0.1 0.0 -0.1 -0.2 -0.2 0.1 0.0 0.2 -0.1 -0.1
[196] 0.0 0.3 0.2 -0.1 0.0 0.0 0.0 -0.3 0.4 0.3 0.1 0.0 -0.1 0.1 -0.1
[211] 0.1 0.0 0.0 0.2 0.2 0.1 0.3 -0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.3
[226] 0.0 0.0 -0.1 0.0 0.2 0.2 0.0 0.0 0.0 0.2 0.1 0.0 0.0 0.2 0.3
[241] 0.1 -0.1 0.0 0.4 0.0 0.2 -0.1 0.1
Here is the output of the histogram. You can see 8 counts of -.5 and -.4, which isn't in the data
$breaks
[1] -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
$counts
[1] 8 8 17 46 75 60 23 7 4
The comments above explain what's happening - the breaks are the left and right limits of the intervals, not the centers.
How to fix it? If you are dealing just with numbers discretized to [natural numbers] * 0.1 you can set your breaks at 0.05, 0.15, ... by
data <- c(-0.5, -0.5, 0.4)
breaks <- ((min(data)*10):(max(data)*10+1))/10-0.05
result <- hist(data, breaks)
But is a histogram really that what you need for this? It seems that you just want to calculate the number of occurrences which is much easier by
data <- c(-0.5, -0.5, 0.4)
aggregate(data, list(data), "length")
returning
Group.1 x
1 -0.4 2
2 0.6 1
And for plotting, have a look at barplot
I am trying to draw all contours lines in a same color following the example from here: http://gnuplot.sourceforge.net/demo/contours.25.gnu
However, the example works, but my own code does not work:
set xyplane 0;
set pm3d
set contour
set cntrparam levels 6
unset surface;
unset key;
set pm3d map
set title "t";
splot for [i=1:1] "-" using 1:2:3 notitle with lines lc rgb "dark-blue";
....data....
Can you help me find the problem?
Here to download the code file:
https://dl.dropboxusercontent.com/u/45318932/contourpm3d.plt
I am using gnuplot4.6.5
The relevant line is
unset clabel
I know, that is very unintuitive. Don't know the reason behind it.
Here is the complete script with the respective changes, for reference:
set xyplane 0;
set pm3d
set contour
unset clabel
set cntrparam levels 6
unset surface;
unset key;
set pm3d map
splot for [i=1:1] "-" using 1:2:3 notitle with lines lw 2 lc rgb "dark-blue";
#a1 a2 t
0.0 0.0 25.0
0.0 0.1 28.0
0.0 0.2 37.0
0.0 0.3 23.0
0.0 0.4 23.0
0.0 0.5 15.0
0.0 0.6 16.0
0.0 0.7 33.0
0.0 0.8 16.0
0.0 0.9 20.0
0.0 1.0 14.0
0.1 0.0 25.0
0.1 0.1 47.0
0.1 0.2 26.0
0.1 0.3 14.0
0.1 0.4 16.0
0.1 0.5 15.0
0.1 0.6 27.0
0.1 0.7 13.0
0.1 0.8 14.0
0.1 0.9 20.0
0.1 1.0 0.0
0.2 0.0 25.0
0.2 0.1 28.0
0.2 0.2 26.0
0.2 0.3 14.0
0.2 0.4 16.0
0.2 0.5 16.0
0.2 0.6 32.0
0.2 0.7 14.0
0.2 0.8 19.0
0.2 0.9 0.0
0.2 1.0 0.0
0.3 0.0 57.0
0.3 0.1 36.0
0.3 0.2 26.0
0.3 0.3 14.0
0.3 0.4 15.0
0.3 0.5 16.0
0.3 0.6 31.0
0.3 0.7 18.0
0.3 0.8 0.0
0.3 0.9 0.0
0.3 1.0 0.0
0.4 0.0 42.0
0.4 0.1 23.0
0.4 0.2 26.0
0.4 0.3 19.0
0.4 0.4 15.0
0.4 0.5 16.0
0.4 0.6 34.0
0.4 0.7 0.0
0.4 0.8 0.0
0.4 0.9 0.0
0.4 1.0 0.0
0.5 0.0 54.0
0.5 0.1 23.0
0.5 0.2 26.0
0.5 0.3 17.0
0.5 0.4 15.0
0.5 0.5 16.0
0.5 0.6 0.0
0.5 0.7 0.0
0.5 0.8 0.0
0.5 0.9 0.0
0.5 1.0 0.0
0.6 0.0 21.0
0.6 0.1 23.0
0.6 0.2 23.0
0.6 0.3 16.0
0.6 0.4 16.0
0.6 0.5 0.0
0.6 0.6 0.0
0.6 0.7 0.0
0.6 0.8 0.0
0.6 0.9 0.0
0.6 1.0 0.0
0.7 0.0 21.0
0.7 0.1 16.0
0.7 0.2 27.0
0.7 0.3 12.0
0.7 0.4 0.0
0.7 0.5 0.0
0.7 0.6 0.0
0.7 0.7 0.0
0.7 0.8 0.0
0.7 0.9 0.0
0.7 1.0 0.0
0.8 0.0 61.0
0.8 0.1 27.0
0.8 0.2 33.0
0.8 0.3 0.0
0.8 0.4 0.0
0.8 0.5 0.0
0.8 0.6 0.0
0.8 0.7 0.0
0.8 0.8 0.0
0.8 0.9 0.0
0.8 1.0 0.0
0.9 0.0 27.0
0.9 0.1 21.0
0.9 0.2 0.0
0.9 0.3 0.0
0.9 0.4 0.0
0.9 0.5 0.0
0.9 0.6 0.0
0.9 0.7 0.0
0.9 0.8 0.0
0.9 0.9 0.0
0.9 1.0 0.0
1.0 0.0 35.0
1.0 0.1 0.0
1.0 0.2 0.0
1.0 0.3 0.0
1.0 0.4 0.0
1.0 0.5 0.0
1.0 0.6 0.0
1.0 0.7 0.0
1.0 0.8 0.0
1.0 0.9 0.0
1.0 1.0 0.0
e
with the output