Hi I'm pushing data into a matrix so I can create a heatmap. The code I am using identical to what is published here (http://sebastianraschka.com/Articles/heatmaps_in_r.html). For some of my datasets, when I push the data into the matrix format I am getting strange behaviour in that some of the values are changing. Some of my datasets work fine but others do not and I am unsure what the primary differences are that is underlying this strange behaviour.
Example code;
data <- read.csv("mydata.txt", sep="\t", header =TRUE)
rnames <- data[,1]
mat_data <- data.matrix(data[,2:ncol(data)])
rownames(mat_data) <- rnames
Now example dataframes..
head(data)
1 1.108029 0.42 0.19 0.04 0.47 -0.08 0.47 0.04 0.10
2 1.108029 0.34 0.40 0.25 0.56 -0.08 -0.06 0.11 0.20
3 1.121099 0.1 -0.45 0.11 -0.22 -0.07 -0.40 0.24 -0.17
4 1.123857 0.26 -0.15 0.15 0.31 0.2 -0.24 -0.27 0.40
5 1.129303 0.11 0.13 0.01 -0.11 0.38 0.29 -0.15 -0.18
6 1.135904 0.4 0.07 0.11 0.03 0.6 -0.32 0.14 -0.12
head(mat_data)
tg_q2_rep_A tg_q2_rep_B tg_q2_rep_C tg_q2_rep_D tg_q4_rep_A tg_q4_rep_B tg_q4_rep_C tg_q4_rep_D
1.10802929 70 0.19 0.04 0.47 5 0.47 0.04 0.10
1.1080293 65 0.40 0.25 0.56 5 -0.06 0.11 0.20
1.12109912 49 -0.45 0.11 -0.22 4 -0.40 0.24 -0.17
1.12385707 62 -0.15 0.15 0.31 53 -0.24 -0.27 0.40
1.12930344 50 0.13 0.01 -0.11 65 0.29 -0.15 -0.18
1.1359041 69 0.07 0.11 0.03 69 -0.32 0.14 -0.12
You can see the rownames have had numbers appended to the ends and the first data for tg_q2_rep_A and tg_q4_rep_A have been changed.
If anyone can suggest how to approach this I'd appreciate it. I've been trying to figure this out for days :/
EDIT
As requested ..
> str(data)
'data.frame': 137 obs. of 33 variables:
$ CpG_id.chr.pos.: num 1.11 1.11 1.12 1.12 1.13 ...
$ tg_q2_rep_A : Factor w/ 75 levels "-0.01","-0.02",..: 70 65 49 62 50 69 71 63 57 7 ...
$ tg_q2_rep_B : num 0.19 0.4 -0.45 -0.15 0.13 0.07 0.5 -0.33 0.23 -0.22 ...
$ tg_q2_rep_C : num 0.04 0.25 0.11 0.15 0.01 0.11 0.16 0.03 0.23 -0.32 ...
$ tg_q2_rep_D : num 0.47 0.56 -0.22 0.31 -0.11 0.03 0.31 0.21 0 0.06 ...
$ tg_q4_rep_A : Factor w/ 73 levels "-0.04","-0.05",..: 5 5 4 53 65 69 50 53 59 46 ...
$ tg_q4_rep_B : num 0.47 -0.06 -0.4 -0.24 0.29 -0.32 0.07 -0.23 0.1 -0.09 ...
$ tg_q4_rep_C : num 0.04 0.11 0.24 -0.27 -0.15 0.14 0.14 0.36 0.1 -0.05 ...
$ tg_q4_rep_D : num 0.1 0.2 -0.17 0.4 -0.18 -0.12 0.15 0.18 -0.21 -0.14 ...
$ tg_q6_rep_A : Factor w/ 79 levels "-0.02","-0.03",..: 46 3 7 67 65 77 64 61 41 12 ...
$ tg_q6_rep_B : Factor w/ 87 levels "-0.01","-0.03",..: 68 79 34 11 82 1 63 1 36 32 ...
$ tg_q6_rep_C : num 0.22 0.5 -0.32 0.13 0.24 0.25 0.35 0.07 0.01 -0.44 ...
$ tg_q6_rep_D : Factor w/ 82 levels "-0.04","-0.05",..: 55 50 27 74 71 68 73 61 5 31 ...
$ tg_q8_rep_A : Factor w/ 73 levels "-0.01","-0.02",..: 49 9 2 52 45 50 13 55 48 9 ...
$ tg_q8_rep_B : num 0.05 0.07 -0.31 0.02 0 -0.33 0.03 -0.05 0.08 0.1 ...
$ tg_q8_rep_C : num 0.35 0.5 -0.06 -0.1 0.24 -0.45 -0.27 0.1 0.15 -0.29 ...
$ tg_q8_rep_D : num 0.15 0.08 -0.08 0.31 0.28 0.43 0.41 0.25 -0.05 -0.04 ...
$ tg_w2_rep_A : Factor w/ 72 levels "-0.01","-0.02",..: 49 16 24 66 60 62 62 68 52 49 ...
$ tg_w2_rep_B : num 0.11 0.24 -0.03 -0.43 0.67 -0.13 0.05 -0.4 -0.13 -0.18 ...
$ tg_w2_rep_C : num 0 0.33 -0.09 0 0.12 -0.35 0.06 0.33 0.15 -0.19 ...
$ tg_w2_rep_D : num -0.04 0 -0.03 0.44 0.04 0.23 0.28 0.19 -0.21 -0.17 ...
$ tg_w4_rep_A : Factor w/ 69 levels "-0.0","-0.01",..: 55 58 53 50 52 67 68 63 27 8 ...
$ tg_w4_rep_B : num 0.29 0.63 -0.37 0.09 0.22 -0.21 0.1 -0.14 -0.04 -0.09 ...
$ tg_w4_rep_C : num 0.09 0.13 -0.08 0.17 0.15 -0.33 0 0.38 0.1 -0.62 ...
$ tg_w4_rep_D : num 0.11 0.33 -0.32 0.41 -0.1 0.07 0.23 0.22 0.1 0.06 ...
$ tg_w6_rep_A : Factor w/ 74 levels "-0.01","-0.02",..: 56 45 4 69 59 47 2 40 47 12 ...
$ tg_w6_rep_B : num 0.07 0.13 -0.14 0.15 0.13 -0.17 0.33 0.12 0.07 -0.15 ...
$ tg_w6_rep_C : num 0.13 0.22 0.31 0.08 0.16 -0.33 -0.05 0.43 0.43 -0.06 ...
$ tg_w6_rep_D : num 0.28 0.11 -0.2 0.66 -0.18 0.16 0.26 0.27 0.06 -0.02 ...
$ tg_w8_rep_A : Factor w/ 67 levels "-0.01","-0.02",..: 52 40 37 44 48 61 48 53 39 63 ...
$ tg_w8_rep_B : num 0.3 0.09 -0.22 -0.1 0.14 -0.25 0.1 -0.49 0.19 0.15 ...
$ tg_w8_rep_C : num 0.23 0.27 0.11 -0.25 0.17 -0.13 0.23 0.47 0.33 -0.09 ...
$ tg_w8_rep_D : num -0.04 0.1 -0.25 0.37 -0.09 0.18 0.26 0.2 -0.35 -0.11 ...
The problem with your rownames is that they aren't unique. R requires unique identifiers for each row, and you have multiple rows with the same value in the data.frame "data". When you try to force it to make the values in that first column rownames, it's trying to make them unique, and it looks as though it's rounding some numbers to accomplish that.
I'm not entirely certain what's going on with columns tg_q2_rep_A and tg_q4_rep_A, but it looks as though those values have been converted to ranks. That can happen if the class of those columns in your original data.frame, data, was "factor" rather than "numeric". Try this to check the classes:
sapply(data, class)
If you've got a mixture of numbers and letters in that column, for example, R will set the data class as factor by default. When you convert those columns to numeric format, which is what data.matrix() does, the output will be the rank of that factor.
I didn't get the same problem for those two columns when I copied and pasted your data into a csv file and loaded it into R, but I'm guessing that you haven't given us all the data there. My first step to figure this out would be to check the classes of the columns.
Related
Suppose I have a dataframe as follows:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
I have a custom function that makes new columns. (Note, my actual function is a lot more complex and can't be vectorized without a custom function, so please ignore the substance of the transformation here.) For example:
newfun <- function(var = NULL) {
newname <- paste0(var, "NEW")
df[[newname]] <- df[[var]]/100
return(df)
}
I want to apply this over many columns of the dataset repeatedly and have the dataset "build up." This happens just fine when I do the following:
df <- newfun("alpha")
df <- newfun("beta")
df <- newfun("gamma")
Obviously this is redundant and a case for map. But when I do the following I get back a list of dataframes, which is not what I want:
df <- data.frame(
alpha = 0:20,
beta = 30:50,
gamma = 100:120
)
out <- c("alpha", "beta", "gamma") %>%
map(function(x) newfun(x))
How can I iterate over a vector of column names AND see the changes repeatedly applied to the same dataframe?
Writing the function to reach outside of its scope to find some df is both risky and will bite you, especially when you see something like:
df[['a']] <- 2
# Error in df[["a"]] <- 2 : object of type 'closure' is not subsettable
You will get this error when it doesn't find your variable named df, and instead finds the base function named df. Two morals from this discovery:
While I admit to using df myself, it's generally bad practice to name variables the same as R functions (especially from base); and
Scope-breach is sloppy and renders a workflow unreproducible and often difficult to troubleshoot problems or changes.
To remedy this, and since your function relies on knowing what the old/new variable names are or should be, I think pmap or base R Map may work better. Further, I suggest that you name the new variables outside of the function, making it "data-only".
myfunc <- function(x) x/100
setNames(lapply(dat[,cols], myfunc), paste0("new", cols))
# $newalpha
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17
# [19] 0.18 0.19 0.20
# $newbeta
# [1] 0.30 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47
# [19] 0.48 0.49 0.50
# $newgamma
# [1] 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 1.11 1.12 1.13 1.14 1.15 1.16 1.17
# [19] 1.18 1.19 1.20
From here, we just need to column-bind (cbind) it:
cbind(dat, setNames(lapply(dat[,cols], myfunc), paste0("new", cols)))
# alpha beta gamma newalpha newbeta newgamma
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# ...
Special note: if you plan on doing this iteratively (repeatedly), it is generally bad to iteratively add rows to frames; while I know this is a bad idea for adding rows, I suspect (without proof at the moment) that doing the same with columns is also bad. For that reason, if you do this a lot, consider using do.call(cbind, c(list(dat), ...)) where ... is the list of things to add. This results in a single call to cbind and therefore only a single memory-copy of the original dat. (Contrast that with iteratively calling the *bind functions which make a complete copy with each pass, scaling poorly.)
additions <- lapply(1:3, function(i) setNames(lapply(dat[,cols], myfunc), paste0("new", i, cols)))
str(additions)
# List of 3
# $ :List of 3
# ..$ new1alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new1beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new1gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new2alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new2beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new2gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
# $ :List of 3
# ..$ new3alpha: num [1:21] 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 ...
# ..$ new3beta : num [1:21] 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 ...
# ..$ new3gamma: num [1:21] 1 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 ...
do.call(cbind, c(list(dat), additions))
# alpha beta gamma new1alpha new1beta new1gamma new2alpha new2beta new2gamma new3alpha new3beta new3gamma
# 1 0 30 100 0.00 0.30 1.00 0.00 0.30 1.00 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01 0.01 0.31 1.01 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02 0.02 0.32 1.02 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03 0.03 0.33 1.03 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04 0.04 0.34 1.04 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05 0.05 0.35 1.05 0.05 0.35 1.05
# ...
An alternative approach is to change your function to only return a vector:
newfun2 <- function(var = NULL) {
df[[var]] / 100
}
newfun2('alpha')
# [1] 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.11 0.12 0.13
#[15] 0.14 0.15 0.16 0.17 0.18 0.19 0.20
Then, using base, you can use lapply() to loop through your list of functions to do:
cols <- c("alpha", "beta", "gamma")
df[, paste0(cols, 'NEW')] <- lapply(cols, newfun2)
#or
#df[, paste0(cols, 'NEW')] <- purrr::map(cols, newfun2)
df
alpha beta gamma alphaNEW betaNEW gammaNEW
1 0 30 100 0.00 0.30 1.00
2 1 31 101 0.01 0.31 1.01
3 2 32 102 0.02 0.32 1.02
4 3 33 103 0.03 0.33 1.03
5 4 34 104 0.04 0.34 1.04
6 5 35 105 0.05 0.35 1.05
7 6 36 106 0.06 0.36 1.06
8 7 37 107 0.07 0.37 1.07
9 8 38 108 0.08 0.38 1.08
10 9 39 109 0.09 0.39 1.09
11 10 40 110 0.10 0.40 1.10
12 11 41 111 0.11 0.41 1.11
13 12 42 112 0.12 0.42 1.12
14 13 43 113 0.13 0.43 1.13
15 14 44 114 0.14 0.44 1.14
16 15 45 115 0.15 0.45 1.15
17 16 46 116 0.16 0.46 1.16
18 17 47 117 0.17 0.47 1.17
19 18 48 118 0.18 0.48 1.18
20 19 49 119 0.19 0.49 1.19
21 20 50 120 0.20 0.50 1.20
Based on the way you wrote your function, a for loop that assign the result of newfun to df repeatedly works pretty well.
vars <- names(df)
for (i in vars){
df <- newfun(i)
}
df
# alpha beta gamma alphaNEW betaNEW gammaNEW
# 1 0 30 100 0.00 0.30 1.00
# 2 1 31 101 0.01 0.31 1.01
# 3 2 32 102 0.02 0.32 1.02
# 4 3 33 103 0.03 0.33 1.03
# 5 4 34 104 0.04 0.34 1.04
# 6 5 35 105 0.05 0.35 1.05
# 7 6 36 106 0.06 0.36 1.06
# 8 7 37 107 0.07 0.37 1.07
# 9 8 38 108 0.08 0.38 1.08
# 10 9 39 109 0.09 0.39 1.09
# 11 10 40 110 0.10 0.40 1.10
# 12 11 41 111 0.11 0.41 1.11
# 13 12 42 112 0.12 0.42 1.12
# 14 13 43 113 0.13 0.43 1.13
# 15 14 44 114 0.14 0.44 1.14
# 16 15 45 115 0.15 0.45 1.15
# 17 16 46 116 0.16 0.46 1.16
# 18 17 47 117 0.17 0.47 1.17
# 19 18 48 118 0.18 0.48 1.18
# 20 19 49 119 0.19 0.49 1.19
# 21 20 50 120 0.20 0.50 1.20
I am interesting in a yeast dataset from UCI (please see the link). The data is saved in text formula. I would like to load it into Rstudio. I saved it in office word (copy and paste). Then, I tried to load it into R studio but I got unclear words instead of the data.
https://archive.ics.uci.edu/ml/datasets/Yeast
Any help please?
Grabbing the data is pretty easy; you can just pass the file URL directly to read.table. Getting the names is a lot more work, as they're buried in a text file. If you like, you can extract them with regex:
library(tidyverse)
yeast <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', stringsAsFactors = FALSE)
l <- readLines('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.names')
l <- l[(grep('^7', l) + 1):(grep('^8', l) - 1)]
l <- l[grep('\\d\\..*:', l)]
names(yeast) <- make.names(c(sub('.*\\d\\.\\s+(.*):.*', '\\1', l), 'class'))
str(yeast)
#> 'data.frame': 1484 obs. of 10 variables:
#> $ Sequence.Name: chr "ADT1_YEAST" "ADT2_YEAST" "ADT3_YEAST" "AAR2_YEAST" ...
#> $ mcg : num 0.58 0.43 0.64 0.58 0.42 0.51 0.5 0.48 0.55 0.4 ...
#> $ gvh : num 0.61 0.67 0.62 0.44 0.44 0.4 0.54 0.45 0.5 0.39 ...
#> $ alm : num 0.47 0.48 0.49 0.57 0.48 0.56 0.48 0.59 0.66 0.6 ...
#> $ mit : num 0.13 0.27 0.15 0.13 0.54 0.17 0.65 0.2 0.36 0.15 ...
#> $ erl : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#> $ pox : num 0 0 0 0 0 0.5 0 0 0 0 ...
#> $ vac : num 0.48 0.53 0.53 0.54 0.48 0.49 0.53 0.58 0.49 0.58 ...
#> $ nuc : num 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.34 0.22 0.3 ...
#> $ class : chr "MIT" "MIT" "MIT" "NUC" ...
...or just copy them all out by hand.
I am new to R and I want to predict the Class variable in my test set using XGBoost. My training data set looks as follows.
> str(train)
'data.frame': 5000 obs. of 37 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ A1 : num 0.36 0.33 0.33 0.31 0.33 0.31 0.3 0.3 0.3 0.3 ...
$ A2 : num 0.45 0.4 0.4 0.4 0.37 0.37 0.4 0.4 0.35 0.37 ...
$ A3 : num 0.47 0.42 0.4 0.4 0.4 0.38 0.42 0.42 0.38 0.38 ...
$ A4 : num 0.37 0.31 0.33 0.31 0.31 0.3 0.33 0.34 0.3 0.3 ...
$ A5 : num 0.33 0.33 0.31 0.33 0.31 0.31 0.3 0.31 0.3 0.3 ...
$ A6 : num 0.4 0.4 0.4 0.37 0.37 0.4 0.4 0.38 0.37 0.38 ...
$ A7 : num 0.42 0.4 0.4 0.4 0.38 0.4 0.42 0.42 0.38 0.4 ...
$ A8 : num 0.31 0.33 0.31 0.31 0.3 0.31 0.34 0.31 0.3 0.28 ...
$ A9 : num 0.33 0.31 0.33 0.31 0.31 0.3 0.31 0.3 0.3 0.3 ...
$ A10 : num 0.4 0.4 0.37 0.37 0.4 0.4 0.38 0.37 0.38 0.37 ...
$ A11 : num 0.4 0.4 0.4 0.38 0.4 0.4 0.42 0.4 0.4 0.35 ...
$ A12 : num 0.33 0.31 0.31 0.3 0.31 0.31 0.31 0.3 0.28 0.3 ...
$ A13 : num 0.4 0.36 0.33 0.33 0.33 0.3 0.31 0.31 0.31 0.3 ...
$ A14 : num 0.49 0.44 0.4 0.39 0.39 0.39 0.42 0.44 0.37 0.36 ...
$ A15 : num 0.52 0.46 0.41 0.41 0.41 0.41 0.46 0.46 0.41 0.41 ...
$ A16 : num 0.4 0.33 0.32 0.31 0.32 0.32 0.35 0.35 0.29 0.29 ...
$ A17 : num 0.36 0.33 0.33 0.33 0.3 0.3 0.31 0.31 0.3 0.3 ...
$ A18 : num 0.44 0.4 0.39 0.39 0.39 0.39 0.44 0.42 0.36 0.37 ...
$ A19 : num 0.46 0.41 0.41 0.41 0.41 0.42 0.46 0.44 0.41 0.39 ...
$ A20 : num 0.33 0.32 0.31 0.32 0.32 0.33 0.35 0.33 0.29 0.31 ...
$ A21 : num 0.33 0.33 0.33 0.3 0.3 0.3 0.31 0.31 0.3 0.3 ...
$ A22 : num 0.4 0.39 0.39 0.39 0.39 0.4 0.42 0.37 0.37 0.36 ...
$ A23 : num 0.41 0.41 0.41 0.41 0.42 0.46 0.44 0.39 0.39 0.39 ...
$ A24 : num 0.32 0.31 0.32 0.32 0.33 0.35 0.33 0.31 0.31 0.29 ...
$ A25 : num 0.4 0.35 0.33 0.33 0.33 0.33 0.31 0.31 0.29 0.29 ...
$ A26 : num 0.49 0.47 0.42 0.39 0.39 0.4 0.42 0.4 0.36 0.36 ...
$ A27 : num 0.53 0.5 0.44 0.41 0.41 0.41 0.44 0.41 0.38 0.38 ...
$ A28 : num 0.41 0.39 0.34 0.31 0.31 0.31 0.34 0.33 0.29 0.28 ...
$ A29 : num 0.35 0.33 0.33 0.33 0.33 0.31 0.31 0.31 0.29 0.31 ...
$ A30 : num 0.47 0.42 0.39 0.39 0.4 0.42 0.4 0.4 0.36 0.34 ...
$ A31 : num 0.5 0.44 0.41 0.41 0.41 0.43 0.41 0.41 0.38 0.36 ...
$ A32 : num 0.39 0.34 0.31 0.31 0.31 0.34 0.33 0.31 0.28 0.28 ...
$ A33 : num 0.33 0.33 0.33 0.33 0.31 0.31 0.31 0.31 0.31 0.31 ...
$ A34 : num 0.42 0.39 0.39 0.4 0.42 0.42 0.4 0.37 0.34 0.34 ...
$ A35 : num 0.44 0.41 0.41 0.41 0.43 0.43 0.41 0.39 0.36 0.36 ...
$ Class: **Factor** w/ 6 levels "A","B","C","D",..: 3 3 3 3 3 3 3 3 4 4 ...
My test data set looks just the same except that Class attribute is empty.I have used this code to predict the Class for my test data set.
train <- read.csv("cse_DS_Intro2TRAIN.csv")
test <- read.csv("cse_DS_Intro2TEST.csv")
setDT(train)
setDT(test)
labels <- train$Class
ts_label <- test$Class
new_tr <- model.matrix(~.+0,data = train[,-c("Class"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Class"),with=F])
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(
booster = "gbtree",
objective = "binary:logistic",
eta=0.3,
gamma=0,
max_depth=6,
min_child_weight=1,
subsample=1,
colsample_bytree=1
)
xgbcv <- xgb.cv(params = params
,data = dtrain
,nrounds = 100
,nfold = 5
,showsd = T
,stratified = T
,print.every.n = 10
,early.stop.round = 20
,maximize = F
)
When I run the above code, I get this error.
Error in xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, obj) :
[16:49:39] amalgamation/../src/objective/regression_obj.cc:108: label must
be in [0,1] for logistic regression
Is it possible to predict a factor type data using XGBoost in R?
P.S. have used Random Forest to predict the class variable previously and it worked well.
Your target classes must start from 0 . Try the following example
library(xgboost)
data(agaricus.train)
data(agaricus.test)
train = agaricus.train
param = list("objective" = "binary:logistic" ,"eval_metric" = "logloss" ,
"eta" =1 , "max.depth" = 2)
This model works since train$labels starts from 0 hence output probabilities will be for '1'
model <- xgboost(data = train$data, label = train$label,
nrounds = 20, objective = "binary:logistic")
this model would not work. Notice the error message when you have it starting from 1.
model <- xgboost(data = train$data, label = train$label+1,
nrounds = 20, objective = "binary:logistic")
Just convert them into numeric type where they start from 0 that should work.
Update:
Also since you have almost 6 classes the "objective" should be "multi:softmax" or "multi:softprob" where you should also include "num_class" parameter.
I want to create a new column which selects the minimum value of three possible columns and then use add or subtract depending on condition.
I have the next data frame called df:
a b c
1 0.60 0.27 0.14
2 0.48 0.32 0.21
3 0.42 0.24 0.35
4 0.28 0.33 0.41
5 0.52 0.28 0.22
6 0.34 0.30 0.37
7 0.38 0.28 0.35
8 0.34 0.28 0.40
9 0.53 0.26 0.22
10 0.17 0.27 0.58
11 0.34 0.35 0.33
12 0.19 0.27 0.56
13 0.56 0.29 0.17
14 0.55 0.28 0.19
15 0.29 0.24 0.48
16 0.23 0.31 0.47
17 0.40 0.32 0.28
18 0.50 0.27 0.24
19 0.45 0.28 0.27
20 0.68 0.26 0.05
21 0.40 0.32 0.28
22 0.23 0.26 0.50
23 0.46 0.33 0.20
24 0.46 0.24 0.28
25 0.44 0.24 0.31
26 0.46 0.26 0.27
27 0.30 0.29 0.40
28 0.45 0.20 0.34
29 0.53 0.27 0.20
30 0.33 0.34 0.33
31 0.20 0.26 0.55
32 0.65 0.29 0.06
33 0.45 0.24 0.32
34 0.30 0.26 0.45
35 0.20 0.36 0.45
36 0.38 0.16 0.38
Every row must sum to 1, but as you can notice, just some of them satisfy that condition.
df_total <- rowSums(df[c("a", "b", "c")])
print(df_total)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1.01 1.01 1.01 1.02 1.02 1.01 1.01 1.02 1.01 1.02 1.02 1.02 1.02 1.02 1.01 1.01 1.00 1.01 1.00
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
0.99 1.00 0.99 0.99 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.01 1.00 1.01 1.01 1.01 0.92
So for example in row number 36 from df, I need to sum the lowest value (Which is 0.16) with a number that will make a, b and c sum to 1.
I guess there's an easier way to do this, but I have done this code so far and it doesn't work...Why?
df_total <- rowSums(df[c("a", "b", "c")])
df_for_sum <- df_total[df_total > 1] - 1 #The ones which are above 1
df_for_minus <- -(df_total[df_total < 1]) + 1 #The ones which are below 1
equal_to_100 <- df_total[df_total == 1] #The ones which are ok
df <- df %>%
mutate(d = ifelse(rowSums(df[c("a","b","c")]) > 1,
apply(df[rowSums(df[c("a","b","c")]) > 1,], 1, min) - df_for_sum,
ifelse(rowSums(df[c("a","b","c")]) < 1,
apply(df[rowSums(df[c("a","b","c")]) < 1,], 1, min) + df_for_minus,
ifelse(rowSums(df[c("a","b","c")]) == 1,
apply(df[rowSums(df[c("a","b","c")]) == 1,], 1, min), ""))))
And this is the output:
a b c d
1 0.60 0.27 0.14 0.13
2 0.48 0.32 0.21 0.2
3 0.42 0.24 0.35 0.23
4 0.28 0.33 0.41 0.26
5 0.52 0.28 0.22 0.2
6 0.34 0.30 0.37 0.29
7 0.38 0.28 0.35 0.27
8 0.34 0.28 0.40 0.26
9 0.53 0.26 0.22 0.21
10 0.17 0.27 0.58 0.15
11 0.34 0.35 0.33 0.31
12 0.19 0.27 0.56 0.17
13 0.56 0.29 0.17 0.15
14 0.55 0.28 0.19 0.17
15 0.29 0.24 0.48 0.23
16 0.23 0.31 0.47 0.22
17 0.40 0.32 0.28 0.33 #From here til the end it's wrong!
18 0.50 0.27 0.24 0.19
19 0.45 0.28 0.27 0.28
20 0.68 0.26 0.05 0.24
21 0.40 0.32 0.28 0.28
22 0.23 0.26 0.50 0.26
23 0.46 0.33 0.20 0.25
24 0.46 0.24 0.28 0.27
25 0.44 0.24 0.31 0.3
26 0.46 0.26 0.27 0.21
27 0.30 0.29 0.40 0.24
28 0.45 0.20 0.34 0.0599999999999999
29 0.53 0.27 0.20 0.33
30 0.33 0.34 0.33 0.06
31 0.20 0.26 0.55 0.15
32 0.65 0.29 0.06 0.27
33 0.45 0.24 0.32 0.17
34 0.30 0.26 0.45 0.15
35 0.20 0.36 0.45 0.17
36 0.38 0.16 0.38 0.24
Any thoughts? Any easier way?
You want to calculate the excess difference first:
diff <- 1 - rowSums(df)
then add that to the minimum:
df$d <- apply(df, 1, min) + diff
Here's how to do that without ifelse in dplyr:
df2 <- df1 %>%
mutate(difference = 1-rowSums(.) ) %>%
rowwise() %>%
mutate(d = min(c(a,b,c))+difference )
df2
a b c difference d
(dbl) (dbl) (dbl) (dbl) (dbl)
1 0.60 0.27 0.14 -0.01 0.13
2 0.48 0.32 0.21 -0.01 0.20
3 0.42 0.24 0.35 -0.01 0.23
4 0.28 0.33 0.41 -0.02 0.26
5 0.52 0.28 0.22 -0.02 0.20
6 0.34 0.30 0.37 -0.01 0.29
7 0.38 0.28 0.35 -0.01 0.27
8 0.34 0.28 0.40 -0.02 0.26
9 0.53 0.26 0.22 -0.01 0.21
10 0.17 0.27 0.58 -0.02 0.15
11 0.34 0.35 0.33 -0.02 0.31
12 0.19 0.27 0.56 -0.02 0.17
13 0.56 0.29 0.17 -0.02 0.15
14 0.55 0.28 0.19 -0.02 0.17
15 0.29 0.24 0.48 -0.01 0.23
16 0.23 0.31 0.47 -0.01 0.22
17 0.40 0.32 0.28 0.00 0.28
18 0.50 0.27 0.24 -0.01 0.23
19 0.45 0.28 0.27 0.00 0.27
20 0.68 0.26 0.05 0.01 0.06
21 0.40 0.32 0.28 0.00 0.28
22 0.23 0.26 0.50 0.01 0.24
23 0.46 0.33 0.20 0.01 0.21
24 0.46 0.24 0.28 0.02 0.26
25 0.44 0.24 0.31 0.01 0.25
26 0.46 0.26 0.27 0.01 0.27
27 0.30 0.29 0.40 0.01 0.30
28 0.45 0.20 0.34 0.01 0.21
29 0.53 0.27 0.20 0.00 0.20
30 0.33 0.34 0.33 0.00 0.33
31 0.20 0.26 0.55 -0.01 0.19
32 0.65 0.29 0.06 0.00 0.06
33 0.45 0.24 0.32 -0.01 0.23
34 0.30 0.26 0.45 -0.01 0.25
35 0.20 0.36 0.45 -0.01 0.19
36 0.38 0.16 0.38 0.08 0.24
Data:
df1 <-read.table(text="a b c
0.6 0.27 0.14
0.48 0.32 0.21
0.42 0.24 0.35
0.28 0.33 0.41
0.52 0.28 0.22
0.34 0.3 0.37
0.38 0.28 0.35
0.34 0.28 0.4
0.53 0.26 0.22
0.17 0.27 0.58
0.34 0.35 0.33
0.19 0.27 0.56
0.56 0.29 0.17
0.55 0.28 0.19
0.29 0.24 0.48
0.23 0.31 0.47
0.4 0.32 0.28
0.5 0.27 0.24
0.45 0.28 0.27
0.68 0.26 0.05
0.4 0.32 0.28
0.23 0.26 0.5
0.46 0.33 0.2
0.46 0.24 0.28
0.44 0.24 0.31
0.46 0.26 0.27
0.3 0.29 0.4
0.45 0.2 0.34
0.53 0.27 0.2
0.33 0.34 0.33
0.2 0.26 0.55
0.65 0.29 0.06
0.45 0.24 0.32
0.3 0.26 0.45
0.2 0.36 0.45
0.38 0.16 0.38",header=TRUE,stringsAsFactors=FALSE)
I'm trying to load a file, file columns separated with space, but there are different number of space
between columns. because of this while i'm reading, R thing every space is another column and producing extra empty columns. Is there any other way to load data without problem.
Example Data :
AAT_ECOLI 0.49 0.29 0.48 0.50 0.56 0.24 0.35 cp
ACEA_ECOLI 0.07 0.40 0.48 0.50 0.54 0.35 0.44 cp
ACEK_ECOLI 0.56 0.40 0.48 0.50 0.49 0.37 0.46 cp
ACKA_ECOLI 0.59 0.49 0.48 0.50 0.52 0.45 0.36 cp
you can see that, between first column and second there 3 space, and 2nd column and 3th column there are two space.
I'm using this code for loading data
xxx <- read.csv("../Datasets/Ecoli/ecoli.data", header=FALSE,sep=" ")
I tried 3 space or other things but none of them worked.
Original data file : https://drive.google.com/file/d/0B_XEmkrWR-hCMXVySVI2bU5waGs/view?usp=sharing
Thank you
read.table works perfectly on your downloaded data set. No arguments other than file are necessary (unless you don't want factors). I tend to reserve read.csv for files that are actually comma-separated.
df <- read.table("Downloads/ecoli.data")
str(df)
# 'data.frame': 336 obs. of 9 variables:
# $ V1: Factor w/ 336 levels "AAS_ECOLI","AAT_ECOLI",..: 2 3 4 5 6 8 9 12 ...
# $ V2: num 0.49 0.07 0.56 0.59 0.23 0.67 0.29 0.21 0.2 0.42 ...
# $ V3: num 0.29 0.4 0.4 0.49 0.32 0.39 0.28 0.34 0.44 0.4 ...
# $ V4: num 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 ...
# $ V5: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# $ V6: num 0.56 0.54 0.49 0.52 0.55 0.36 0.44 0.51 0.46 0.56 ...
# $ V7: num 0.24 0.35 0.37 0.45 0.25 0.38 0.23 0.28 0.51 0.18 ...
# $ V8: num 0.35 0.44 0.46 0.36 0.35 0.46 0.34 0.39 0.57 0.3 ...
# $ V9: Factor w/ 8 levels "cp","im","imL",..: 1 1 1 1 1 1 1 1 1 1 ...
You need to set strip.white=T and sep='' :
xxx <- read.csv("c:\\r_stack_overflow\\test.csv", header=FALSE, strip.white=T, sep='')
> xxx
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
> dim(xxx)
[1] 4 9
And it works!
UPDATE:
It works perfect with your data too:
xxx <- read.csv("c:\\r_stack_overflow\\ecoli.data", header=FALSE, strip.white=T, sep='')
Output:
> xxx
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
5 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
6 ALKH_ECOLI 0.67 0.39 0.48 0.5 0.36 0.38 0.46 cp
7 AMPD_ECOLI 0.29 0.28 0.48 0.5 0.44 0.23 0.34 cp
8 AMY2_ECOLI 0.21 0.34 0.48 0.5 0.51 0.28 0.39 cp
9 APT_ECOLI 0.20 0.44 0.48 0.5 0.46 0.51 0.57 cp
10 ARAC_ECOLI 0.42 0.40 0.48 0.5 0.56 0.18 0.30 cp
11 ASG1_ECOLI 0.42 0.24 0.48 0.5 0.57 0.27 0.37 cp
12 BTUR_ECOLI 0.25 0.48 0.48 0.5 0.44 0.17 0.29 cp
13 CAFA_ECOLI 0.39 0.32 0.48 0.5 0.46 0.24 0.35 cp
14 CAIB_ECOLI 0.51 0.50 0.48 0.5 0.46 0.32 0.35 cp
15 CFA_ECOLI 0.22 0.43 0.48 0.5 0.48 0.16 0.28 cp
16 CHEA_ECOLI 0.25 0.40 0.48 0.5 0.46 0.44 0.52 cp
17 CHEB_ECOLI 0.34 0.45 0.48 0.5 0.38 0.24 0.35 cp
18 CHEW_ECOLI 0.44 0.27 0.48 0.5 0.55 0.52 0.58 cp
19 CHEY_ECOLI 0.23 0.40 0.48 0.5 0.39 0.28 0.38 cp
20 CHEZ_ECOLI 0.41 0.57 0.48 0.5 0.39 0.21 0.32 cp
21 CRL_ECOLI 0.40 0.45 0.48 0.5 0.38 0.22 0.00 cp
22 CSPA_ECOLI 0.31 0.23 0.48 0.5 0.73 0.05 0.14 cp
23 CYNR_ECOLI 0.51 0.54 0.48 0.5 0.41 0.34 0.43 cp
24 CYPB_ECOLI 0.30 0.16 0.48 0.5 0.56 0.11 0.23 cp
25 CYPC_ECOLI 0.36 0.39 0.48 0.5 0.48 0.22 0.23 cp
26 CYSB_ECOLI 0.29 0.37 0.48 0.5 0.48 0.44 0.52 cp
27 CYSE_ECOLI 0.25 0.40 0.48 0.5 0.47 0.33 0.42 cp
28 DAPD_ECOLI 0.21 0.51 0.48 0.5 0.50 0.32 0.41 cp
29 DCP_ECOLI 0.43 0.37 0.48 0.5 0.53 0.35 0.44 cp
30 DDLA_ECOLI 0.43 0.39 0.48 0.5 0.47 0.31 0.41 cp
31 DDLB_ECOLI 0.53 0.38 0.48 0.5 0.44 0.26 0.36 cp
32 DEOC_ECOLI 0.34 0.33 0.48 0.5 0.38 0.35 0.44 cp
33 DLDH_ECOLI 0.56 0.51 0.48 0.5 0.34 0.37 0.46 cp
34 EFG_ECOLI 0.40 0.29 0.48 0.5 0.42 0.35 0.44 cp
35 EFTS_ECOLI 0.24 0.35 0.48 0.5 0.31 0.19 0.31 cp
36 EFTU_ECOLI 0.36 0.54 0.48 0.5 0.41 0.38 0.46 cp
37 ENO_ECOLI 0.29 0.52 0.48 0.5 0.42 0.29 0.39 cp
38 FABB_ECOLI 0.65 0.47 0.48 0.5 0.59 0.30 0.40 cp
39 FES_ECOLI 0.32 0.42 0.48 0.5 0.35 0.28 0.38 cp
40 G3P1_ECOLI 0.38 0.46 0.48 0.5 0.48 0.22 0.29 cp
41 G3P2_ECOLI 0.33 0.45 0.48 0.5 0.52 0.32 0.41 cp
42 G6PI_ECOLI 0.30 0.37 0.48 0.5 0.59 0.41 0.49 cp
43 GCVA_ECOLI 0.40 0.50 0.48 0.5 0.45 0.39 0.47 cp
44 GLNA_ECOLI 0.28 0.38 0.48 0.5 0.50 0.33 0.42 cp
45 GLPD_ECOLI 0.61 0.45 0.48 0.5 0.48 0.35 0.41 cp
46 GLYA_ECOLI 0.17 0.38 0.48 0.5 0.45 0.42 0.50 cp
47 GSHR_ECOLI 0.44 0.35 0.48 0.5 0.55 0.55 0.61 cp
48 GT_ECOLI 0.43 0.40 0.48 0.5 0.39 0.28 0.39 cp
49 HEM6_ECOLI 0.42 0.35 0.48 0.5 0.58 0.15 0.27 cp
50 HEMN_ECOLI 0.23 0.33 0.48 0.5 0.43 0.33 0.43 cp
51 HPRT_ECOLI 0.37 0.52 0.48 0.5 0.42 0.42 0.36 cp
52 IF1_ECOLI 0.29 0.30 0.48 0.5 0.45 0.03 0.17 cp
53 IF2_ECOLI 0.22 0.36 0.48 0.5 0.35 0.39 0.47 cp
54 ILVY_ECOLI 0.23 0.58 0.48 0.5 0.37 0.53 0.59 cp
55 IPYR_ECOLI 0.47 0.47 0.48 0.5 0.22 0.16 0.26 cp
56 KAD_ECOLI 0.54 0.47 0.48 0.5 0.28 0.33 0.42 cp
57 KDSA_ECOLI 0.51 0.37 0.48 0.5 0.35 0.36 0.45 cp
58 LEU3_ECOLI 0.40 0.35 0.48 0.5 0.45 0.33 0.42 cp
59 LON_ECOLI 0.44 0.34 0.48 0.5 0.30 0.33 0.43 cp
60 LPLA_ECOLI 0.42 0.38 0.48 0.5 0.54 0.34 0.43 cp
61 LYSR_ECOLI 0.44 0.56 0.48 0.5 0.50 0.46 0.54 cp
62 MALQ_ECOLI 0.52 0.36 0.48 0.5 0.41 0.28 0.38 cp
63 MALZ_ECOLI 0.36 0.41 0.48 0.5 0.48 0.47 0.54 cp
64 MASY_ECOLI 0.18 0.30 0.48 0.5 0.46 0.24 0.35 cp
65 METB_ECOLI 0.47 0.29 0.48 0.5 0.51 0.33 0.43 cp
66 METC_ECOLI 0.24 0.43 0.48 0.5 0.54 0.52 0.59 cp
67 METK_ECOLI 0.25 0.37 0.48 0.5 0.41 0.33 0.42 cp
And dimensions:
> dim(xxx)
[1] 336 9
There's probably a better way, but I believe this should work:
file_df <- scan('data.txt', what = list("","","","","","","","",""))
df <- data.frame(matrix(unlist(file_df), nrow=4))