I am new to R and I want to predict the Class variable in my test set using XGBoost. My training data set looks as follows.
> str(train)
'data.frame': 5000 obs. of 37 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ A1 : num 0.36 0.33 0.33 0.31 0.33 0.31 0.3 0.3 0.3 0.3 ...
$ A2 : num 0.45 0.4 0.4 0.4 0.37 0.37 0.4 0.4 0.35 0.37 ...
$ A3 : num 0.47 0.42 0.4 0.4 0.4 0.38 0.42 0.42 0.38 0.38 ...
$ A4 : num 0.37 0.31 0.33 0.31 0.31 0.3 0.33 0.34 0.3 0.3 ...
$ A5 : num 0.33 0.33 0.31 0.33 0.31 0.31 0.3 0.31 0.3 0.3 ...
$ A6 : num 0.4 0.4 0.4 0.37 0.37 0.4 0.4 0.38 0.37 0.38 ...
$ A7 : num 0.42 0.4 0.4 0.4 0.38 0.4 0.42 0.42 0.38 0.4 ...
$ A8 : num 0.31 0.33 0.31 0.31 0.3 0.31 0.34 0.31 0.3 0.28 ...
$ A9 : num 0.33 0.31 0.33 0.31 0.31 0.3 0.31 0.3 0.3 0.3 ...
$ A10 : num 0.4 0.4 0.37 0.37 0.4 0.4 0.38 0.37 0.38 0.37 ...
$ A11 : num 0.4 0.4 0.4 0.38 0.4 0.4 0.42 0.4 0.4 0.35 ...
$ A12 : num 0.33 0.31 0.31 0.3 0.31 0.31 0.31 0.3 0.28 0.3 ...
$ A13 : num 0.4 0.36 0.33 0.33 0.33 0.3 0.31 0.31 0.31 0.3 ...
$ A14 : num 0.49 0.44 0.4 0.39 0.39 0.39 0.42 0.44 0.37 0.36 ...
$ A15 : num 0.52 0.46 0.41 0.41 0.41 0.41 0.46 0.46 0.41 0.41 ...
$ A16 : num 0.4 0.33 0.32 0.31 0.32 0.32 0.35 0.35 0.29 0.29 ...
$ A17 : num 0.36 0.33 0.33 0.33 0.3 0.3 0.31 0.31 0.3 0.3 ...
$ A18 : num 0.44 0.4 0.39 0.39 0.39 0.39 0.44 0.42 0.36 0.37 ...
$ A19 : num 0.46 0.41 0.41 0.41 0.41 0.42 0.46 0.44 0.41 0.39 ...
$ A20 : num 0.33 0.32 0.31 0.32 0.32 0.33 0.35 0.33 0.29 0.31 ...
$ A21 : num 0.33 0.33 0.33 0.3 0.3 0.3 0.31 0.31 0.3 0.3 ...
$ A22 : num 0.4 0.39 0.39 0.39 0.39 0.4 0.42 0.37 0.37 0.36 ...
$ A23 : num 0.41 0.41 0.41 0.41 0.42 0.46 0.44 0.39 0.39 0.39 ...
$ A24 : num 0.32 0.31 0.32 0.32 0.33 0.35 0.33 0.31 0.31 0.29 ...
$ A25 : num 0.4 0.35 0.33 0.33 0.33 0.33 0.31 0.31 0.29 0.29 ...
$ A26 : num 0.49 0.47 0.42 0.39 0.39 0.4 0.42 0.4 0.36 0.36 ...
$ A27 : num 0.53 0.5 0.44 0.41 0.41 0.41 0.44 0.41 0.38 0.38 ...
$ A28 : num 0.41 0.39 0.34 0.31 0.31 0.31 0.34 0.33 0.29 0.28 ...
$ A29 : num 0.35 0.33 0.33 0.33 0.33 0.31 0.31 0.31 0.29 0.31 ...
$ A30 : num 0.47 0.42 0.39 0.39 0.4 0.42 0.4 0.4 0.36 0.34 ...
$ A31 : num 0.5 0.44 0.41 0.41 0.41 0.43 0.41 0.41 0.38 0.36 ...
$ A32 : num 0.39 0.34 0.31 0.31 0.31 0.34 0.33 0.31 0.28 0.28 ...
$ A33 : num 0.33 0.33 0.33 0.33 0.31 0.31 0.31 0.31 0.31 0.31 ...
$ A34 : num 0.42 0.39 0.39 0.4 0.42 0.42 0.4 0.37 0.34 0.34 ...
$ A35 : num 0.44 0.41 0.41 0.41 0.43 0.43 0.41 0.39 0.36 0.36 ...
$ Class: **Factor** w/ 6 levels "A","B","C","D",..: 3 3 3 3 3 3 3 3 4 4 ...
My test data set looks just the same except that Class attribute is empty.I have used this code to predict the Class for my test data set.
train <- read.csv("cse_DS_Intro2TRAIN.csv")
test <- read.csv("cse_DS_Intro2TEST.csv")
setDT(train)
setDT(test)
labels <- train$Class
ts_label <- test$Class
new_tr <- model.matrix(~.+0,data = train[,-c("Class"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("Class"),with=F])
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(
booster = "gbtree",
objective = "binary:logistic",
eta=0.3,
gamma=0,
max_depth=6,
min_child_weight=1,
subsample=1,
colsample_bytree=1
)
xgbcv <- xgb.cv(params = params
,data = dtrain
,nrounds = 100
,nfold = 5
,showsd = T
,stratified = T
,print.every.n = 10
,early.stop.round = 20
,maximize = F
)
When I run the above code, I get this error.
Error in xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, obj) :
[16:49:39] amalgamation/../src/objective/regression_obj.cc:108: label must
be in [0,1] for logistic regression
Is it possible to predict a factor type data using XGBoost in R?
P.S. have used Random Forest to predict the class variable previously and it worked well.
Your target classes must start from 0 . Try the following example
library(xgboost)
data(agaricus.train)
data(agaricus.test)
train = agaricus.train
param = list("objective" = "binary:logistic" ,"eval_metric" = "logloss" ,
"eta" =1 , "max.depth" = 2)
This model works since train$labels starts from 0 hence output probabilities will be for '1'
model <- xgboost(data = train$data, label = train$label,
nrounds = 20, objective = "binary:logistic")
this model would not work. Notice the error message when you have it starting from 1.
model <- xgboost(data = train$data, label = train$label+1,
nrounds = 20, objective = "binary:logistic")
Just convert them into numeric type where they start from 0 that should work.
Update:
Also since you have almost 6 classes the "objective" should be "multi:softmax" or "multi:softprob" where you should also include "num_class" parameter.
Related
I am interesting in a yeast dataset from UCI (please see the link). The data is saved in text formula. I would like to load it into Rstudio. I saved it in office word (copy and paste). Then, I tried to load it into R studio but I got unclear words instead of the data.
https://archive.ics.uci.edu/ml/datasets/Yeast
Any help please?
Grabbing the data is pretty easy; you can just pass the file URL directly to read.table. Getting the names is a lot more work, as they're buried in a text file. If you like, you can extract them with regex:
library(tidyverse)
yeast <- read.table('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.data', stringsAsFactors = FALSE)
l <- readLines('https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/yeast.names')
l <- l[(grep('^7', l) + 1):(grep('^8', l) - 1)]
l <- l[grep('\\d\\..*:', l)]
names(yeast) <- make.names(c(sub('.*\\d\\.\\s+(.*):.*', '\\1', l), 'class'))
str(yeast)
#> 'data.frame': 1484 obs. of 10 variables:
#> $ Sequence.Name: chr "ADT1_YEAST" "ADT2_YEAST" "ADT3_YEAST" "AAR2_YEAST" ...
#> $ mcg : num 0.58 0.43 0.64 0.58 0.42 0.51 0.5 0.48 0.55 0.4 ...
#> $ gvh : num 0.61 0.67 0.62 0.44 0.44 0.4 0.54 0.45 0.5 0.39 ...
#> $ alm : num 0.47 0.48 0.49 0.57 0.48 0.56 0.48 0.59 0.66 0.6 ...
#> $ mit : num 0.13 0.27 0.15 0.13 0.54 0.17 0.65 0.2 0.36 0.15 ...
#> $ erl : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
#> $ pox : num 0 0 0 0 0 0.5 0 0 0 0 ...
#> $ vac : num 0.48 0.53 0.53 0.54 0.48 0.49 0.53 0.58 0.49 0.58 ...
#> $ nuc : num 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.34 0.22 0.3 ...
#> $ class : chr "MIT" "MIT" "MIT" "NUC" ...
...or just copy them all out by hand.
I want to create a new column which selects the minimum value of three possible columns and then use add or subtract depending on condition.
I have the next data frame called df:
a b c
1 0.60 0.27 0.14
2 0.48 0.32 0.21
3 0.42 0.24 0.35
4 0.28 0.33 0.41
5 0.52 0.28 0.22
6 0.34 0.30 0.37
7 0.38 0.28 0.35
8 0.34 0.28 0.40
9 0.53 0.26 0.22
10 0.17 0.27 0.58
11 0.34 0.35 0.33
12 0.19 0.27 0.56
13 0.56 0.29 0.17
14 0.55 0.28 0.19
15 0.29 0.24 0.48
16 0.23 0.31 0.47
17 0.40 0.32 0.28
18 0.50 0.27 0.24
19 0.45 0.28 0.27
20 0.68 0.26 0.05
21 0.40 0.32 0.28
22 0.23 0.26 0.50
23 0.46 0.33 0.20
24 0.46 0.24 0.28
25 0.44 0.24 0.31
26 0.46 0.26 0.27
27 0.30 0.29 0.40
28 0.45 0.20 0.34
29 0.53 0.27 0.20
30 0.33 0.34 0.33
31 0.20 0.26 0.55
32 0.65 0.29 0.06
33 0.45 0.24 0.32
34 0.30 0.26 0.45
35 0.20 0.36 0.45
36 0.38 0.16 0.38
Every row must sum to 1, but as you can notice, just some of them satisfy that condition.
df_total <- rowSums(df[c("a", "b", "c")])
print(df_total)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1.01 1.01 1.01 1.02 1.02 1.01 1.01 1.02 1.01 1.02 1.02 1.02 1.02 1.02 1.01 1.01 1.00 1.01 1.00
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
0.99 1.00 0.99 0.99 0.98 0.99 0.99 0.99 0.99 1.00 1.00 1.01 1.00 1.01 1.01 1.01 0.92
So for example in row number 36 from df, I need to sum the lowest value (Which is 0.16) with a number that will make a, b and c sum to 1.
I guess there's an easier way to do this, but I have done this code so far and it doesn't work...Why?
df_total <- rowSums(df[c("a", "b", "c")])
df_for_sum <- df_total[df_total > 1] - 1 #The ones which are above 1
df_for_minus <- -(df_total[df_total < 1]) + 1 #The ones which are below 1
equal_to_100 <- df_total[df_total == 1] #The ones which are ok
df <- df %>%
mutate(d = ifelse(rowSums(df[c("a","b","c")]) > 1,
apply(df[rowSums(df[c("a","b","c")]) > 1,], 1, min) - df_for_sum,
ifelse(rowSums(df[c("a","b","c")]) < 1,
apply(df[rowSums(df[c("a","b","c")]) < 1,], 1, min) + df_for_minus,
ifelse(rowSums(df[c("a","b","c")]) == 1,
apply(df[rowSums(df[c("a","b","c")]) == 1,], 1, min), ""))))
And this is the output:
a b c d
1 0.60 0.27 0.14 0.13
2 0.48 0.32 0.21 0.2
3 0.42 0.24 0.35 0.23
4 0.28 0.33 0.41 0.26
5 0.52 0.28 0.22 0.2
6 0.34 0.30 0.37 0.29
7 0.38 0.28 0.35 0.27
8 0.34 0.28 0.40 0.26
9 0.53 0.26 0.22 0.21
10 0.17 0.27 0.58 0.15
11 0.34 0.35 0.33 0.31
12 0.19 0.27 0.56 0.17
13 0.56 0.29 0.17 0.15
14 0.55 0.28 0.19 0.17
15 0.29 0.24 0.48 0.23
16 0.23 0.31 0.47 0.22
17 0.40 0.32 0.28 0.33 #From here til the end it's wrong!
18 0.50 0.27 0.24 0.19
19 0.45 0.28 0.27 0.28
20 0.68 0.26 0.05 0.24
21 0.40 0.32 0.28 0.28
22 0.23 0.26 0.50 0.26
23 0.46 0.33 0.20 0.25
24 0.46 0.24 0.28 0.27
25 0.44 0.24 0.31 0.3
26 0.46 0.26 0.27 0.21
27 0.30 0.29 0.40 0.24
28 0.45 0.20 0.34 0.0599999999999999
29 0.53 0.27 0.20 0.33
30 0.33 0.34 0.33 0.06
31 0.20 0.26 0.55 0.15
32 0.65 0.29 0.06 0.27
33 0.45 0.24 0.32 0.17
34 0.30 0.26 0.45 0.15
35 0.20 0.36 0.45 0.17
36 0.38 0.16 0.38 0.24
Any thoughts? Any easier way?
You want to calculate the excess difference first:
diff <- 1 - rowSums(df)
then add that to the minimum:
df$d <- apply(df, 1, min) + diff
Here's how to do that without ifelse in dplyr:
df2 <- df1 %>%
mutate(difference = 1-rowSums(.) ) %>%
rowwise() %>%
mutate(d = min(c(a,b,c))+difference )
df2
a b c difference d
(dbl) (dbl) (dbl) (dbl) (dbl)
1 0.60 0.27 0.14 -0.01 0.13
2 0.48 0.32 0.21 -0.01 0.20
3 0.42 0.24 0.35 -0.01 0.23
4 0.28 0.33 0.41 -0.02 0.26
5 0.52 0.28 0.22 -0.02 0.20
6 0.34 0.30 0.37 -0.01 0.29
7 0.38 0.28 0.35 -0.01 0.27
8 0.34 0.28 0.40 -0.02 0.26
9 0.53 0.26 0.22 -0.01 0.21
10 0.17 0.27 0.58 -0.02 0.15
11 0.34 0.35 0.33 -0.02 0.31
12 0.19 0.27 0.56 -0.02 0.17
13 0.56 0.29 0.17 -0.02 0.15
14 0.55 0.28 0.19 -0.02 0.17
15 0.29 0.24 0.48 -0.01 0.23
16 0.23 0.31 0.47 -0.01 0.22
17 0.40 0.32 0.28 0.00 0.28
18 0.50 0.27 0.24 -0.01 0.23
19 0.45 0.28 0.27 0.00 0.27
20 0.68 0.26 0.05 0.01 0.06
21 0.40 0.32 0.28 0.00 0.28
22 0.23 0.26 0.50 0.01 0.24
23 0.46 0.33 0.20 0.01 0.21
24 0.46 0.24 0.28 0.02 0.26
25 0.44 0.24 0.31 0.01 0.25
26 0.46 0.26 0.27 0.01 0.27
27 0.30 0.29 0.40 0.01 0.30
28 0.45 0.20 0.34 0.01 0.21
29 0.53 0.27 0.20 0.00 0.20
30 0.33 0.34 0.33 0.00 0.33
31 0.20 0.26 0.55 -0.01 0.19
32 0.65 0.29 0.06 0.00 0.06
33 0.45 0.24 0.32 -0.01 0.23
34 0.30 0.26 0.45 -0.01 0.25
35 0.20 0.36 0.45 -0.01 0.19
36 0.38 0.16 0.38 0.08 0.24
Data:
df1 <-read.table(text="a b c
0.6 0.27 0.14
0.48 0.32 0.21
0.42 0.24 0.35
0.28 0.33 0.41
0.52 0.28 0.22
0.34 0.3 0.37
0.38 0.28 0.35
0.34 0.28 0.4
0.53 0.26 0.22
0.17 0.27 0.58
0.34 0.35 0.33
0.19 0.27 0.56
0.56 0.29 0.17
0.55 0.28 0.19
0.29 0.24 0.48
0.23 0.31 0.47
0.4 0.32 0.28
0.5 0.27 0.24
0.45 0.28 0.27
0.68 0.26 0.05
0.4 0.32 0.28
0.23 0.26 0.5
0.46 0.33 0.2
0.46 0.24 0.28
0.44 0.24 0.31
0.46 0.26 0.27
0.3 0.29 0.4
0.45 0.2 0.34
0.53 0.27 0.2
0.33 0.34 0.33
0.2 0.26 0.55
0.65 0.29 0.06
0.45 0.24 0.32
0.3 0.26 0.45
0.2 0.36 0.45
0.38 0.16 0.38",header=TRUE,stringsAsFactors=FALSE)
I'm trying to load a file, file columns separated with space, but there are different number of space
between columns. because of this while i'm reading, R thing every space is another column and producing extra empty columns. Is there any other way to load data without problem.
Example Data :
AAT_ECOLI 0.49 0.29 0.48 0.50 0.56 0.24 0.35 cp
ACEA_ECOLI 0.07 0.40 0.48 0.50 0.54 0.35 0.44 cp
ACEK_ECOLI 0.56 0.40 0.48 0.50 0.49 0.37 0.46 cp
ACKA_ECOLI 0.59 0.49 0.48 0.50 0.52 0.45 0.36 cp
you can see that, between first column and second there 3 space, and 2nd column and 3th column there are two space.
I'm using this code for loading data
xxx <- read.csv("../Datasets/Ecoli/ecoli.data", header=FALSE,sep=" ")
I tried 3 space or other things but none of them worked.
Original data file : https://drive.google.com/file/d/0B_XEmkrWR-hCMXVySVI2bU5waGs/view?usp=sharing
Thank you
read.table works perfectly on your downloaded data set. No arguments other than file are necessary (unless you don't want factors). I tend to reserve read.csv for files that are actually comma-separated.
df <- read.table("Downloads/ecoli.data")
str(df)
# 'data.frame': 336 obs. of 9 variables:
# $ V1: Factor w/ 336 levels "AAS_ECOLI","AAT_ECOLI",..: 2 3 4 5 6 8 9 12 ...
# $ V2: num 0.49 0.07 0.56 0.59 0.23 0.67 0.29 0.21 0.2 0.42 ...
# $ V3: num 0.29 0.4 0.4 0.49 0.32 0.39 0.28 0.34 0.44 0.4 ...
# $ V4: num 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 0.48 ...
# $ V5: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# $ V6: num 0.56 0.54 0.49 0.52 0.55 0.36 0.44 0.51 0.46 0.56 ...
# $ V7: num 0.24 0.35 0.37 0.45 0.25 0.38 0.23 0.28 0.51 0.18 ...
# $ V8: num 0.35 0.44 0.46 0.36 0.35 0.46 0.34 0.39 0.57 0.3 ...
# $ V9: Factor w/ 8 levels "cp","im","imL",..: 1 1 1 1 1 1 1 1 1 1 ...
You need to set strip.white=T and sep='' :
xxx <- read.csv("c:\\r_stack_overflow\\test.csv", header=FALSE, strip.white=T, sep='')
> xxx
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
> dim(xxx)
[1] 4 9
And it works!
UPDATE:
It works perfect with your data too:
xxx <- read.csv("c:\\r_stack_overflow\\ecoli.data", header=FALSE, strip.white=T, sep='')
Output:
> xxx
V1 V2 V3 V4 V5 V6 V7 V8 V9
1 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp
2 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp
3 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp
4 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp
5 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp
6 ALKH_ECOLI 0.67 0.39 0.48 0.5 0.36 0.38 0.46 cp
7 AMPD_ECOLI 0.29 0.28 0.48 0.5 0.44 0.23 0.34 cp
8 AMY2_ECOLI 0.21 0.34 0.48 0.5 0.51 0.28 0.39 cp
9 APT_ECOLI 0.20 0.44 0.48 0.5 0.46 0.51 0.57 cp
10 ARAC_ECOLI 0.42 0.40 0.48 0.5 0.56 0.18 0.30 cp
11 ASG1_ECOLI 0.42 0.24 0.48 0.5 0.57 0.27 0.37 cp
12 BTUR_ECOLI 0.25 0.48 0.48 0.5 0.44 0.17 0.29 cp
13 CAFA_ECOLI 0.39 0.32 0.48 0.5 0.46 0.24 0.35 cp
14 CAIB_ECOLI 0.51 0.50 0.48 0.5 0.46 0.32 0.35 cp
15 CFA_ECOLI 0.22 0.43 0.48 0.5 0.48 0.16 0.28 cp
16 CHEA_ECOLI 0.25 0.40 0.48 0.5 0.46 0.44 0.52 cp
17 CHEB_ECOLI 0.34 0.45 0.48 0.5 0.38 0.24 0.35 cp
18 CHEW_ECOLI 0.44 0.27 0.48 0.5 0.55 0.52 0.58 cp
19 CHEY_ECOLI 0.23 0.40 0.48 0.5 0.39 0.28 0.38 cp
20 CHEZ_ECOLI 0.41 0.57 0.48 0.5 0.39 0.21 0.32 cp
21 CRL_ECOLI 0.40 0.45 0.48 0.5 0.38 0.22 0.00 cp
22 CSPA_ECOLI 0.31 0.23 0.48 0.5 0.73 0.05 0.14 cp
23 CYNR_ECOLI 0.51 0.54 0.48 0.5 0.41 0.34 0.43 cp
24 CYPB_ECOLI 0.30 0.16 0.48 0.5 0.56 0.11 0.23 cp
25 CYPC_ECOLI 0.36 0.39 0.48 0.5 0.48 0.22 0.23 cp
26 CYSB_ECOLI 0.29 0.37 0.48 0.5 0.48 0.44 0.52 cp
27 CYSE_ECOLI 0.25 0.40 0.48 0.5 0.47 0.33 0.42 cp
28 DAPD_ECOLI 0.21 0.51 0.48 0.5 0.50 0.32 0.41 cp
29 DCP_ECOLI 0.43 0.37 0.48 0.5 0.53 0.35 0.44 cp
30 DDLA_ECOLI 0.43 0.39 0.48 0.5 0.47 0.31 0.41 cp
31 DDLB_ECOLI 0.53 0.38 0.48 0.5 0.44 0.26 0.36 cp
32 DEOC_ECOLI 0.34 0.33 0.48 0.5 0.38 0.35 0.44 cp
33 DLDH_ECOLI 0.56 0.51 0.48 0.5 0.34 0.37 0.46 cp
34 EFG_ECOLI 0.40 0.29 0.48 0.5 0.42 0.35 0.44 cp
35 EFTS_ECOLI 0.24 0.35 0.48 0.5 0.31 0.19 0.31 cp
36 EFTU_ECOLI 0.36 0.54 0.48 0.5 0.41 0.38 0.46 cp
37 ENO_ECOLI 0.29 0.52 0.48 0.5 0.42 0.29 0.39 cp
38 FABB_ECOLI 0.65 0.47 0.48 0.5 0.59 0.30 0.40 cp
39 FES_ECOLI 0.32 0.42 0.48 0.5 0.35 0.28 0.38 cp
40 G3P1_ECOLI 0.38 0.46 0.48 0.5 0.48 0.22 0.29 cp
41 G3P2_ECOLI 0.33 0.45 0.48 0.5 0.52 0.32 0.41 cp
42 G6PI_ECOLI 0.30 0.37 0.48 0.5 0.59 0.41 0.49 cp
43 GCVA_ECOLI 0.40 0.50 0.48 0.5 0.45 0.39 0.47 cp
44 GLNA_ECOLI 0.28 0.38 0.48 0.5 0.50 0.33 0.42 cp
45 GLPD_ECOLI 0.61 0.45 0.48 0.5 0.48 0.35 0.41 cp
46 GLYA_ECOLI 0.17 0.38 0.48 0.5 0.45 0.42 0.50 cp
47 GSHR_ECOLI 0.44 0.35 0.48 0.5 0.55 0.55 0.61 cp
48 GT_ECOLI 0.43 0.40 0.48 0.5 0.39 0.28 0.39 cp
49 HEM6_ECOLI 0.42 0.35 0.48 0.5 0.58 0.15 0.27 cp
50 HEMN_ECOLI 0.23 0.33 0.48 0.5 0.43 0.33 0.43 cp
51 HPRT_ECOLI 0.37 0.52 0.48 0.5 0.42 0.42 0.36 cp
52 IF1_ECOLI 0.29 0.30 0.48 0.5 0.45 0.03 0.17 cp
53 IF2_ECOLI 0.22 0.36 0.48 0.5 0.35 0.39 0.47 cp
54 ILVY_ECOLI 0.23 0.58 0.48 0.5 0.37 0.53 0.59 cp
55 IPYR_ECOLI 0.47 0.47 0.48 0.5 0.22 0.16 0.26 cp
56 KAD_ECOLI 0.54 0.47 0.48 0.5 0.28 0.33 0.42 cp
57 KDSA_ECOLI 0.51 0.37 0.48 0.5 0.35 0.36 0.45 cp
58 LEU3_ECOLI 0.40 0.35 0.48 0.5 0.45 0.33 0.42 cp
59 LON_ECOLI 0.44 0.34 0.48 0.5 0.30 0.33 0.43 cp
60 LPLA_ECOLI 0.42 0.38 0.48 0.5 0.54 0.34 0.43 cp
61 LYSR_ECOLI 0.44 0.56 0.48 0.5 0.50 0.46 0.54 cp
62 MALQ_ECOLI 0.52 0.36 0.48 0.5 0.41 0.28 0.38 cp
63 MALZ_ECOLI 0.36 0.41 0.48 0.5 0.48 0.47 0.54 cp
64 MASY_ECOLI 0.18 0.30 0.48 0.5 0.46 0.24 0.35 cp
65 METB_ECOLI 0.47 0.29 0.48 0.5 0.51 0.33 0.43 cp
66 METC_ECOLI 0.24 0.43 0.48 0.5 0.54 0.52 0.59 cp
67 METK_ECOLI 0.25 0.37 0.48 0.5 0.41 0.33 0.42 cp
And dimensions:
> dim(xxx)
[1] 336 9
There's probably a better way, but I believe this should work:
file_df <- scan('data.txt', what = list("","","","","","","","",""))
df <- data.frame(matrix(unlist(file_df), nrow=4))
How can I manage to get the data from a website that presents multiple options like the ticker of the stock, and the beginning and the end of the period I want the data.
The code that generates this data comes from this line:
<td><input name="button" type="button" class="boton" id="button" value="Buscar" onclick="getInf_Cotizaciones('SIDERC1',document.getElementById('anoIni').value+document.getElementById('mesIni').value+'01',document.getElementById('anoFin').value+document.getElementById('mesFin').value+'01')" /></td>
However the data doesn't show on the HTML source code. How can I get R to download this data.
If you use "Developer Mode" on any modern browser and sort the "timeline" view of the "network resources" (they all have this) by "start time", you'd see that site submits the following URL:
http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20140901&fec_fin=20141001&nemonico=SIDERC1
when posting data based on the <select> box choices. You can, then, use the rvest package to grab the resultant table:
library(rvest)
pg <- html("http://www.bvl.com.pe/jsp/cotizacion.jsp?fec_inicio=20140901&fec_fin=20141001&nemonico=SIDERC1")
pg %>% html_table()
## [[1]]
## Precio fecha actual NA NA NA NA NA NA NA Precios fecha anterior NA
## 1 Fecha cotización Apertura Cierre Máxima Mínima Promedio CantidadNegociada MontoNegociado (S/.) Fechaanterior Cierreanterior
## 2 01/10/2014 0.33 0.32 0.33 0.32 0.32 193,148.00 62,707.36 30/09/2014 0.32
## 3 30/09/2014 0.33 0.32 0.33 0.32 0.33 542,761.00 177,545.23 29/09/2014 0.34
## 4 29/09/2014 0.34 0.34 0.34 0.34 0.34 42,738.00 14,530.92 26/09/2014 0.34
## 5 26/09/2014 0.34 0.34 0.34 0.34 0.34 139,829.00 47,503.57 25/09/2014 0.35
## 6 25/09/2014 0.35 0.35 0.35 0.35 0.35 56,100.00 19,635.00 23/09/2014 0.35
## 7 24/09/2014 23/09/2014 0.35
## 8 23/09/2014 0.35 0.35 0.35 0.35 0.35 79,800.00 27,900.00 19/09/2014 0.35
## 9 22/09/2014 19/09/2014 0.35
## 10 19/09/2014 0.35 0.35 0.35 0.35 0.35 73,655.00 25,592.70 18/09/2014 0.35
## 11 18/09/2014 0.35 0.35 0.35 0.35 0.35 50,000.00 17,500.00 17/09/2014 0.35
## 12 17/09/2014 0.35 0.35 0.35 0.35 0.35 94,000.00 32,900.00 16/09/2014 0.36
## 13 16/09/2014 0.36 0.36 0.36 0.36 0.36 49,582.00 17,666.87 15/09/2014 0.35
## 14 15/09/2014 0.35 0.35 0.35 0.35 0.35 63,900.00 22,365.00 12/09/2014 0.35
## 15 12/09/2014 0.35 0.35 0.35 0.35 0.35 100,000.00 35,000.00 11/09/2014 0.36
## 16 11/09/2014 0.36 0.36 0.36 0.36 0.36 79,680.00 28,684.80 10/09/2014 0.36
## 17 10/09/2014 0.36 0.36 0.36 0.36 0.36 136,169.00 49,020.84 09/09/2014 0.36
## 18 09/09/2014 0.35 0.36 0.36 0.35 0.36 420,200.00 151,074.07 08/09/2014 0.35
## 19 08/09/2014 0.35 0.35 0.35 0.35 0.35 90,344.00 31,620.40 05/09/2014 0.34
## 20 05/09/2014 0.34 0.34 0.34 0.34 0.34 212,500.00 72,250.00 04/09/2014 0.33
## 21 04/09/2014 0.33 0.33 0.33 0.33 0.33 12,500.00 4,125.00 03/09/2014 0.34
## 22 03/09/2014 0.33 0.34 0.34 0.33 0.33 186,000.00 61,970.00 02/09/2014 0.33
## 23 02/09/2014 0.34 0.33 0.34 0.33 0.34 221,613.00 74,654.42 28/08/2014 0.35
## 24 01/09/2014 28/08/2014 0.35
Hi I'm pushing data into a matrix so I can create a heatmap. The code I am using identical to what is published here (http://sebastianraschka.com/Articles/heatmaps_in_r.html). For some of my datasets, when I push the data into the matrix format I am getting strange behaviour in that some of the values are changing. Some of my datasets work fine but others do not and I am unsure what the primary differences are that is underlying this strange behaviour.
Example code;
data <- read.csv("mydata.txt", sep="\t", header =TRUE)
rnames <- data[,1]
mat_data <- data.matrix(data[,2:ncol(data)])
rownames(mat_data) <- rnames
Now example dataframes..
head(data)
1 1.108029 0.42 0.19 0.04 0.47 -0.08 0.47 0.04 0.10
2 1.108029 0.34 0.40 0.25 0.56 -0.08 -0.06 0.11 0.20
3 1.121099 0.1 -0.45 0.11 -0.22 -0.07 -0.40 0.24 -0.17
4 1.123857 0.26 -0.15 0.15 0.31 0.2 -0.24 -0.27 0.40
5 1.129303 0.11 0.13 0.01 -0.11 0.38 0.29 -0.15 -0.18
6 1.135904 0.4 0.07 0.11 0.03 0.6 -0.32 0.14 -0.12
head(mat_data)
tg_q2_rep_A tg_q2_rep_B tg_q2_rep_C tg_q2_rep_D tg_q4_rep_A tg_q4_rep_B tg_q4_rep_C tg_q4_rep_D
1.10802929 70 0.19 0.04 0.47 5 0.47 0.04 0.10
1.1080293 65 0.40 0.25 0.56 5 -0.06 0.11 0.20
1.12109912 49 -0.45 0.11 -0.22 4 -0.40 0.24 -0.17
1.12385707 62 -0.15 0.15 0.31 53 -0.24 -0.27 0.40
1.12930344 50 0.13 0.01 -0.11 65 0.29 -0.15 -0.18
1.1359041 69 0.07 0.11 0.03 69 -0.32 0.14 -0.12
You can see the rownames have had numbers appended to the ends and the first data for tg_q2_rep_A and tg_q4_rep_A have been changed.
If anyone can suggest how to approach this I'd appreciate it. I've been trying to figure this out for days :/
EDIT
As requested ..
> str(data)
'data.frame': 137 obs. of 33 variables:
$ CpG_id.chr.pos.: num 1.11 1.11 1.12 1.12 1.13 ...
$ tg_q2_rep_A : Factor w/ 75 levels "-0.01","-0.02",..: 70 65 49 62 50 69 71 63 57 7 ...
$ tg_q2_rep_B : num 0.19 0.4 -0.45 -0.15 0.13 0.07 0.5 -0.33 0.23 -0.22 ...
$ tg_q2_rep_C : num 0.04 0.25 0.11 0.15 0.01 0.11 0.16 0.03 0.23 -0.32 ...
$ tg_q2_rep_D : num 0.47 0.56 -0.22 0.31 -0.11 0.03 0.31 0.21 0 0.06 ...
$ tg_q4_rep_A : Factor w/ 73 levels "-0.04","-0.05",..: 5 5 4 53 65 69 50 53 59 46 ...
$ tg_q4_rep_B : num 0.47 -0.06 -0.4 -0.24 0.29 -0.32 0.07 -0.23 0.1 -0.09 ...
$ tg_q4_rep_C : num 0.04 0.11 0.24 -0.27 -0.15 0.14 0.14 0.36 0.1 -0.05 ...
$ tg_q4_rep_D : num 0.1 0.2 -0.17 0.4 -0.18 -0.12 0.15 0.18 -0.21 -0.14 ...
$ tg_q6_rep_A : Factor w/ 79 levels "-0.02","-0.03",..: 46 3 7 67 65 77 64 61 41 12 ...
$ tg_q6_rep_B : Factor w/ 87 levels "-0.01","-0.03",..: 68 79 34 11 82 1 63 1 36 32 ...
$ tg_q6_rep_C : num 0.22 0.5 -0.32 0.13 0.24 0.25 0.35 0.07 0.01 -0.44 ...
$ tg_q6_rep_D : Factor w/ 82 levels "-0.04","-0.05",..: 55 50 27 74 71 68 73 61 5 31 ...
$ tg_q8_rep_A : Factor w/ 73 levels "-0.01","-0.02",..: 49 9 2 52 45 50 13 55 48 9 ...
$ tg_q8_rep_B : num 0.05 0.07 -0.31 0.02 0 -0.33 0.03 -0.05 0.08 0.1 ...
$ tg_q8_rep_C : num 0.35 0.5 -0.06 -0.1 0.24 -0.45 -0.27 0.1 0.15 -0.29 ...
$ tg_q8_rep_D : num 0.15 0.08 -0.08 0.31 0.28 0.43 0.41 0.25 -0.05 -0.04 ...
$ tg_w2_rep_A : Factor w/ 72 levels "-0.01","-0.02",..: 49 16 24 66 60 62 62 68 52 49 ...
$ tg_w2_rep_B : num 0.11 0.24 -0.03 -0.43 0.67 -0.13 0.05 -0.4 -0.13 -0.18 ...
$ tg_w2_rep_C : num 0 0.33 -0.09 0 0.12 -0.35 0.06 0.33 0.15 -0.19 ...
$ tg_w2_rep_D : num -0.04 0 -0.03 0.44 0.04 0.23 0.28 0.19 -0.21 -0.17 ...
$ tg_w4_rep_A : Factor w/ 69 levels "-0.0","-0.01",..: 55 58 53 50 52 67 68 63 27 8 ...
$ tg_w4_rep_B : num 0.29 0.63 -0.37 0.09 0.22 -0.21 0.1 -0.14 -0.04 -0.09 ...
$ tg_w4_rep_C : num 0.09 0.13 -0.08 0.17 0.15 -0.33 0 0.38 0.1 -0.62 ...
$ tg_w4_rep_D : num 0.11 0.33 -0.32 0.41 -0.1 0.07 0.23 0.22 0.1 0.06 ...
$ tg_w6_rep_A : Factor w/ 74 levels "-0.01","-0.02",..: 56 45 4 69 59 47 2 40 47 12 ...
$ tg_w6_rep_B : num 0.07 0.13 -0.14 0.15 0.13 -0.17 0.33 0.12 0.07 -0.15 ...
$ tg_w6_rep_C : num 0.13 0.22 0.31 0.08 0.16 -0.33 -0.05 0.43 0.43 -0.06 ...
$ tg_w6_rep_D : num 0.28 0.11 -0.2 0.66 -0.18 0.16 0.26 0.27 0.06 -0.02 ...
$ tg_w8_rep_A : Factor w/ 67 levels "-0.01","-0.02",..: 52 40 37 44 48 61 48 53 39 63 ...
$ tg_w8_rep_B : num 0.3 0.09 -0.22 -0.1 0.14 -0.25 0.1 -0.49 0.19 0.15 ...
$ tg_w8_rep_C : num 0.23 0.27 0.11 -0.25 0.17 -0.13 0.23 0.47 0.33 -0.09 ...
$ tg_w8_rep_D : num -0.04 0.1 -0.25 0.37 -0.09 0.18 0.26 0.2 -0.35 -0.11 ...
The problem with your rownames is that they aren't unique. R requires unique identifiers for each row, and you have multiple rows with the same value in the data.frame "data". When you try to force it to make the values in that first column rownames, it's trying to make them unique, and it looks as though it's rounding some numbers to accomplish that.
I'm not entirely certain what's going on with columns tg_q2_rep_A and tg_q4_rep_A, but it looks as though those values have been converted to ranks. That can happen if the class of those columns in your original data.frame, data, was "factor" rather than "numeric". Try this to check the classes:
sapply(data, class)
If you've got a mixture of numbers and letters in that column, for example, R will set the data class as factor by default. When you convert those columns to numeric format, which is what data.matrix() does, the output will be the rank of that factor.
I didn't get the same problem for those two columns when I copied and pasted your data into a csv file and loaded it into R, but I'm guessing that you haven't given us all the data there. My first step to figure this out would be to check the classes of the columns.