how to remove decimals in rownames matrix? - r

I have a matrix like that:
12Q_S12 14Q_S14 16Q_S16 18Q_S2 22Q_S6 28Q_S12
ENSG00000000003.14 1.18007 0.0000 1.20602 2.24477 1.27663 1.12392
ENSG00000000005.5 0.00000 0.0000 0.00000 0.00000 0.00000 0.00000
and I would like to remove the decimal part only for the rownames (ENSG00000000003.14, ENSG00000000005.5 ...) any help?
Expected:
12Q_S12 14Q_S14 16Q_S16 18Q_S2 22Q_S6 28Q_S12
ENSG00000000003 1.18007 0.0000 1.20602 2.24477 1.27663 1.12392
ENSG00000000005 0.00000 0.0000 0.00000 0.00000 0.00000 0.00000

You need to reassign the rownames and eliminate the part after the point, you can do it with gsub.
rownames(tab) <- gsub("\\..*","",rownames(tab))

Related

Display only specific statistics (Median, mean, std, var, min val,max val, quantiles) with one command. No extra statistics

Display only the specific statistics median, mean, standard deviation, variance,
inter quartile range, min value, max value and
the first and the third quartile for the variables
of the dataframe.
I tried the following command but its not showing the median,min,max and variance.
library("RcmdrMisc")
numSummary(df[,c('v1','v2'),drop=FALSE], statistics=c("median",mean","sd","var","IQR",
"min","max","quantiles"), quantiles
=c(0,.25,.5,.75,1))
Update
For numSummary type ?numSummary in your console then you get this example:
if (require("car")){
data(Prestige)
Prestige[1, "income"] <- NA
print(numSummary(Prestige[,c("income", "education")],
statistics=c("mean", "sd", "quantiles", "cv", "skewness", "kurtosis")))
print(numSummary(Prestige[,c("income", "education")], groups=Prestige$type))
remove(Prestige)
}
Output:
mean sd cv skewness kurtosis 0% 25% 50% 75% 100% n NA
income 6742.92079 4230.450564 0.6273914 2.2565718 7.2046866 611.00 4075.000 5902.00 8131.0000 25879.00 101 1
education 10.73804 2.728444 0.2540915 0.3345254 -0.9782158 6.38 8.445 10.54 12.6475 15.97 102 0
Variable: income
mean sd IQR 0% 25% 50% 75% 100% n NA
bc 5374.136 2004.330 2892.75 1656 3836.75 5216.5 6729.50 8895 44 0
prof 10499.733 5505.145 5687.50 4614 6516.75 8645.0 12204.25 25879 30 1
wc 5052.304 1944.325 3175.50 2448 3450.00 4741.0 6625.50 8780 23 0
Variable: education
mean sd IQR 0% 25% 50% 75% 100% n NA
bc 8.359318 1.1648343 1.3525 6.38 7.570 8.35 8.9225 10.93 44 0
prof 14.084194 1.3940248 2.2100 11.09 12.940 14.44 15.1500 15.97 31 0
wc 11.021739 0.9233076 0.8850 9.17 10.575 11.13 11.4600 12.79 23 0
First answer: alternatives:
You could use stat.desc from pastecs package:
or summary or library(Hmisc) Hmisc::describe(mtcars)
Here an example with pastecs
#install.packages("pastecs")
library(pastecs)
stat.desc(mtcars)
Output:
mpg cyl disp hp drat wt qsec vs
nbr.val 32.0000000 32.0000000 3.200000e+01 32.0000000 32.00000000 32.0000000 32.0000000 32.00000000
nbr.null 0.0000000 0.0000000 0.000000e+00 0.0000000 0.00000000 0.0000000 0.0000000 18.00000000
nbr.na 0.0000000 0.0000000 0.000000e+00 0.0000000 0.00000000 0.0000000 0.0000000 0.00000000
min 10.4000000 4.0000000 7.110000e+01 52.0000000 2.76000000 1.5130000 14.5000000 0.00000000
max 33.9000000 8.0000000 4.720000e+02 335.0000000 4.93000000 5.4240000 22.9000000 1.00000000
range 23.5000000 4.0000000 4.009000e+02 283.0000000 2.17000000 3.9110000 8.4000000 1.00000000
sum 642.9000000 198.0000000 7.383100e+03 4694.0000000 115.09000000 102.9520000 571.1600000 14.00000000
median 19.2000000 6.0000000 1.963000e+02 123.0000000 3.69500000 3.3250000 17.7100000 0.00000000
mean 20.0906250 6.1875000 2.307219e+02 146.6875000 3.59656250 3.2172500 17.8487500 0.43750000
SE.mean 1.0654240 0.3157093 2.190947e+01 12.1203173 0.09451874 0.1729685 0.3158899 0.08909831
CI.mean.0.95 2.1729465 0.6438934 4.468466e+01 24.7195501 0.19277224 0.3527715 0.6442617 0.18171719
var 36.3241028 3.1895161 1.536080e+04 4700.8669355 0.28588135 0.9573790 3.1931661 0.25403226
std.dev 6.0269481 1.7859216 1.239387e+02 68.5628685 0.53467874 0.9784574 1.7869432 0.50401613
coef.var 0.2999881 0.2886338 5.371779e-01 0.4674077 0.14866382 0.3041285 0.1001159 1.15203687
am gear carb type
nbr.val 32.00000000 32.0000000 32.0000000 NA
nbr.null 19.00000000 0.0000000 0.0000000 NA
nbr.na 0.00000000 0.0000000 0.0000000 NA
min 0.00000000 3.0000000 1.0000000 NA
max 1.00000000 5.0000000 8.0000000 NA
range 1.00000000 2.0000000 7.0000000 NA
sum 13.00000000 118.0000000 90.0000000 NA
median 0.00000000 4.0000000 2.0000000 NA
mean 0.40625000 3.6875000 2.8125000 NA
SE.mean 0.08820997 0.1304266 0.2855297 NA
CI.mean.0.95 0.17990541 0.2660067 0.5823417 NA
var 0.24899194 0.5443548 2.6088710 NA
std.dev 0.49899092 0.7378041 1.6152000 NA
coef.var 1.22828533 0.2000825 0.5742933 NA

Calculationg median of observations in particular set of columns in R

I have an sf object containing the following columns of data:
HR60 HR70 HR80 HR90 HC60 HC70 HC80 HC90
0.000000 0.000000 8.855827 0.000000 0.0000000 0.0000000 0.3333333 0.0000000
0.000000 0.000000 17.208742 15.885624 0.0000000 0.0000000 1.0000000 1.0000000
1.863863 1.915158 3.450775 6.462453 0.3333333 0.3333333 1.0000000 2.0000000
...
How can I calculate the median of HR60 to HR90 columns for all observations and place it in a different column, let's say HR-median? I tried to use apply(), but this kind of works for the whole dataset only and I need only these 4 columns to be considered.
We can select those columns
df1$HR_median <- apply(subset(df1, select = HR60:HR90), 1, median)

how can I print variable importance in gbm function?

I used the gbm function to implement gradient boosting. And I want to perform classification.
After that, I used the varImp() function to print variable importance in gradient boosting modeling.
But... only 4 variables have non-zero importance. There are 371 variables in my big data.... Is it right?
This is my code and result.
>asd<-read.csv("bigdatafile.csv",header=TRUE)
>asd1<-gbm(TARGET~.,n.trees=50,distribution="adaboost", verbose=TRUE,interaction.depth = 1,data=asd)
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.5840 nan 0.0010 0.0011
2 0.5829 nan 0.0010 0.0011
3 0.5817 nan 0.0010 0.0011
4 0.5806 nan 0.0010 0.0011
5 0.5795 nan 0.0010 0.0011
6 0.5783 nan 0.0010 0.0011
7 0.5772 nan 0.0010 0.0011
8 0.5761 nan 0.0010 0.0011
9 0.5750 nan 0.0010 0.0011
10 0.5738 nan 0.0010 0.0011
20 0.5629 nan 0.0010 0.0011
40 0.5421 nan 0.0010 0.0010
50 0.5321 nan 0.0010 0.0010
>varImp(asd1,numTrees = 50)
Overall
CA0000801 0.00000
AS0000138 0.00000
AS0000140 0.00000
A1 0.00000
PROFILE_CODE 0.00000
A2 0.00000
CB_thinfile2 0.00000
SP_thinfile2 0.00000
thinfile1 0.00000
EW0001901 0.00000
EW0020901 0.00000
EH0001801 0.00000
BS_Seg1_Score 0.00000
BS_Seg2_Score 0.00000
LA0000106 0.00000
EW0001903 0.00000
EW0002801 0.00000
EW0002902 0.00000
EW0002903 0.00000
EW0002904 0.00000
EW0002906 0.00000
LA0300104_SP 56.19052
ASMGRD2 2486.12715
MIX_GRD 2211.03780
P71010401_1 0.00000
PS0000265 0.00000
P11021100 0.00000
PE0000123 0.00000
There are 371 variables. So above the result,I didn't write other variables. That all have zero importance.
TARGET is target variable. And I produced 50 trees. TARGET variable has two levels. so I used adaboost.
Is there a mistake in my code??? There are a little non-zero variables....
Thank you for your reply.
You cannot use importance() NOR varImp() this is only for Random Forest.
However, you can use summary.gbm from the gbm package.
Ex:
summary.gbm(boost_model)
Output will look like:
In your code, n.trees is very low and shrinkage is very high.
Just adjust this two factor.
n.trees is Number of trees. N increasing N reduces the error on training set, but setting it too high may lead to over-fitting.
interaction.depth(maximum nodes per tree) is number of splits it has to perform on a tree(starting from a single node).
shrinkage is considered as a learning rate. shrinkage is commonly used in ridge regression where it reduces regression coefficients to zero and, thus, reduces the impact of potentially unstable regression coefficients.
I recommend uses 0.1 for all data sets with more than 10,000 records.
Also! use a small shrinkage when growing many trees.
If you input 1,000 in n.trees & 0.1 in shrinkage, you can get different value.
And if you want to know relative influence of each variable in the gbm, Use summary.gbm() not varImp(). Of course, varImp() is good function. but I recommend summary.gbm().
Good luck.

Line-by-line parsing of text file containing data with Julia?

I'm trying to read tensor elements written to a text file. The first line of the file defines the tensor dimension. The next lines give the tensor values. In Matlab syntax, I was able to achieve this with the following line of code, but I am having a difficult time coding an equivalent function in Julia. Any help is greatly appreciated.
fid=fopen(fname);
shape = sscanf(fgetl(fid),'%i');
for j = 1:shape(3)
for i = 1:shape(1)
A(i,:,j) = str2num(fgets(fid));
end
end
fclose(fid);
The first lines of a typical file are reproduced below:
4 4 48
1.00000 0.00000 0.00000 0.00000
0.00000 1.00000 0.00000 0.00000
0.00000 0.00000 1.00000 0.00000
0.00000 0.00000 0.00000 1.00000
-1.00000 0.00000 0.00000 0.00000
0.00000 1.00000 0.00000 0.00000
0.00000 0.00000 -1.00000 0.00000
0.00000 0.00000 0.00000 1.00000
-1.00000 0.00000 0.00000 0.00000
...
As #colin said in his comment, such a file can be easily read into Julia with this:
julia> data, heading = readdlm("/tmp/data.txt", header=true)
(
9x4 Array{Float64,2}:
1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 1.0 0.0
0.0 0.0 0.0 1.0
-1.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0
0.0 0.0 -1.0 0.0
0.0 0.0 0.0 1.0
-1.0 0.0 0.0 0.0,
1x4 Array{AbstractString,2}:
"4" "4" "48" "")
The two values returned are the array of Float64s and the header row as an array of strings.
Any use?
If you do want to read line by line, you can use the following:
a = open("/path/to/data.txt", "r")
for line in eachline(a)
print(line) ## or whatever else you want to do with the line.
end
close(a)
In particular, a syntax like this:
LineArray = split(replace(line, "\n", ""), "\t")
Might be useful to you. It will (a) remove the line break at the end of the line and (b) then split it up into an indexed array so that you can then pull elements out of it based on predictable positions they occupy in the line.
You could also put:
Header = readline(a);
right after you open the file if you want to specifically pull out the header, and then run the above loop. Alternatively, you could use enumerate() over eachline(a) and then perform logic on the index of the enumeration (e.g. define the header when the index = 1).
Note though that this will be slower than the answer from daycaster, so it's only worthwhile if you really need the extra flexibility.

Extract information from conditional formula

I'd like to write an R function that accepts a formula as its first argument, similar to lm() or glm() and friends. In this case, it's a function that takes a data frame and writes out a file in SVMLight format, which has this general form:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>
for example, the following data frame:
result qid f1 f2 f3 f4 f5 f6 f7 f8
1 -1 1 0.0000 0.1253 0.0000 0.1017 0.00 0.0000 0.0000 0.9999
2 -1 1 0.0098 0.0000 0.0000 0.0000 0.00 0.0316 0.0000 0.3661
3 1 1 0.0000 0.0000 0.1941 0.0000 0.00 0.0000 0.0509 0.0000
4 -1 2 0.0000 0.2863 0.0948 0.0000 0.34 0.0000 0.7428 0.0608
5 1 2 0.0000 0.0000 0.0000 0.4347 0.00 0.0000 0.9539 0.0000
6 1 2 0.0000 0.7282 0.9087 0.0000 0.00 0.0000 0.0000 0.0355
would be represented as follows:
-1 qid:1 2:0.1253 4:0.1017 8:0.9999
-1 qid:1 1:0.0098 6:0.0316 8:0.3661
1 qid:1 3:0.1941 7:0.0509
-1 qid:2 2:0.2863 3:0.0948 5:0.3400 7:0.7428 8:0.0608
1 qid:2 4:0.4347 7:0.9539
1 qid:2 2:0.7282 3:0.9087 8:0.0355
The function I'd like to write would be called something like this:
write.svmlight(result ~ f1+f2+f3+f4+f5+f6+f7+f8 | qid, data=mydata, file="out.txt")
Or even
write.svmlight(result ~ . | qid, data=mydata, file="out.txt")
But I can't figure out how to use model.matrix() and/or model.frame() to know what columns it's supposed to write. Are these the right things to be looking at?
Any help much appreciated!
Partial answer. You can subscript a formula object to get a parse tree of the formula:
> f<-a~b+c|d
> f[[1]]
`~`
> f[[2]]
a
> f[[3]]
b + c | d
> f[[3]][[1]]
`|`
> f[[3]][[2]]
b + c
> f[[3]][[3]]
d
Now all you need is code to walk this tree.
UPDATE: Here's is an example of a function that walks the tree.
walker<-function(formu){
if (!is(formu,"formula"))
stop("Want formula")
lhs <- formu[[2]]
formu <- formu[[3]]
if (formu[[1]]!='|')
stop("Want conditional part")
condi <- formu[[3]]
flattener <- function(f) {if (length(f)<3) return(f);
c(Recall(f[[2]]),Recall(f[[3]]))}
vars <- flattener(formu[[2]])
list(lhs=lhs,condi=condi,vars=vars)
}
walker(y~a+b|c)
Also look at the documentation for terms.formula and terms.object. Looking at the code for some functions that take conditional formulas can help, for eg. the lmer function in lme4 package.
I used
formu.names <- all.vars(formu)
Y.name <- formu.names[1]
X.name <- formu.names[2]
block.name <- formu.names[3]
In the code I wrote about doing a post-hoc for a friedman test:
http://www.r-statistics.com/2010/02/post-hoc-analysis-for-friedmans-test-r-code/
But it will only work for: Y`X|block
I hope for a better answer others will give.

Resources