Finding the mean of a subset - r

I have made a subset from the dataframe 'Indometh' called 'indo':
indo
Subject time conc
1 1 0.25 1.50
13 2 0.50 1.63
24 3 0.50 1.49
25 3 0.75 1.16
34 4 0.25 1.85
35 4 0.50 1.39
36 4 0.75 1.02
46 5 0.50 1.04
57 6 0.50 1.44
58 6 0.75 1.03
I want to find what the average concentration for the subset is. I have used code but to no avail:
mean(subset(indo, conc >1 & conc <2))
I know summary(indo) will show the mean of the concentration but wanted to know if there was another way I could do this just for conc.

You can try subsetting via bracket notation:
mean(indo$conc[indo$conc > 1 & indo$conc < 2])

Related

How to merge three tables by inserting to each other in R?

I have a data frame as following. I want to know the evolution from RIK_T1 to RIK_T2 by seeing their frequency, row% and Column%. How to show them at once?
ID<-c('1','2','3','4','5','6','7','8','9','10')
RIK_T1<-c('20','15','20','20','97','20','20','20','15','15')
RIK_T2<-c('20','15','15','20','97','97','20','20','20','20')
df<-data.frame(ID,RIK_T1,RIK_T2)
df
TAB=table(df$RIK_T1,df$RIK_T2)
t1<-addmargins(TAB) #TABLE-01
TAB_row=prop.table(TAB,1)#row
t2<-round(addmargins(TAB_row),digits=2)#TABLE-01-1
TAB_col=prop.table(TAB,2)#column
t3<-round(addmargins(TAB_col),digits=2)#TABLE-01-2
I get three tables as following:table, row% and col%
15 20 97 Sum
15 1 2 0 3
20 1 4 1 6
97 0 0 1 1
Sum 2 6 2 10
15 20 97 Sum
15 0.33 0.67 0.00 1.00
20 0.17 0.67 0.17 1.00
97 0.00 0.00 1.00 1.00
Sum 0.50 1.33 1.17 3.00
15 20 97 Sum
15 0.50 0.33 0.00 0.83
20 0.50 0.67 0.50 1.67
97 0.00 0.00 0.50 0.50
Sum 1.00 1.00 1.00 3.00
Is it possible to merge them into one table as following?
15 20 97 Sum
R%/C% R%/C% R%/C% R%/C%
15 1 2 0 3
0.33/0.50 0.67/0.33 0.00/0.00 1.00/0.83
20 1 4 1 6
0.17/0.50 0.67/0.67 0.17/0.50 1.00/1.67
97 0 0 1 1
0.00/0.00 0.00/0.00 1.00/0.50 1.00/0.50
Sum 2 6 2 10
0.50/1.00 1.33/1.00 1.17/1.00 3.00/3.00
Thanks in advance.

Obtaining Probabilities in KNN Classifier in R

I have the following the data set:
TRAIN dataset
Sr A B C XX
1 0.09 0.52 11.1 high
2 0.13 0.25 11.1 low
3 0.20 0.28 11.1 high
4 0.29 0.50 11.1 low
5 0.31 0.58 11.1 high
6 0.32 0.37 11.1 high
7 0.37 0.58 11.1 low
8 0.38 0.40 11.1 low
9 0.42 0.65 11.1 high
10 0.42 0.79 11.1 low
11 0.44 0.34 11.1 high
12 0.45 0.89 11.1 low
13 0.57 0.72 11.1 low
TEST dataset
Sr A B C XX
1 0.54 1.36 9.80 low
2 0.72 0.82 9.80 low
3 0.19 0.38 9.90 high
4 0.25 0.44 9.90 high
5 0.29 0.54 9.90 high
6 0.30 0.54 9.90 high
7 0.42 0.86 9.90 low
8 0.44 0.86 9.90 low
9 0.49 0.66 9.90 low
10 0.54 0.76 9.90 low
11 0.54 0.76 9.90 low
12 0.68 1.08 9.90 low
13 0.88 0.51 9.90 high
Sr : Serial Number
A-C : Parameters
XX : Output Binary Parameter
I am trying to use the KNN classifier to develop a predictor model with 5 nearest neighbors. Following is the code that I have written:
train_input <- as.matrix(train[,-ncol(train)])
train_output <- as.factor(train[,ncol(train)])
test_input <- as.matrix(test[,-ncol(test)])
prediction <- knn(train_input, test_input, train_output, k=5, prob=TRUE)
resultdf <- as.data.frame(cbind(test[,ncol(test)], prediction))
colnames(resultdf) <- c("Actual","Predicted")
RESULT dataset
A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
I have the following concerns:
What should I do to obtain probability values? Is this a probability of getting high or low i.e. P(high) or P(low)?
The levels are set to 1 (high) and 2 (low), which is based on the order of first appearance. If low appeared before high in the train dataset, it would have a value 1. I feel this is not good practice. Is there anyway I can avoid this?
If there were more classes (more than 2) in the classifier, how would I handle this in the classifier?
I am using the class and e1071 library.
Thanks.
Utility function built before the "text" argument to scan was introduced:
rd.txt <- function (txt, header = TRUE, ...)
{ tconn <- textConnection(txt)
rd <- read.table(tconn, header = header, ...)
close(tconn)
rd}
RESULT <- rd.txt(" A P
1 2 2
2 2 2
3 1 2
4 1 1
5 1 1
6 1 2
7 2 2
8 2 2
9 2 2
10 2 2
11 2 2
12 2 1
13 1 2
")
> prop.table(table(RESULT))
P
A 1 2
1 0.15385 0.23077
2 0.07692 0.53846
You can also set up prop.table to deliver row or column proportions (AKA probabilities).

Is it possible draw a histogram with 2 gaps in R? [duplicate]

I'm not sure exactly what to call this, but I'm trying to achieve a sort of "broken histogram" or "axis gap" effect: http://gnuplot-tricks.blogspot.com/2009/11/broken-histograms.html (example is in gnuplot) with R.
It looks like I should be using the gap.plot() function from the plotrix package, but I've only seen examples of doing that with scatter and line plots. I've been able to add a break in the box around my plot and put a zigzag in there, but I can't figure out how to rescale my axes to zoom in on the part below the break.
The whole point is to be able to show the top value for one really big bar in my histogram while zooming into the majority of my bins which are significantly shorter. (Yes, I know this could potentially be misleading, but I still want to do it if possible)
Any suggestions?
Update 5/10/2012 1040 EST:
If I make a regular histogram with the data and use <- to save it into a variable (hdata <- hist(...)), I get the following values for the following variables:
hdata$breaks
[1] 0.00 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33
[16] 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
[31] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63
[46] 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78
[61] 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93
[76] 0.94 0.95 0.96 0.97 0.98 0.99 1.00
hdata$counts
[1] 675 1 0 1 2 2 0 1 0 2
[11] 1 1 1 2 5 2 1 0 2 0
[21] 2 1 2 2 1 2 2 2 6 1
[31] 0 2 2 2 2 3 5 4 0 1
[41] 5 8 6 4 10 3 7 7 4 3
[51] 7 6 16 11 15 15 16 25 20 22
[61] 31 42 48 62 57 45 69 70 98 104
[71] 79 155 214 277 389 333 626 937 1629 3471
[81] 175786
I believe I want to use $breaks as my x-axis and $counts as my y-axis.
You could use the gap.barplot from the plotrix package.
# install.packages('plotrix', dependencies = TRUE)
require(plotrix)
example(gap.barplot)
or
twogrp<-c(rnorm(10)+4,rnorm(10)+20)
gap.barplot(twogrp,gap=c(8,16),xlab="Index",ytics=c(3,6,17,20),
ylab="Group values",main="Barplot with gap")
Will give you this,
update 2012-05-09 19:15:42 PDT
Would it be an option to use facet_wrap with "free" (or "free_y") scales? That way you would be able to compare the data side by side, but have different y scales
Here is my quick example,
library('ggplot2')
source("http://www.ling.upenn.edu/~joseff/rstudy/data/coins.R")
coins$foo <- ifelse(coins$Mass.g >= 10, c("Low"), c("hight"))
m <- ggplot(coins, aes(x = Mass.g))
m + geom_histogram(binwidth = 2) + facet_wrap(~ foo, scales = "free")
The above would give you this,
This seems to work:
gap.barplot(hdata$counts,gap=c(4000,175000),xlab="Counts",ytics=c(0,3500,175000),
ylab="Frequency",main="Barplot with gap",xtics=hdata$counts)

How to pick only efficient frontier points in a plot of portfolio performance?

The name of this question does not do it justice. This is best explained by numerical example. Let's say I have the following portfolio data, called data.
> data
Stdev AvgReturn
1 1.92 0.35
2 1.53 0.34
3 1.39 0.31
4 1.74 0.31
5 1.16 0.30
6 1.27 0.29
7 1.78 0.28
8 1.59 0.27
9 1.05 0.27
10 1.17 0.26
11 1.62 0.25
12 1.33 0.25
13 0.96 0.24
14 1.47 0.24
15 1.09 0.24
16 1.20 0.24
17 1.49 0.23
18 1.01 0.23
19 0.88 0.22
20 1.21 0.22
21 1.37 0.22
22 1.09 0.22
23 0.95 0.21
24 0.81 0.21
I have already sorted the data data.frame by AvgReturn to make this (what I believe to be easier). My goal is to essentially eliminate all the points that do not make sense to choose, i.e., I would not want a portfolio where I choose a lower AvgReturn but receive a higher Stdev (assuming stdev is an appropriate measure of risk, but I am assuming that for now).
Essentially, does any know of an efficient (in the code sense) way to choose the "rational" portfolio choices. I have manually created a third column to this data frame to show you which portfolio choices should be kept. I would want to remove portfolio 4 because I would never choose it since I can choose portfolio 3 and receive the same return and a lower stdev. Similarly, I would never choose 8 because I can choose 5 with a higher return and a lower stdev.
> res
Stdev AvgReturn Keep
1 1.92 0.35 TRUE
2 1.53 0.34 TRUE
3 1.39 0.31 TRUE
4 1.74 0.31 FALSE
5 1.16 0.30 TRUE
6 1.27 0.29 FALSE
7 1.78 0.28 FALSE
8 1.59 0.27 FALSE
9 1.05 0.27 TRUE
10 1.17 0.26 FALSE
11 1.62 0.25 FALSE
12 1.33 0.25 FALSE
13 0.96 0.24 TRUE
14 1.47 0.24 FALSE
15 1.09 0.24 FALSE
16 1.20 0.24 FALSE
17 1.49 0.23 FALSE
18 1.01 0.23 FALSE
19 0.88 0.22 TRUE
20 1.21 0.22 FALSE
21 1.37 0.22 FALSE
22 1.09 0.22 FALSE
23 0.95 0.21 FALSE
24 0.81 0.21 TRUE
The only way I can think of solving this issue is by looping through and checking each condition. This, however, will be relatively inefficient in R my preferred language for this solution. I am having difficulty thinking of a vectorized solution. Any help is appreciated!
EDIT
Here I believe is a solution:
domstrat <- function(data){
keep <- c(-1,sign(diff(cummin(data[[1]]))))
data <- data[which(keep!=0),]
return(data)
}
Stdev AvgReturn
1 1.92 0.35
2 1.53 0.34
3 1.39 0.31
5 1.16 0.30
9 1.05 0.27
13 0.96 0.24
19 0.88 0.22
24 0.81 0.21
This uses the function cummax to identify a series of qualifying points by then testing against the original data:
> data <- data[order(data$Stdev),]
> data[ which(data$AvgReturn == cummax(data$AvgReturn)) , ]
Stdev AvgReturn
24 0.81 0.21
19 0.88 0.22
13 0.96 0.24
9 1.05 0.27
5 1.16 0.30
3 1.39 0.31
2 1.53 0.34
1 1.92 0.35
> plot(data)
> points( data[ which(data$AvgReturn == cummax(data$AvgReturn)) , ] , col="green")
It's not actually the convex hull but what might be called the "monotonically increasing hull".
You can define a custom R function which contains some logic to decide whether or not to keep a certain portfolio depending on the standard deviation and the average return:
>portfolioKeep <- function(x){
+ # x[1] contains the Stdev for the input row
+ # x[2] contains the AvgReturn for the input row
+ # make your decision based on these inputs here...
+ # and remember to return either "TRUE" or "FALSE"
+ }
Next we can use an apply function on your input data frame to come up with the Keep column you want:
# your 'input' data frame
input.mat <- data.matrix(input)
# apply custom function to rows
keep <- apply(input.mat, 1, portfolioKeep)
# bind keep vector to input data frame
input <- cbind(input, keep)
The above code first converts the input data frame into a numeric matrix so that we can use the apply function on it. The apply function will run portfolioKeep on each row, returning either "TRUE" or "FALSE." Finally, we roll the Keep column up into the original data frame for convenience.
And now you can do your reporting easily with the data frame input with which you started.

Put a break in the Y-Axis of a histogram

I'm not sure exactly what to call this, but I'm trying to achieve a sort of "broken histogram" or "axis gap" effect: http://gnuplot-tricks.blogspot.com/2009/11/broken-histograms.html (example is in gnuplot) with R.
It looks like I should be using the gap.plot() function from the plotrix package, but I've only seen examples of doing that with scatter and line plots. I've been able to add a break in the box around my plot and put a zigzag in there, but I can't figure out how to rescale my axes to zoom in on the part below the break.
The whole point is to be able to show the top value for one really big bar in my histogram while zooming into the majority of my bins which are significantly shorter. (Yes, I know this could potentially be misleading, but I still want to do it if possible)
Any suggestions?
Update 5/10/2012 1040 EST:
If I make a regular histogram with the data and use <- to save it into a variable (hdata <- hist(...)), I get the following values for the following variables:
hdata$breaks
[1] 0.00 0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.30 0.31 0.32 0.33
[16] 0.34 0.35 0.36 0.37 0.38 0.39 0.40 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48
[31] 0.49 0.50 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.60 0.61 0.62 0.63
[46] 0.64 0.65 0.66 0.67 0.68 0.69 0.70 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78
[61] 0.79 0.80 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.90 0.91 0.92 0.93
[76] 0.94 0.95 0.96 0.97 0.98 0.99 1.00
hdata$counts
[1] 675 1 0 1 2 2 0 1 0 2
[11] 1 1 1 2 5 2 1 0 2 0
[21] 2 1 2 2 1 2 2 2 6 1
[31] 0 2 2 2 2 3 5 4 0 1
[41] 5 8 6 4 10 3 7 7 4 3
[51] 7 6 16 11 15 15 16 25 20 22
[61] 31 42 48 62 57 45 69 70 98 104
[71] 79 155 214 277 389 333 626 937 1629 3471
[81] 175786
I believe I want to use $breaks as my x-axis and $counts as my y-axis.
You could use the gap.barplot from the plotrix package.
# install.packages('plotrix', dependencies = TRUE)
require(plotrix)
example(gap.barplot)
or
twogrp<-c(rnorm(10)+4,rnorm(10)+20)
gap.barplot(twogrp,gap=c(8,16),xlab="Index",ytics=c(3,6,17,20),
ylab="Group values",main="Barplot with gap")
Will give you this,
update 2012-05-09 19:15:42 PDT
Would it be an option to use facet_wrap with "free" (or "free_y") scales? That way you would be able to compare the data side by side, but have different y scales
Here is my quick example,
library('ggplot2')
source("http://www.ling.upenn.edu/~joseff/rstudy/data/coins.R")
coins$foo <- ifelse(coins$Mass.g >= 10, c("Low"), c("hight"))
m <- ggplot(coins, aes(x = Mass.g))
m + geom_histogram(binwidth = 2) + facet_wrap(~ foo, scales = "free")
The above would give you this,
This seems to work:
gap.barplot(hdata$counts,gap=c(4000,175000),xlab="Counts",ytics=c(0,3500,175000),
ylab="Frequency",main="Barplot with gap",xtics=hdata$counts)

Resources