Subset and remove rows from a dataset - r

I want to exclude some rows from my dataset while I also subset it. Something like I wrote below.
a<-c(2, 4, 6, 6, 8, 10, 12, 13, 14)
c<-c(2, 2, 2, 2, 2, 2, 4, 4,4)
d<-c(10, 10, 10, 30, 30, 30, 50, 50, 50)
ID<-rep(c("no","bo", "fo"), each=3)
mydata<-data.frame(ID, a, c, d)
gg.df <- melt(mydata, id="ID", variable.name="variable")
gg.df[gg.df$variable=="a"& gg.df$ID==-"fo", ]

Related

Zelen Exact Test - Trying to use a k 2x2 in the function zelen.test()

I am trying to use the zelen.test function on the package NSM3. I am having difficulty reading the data into the function.
You can recreate my data using
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(data, ncol = 2)
The documentation on CRAN states that zelen.test(z, example = F, r = 3) where z is an array of k 2 x 2 matrix, example is set to FALSE because it returns a p-value for an example I cannot access, and r is the number of decimals the users wants returned in the p-value.
I've tried:
zelen.test(events, r = 4)
I thought it may want the study number and the trial data, so I tried this:
studies <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7)
data <- c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1, 12, 13,
74, 74, 77, 85, 31, 37, 11, 7, 18, 18, 96, 97, 48, 40)
events <- matrix(cbind(studies, events), ncol = 3)
zelen.test(events, r = 4)
but it continues to return and error stating
"Error in z[1, 1, ] : incorrect number of dimensions" for both cases I tried above.
Any help would be greatly appreciated!
If we check the source code by typing zelen.test on the console, if the example = TRUE, it is constructing a 3D array
...
if (example)
z <- array(c(2, 1, 2, 5, 1, 5, 4, 1), dim = c(2, 2, 2))
...
The input z dim is also specified in the documentation of ?zelen.test
z - data as an array of k 2x2 matrices. Small data sets only!
So, we may need to construct an array of dimensions 3
library(NSM3)
z1 <- array(c(4, 2, 3, 3, 8, 3, 4, 7), c(2, 2, 2))
zelen.test(z1, r = 4)
# Zelen's test:
# P = 1
Or with 3rd dimension of length 3
z1 <- array( c(4, 2, 3, 3, 8, 3, 4, 7, 0, 7, 1, 1), c(2, 2, 3))
zelen.test(z1, r = 4)
# Zelen's test:
#P = 0.1238

Logarithmic scaling with ggplot2 in R

I am trying to create a diagram using ggplot2. There are several very small values to be displayed and a few larger ones. I'd like to display all of them in an appropriate way using logarithmic scaling. This is what I do:
plotPointsPre <- ggplot(data = solverEntries, aes(x = val, y = instance,
color = solver, group = solver))
...
finalPlot <- plotPointsPre + coord_trans(x = 'log10') + geom_point() +
xlab("costs") + ylab("instance")
This is the result:
It is just the same as without coord_trans(x = 'log10').
However, if I use it with the y-axis:
How do I achieve the logarithmic scaling on the x-axis? Besides, it is not about the x-axis, if I switch the values of x and y, then it works on the x-axis and no longer on the y-axis. So there seems to be some problem with the displayed values. Does anybody have an idea how to fix this?
Edit - Here's the used data contained in solverEntries:
solverEntries <- data.frame(instance = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 16, 16, 16, 17, 17, 17, 17, 18, 18, 18, 18, 19, 19, 19, 19, 20, 20, 20, 20),
solver = c(4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1),
time = c(1, 24, 13, 6, 1, 41, 15, 5, 1, 26, 16, 5, 1, 39, 7, 4, 1, 28, 11, 3, 1, 31, 12, 3, 1, 38, 20, 3, 1, 37, 10, 4, 1, 25, 11, 3, 1, 32, 18, 4, 1, 27, 21, 3, 1, 23, 22, 3, 1, 30, 17, 2, 1, 36, 8, 3, 1, 37, 19, 4, 1, 40, 21, 3, 1, 29, 11, 4, 1, 33, 10, 3, 1, 34, 9, 3, 1, 35, 14, 3),
val = c(6553.48, 6565.6, 6565.6, 6577.72, 6568.04, 7117.14, 6578.98, 6609.28, 6559.54, 6561.98, 6561.98, 6592.28, 6547.42, 7537.64, 6549.86, 6555.92, 6546.24, 6557.18, 6557.18, 6589.92, 6586.22, 6588.66, 6588.66, 6631.08, 6547.42, 7172.86, 6569.3, 6582.6, 6547.42, 6583.78, 6547.42, 6575.28, 6555.92, 6565.68, 6565.68, 6575.36, 6551.04, 6551.04, 6551.04, 6563.16, 6549.86, 6549.86, 6549.86, 6555.92, 6544.98, 6549.86, 6549.86, 6561.98, 6558.36, 6563.24, 6563.24, 6578.98, 6566.86, 7080.78, 6570.48, 6572.92, 6565.6, 7073.46, 6580.16, 6612.9, 6557.18, 7351.04, 6562.06, 6593.54, 6547.42, 6552.3, 6552.3, 6558.36, 6553.48, 6576.54, 6576.54, 6612.9, 6555.92, 6560.8, 6560.8, 6570.48, 6566.86, 6617.78, 6572.92, 6578.98))
Your data in current form is not log distributed -- most val around 6500 and some 10% higher. If you want to stretch the data, you could use a custom transformation using the scales::trans_new(), or here's a simpler version that just subtracts a baseline value to make a log transform useful. After subtracting 6500, the small values will be mapped to around 50, with the large values around 1000, which is a more appropriate range for a log scale. Then we apply the same transformation to the breaks so that the labels will appear in the right spots. (i.e. the label 6550 is mapped to the data that is mapped to 6550 - 6500 = 50)
This method helps if you want to make the underlying values more distinguishable, but at the cost of distorting the underlying proportions between values. You might be able to help with this by picking useful breaks and labeling them with scaling stats, e.g.
7000
+7% over min
my_breaks <- c(6550, 6600, 6750, 7000, 7500)
baseline = 6500
library(ggplot2)
ggplot(data = solverEntries,
aes(x = val - baseline, y = instance,
color = solver, group = solver)) +
geom_point() +
scale_x_log10(breaks = my_breaks - baseline,
labels = my_breaks, name = "val")
Is this what you're looking for?
x_data <- seq(from=1,to=50)
y_data <- 2*x_data+rnorm(n=50,mean=0,sd=5)
#non log y
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()
#log y scale
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()+
scale_y_log10()
#log x scale
ggplot()+
aes(x=x_data,y=y_data)+
geom_point()+
scale_x_log10()

Quade test in R

I would like to perform a Quade test with more than one covariate in R. I know the command quade.test and I have seen the example below:
## Conover (1999, p. 375f):
## Numbers of five brands of a new hand lotion sold in seven stores
## during one week.
y <- matrix(c( 5, 4, 7, 10, 12,
1, 3, 1, 0, 2,
16, 12, 22, 22, 35,
5, 4, 3, 5, 4,
10, 9, 7, 13, 10,
19, 18, 28, 37, 58,
10, 7, 6, 8, 7),
nrow = 7, byrow = TRUE,
dimnames =
list(Store = as.character(1:7),
Brand = LETTERS[1:5]))
y
quade.test(y)
My question is as follows: how could I introduce more than one covariate? In this example the covariate is the Store variable.

dplyr: Create a new variable as a function of all existing variables without defining their names

In the following dataframe I want to create a new variable as the following function of all existing ones:
as.numeric(paste0(df[i,],collapse=""))
However, I don't want to define the column names explicitly because their number and names maybe different each time. How can I do that using dplyr?
The equivalent in base r would be something like this:
apply(df,1,function(x) as.numeric(paste0(x,collapse="")))
df <- structure(list(X1 = c(50, 2, 2, 50, 5, 5, 2, 50, 5, 5, 50, 2,
5, 5, 50, 2, 2, 50, 9, 9, 9, 9, 9, 9), X2 = c(2, 50, 5, 5, 50,
2, 5, 5, 50, 2, 2, 50, 9, 9, 9, 9, 9, 9, 50, 2, 2, 50, 5, 5),
X3 = c(5, 5, 50, 2, 2, 50, 9, 9, 9, 9, 9, 9, 50, 2, 2, 50,
5, 5, 2, 50, 5, 5, 50, 2), X4 = c(9, 9, 9, 9, 9, 9, 50, 2,
2, 50, 5, 5, 2, 50, 5, 5, 50, 2, 5, 5, 50, 2, 2, 50)), class = "data.frame", .Names = c("X1",
"X2", "X3", "X4"), row.names = c(NA, -24L))
You can try:
df %>% mutate(newcol=as.numeric(do.call(paste0,df)))
Or (as you suggested, maybe more dplyr style):
df %>% mutate(newcol=as.numeric(do.call(paste0,.)))

How can I calculate the mean of the top 4 observations in my column?

How can I calculate the mean of the top 4 observations in my column?
c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
For instance, in the above I would have (50+60+50+60)/4 = 55. I only know how to use the quantile, but it does not work for this.
Any ideas?
Since you're interested in only the top 4 items, you can use partial sort instead of full sort. If your vector is huge, you might save quite some time:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
idx <- seq(length(x)-3, length(x))
mean(sort(x, partial=idx)[idx])
# [1] 55
Try this:
vec <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(sort(vec, decreasing=TRUE)[1:4])
gives
[1] 55
Maybe something like this:
v <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(head(sort(v,decreasing=T),4))
First, you sort your vector so that the largest values are in the beginning. Then with head you take the 4 first values in that vector, subsequently taking the mean value of that.
To be different! Also, please try to do some research on your own before posting.
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(tail(sort(x), 4))
Just to show that you can use quantile in this exercise:
mean(quantile(x,1-(0:3)/length(x),type=1))
#[1] 55
However, the other answers are clearly more efficient.
You could use the order function. Order by -x to give the values in descending order, and just average the first 4:
x <- c(12, 13, 15, 1, 5, 9, 34, 50, 60, 50, 60, 4, 6, 8, 12)
mean(x[order(-x)][1:4])
[1] 55

Resources