I am interested in plotting the range of values of variables so that the names appear on the Y-axis and the range on the X-axis, for a better visualization.
I have used the following code:
primer_matrix1a <- matrix(
c(
"EF1", 65, 217,
"EF6", 165, 197,
"EF14", 96, 138,
"EF15", 103, 159,
"EF20", 86, 118,
"G9", 115, 173,
"G25", 112, 140,
"BE22", 131, 135,
"TT20", 180, 190
)
,nrow=9,ncol=3,byrow = T)
# Format data
Primer_name <- primer_matrix1a[,1]
Primer_name <- matrix(c(Primer_name),nrow = 9,byrow = T)
Primer_values<- matrix(c(as.numeric(primer_matrix1a[ ,2-3])),nrow = 9,ncol = 2,byrow = T)
Primer_Frame <- data.frame(Primer_name,Primer_values)
colnames(Primer_Frame) <- c("Primer","min","max")
Primer_Frame$mean<- mean(c(Primer_Frame$min,Primer_Frame$max))
ggplot(Primer_Frame, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
but the plot is weird, EF15 goes from 103, 159, while G9 goes from 115 to 173, and they do not overlap, so I am doing something wrong.
It looks like something is getting muddled when you are joining the matrix, but the approach is already more complex than it should be, so you might want to start afresh. It is probably easiest converting it to a dataframe and then formatting it there, rather than fiddling around with all the matrix functions:
df <- as.data.frame(primer_matrix1a)
names(df)<- c("Primer","min","max")
df$min <- as.numeric(as.character(df$min)) # Converts factor to numeric
df$max <- as.numeric(as.character(df$max))
df$mean<- mean(c(df$min,df$max))
ggplot(df, aes(x=Primer))+
geom_linerange(aes(ymin=min,ymax=max),linetype=2,color="blue")+
geom_point(aes(y=min),size=3,color="red")+
geom_point(aes(y=max),size=3,color="red")+
theme_bw()
Related
I am attempting an analysis that requires the extraction of a some (2 or 3) consecutive values in which perform further analysis later.
I have two vectors: a is the output from a machine of consecutive cellular signals. b is the same output, but shifted by 1. This notation is used to understand the variability between one signal and the next one
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
What I am trying to do is to identify the most homogeneous (stable) region (i.e. one value is similar to the following) in this set of data.
The idea I had was to perform a subtraction between a and b and consider the absolute value:
c <- abs(a-b)
which gives
c
[1] 20 5 45 2 8 2 7 25 30 20 10 50 1 51
Now, if I want the 3 closest consecutive points, I can clearly see that the sequence 2 8 2 is by far the one that I would consider, but I have no idea on how I can automatically extract these 3 values, especially from arrays of hundreds of data points.
Initial data:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
Find absolute difference between two vectors:
res <- abs(a - b)
For each element in res get neighbors and calculate sum off absolute difference :
# with res[(x-1):(x+1)] we extract x and it's neighbors
resSimilarity <- sapply(seq_along(res), function(x) sum(res[(x-1):(x+1)]))
resPosition <- which.min(resSimilarity)
# [1] 5
To extract values from original vectors use:
a[(resPosition - 1):(resPosition + 1)]
# [1] 180 182 190
b[(resPosition - 1):(resPosition + 1)]
# [1] 182 190 188
Here is one more alternative:
a <- c(150, 130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181)
b <- c(130, 135, 180, 182, 190, 188, 195, 170, 140, 120, 130, 180, 181, 130)
res <- abs(a-b)
> which.min(diff(c(0, cumsum(res)), lag=3))
[1] 4
> res[(4):(4+2)]
[1] 2 8 2
The above code uses cumsum to get the cumulative sums of your absolute differences. Then it calls diff with lag=3 to get the differences between each element and the element 3 positions away from it. Finally it takes the position where the increase in cumulative sum over successive 3 elements was the smallest.
I have a (for me) pretty complex problem. I have got two vectors:
vectora <- c(111, 245, 379, 516, 671)
vectorb <- c(38, 54, 62, 67, 108)
Furthermore i have got two variables
x = 80
y = 0.8
The third vector is based on the variables x and y the following way:
vectorc <- vectora^y/(1+(vectora^y-1)/x)
The goal is to minimize the deviation of vectorb and vectorc by changing x and y. The deviation is the defined by following function:
deviation <- (abs(vectorb[1]-vectorc[1])) + (abs(vectorb[2]-vectorc[2])) + (abs(vectorb[3]-vectorc[3])) + (abs(vectorb[4]-vectorc[4])) + (abs(vectorb[5]-vectorc[5]))
How can i do this in R?
You can use the optim procedure!
Here's how it'd work:
vectora <- c(111, 245, 379, 516, 671)
vectorb <- c(38, 54, 62, 67, 108)
fn <- function(v) {
x = v[1]
y = v[2]
vectorc <- vectora^y/(1+(vectora^y-1)/x);
return <- sum(abs(vectorb - vectorc))
}
optim(c(80, 0.8), fn)
The output of that is:
$par
[1] 91.4452617 0.8840952
$value
[1] 37.2487
$counts
function gradient
151 NA
$convergence
[1] 0
$message
NULL
For example, I have variables
Store.Number <- c("105, 105, 105, 106, 106, 106, 110, 110, 110, 110")
Date <- c("2017-01-04", "2016-07-06", "2016-04-04", "2017-01-31", "2016-10-31",
"2016-05-11", "2017-01-26", "2016-10-28", "2016-07-20", "2016-04-27")
Jan012016 <- c("369",NA,NA.......)
a <- as.Date("01/01/2016", "%m/%d/%Y)
I want to write a function such that first loop should check for
for i=1
Store.Number[i+1,1] = Store.Number [i,1]
If "True"
abs(Date[i,2] - Date[i+1,2])
Else
abs(Date[i,2] - a)
example:
Store.Number[2,1] = Store.Number [1,1] > TRUE
Jan012016 = abs(Date[2,2] - Date[1,2])
Else
Jan012016 = abs(Date[i,2] - a)
The functions most commonly used in R to split the data into subsets and perform calculations are probably tapply and aggregate. In this case, tapply gives output closest to what you want, so an example is:
Store.Number <- c(105, 105, 105, 106, 106, 106, 110, 110, 110, 110)
Date <- as.Date(c("2017-01-04", "2016-07-06", "2016-04-04", "2017-01-31", "2016-10-31",
"2016-05-11", "2017-01-26", "2016-10-28", "2016-07-20", "2016-04-27"), "%Y-%m-%d")
a <- as.Date("01/01/2016", "%m/%d/%Y")
store.diffs <- tapply(Date, Store.Number, function(x)abs(diff(c(a,x))))
Jan012016 <- do.call(c,store.diffs)
names(Jan012016) <- NULL
Jan012016
# Time differences in days
# [1] 369 182 93 396 92 173 391 90 100 84
Note that I've got rid of the quotes in your definition of Store.Number, and converted Date from character to Date.
Suppose this is my data set:
ID<- seq(1:50)
mou<-sample(c(2000, 2500, 440, 4990, 23000, 450, 3412, 4958,745,1000), 50, replace= TRUE)
calls<-sample(c(50, 51, 12, 60, 90, 888, 444, 668, 16, 89, 222,33, 243, 239, 333, 645,23, 50,555), 50, replace= TRUE)
rev<- sample(c(100, 345, 758, 44, 58, 334, 50000, 888, 205, 940,298, 754), 50, replace= TRUE)
dt<- data.frame(mou, calls, rev)
I did the box plot for calls and while analyzing it, I saw the following objects for the boxplot.
x<-boxplot(dt$calls)
names(x)
> names(x)
[1] "stats" "n" "conf" "out" "group" "names"
Looking at the output for x$stats, I figured that stats object gives me the lower whisker the lower hinge, the median, the the upper hinge and the upper whisker for each group. But i am little bit confused what the object "out" really mean? Does this signify the outlier values or something else?
The out object for my boxplot gives the following results:
> x$out
[1] 555 10000 555 555 555 555 555 10000
It gives you: "The values of any data points which lie beyond the extremes of the whiskers"
Take a look at here for more insight.
In R is have 2 vectors
u <- c(109, 77, 57, 158, 60, 63, 42, 20, 139, 15, 64, 18)
v <- c(734, 645, 1001, 1117, 1071, 687, 162, 84, 626, 64, 218, 79)
I want to test H: u and v are independent so I run a chi-square test:
chisq.test( as.data.frame( rbind(u,v) ) )
and get a very low p-value meaning that I can reject H, meaning that u and v are not independent.
But when I type
chisq.test(u,v)
I get a p-value on 0.23 which mean that I can accept H.
Which one of these two test should I chose ?
Furthermore I want to find the entries in these vectors that causes this low p-value. Any ideas how to do this?
The test statistic uses the sum of squared standardised residuals. You can look at these values to get an idea of the importance of particular values
m = chisq.test(u, v)
residuals(m)
m$stdres