R inference from one matrix to a data frame - r

I think this may be a very simple and easy question, but since I'm new to R, I hope someone can give me some outlines of how to solve it step by step. Thanks!
So the question is if I have a (n * 2) matrix (say m) where the first column representing the index of the data in another data frame (say d) and the second column representing some value(p value).
What i want to do is if the p value of some row r in m is less than 0.05,I will plot the data in d by the index indicated in the first column in row r of matrix m.
..............
The data is somewhat like what I draw below:
m:
ind p_value
2 0.02
23 0.03
56 0.12
64 0.54
105 0.04
d:
gene_id s1 s2 s3 s4 ... sn
IDH1 0.23 3.01 0 0.54 ... 4.02
IDH2 0.67 0 8.02 10.54 ... 0.72
...
so IDH2 is corresponding to the first line in m whose index column is 2

toplot <- d[ m[ m[,'p_value'] < .05,'ind'], ] works!

Related

How to fix the error "Subscript out of bounds"

I have a question about fixing the error:
"subscript out of bounds".
I am analyzing data of an eye-tracking experiment. You may find example data below:
Stimulus Timebin Language Percentage on AOI
1 11 L1 0.80
1 11 L2 0.60
1 12 L1 0.80
1 12 L2 0.50
1 13 L1 0.83
1 13 L2 0.50
...
10 37 L1 0.00
10 37 L2 0.50
10 38 L1 0.70
10 38 L2 0.50
10 39 L1 0.60
10 39 L2 0.70
10 40 L1 0.75
10 40 L2 0.89
...
I would like to do a Growth curve analysis with the Language and Timebin as independent variables and percentage on Area of Interest (AOI) as dependent variable. Besides, the Stimulus as random factor. I got 40 timebins for each stimulus and condition. In order to avoid the potential problem of collinearity, I want to create orthogonalized polynomials. The code below was used to create independent (orthogonal) polynomial time terms (linear, quadratic, and cubic).
Gaze_1_Poly <- poly((unique(Gaze_1$timebin)), 3)
Gaze_1[,paste("ot", 1:3, sep="")] <- Gaze_1_Poly[Gaze_1$timebin, 1:3]
I always get an error told me that there is a Out of Bounds Subscript.
Error in Gaza_1_Poly[Gaze_1$timebin, :
subscript out of bounds
So I checked the class of variables and I think it is of no problem:
Stimulus Timebin Language percentage on AOI
"character" "integer" "factor" "numeric"
I can not figure out the reason. Can someone give me a hand?
See comment above. Let me know if this is what you had in mind.
library(dplyr)
Gaze_1 %>%
left_join(data.frame(Timebin = unique(.$Timebin), poly(unique(.$Timebin), degree = 3)),
by = 'Timebin') %>%
setNames(c("Stimulus", "Timebin", "Language", "Percentage on AOI", "ot1", "ot2", "ot3"))

Comparing changes across two matrices

I'm performing some biogeographic analyses in R and the result is encoded as a pair of matrices. Columns represent geographic regions, rows indicate nodes in a phylogenetic tree and values in the matrix are the probability that the branching event occurred in the geographic region indicated by the column. A very simple example would be:
> One_node<-matrix(c(0,0.8,0.2,0),
+ nrow=1, ncol=4,
+ dimnames = list(c("node 1"),
+ c("A","B","C","D")))
> One_node
A B C D
node_1 0 0.8 0.2 0
In this case, the most probable location for node_1 is region B. In reality, the output of the analysis is encoded as two separate 79x123 matrices. The first is the probabilities of a node occupying a given region before an event and the second is the probabilities of a node occupying a given region after an event (rowSums=1). Some slightly more complicated examples:
before<-matrix(c(0,0,0,0,0.9,
0.8,0.2,0.6,0.4,0.07,
0.2,0.8,0.4,0.6,0.03,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
after<-matrix(c(0,0,0,0,0.9,
0.2,0.8,0.4,0.6,0.03,
0.8,0.2,0.6,0.4,0.07,
0,0,0,0,0),
nrow=5, ncol=4,
dimnames = list(c("node_1","node_2","node_3","node_4","node_5"),
c("A","B","C","D")))
> before
A B C D
node_1 0.0 0.80 0.20 0
node_2 0.0 0.20 0.80 0
node_3 0.0 0.60 0.40 0
node_4 0.0 0.40 0.60 0
node_5 0.9 0.07 0.03 0
> after
A B C D
node_1 0.0 0.20 0.80 0
node_2 0.0 0.80 0.20 0
node_3 0.0 0.40 0.60 0
node_4 0.0 0.60 0.40 0
node_5 0.9 0.03 0.07 0
Specifically, I'm only interested in extracting row numbers where column B is the highest in before and column C is the highest in after and vice versa as I'm trying to extract node numbers in a tree where taxa have moved B->C or C->B.
So the output I'm looking for would be something like:
> BC
[1] 1 3
> CB
[1] 2 4
There will be rows where B>C or C>B but where neither is the highest in the row (node_5) and I need to ignore these. The row numbers are then used to query a separate dataframe that provides the data I want.
I hope this all makes sense. Thanks in advance for any advice!
You could do something like this...
maxBefore <- apply(before, 1, which.max) #find highest columns before (by row)
maxAfter <- apply(after, 1, which.max) #and highest columns after
BC <- which(maxBefore==2 & maxAfter==3) #rows with B highest before, C after
CB <- which(maxBefore==3 & maxAfter==2) #rows with C highest before, B after
BC
node_1 node_3
1 3
CB
node_2 node_4
2 4

Adding NA's to a vector

Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!
You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA
Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)

Analysing subsets of data from one data frame defined by another data frame

I need to know how to take the mean/median etc. from rows of one data frame selected according to whether they meet a condition that refers to another. Difficult to explain, so I'll just give an example.
> d
Position Value
1 0 0.20
2 5 0.30
3 10 0.45
4 15 0.23
5 20 0.71
6 25 0.10
7 30 0.20
8 35 0.22
9 40 0.80
10 45 0.50
11 50 0.31
12 55 0.40
And also:
Segment Start End
1 1 0 15
2 2 20 40
3 3 45 55
Basically, "d" gives a variable's value at a certain 'position.' "d2" gives start and end points (or positions) of several 'segments' of the data from "d". Now, what I want is the mean and median of the "value" entries from "d" in each "segment." So for segment 1, because it has start and end positions 0 and 15, respectively, it would return the mean of the entries for 0, 10, and 15 from "d". Note that the segments are not necessarily of equal length, so it would not work to just take the mean of the first n entries, second n entries, third n entries, and so on.
One could think of the segments as segments on a chromosome; and each point on the chromosome has a "value" that describes some characteristic of that point on the chromosome, and I have data on what this value equals at each point, and also data on where each segment begins and ends (segments are all contiguous, just not equal length), and now want to compute, say, the mean value for all the points within each segment. Suffice it to say, unlike with my example, in the actual data set there are far too many segments to compute these manually, hence the question. Thanks.
You could try
mapply(function(s,e) {
mean(d$Value[d$Position>=s & d$Position<=e])}
, d2$Start, d2$End)
That should give you a vector the same length as the number of rows of d2 so you where where all the values belong.

Sampling and Calculation in R

I have a file that contains two columns (Time , VA). The file is large and I managed to read it in R(used read and subset -not a practical for large file). Now, I want to do sampling based on the time, where each sample has a sample size and sample shift. Sample size is fixed value for the whole process of sampling e.g. sampleSize=10 second. Sample shift is the start point for each new sample (after First sample). For example, if sampleShift =4 sec and the sampleSize is 10 sec , that means the second sample will start from 5 sec and add 10 sec as the sample sample size=10 sec. For each sample I want feed the
-VA- values to a function to some calculation.
Sampling <- function(values){
# Perform the sampling
lastRowNumber<- #specify the last row manually
sampleSize<-10
lastValueInFile<-lastRowNumber-sampleSize
for (i in 1: (lastValueInFile ) ){
EndOfShift<-9+i
sample<-c(1:sampleSize)
h<-1
for(j in i:EndOfShift){
sample[h] <- values[j,1]
h<-h+1
}
print(sample)
#Perform the Calculation on the extracted sample
#--Samp_Calculation<-SomFunctionDoCalculation(sample)
}
}
The problems with my try are:
1) I have to specify the lastRow number manually for each file I read.
2) I was trying to do the sampling based on rows number not the Time value. Also, the shift was by one for each sample.
file sample:
Time VA
0.00000 1.000
0.12026 2.000
0.13026 2.000
0.14026 2.000
0.14371 3.000
0.14538 4.000
..........
..........
15.51805 79.002
15.51971 79.015
15.52138 79.028
15.52304 79.040
15.52470 79.053
.............
Any suggestion for more professional way ?
I've generated some test data as follows:
val <- data.frame (time=seq(from=0,to=15,by=0.01),VA=c(0:1500))
... then the function:
sampTime <- function (values,sampTimeLen)
{
# return a data frame for a random sample of the data frame -values-
# of length -sampTimeLen-
minTime <- values$time[1]
maxTime <- values$time[length(values$time)] - sampTimeLen
startTime <- runif(1,minTime,maxTime)
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... can be used as follows:
> sampTime(val,0.05)
time VA
857 8.56 856
858 8.57 857
859 8.58 858
860 8.59 859
861 8.60 860
... which I think is what you were looking for.
(EDIT)
Following the clarification that you want a sample from a specific time rather than a random time, this function should give you that:
sampTimeFrom <- function (values,sampTimeLen,startTime)
{
# return a data frame for sample of the data frame -values-
# of length -sampTimeLen- from a specific -startTime-
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... which gives:
> sampTimeFrom(val,0.05,0)
time VA
1 0.00 0
2 0.01 1
3 0.02 2
4 0.03 3
5 0.04 4
6 0.05 5
> sampTimeFrom(val,0.05,0.05)
time VA
6 0.05 5
7 0.06 6
8 0.07 7
9 0.08 8
10 0.09 9
11 0.10 10
If you want multiple samples, they can be delivered with sapply() like this:
> samples <- sapply(seq(from=0,to=0.15,by=0.05),function (x) sampTimeFrom(val,0.05,x))
> samples[,1]
$time
[1] 0.00 0.01 0.02 0.03 0.04 0.05
$VA
[1] 0 1 2 3 4 5
In this case the output will overlap but making the sampTimeLen very slightly smaller than the shift value (which is shown in the by= parameter of the seq) will give you non-overlapping samples. Alternatively, one or both of the criteria in the function could be changed from >= or <= to > or <.

Resources