**EDITED, I have made progress, but didn't think my original question was as well constructed as it could be.
I am new to R and computer programming in general and I am attempting to write my first for loop.
I want to be able to do some tidal analysis using harmonic constituents from NOAA.
I have my initial data=data which looks like:
Constituent # Name Amplitude Phase Speed
1 M2 3.264 29.0 28.98
2 S2 0.781 51.9 30.0
3 N2 0.63 12.3 28.43
4 K1 1.263 136.8 15.04
5 M4 0.043 286.0 57.96
The equation for wave height is h(t)= Amplitude*cos(Speed*t-Phase) where t is time.
Therefore I need to perform this calculation for each constituent (row) and sum the results of each constituent by time.
So my middle result will be a table with the ncols=number of time stamps and the nrow= number of constituents.
T1 T2 T3...
data[1,3]*cos(data[1,4]*T1-data[1,3]) data[1,3]*cos(data[1,4]*T2-data[1,3])
data[2,3]*cos(data[2,4]*T1-data[2,3]) data[2,3]*cos(data[2,4]*T2-data[2,3])
.
.
.
data[n,3]*cos(data[n,4]*T1-data[n,3]) data[n,3]*cos(data[n,4]*T2-data[n,3])
With this table I can sum the columns to get my final answer of what the tide height is at each time stamp.
To do this I have attempted to create a for loop.
DF=NULL
for (i in 1:nrow(data)){
DF<- matrix(c(DF, data[i,2]*cos(pi/180*(data[i,4]*Time[,]-data[i,3]))))
}
This returns all the results a single vector. I can't figure out how to separate it into columns by the timestamp. It just runs through all the timestamps for the furst constituent, then the second and so on. So for my current station I have 37 constituents and 100 time stamps so my matrix DF is 1 column with 3700 rows.
I have tried setting the matrix DF with the appropriate number of columns and rows, but this returns a single result for all rows and columns. I have also tried a nested if statement with time, and many other things that I can't remember.
***Used Rusan's approach and finished what I was doing with the script below. Any other approaches are appreciated.
Time<-matrix(seq(1,100,1)) #my time series
n<-hh3(Time) #Function outlined by Rusan below
b<- matrix(c(rep(Time[1,1]:Time[nrow(Time),1], nrow(wave_table)))) #A repeating list to bind with n
height<-matrix(colSums(dcast(data.frame(cbind(b,n)),Constituent~V1,value.var="V1.1")[,-1])) #The sums of all the constituents at each time stamp, the final height of the wave at each time
This allows me to sum all the constituents at each time stamp. Height=sum of all constituents at time t. So for my example above height(t1)=M2(t1)+S2(t1)+N2(t1)+K1(t1)+M4(t1)
My final output is a matrix of a single vector height. I want this to create an inundation duration curve.
Perhaps this is not an answer - but I would suggest a different approach. I will use the package data.table in R.
library(data.table)
#use own location of your data
wave_table=fread(input="F:\\wave.csv");
wave_table
# Constituent Name Amplitude Phase Speed
# 1: 1 M2 3.264 29.0 28.98
# 2: 2 S2 0.781 51.9 30.00
# 3: 3 N2 0.630 12.3 28.43
# 4: 4 K1 1.263 136.8 15.04
# 5: 5 M4 0.043 286.0 57.96
#create a function which does your calculation on the named columns of your data,
#taking time 't' as a parameter
hh<-function(t){ wave_table[,{Amplitude*cos(Speed*t-Phase)}] }
hh2<-function(t) wave_table[,{Amplitude*cos(Speed*t-Phase)}, by=Name]
hh3<-function(t) wave_table[,{Amplitude*cos(Speed*t-Phase)}, by=Constituent]
hh4<-function(t) wave_table[,{sum(Amplitude*cos(Speed*t-Phase))}, by=Constituent]
#Now the function `hh` can be used like this, giving you a bit
#more flexibility with what you want to do, perhaps?
hh(1)
#3.26334722 -0.77775795 -0.57472163 -0.91362687 -0.01165717
or
hh2(1)
# Name V1
#1: M2 3.26334722
#2: S2 -0.77775795
#3: N2 -0.57472163
#4: K1 -0.91362687
#5: M4 -0.01165717
or
hh4(1) #after adding an extra row to your data: "Constituent=1, Name=M3,
#Amp=1.263,Phase=51.9, Speed=15.04
# Constituent V1
#1: 1 4.10718774
#2: 2 -0.77775795
#3: 3 -0.57472163
#4: 4 -0.91362687
#5: 5 -0.01165717
In general, loops in R for this type of problem should be avoided, as they are slow/there are much better tools available. Loops are typically "last resort".
If the function hh to hh4 do not do exactly what you want, there are other variations that could be used. Check out http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf
Related
I have the following data with two observations per subject:
SUBJECT <- c(8,8,10,10,11,11,15,15)
POSITION <- c("H","L","H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30,90,90)
RESPONSE <- c(5.6,5.2,0,0,4.8,4.9,1.2,.9)
DATA <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I want the rows of DATA which have SUBJECT numbers that are in a vector, V:
V <- c(8,10,10)
How can I obtain both observations from DATA whose SUBJECT number is in V and have those observations repeated the same number of times as the corresponding SUBJECT number appears in V?
Desired result:
SUBJECT <- c(8,8,10,10,10,10)
POSITION <- c("H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30)
RESPONSE <- c(5.6,5.2,0,0,0,0)
OUT <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I thought some variation of the %in% operator would do the trick but it does not account for repeated subject numbers in V. Even though a subject number is listed twice in V, I only get one copy of the corresponding rows in DATA.
I could also create a loop and append matching observations but this piece is inside a bootstrap sampler and this option would dramatically increase computation time.
merge is your friend:
merge(list(SUBJECT=V), DATA)
# SUBJECT POSITION TIME RESPONSE
#1 8 H 90 5.6
#2 8 L 90 5.2
#3 10 H 30 0.0
#4 10 L 30 0.0
#5 10 H 30 0.0
#6 10 L 30 0.0
As #Frank implies, this logic can be translated to data.table or dplyr or sql of anything else that will handle a left-join.
I'm new to R and having a difficult time thinking about the right way to approach a problem. I'm used to doing most of my data analysis in excel, so I think I'm stuck in spreadsheetland. Now I'm getting into data that's too large to do comfortably in excel, so I wanted to step into the light and use R. Thanks in advance for any help you have.
So lets use ChickWeight as an example:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Say I want to be able to split the data frame by both diet and time point such that it would be easy to generate a table of average weights with Time for columns and Diet for rows. Something like:
0 2 4 6 (time)
1
2 <average weights
3 go in here>
4
(diet)
In my head, the easiest way to do this would be to generate a 2d array containing these values so that I can access them like average_weight[<Time>][<Diet>].
I would like to to be easy to also access all of the average weights for a given time or a given diet using something like average_weight[<Time>][]
I've gotten the sense that I'm not thinking about this problem right, because none of the tools I've found seem to point me in the right direction. The closest I've gotten is using split()
chicks_by_time_and_diet <- split(ChickWeight, list(ChickWeight$Time, ChickWeight$Diet))
But this returns a list of length 55, not a two-dimensional array. I've also tried looking into plyr. This sounded like it was exactly what I wanted, but it's unclear to me exactly how to use it towards this end.
Any help is appreciated, thank you!
Bonus:
In reality my data frame has many more factors than ChickWeight, and if it were possible to access all of the factors for a given 'Time' and 'Diet', that would be ideal.
E.g. pretend that ChickWeight has another factor, height. Would it be possible to store both the average height and weight for a given diet at a particular location in the array such that average_weight_and_height[<Time>][<Diet>] returns a list of (weight, height)?
Using dplyr/tidyr
library(dplyr)
library(tidyr)
ChickWeight %>%
group_by(Time, Diet) %>%
summarise(weight=mean(weight)) %>%
spread(Time, weight)
tapply is made just for this:
> with(ChickWeight, tapply(weight, list(Time, Diet), mean))
1 2 3 4
0 41.40000 40.7 40.8 41.0000
2 47.25000 49.4 50.4 51.8000
4 56.47368 59.8 62.2 64.5000
6 66.78947 75.4 77.9 83.9000
8 79.68421 91.7 98.4 105.6000
10 93.05263 108.5 117.1 126.0000
12 108.52632 131.3 144.4 151.4000
14 123.38889 141.9 164.5 161.8000
16 144.64706 164.7 197.4 182.0000
18 158.94118 187.7 233.1 202.9000
20 170.41176 205.6 258.9 233.8889
21 177.75000 214.7 270.3 238.5556
You can also use data.table or dplyr, though you will need to reshape the results of those to get to the 2D (or 3D) formats:
library(data.table)
DT <- data.table(ChickWeight)[, mean(weight), by=.(Time, Diet)]
dcast.data.table(DT, Time ~ Diet)
Or, as Arun points out (here we just use a normal data frame):
reshape2::dcast(ChickWeight, Time ~ Diet, value.var="weight", fun.aggregate=mean)
A lot of R analysis involves getting comfortable with data in "long format" (see DT before we dcast it), where dimensions are represented by columns.
I have a .csv file with several columns, but I am only interested in two of the columns(TIME and USER). The USER column consists of the value markers 1 or 2 in chunks and the TIME column consists of a value in seconds. I want to calculate the difference between the TIME value of the first 2 in a chunk in the USER column and the first 1 in a chunk in the USER column. I want to accomplish this through R. It would be ideal for their to be another column added to my data file with these differences.
So far I have only imported the .csv into R.
Latency <- read.csv("/Users/alinazjoo/Documents/Latency_allgaze.csv")
I'm going to guess your data looks like this
# sample data
set.seed(15)
rr<-sample(1:4, 10, replace=T)
dd<-data.frame(
user=rep(1:5, each=10),
marker=rep(rep(1:2,10), c(rbind(rr, 5-rr))),
time=1:50
)
Then you can calculate the difference using the base function aggregate and transform. Observe
namin<-function(...) min(..., na.rm=T)
dx<-transform(aggregate(
cbind(m2=ifelse(marker==2,time,NA), m1=ifelse(marker==1, time,NA)) ~ user,
dd, namin, na.action=na.pass),
diff = m2-m1)
dx
# user m2 m1 diff
# 1 1 4 1 3
# 2 2 15 11 4
# 3 3 23 21 2
# 4 4 35 31 4
# 5 5 44 41 3
We use aggregate to find the minimal time for each of the two kinds or markers, then we use transform to calculate the difference between them.
I tried to do a stochastic simulation on a epidemiology SEIR model using the coding below.
library(GillespieSSA)
parms <- c(beta=0.591,sigma=1/8,gamma=1/7)
x0 <- c(S=50,E=0,I=1,R=0)
a <- c("beta*S*I","sigma*E","gamma*I")
nu <- matrix(c(-1,0,0,
1,-1,0,
0,1,-1,
0,0,1),nrow=4,byrow=TRUE)
set.seed(12345)
out <- lapply(X=1:10,FUN=function(x) ssa(x0,a,nu,parms,tf=50)$data)
out
I managed to obtain the 10 simulations values that I wanted. The time is in continuous form. Now, I have to extract time in discrete form such as 1,2,3...,50 from each simulation. Which type of coding should I use?
I tried doing data.frame and extract but still not able to do it.
Thanking in advance for any help.
Lets say the data looks like this:
df <- data.frame(t=seq(0.4,4.5,0.03), x=1:137)
## t x
## 1 0.40 1
## 2 0.43 2
## 3 0.46 3
## 4 0.49 4
## 5 0.52 5
To get the discrete time index values:
idx <- diff(ceiling(df$t)) == 1
Discrete time series will be:
df[idx,]
## t x
## 21 1.00 21
## 54 1.99 54
## 87 2.98 87
## 121 4.00 121
Having run the trial myself a problem seems to be that many of the time stamps are quite a distance from an integer result.
To see these remainders, check: out[[1]][,1] %% 1
The good news is that you can use the output from this, with a tuning parameter, to select what you want. For this purpose, you'll want to find the distance from one and then control for what's an acceptable gap.
Do this as follows and save the result (and bunch of TRUE and FALSE results)
selection <- abs((out[[1]][,1] %% 1) - 1) < 0.1
You can then subset the matrix out using the selection index we just saved:
out[[1]][selection,]
It is easy to do an Exact Binomial Test on two values but what happens if one wants to do the test on a whole bunch of number of successes and number of trials. I created a dataframe of test sensitivities, potential number of enrollees in a study and then for each row I calculate how may successes that would be. Here is the code.
sens <-seq(from=.1, to=.5, by=0.05)
enroll <-seq(from=20, to=200, by=20)
df <-expand.grid(sens=sens,enroll=enroll)
df <-transform(df,succes=sens*enroll)
But now how do I use each row's combination of successes and number of trials to do the binomial test.
I am only interested in the upper limit of the 95% confidence interval of the binomial test. I want that single number to be added to the data frame as a column called "upper.limit"
I thought of something along the lines of
binom.test(succes,enroll)$conf.int
alas, conf.int gives something such as
[1] 0.1266556 0.2918427
attr(,"conf.level")
[1] 0.95
All I want is just 0.2918427
Furthermore I have a feeling that there has to be do.call in there somewhere and maybe even an lapply but I do not know how that will go through the whole data frame. Or should I perhaps be using plyr?
Clearly my head is spinning. Please make it stop.
If this gives you (almost) what you want, then try this:
binom.test(succes,enroll)$conf.int[2]
And apply across the board or across the rows as it were:
> df$UCL <- apply(df, 1, function(x) binom.test(x[3],x[2])$conf.int[2] )
> head(df)
sens enroll succes UCL
1 0.10 20 2 0.3169827
2 0.15 20 3 0.3789268
3 0.20 20 4 0.4366140
4 0.25 20 5 0.4910459
5 0.30 20 6 0.5427892
6 0.35 20 7 0.5921885
Here you go:
R> newres <- do.call(rbind, apply(df, 1, function(x) {
+ bt <- binom.test(x[3], x[2])$conf.int;
+ newdf <- data.frame(t(x), UCL=bt[2]) }))
R>
R> head(newres)
sens enroll succes UCL
1 0.10 20 2 0.31698
2 0.15 20 3 0.37893
3 0.20 20 4 0.43661
4 0.25 20 5 0.49105
5 0.30 20 6 0.54279
6 0.35 20 7 0.59219
R>
This uses apply to loop over your existing data, compute test, return the value you want by sticking it into a new (one-row) data.frame. And we then glue all those 90 data.frame objects into a new single one with do.call(rbind, ...) over the list we got from apply.
Ah yes, if you just want to directly insert a single column the other answer rocks as it is simple. My longer answer shows how to grow or construct a data.frame during the sweep of apply.