Rolling subset of data frame within for loop in R - r

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.

Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Related

When plotting values over a large period of time, the dates populate out of order

I am trying to plot many years of data to view the trends that my data takes. When utilizing plot(dx) I get a date vs. value plot, but the date jumps around and doesn't seem to be in order. Below is the code I am utilizing, and a copy of the plot that is produced. Any help would be appreciated.
oz8<-(SRDAILYAVG_1004$Ozone.8.hr.Max.Value)
dd<-(SRDAILYAVG_1004$Date)
MDA8<-exp(oz8)
dx<-data.frame(date=dd,value=MDA8)
plot(dx)
A sample of the data stored in dx is provided as well.
241 8/29/2001 NaN
242 8/30/2001 NaN
243 8/31/2001 1.019182
244 9/1/2001 1.031486
245 9/2/2001 1.030455
246 9/3/2001 1.025315

Mean Y for individual X values

I have a data set in .dta format with height and weight of baseball players. I want to calculate the mean height for each individual weight value.
From what I've been able to find, I could use dplyr and "group_by", but my R script does not recognize the command, despite having installed and called the package.
Thanks!
Here is an example coded in base R using baseball player height and weight data obtained from the UCLA SOCR MLB HeightsWeights data set.
After cleaning the data (weight is missing for one player), I posted it to GitHub to make it accessible without having to clean it again.
theCSVFile <- "https://raw.githubusercontent.com/lgreski/datasciencedepot/gh-pages/data/baseballPlayers.csv"
download.file(theCSVFile,"./data/baseballPlayers.csv",method="curl")
theData <- read.csv("./data/baseballPlayers.csv",header=TRUE,stringsAsFactors=FALSE)
aggData <- aggregate(HeightInInches ~ WeightInPounds,mean,
data=theData)
head(aggData)
...and the output is:
> head(aggData)
WeightInPounds HeightInInches
1 150 70.75000
2 155 69.33333
3 156 75.00000
4 160 71.46667
5 163 70.00000
6 164 73.00000
>
regards,
Len

How to Interpret "Levels" in Random Forest using R/Rattle

I am brand new at using R/Rattle and am having difficulty understanding how to interpret the last line of this code output. Here is the function call along with it's output:
> head(weatherRF$model$predicted, 10)
336 342 94 304 227 173 265 44 230 245
No No No No No No No No No No
Levels: No Yes
This code is implementing a weather data set in which we are trying to get predictions for "RainTomorrow". I understand that this function calls for the predictions for the first 10 observations of the data set. What I do NOT understand is what the last line ("Levels: No Yes") means in the output.
It's called a factor variable.
That is the list of permitted values of the factor, here the values No and Yes are permitted.

Sum row values based on previous ones

I'll try to be specific: I want to create a new column on a data frame in which the values are the sum of the previous values in another column.
So I already have the first two columns (ID and Value) below and want to create the third one (Sum), but I don't know how to do this.
In the column "Sum", the values are the sum of the values in "Value), so for example, 31.098 (Sum) is the sum of 16.91 and 14.18 (Value):
ID Value Sum
157 16.91531834 16.91531834
142 14.18365203 31.09897037
205 11.93528052 43.03425089
89 11.83021643 54.86446732
53 6.3668838 61.23135112
204 3.99243539 65.22378651
202 3.21496113 68.43874764
17 1.93317924 70.37192688
220 1.74406388 72.11599076
147 1.59697415 73.71296491
33 1.42887161 75.14183652
138 1.28178189 76.42361841
154 1.19773062 77.62134903
It is the first time I'm posting here. Until now I found everything I was searching for already answered... so, sorry if this kind of question is already answered too (I must have been!), but I wasn't able to find. I'm not a native speaker (as you probably guessed already), so maybe I didn't use the proper key words...
Thanks!!

Looping within a loop in R

I'm trying to build quite a complex loop in R.
I have a set of data set as an object called p_int (p_int is peak intensity).
For this example the structure of p_int i.e. str(p_int) is:
num [1:1599]
The size of p_int can vary i.e. [1:688], [1:1200] etc.
What I'm trying to do with p_int is to construct a complex loop to extract the monoisotopic peaks, these are peaks with certain characteristics which will be extracted into a second object: mono_iso:
search for the first eight sets of data results in p_int. Of these eight, find the set of data with the greatest score (this score also needs to be above 50).
Once this result has been found, record it into mono_iso.
The loop will then fix on to this position of where this result is located within the large dataset. From this position it will then skip the next result along the dataset before doing the same for the next set of 8 results.
So something similar to this:
16 Results: 100 120 90 66 220 90 70 30 70 100 54 85 310 200 33 41
** So, to begin with, the loop would take the first 8 results:
100 120 90 66 220 90 70 30
**It would then decide which peak is the greatest:
220
**It would determine whether 220 was greater than 50
IF YES: It would record 220 into "mono_iso"
IF NO: It would move on to the next set of 8 results
**220 is greater than 50... so records into mono_iso
The loop would then place it's position at 220 it would then skip the "90" and begin the same thing again for the next set of 8 results beginning at the next data result in line: in this case at the 70:
70 30 70 100 54 85 310 200
It would then record the "310" value (highest value) and do the same thing again etc etc until the end of the set of data.
Hope this makes perfect sense. If anyone could possibly help me out into making such a loop work with R-script, I'd very much appreciate it.
Use this:
mono_iso <- aggregate(p_int, by=list(group=((seq_along(p_int)-1)%/%8)+1), function(x)ifelse(max(x)>50,max(x),NA))$x
This will put NA for groups such that max(...)<=50. If you want to filter those out, use this:
mono_iso <- mono_iso[!is.na(mono_iso)]

Resources