I wanted to use the nested loop below to work out a variable 'data' for every day within a number of years.
x is a vector of length 20 (number of years) and each of the 20 entries is the number of days the inner loop is to run for.
I also have a vector 'start' that has 20 dates in the format "1981-02-01".
I wanted to create a matrix of the output (data) that would have the data for each day in rows and then one column per year.
The code I am using below however does not seem to be updating the counters (yrcntr and daycntr) which is causing the whole thing to not work.
Also, when I try to assign values to 'data' within the loop using the counters as indices (data[daycntr yrcntr]),it's not working.
I'm not even getting an error.
I'm not sure how to write out the format of 'data' used below here, but I'll give it a go:
datamat=
tmax tmin date
11 4 "1981-03-31"
13 6 "1981-04-01"
12 7 "1981-04-02"
and 'start' is a vector of dates in the format: `"1981-04-02" "1981-04-03"
tmax<-datamat[,1]
tmin<-datamat[,2]
tdates<-datamat[,3]
yrcntr<-0;
daycntr<-0;
for (yr in 1:length(x)){
yrcntr<-yrcntr+1
#find the row in the temp data that matches the startdate each year
tempidx<- (which(tdates==start[yrcntr]))-1
for (days in 1:numdays[yr]){
daycntr<-daycntr+1
dlytempidx=tempidx+1
data[daycntr yrcntr]<- (tmax[dlytempidx]+tmin[dlytempidx])
}
rm(tempidx)
}
Related
Im having problems printing the rowname for specific values within a matrix. The following two questions have been difficult.
On which day(s) did she arrive the fastest in the first week? (Only the day(s) of the week should print. (Hint: Use the row names.)
Determine the day(s) of the second week on which she arrived to work within a half an hour. (Only the day(s) of the week should print.)
This is the data set called commutes
Week1 Week2
Monday 26 22
Tuesday 35 23
Wednesday 24 36
Thursday 31 32
Friday 34 25
1) You can use the which() function to find the index of the smallest value in the first column. You provide which() with a logical object (in this case, a vectorized equal test). Supposing you have your matrix bound to m:
ind = which(m[,'Week1'] == min(m[,'Week1']))
You can then take the use the index to get the row name matching that logical using rownames():
day = rownames(m)[ind]
2) This is essentially the same thing, except you will be expecting a vector of indices rather than a single index. Again use which() to find the indices which match the desired logical expression:
inds = which(m$Week2 < 30)
days = rownames(m)[inds]
I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}
Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}
I am wondering how to create a subset of data in R based on a list of dates, rather than by a date range.
For example, I have the following data set data which contains 3 years of 6-minute data.
date zone month day year hour minute temp speed gust dir
1 09/06/2009 00:00 PDT 9 6 2009 0 0 62 2 15 156
2 09/06/2009 00:06 PDT 9 6 2009 0 6 62 13 16 157
I have used breeze<-subset(data, ws>=15 & wd>=247.5 & wd<=315, select=date:dir) to select the rows which meet my criteria for a sea breeze, which is fine, but what I want to do is create a subset of the days which contain those times that meet my criteria.
I have used...
as.character(breeze$date)
trimdate<-strtrim(breeze$date, 10)
breezedate<-as.Date(trimdate, "%m/%d/%Y")
breezedate<-format(breezedate, format="%m/%d/%Y")
...to extract the dates from each row that meets my criteria so I have a variable called breezedate that contains a list of the dates that I want (not the most eloquent coding to do this, I'm sure). There are about two-hundred dates in the list. What I am trying to do with the next command is in my original dataset data to create a subset which contains only those days which meet the seabreeze criteria, not just the specific times.
breezedays<-(data$date==breezedate)
I think one of my issues here is that I am comparing one value to a list of values, but I am not sure how to make it work.
Lets assume your breezedate list looks like this and data$date is simple string:
breezedate <- as.Date(c("2009-09-06", "2009-10-01"))
This is probably want you want:
breezedays <- data[as.Date(data$date, '%m/%d/%Y') %in% breezedate]
The intersect() function (docs) will allow you to compare one data frame to another and return those records that are the same.
To use, run the following:
breezedays <- intersect(data$date,breezedate) # returns into breezedays all records that are shared between data$date and breezedate
I have a data.frame with two columns. The first column contains various specific times during a day. The second column contains the animal behavior (behavior period) that I observed at each specific time:
Time; Behavior
10:20; feeding
10:25; feeding
10:30; resting
...
For each of those behavior periods I have an additional dataset (TimeSeries) which contains data about the actual animal movement (output from a movement sensor). Each TimeSeries has about 100 rows:
Time; Var1; Var2
10:20:01; 1345; 5232
10:20:02; 1423; 5271
...
Now I would like to link each TimeSeries with the behavior from the first dataset. So, that R knows that "feeding" is related to the TimeSeries of 10:20 and 10:25 and that "resting" is related to the TimeSeries of 10:30 and so on.
Afterwards I want to use this "knowledge" to calculate mean and sd from each TimeSeries. So I will have all the means and sd's from all TimeSeries for each behavior.
It is not clear whether your times are currently characters, factors, POSIXct, variables, etc. So you should first convert them (possibly in a new column) to a numeric variable, something like the number of seconds since midnight. Functions like strptime, difftime, and as.numeric may help.
Add a column to the first data frame that is just 1:nrow(firstdf). Then add a column to the second dataframe that is computed by the findInterval function:
seconddf$newcol <- findInterval( seconddf$seconds, firstdf$seconds )
Now you can merge the 2 data frames on the new columns and the finer grained times will be associated with the activity from the most recent time.