R Table modification - r

How do I take the average of a few entries in a column whose corresponding entry in another column,has the same entries?
For instance I have a large table with say 3 columns, time and prices being 2. and lets say under the time column the values repeat. like 10:30 appears 4 times, then i would need to take the average of the corresponding price column entries and summarize the same onto a single row of 10:30 with a single price of it. Can someone provide me some insights?
Sample data:
time prices size
10:00 23 1
10:15 12 3
10:30 12 1
10:30 19 4
10:45 12 1
I would like to modify rows 3 and 4 merging into a single row, averaging the prices.

How about something like
tapply(prices, time, mean)
For a more complete picture, see ?tapply
But what would you like to do with the column size?
EDIT:
To take the mean of prices and the last value of size, here's one suggestion:
myDF<-data.frame(time=c("10:00","10:15","10:30","10:30","10:45"),
prices=c(23,12,12,19,12),size=c(1,3,1,4,1))
theRows <- tapply(seq_len(nrow(myDF)), myDF$time, function(x) {
return(data.frame(time = head(myDF[x, "time"],1), prices = mean(myDF[x, "prices"]),
size = tail(myDF[x, "size"], 1)))
}
)
Reduce(function(...) rbind(..., deparse.level = FALSE), theRows)
p.s. This can be done very well using ddply -- see Paul's answer, too!

You could also take a look at the plyr package. I would use ddply for this:
ddply(df, .(time), summarise,
mean_price = mean(prices),
sum_size = sum(size))
this assumes your data is in df. For a more elaborate description of plyr, please take a look at this paper in the Journal of Statistical Software.
Other alternatives include using data.table, or ave.

Related

creating columns of monthly averages in R

I have a dataframe in R where each row corresponds to a household. One column describes a date in 2010 when that household planted crops. The remainder of the dataset contains over 1000 columns describing the temperature on every day between 2007-2010 for those households.
This is the basic form:
Date 2007-01-01 2007-01-02 2007-01-03
1 2010-05-01 70 72 61
2 2010-02-10 63 59 73
3 2010-03-06 60 59 81
I need to create columns for each household that describe the monthly mean temperatures of the two months following their planting date in each of the three years prior to 2010.
For instance: if a household planted on 2010-05-01, I would need the following columns:
mean temp of 2007-05-01 through 2007-06-01
mean temp of 2007-06-02 through 2007-07-01
mean temp of 2008-05-01 through 2008-06-01
...
mean temp of 2009-06-02 through 2009-07-01
I skipped two columns, but you get the idea. Specific code would be most helpful, but in general, I am just looking for a way to pull data from specific columns based upon a date that is described by another column.
Hi #bricevk you could use the apply function. It allows you to use a function over a data either column-wise or row-wise.
https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/apply
Say your data is in a object df. It applies the mean function over the columns of df . Giving you the column-wise mean. The 2 indicates the columns. This wpuld the daily average, assuming each column, is a day.
Averages <- apply(df,2,mean)
If I didn't answer this the way you would like perhaps I have not really understood your dataset. Could you try explain it more clearly?
I suggest you to use tidyverse. However, in order to be compatible with this universe, you firstly have to make your data standard, ie tidy. In your example, the things would be easier if you transformed your data in order to have your observations ordrered by rows, and columns being variables. If I correctly understood your data, you have households planting trees (the row names are dates of plantation ?), and then controls with temperature. I'd do something like :
-----------------------------------------------------------------------------
| Household ID | planting date | Date of control | Temperature controlled |
-----------------------------------------------------------------------------
firstly, have your planting date stored as another thing than a rowname, by example :
library(dplyr)
df <- tibble::rownames_to_column(data, "PlantingDate")
You also have to get your household id var you haven't specified to us.
Then you can manage to have the tidy data with tidyr, using
library(tidyr)
df <- gather(df,"DateOfControl","Temperature",-c(PlantingDate,ID))
When you'll have that, you'll be able to use the package lubridate, something like
library(lubridate)
df %>%
group_by(ID,PlantingDate,year(ControlDate),month(ControlDate)) %>%
summarise(MeanT=mean(Temperature))
could work

List of Dataframes filter row based on Date

I am currently working with a list of dataframes.
Actually, I have about a hundred csv files representing forecasts of some kind, where the date on which the forecast was made is in the first line, the lines thereafter contain the predicted values. The data might look like this:
2010/04/15 10:12:51 #Date of the forecast
2010/05/02 2372 #Date for which the forecast was made and the value assigned
2010/05/09 2298
2009/04/15 10:09:13 #another forecast
....
2010/05/02 2298 #also predicts for 2010/05/02
As you might guess, the forecasts do predict values quite some time ahead (e.g. 5 years), which means predictions for the date 2010/05/02 were not only made on 2010/04/15 but also 2009/04/15 and so on (actually, forecasts are done weekly).
I would like to compare how the predicted value for a specified date (for example 2010/05/02) has changed over time.
Right now, I read in all .csv datas I have as a dataframe, and save each of the resulting dataframes in a list.
(Sadly, the date on which the prediction was made got lost-I hoped to be able to name the list elements with the respective date but have not yet figured out how to do this-still, I am pretty sure I'll find something somewhere, not the main problem here)
That's where the question title comes in: I would like to know how to filter a list of dataframes by row value.
So, I'd like to be able to use a function: function(2010/05/02) and get as a result the rows of each Element of the list (each dataframe in the list) where Date is 2010/05/02.
In this case I'd like to get:
2010/05/02 2372
2010/05/02 2298
I know how to do this using a for loop, but it needs endlessly much time.
I am happy for any suggestions.
(By this example you might understand why it is important to know when the prediction was made- which I would not have right now. I was thinking about adding a new row containing the date on which the prediction was made in each dataframe)
Threads visited until now include:
get column from list of dataframes R
convert a row of a data frame to a simple vector in R
How to get the name of a data.frame within a list? (which more or less adresses the name problem)
As you can see, no thread was particularly helpful.
As requested, a small reproducible example:
dateList <- as.Date(seq(0,100,5),origin="2010-01-01")
forecasts <- seq(2000,3000,50)
df1 <- data.frame(dateList,forecasts)
df2 <- data.frame(dateList-50,forecasts)
l <- list(df1,df2)
we have dates from 2010-01-01 in 5 days steps. I would for example like to know the predicted values for 2010-01-01 in both dataframes.
The first dataframe looks like this:
dateList forecasts
1 2010-01-01 2000
2 2010-01-06 2050
3 2010-01-11 2100
while the second looks like this:
10 2009-12-27 2450
11 2010-01-01 2500
12 2010-01-06 2550
I was hoping to find out for example the predicted values for 2010-01-01.
So, for example:
function(2010-01-01):
2000
2500
Couldn't wait for your example so I made a small one. Let me know if this is in the general direction of what you're after.
xy <- list(df1 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df2 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df3 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3))
)
getValueOnDate <- function(x, list.all) {
lapply(list.all, FUN = function(m) m[m$dates %in% x, ])
}
out <- getValueOnDate(as.Date("2016-01-02"), list.all = xy)
do.call("rbind", out)
dates value
df1 2016-01-02 0.7665590
df2 2016-01-02 0.9907976
df3 2016-01-02 0.4909025
You can obviously modify the function to return just the values.
You could alternatively use the following approach, given your list is called ls and the date column date in all data.frame's:
my.ls <- lapply(ls, subset, date == "2010/05/02")
df <- do.call("rbind", my.ls)

Using dplyr::mutate between two dataframes to create column based on date range

Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481

Combine different rows

Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}

Data aggregation loop in R

I am facing a problem concerning aggregating my data to daily data.
I have a data frame where NAs have been removed (Link of picture of data is given below). Data has been collected 3 times a day, but sometimes due to NAs, there is just 1 or 2 entries per day; some days data is missing completely.
I am now interested in calculating the daily mean of "dist": this means summing up the data of "dist" of one day and dividing it by number of entries per day (so 3 if there is no data missing that day). I would like to do this via a loop.
How can I do this with a loop? The problem is that sometimes I have 3 entries per day and sometimes just 2 or even 1. I would like to tell R that for every day, it should sum up "dist" and divide it by the number of entries that are available for every day.
I just have no idea how to formulate a for loop for this purpose. I would really appreciate if you could give me any advice on that problem. Thanks for your efforts and kind regards,
Jan
Data frame: http://www.pic-upload.de/view-11435581/Data_loop.jpg.html
Edit: I used aggregate and tapply as suggested, however, the mean value of the data was not really calculated:
Group.1 x
1 2006-10-06 12:00:00 636.5395
2 2006-10-06 20:00:00 859.0109
3 2006-10-07 04:00:00 301.8548
4 2006-10-07 12:00:00 649.3357
5 2006-10-07 20:00:00 944.8272
6 2006-10-08 04:00:00 136.7393
7 2006-10-08 12:00:00 360.9560
8 2006-10-08 20:00:00 NaN
The code used was:
dates<-Dis_sub$date
distance<-Dis_sub$dist
aggregate(distance,list(dates),mean,na.rm=TRUE)
tapply(distance,dates,mean,na.rm=TRUE)
Don't use a loop. Use R. Some example data :
dates <- rep(seq(as.Date("2001-01-05"),
as.Date("2001-01-20"),
by="day"),
each=3)
values <- rep(1:16,each=3)
values[c(4,5,6,10,14,15,30)] <- NA
and any of :
aggregate(values,list(dates),mean,na.rm=TRUE)
tapply(values,dates,mean,na.rm=TRUE)
gives you what you want. See also ?aggregate and ?tapply.
If you want a dataframe back, you can look at the package plyr :
Data <- as.data.frame(dates,values)
require(plyr)
ddply(data,"dates",mean,na.rm=TRUE)
Keep in mind that ddply is not fully supporting the date format (yet).
Look at the data.table package especially if your data is huge. Here is some code that calculates the mean of dist by day.
library(data.table)
dt = data.table(Data)
Data[,list(avg_dist = mean(dist, na.rm = T)),'date']
It looks like your main problem is that your date field has times attached. The first thing you need to do is create a column that has just the date using something like
Dis_sub$date_only <- as.Date(Dis_sub$date)
Then using Joris Meys' solution (which is the right way to do it) should work.
However if for some reason you really want to use a loop you could try something like
newFrame <- data.frame()
for d in unique(Dis_sub$date){
meanDist <- mean(Dis_sub$dist[Dis_sub$date==d],na.rm=TRUE)
newFrame <- rbind(newFrame,c(d,meanDist))
}
But keep in mind that this will be slow and memory-inefficient.

Resources