I'm having a dataframe like ba.
I need to extract the dataframe based on region and merge based on date.
It is working if I do manually as like below. But If the number of region is more than two, I need to extract using sapply and then I need to merge(not sure how I can do using loop or sapply). Please advise how I can extract based on "region" and then merge even there are more than two regions(ex: betasol, alpha, atpTax) dynamically.
> ba
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-07 betasol 1408
6 2012-06-12 betasol 1471
7 2012-06-15 betasol 1384
8 2012-06-21 betasol 1390
9 2012-06-22 atpTax 1252
10 2012-06-23 betasol 1442
> dfa <- ba[ab$region == "atpTax", c("date", "AveElapsedTime")]
> dfb <- ba[ab$region == "betasol", c("date", "AveElapsedTime")]
> merge(dfa, dfb, by="date", all=TRUE)
date AveElapsedTime.x AveElapsedTime.y
1 2012-05-19 NA 1372
2 2012-05-22 1652 NA
3 2012-06-02 1552 1630
4 2012-06-07 NA 1408
5 2012-06-12 NA 1471
6 2012-06-15 NA 1384
7 2012-06-21 NA 1390
8 2012-06-22 1252 NA
9 2012-06-23 NA 1442
extractfun <- function(z, ab) {
df[z] <- ab[ab$region == z, c("date","region")]
}
sapply(unique(ba$region), FUN=extractfun, ab=avg_data)
require(reshape)
cast(ba,date~region)
Related
A sample of my data is available here.
I am trying to calculate the growth rate (change in weight (wt) over time) for each squirrel.
When I have my data in wide format:
squirrel fieldBirthDate date1 date2 date3 date4 date5 date6 age1 age2 age3 age4 age5 age6 wt1 wt2 wt3 wt4 wt5 wt6 litterid
22922 2017-05-13 2017-05-14 2017-06-07 NA NA NA NA 1 25 NA NA NA NA 12 52.9 NA NA NA NA 7684
22976 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 3 25 NA NA NA NA 15.5 50.9 NA NA NA NA 7692
22926 2017-05-13 2017-05-16 2017-06-07 NA NA NA NA 0 25 NA NA NA NA 10.1 48 NA NA NA NA 7719
I am able to calculate growth rate with the following code:
library(dplyr)
#growth rate between weight 1 and weight 3, divided by age when weight 3 is recorded
growth <- growth %>%
mutate (g.rate=((wt3-wt1)/age3))
#growth rate between weight 1 and weight 2, divided by age when weight 2 is recorded
merge.growth <- merge.growth %>%
mutate (g.rate=((wt2-wt1)/age2))
However, when the data is in long format (a format needed for the analysis I am running afterwards):
squirrel litterid date age wt
22922 7684 2017-05-13 0 NA
22922 7684 2017-05-14 1 12
22922 7684 2017-06-07 25 52.9
22976 7692 2017-05-13 1 NA
22976 7692 2017-05-16 3 15.5
22976 7692 2017-06-07 25 50.9
22926 7719 2017-05-14 0 10.1
22926 7719 2017-06-08 25 48
I cannot use the mutate function I used above. I am hoping to create a new column that includes growth rate as follows:
squirrel litterid date age wt g.rate
22922 7684 2017-05-13 0 NA NA
22922 7684 2017-05-14 1 12 NA
22922 7684 2017-06-07 25 52.9 1.704
22976 7692 2017-05-13 1 NA NA
22976 7692 2017-05-16 3 15.5 NA
22976 7692 2017-06-07 25 50.9 1.609
22926 7719 2017-05-14 0 10.1 NA
22926 7719 2017-06-08 25 48 1.516
22758 7736 2017-05-03 0 8.8 NA
22758 7736 2017-05-28 25 43 1.368
22758 7736 2017-07-05 63 126 1.860
22758 7736 2017-07-23 81 161 1.879
22758 7736 2017-07-26 84 171 1.930
I have been calculating the growth rates (growth between each wt and the first time it was weighed) in excel, however I would like to do the calculations in R instead since I have a large number of squirrels to work with. I suspect if else loops might be the way to go here, but I am not well versed in that sort of coding. Any suggestions or ideas are welcome!
You can use group_by to calculate this for each squirrel:
group_by(df, squirrel) %>%
mutate(g.rate = (wt - nth(wt, which.min(is.na(wt)))) /
(age - nth(age, which.min(is.na(wt)))))
That leaves NaNs where the age term is zero, but you can change those to NAs if you want with df$g.rate[is.nan(df$g.rate)] <- NA.
alternative using data.table and its function "shift" that takes the previous row
library(data.table)
df= data.table(df)
df[,"growth":=(wt-shift(wt,1))/age,by=.(squirrel)]
Suppose i have two dataset
ds1
NO ID DOB ID2 count
1 4083 2007-10-01 3625 5
2 4408 2008-07-01 3603 2
3 4514 2007-07-01 3077 3
4 4396 2008-05-01 3413 5
5 4222 2003-12-01 3341 1
ds2
loc share
12 445
23 4
10 56
1 1
23 34
I want "share" column of ds2 to be added to ds1 so that it would look like
dsmerged
NO ID DOB ID2 count share
1 4083 2007-10-01 3625 5 445
2 4408 2008-07-01 3603 2 4
3 4514 2007-07-01 3077 3 56
4 4396 2008-05-01 3413 5 1
5 4222 2003-12-01 3341 1 34
i tried merge as
dsmerged <- merge(ds1[,c(1:5)],ds2[,c(2)])
But what it does is it duplicates the dataset (5*5=25 rows) while it does add "share" column. i dont want that duplicate values obviously. Thank you
If you know that the rows represent the same id then you can just cbind
ds3 <- cbind(ds1, share = ds2$share)
but it would be better if you had an id to join on.
Using dplyr
library(dplyr)
bind_cols(ds1, ds2['share'])
Or with data.table
setDT(ds1)[, share := ds2[["share"]]]
I have been trying to calculate the growth rate comparing quarter 1 from one year to quarter 1 for the following year.
In excel the formula would look like this ((B6-B2)/B2)*100.
What is the best way to accomplish this in R? I know how to get the differences from period to period, but cannot accomplish it with 4 time periods' difference.
Here is the code:
date <- c("2000-01-01","2000-04-01", "2000-07-01",
"2000-10-01","2001-01-01","2001-04-01",
"2001-07-01","2001-10-01","2002-01-01",
"2002-04-01","2002-07-01","2002-10-01")
value <- c(1592,1825,1769,1909,2022,2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
Which will produce this data frame:
date value
1 2000-01-01 1592
2 2000-04-01 1825
3 2000-07-01 1769
4 2000-10-01 1909
5 2001-01-01 2022
6 2001-04-01 2287
7 2001-07-01 2169
8 2001-10-01 2366
9 2002-01-01 2001
10 2002-04-01 2087
11 2002-07-01 2099
12 2002-10-01 2258
Here's an option using the dplyr package:
# Convert date column to date format
df$date = as.POSIXct(df$date)
library(dplyr)
library(lubridate)
In the code below, we first group by month, which allows us to operate on each quarter separately. The arrange function just makes sure that the data within each quarter is ordered by date. Then we add the yearOverYear column using mutate which calculates the ratio of the current year to the previous year for each quarter.
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1))
date value month yearOverYear
1 2000-01-01 1592 1 NA
2 2001-01-01 2022 1 1.2701005
3 2002-01-01 2001 1 0.9896142
4 2000-04-01 1825 4 NA
5 2001-04-01 2287 4 1.2531507
6 2002-04-01 2087 4 0.9125492
7 2000-07-01 1769 7 NA
8 2001-07-01 2169 7 1.2261164
9 2002-07-01 2099 7 0.9677271
10 2000-10-01 1909 10 NA
11 2001-10-01 2366 10 1.2393924
12 2002-10-01 2258 10 0.9543533
If you prefer to have the data frame back in overall date order after adding the year-over-year values:
df = df %>% group_by(month=month(date)) %>%
arrange(date) %>%
mutate(yearOverYear=value/lag(value,1)) %>%
ungroup() %>% arrange(date)
Or using data.table
library(data.table) # v1.9.5+
setDT(df)[, .(date, yoy = (value-shift(value))/shift(value)*100),
by = month(date)
][order(date)]
Here's a very simple solution:
YearOverYear<-function (x,periodsPerYear){
if(NROW(x)<=periodsPerYear){
stop("too few rows")
}
else{
indexes<-1:(NROW(x)-periodsPerYear)
return(c(rep(NA,periodsPerYear),(x[indexes+periodsPerYear]-x[indexes])/x[indexes]))
}
}
> cbind(df,YoY=YearOverYear(df$value,4))
date value YoY
1 2000-01-01 1592 NA
2 2000-04-01 1825 NA
3 2000-07-01 1769 NA
4 2000-10-01 1909 NA
5 2001-01-01 2022 0.27010050
6 2001-04-01 2287 0.25315068
7 2001-07-01 2169 0.22611645
8 2001-10-01 2366 0.23939235
9 2002-01-01 2001 -0.01038576
10 2002-04-01 2087 -0.08745081
11 2002-07-01 2099 -0.03227294
12 2002-10-01 2258 -0.04564666
df$yoy <- c(rep(NA,4),(df$value[5:nrow(df)]-df$value[1:(nrow(df)-4)])/df$value[1:(nrow(df)-4)]*100);
df;
## date value yoy
## 1 2000-01-01 1592 NA
## 2 2000-04-01 1825 NA
## 3 2000-07-01 1769 NA
## 4 2000-10-01 1909 NA
## 5 2001-01-01 2022 27.010050
## 6 2001-04-01 2287 25.315068
## 7 2001-07-01 2169 22.611645
## 8 2001-10-01 2366 23.939235
## 9 2002-01-01 2001 -1.038576
## 10 2002-04-01 2087 -8.745081
## 11 2002-07-01 2099 -3.227294
## 12 2002-10-01 2258 -4.564666
Another base R solution. Requires that the date is in date format, so that the common months can be used as a grouping variable to which the function to calculate growth rate can be passed
# set date to a date objwct
df$date <- as.Date(df$date)
# order by date
df <- df[order(df$date), ]
# function to calculate differences
f <- function(x) c(NA, 100*diff(x)/x[-length(x)])
df$yoy <- ave(df$value, format(df$date, "%m"), FUN=f)
# date value yoy
# 1 2000-01-01 1592 NA
# 2 2000-04-01 1825 NA
# 3 2000-07-01 1769 NA
# 4 2000-10-01 1909 NA
# 5 2001-01-01 2022 27.010050
# 6 2001-04-01 2287 25.315068
# 7 2001-07-01 2169 22.611645
# 8 2001-10-01 2366 23.939235
# 9 2002-01-01 2001 -1.038576
# 10 2002-04-01 2087 -8.745081
# 11 2002-07-01 2099 -3.227294
# 12 2002-10-01 2258 -4.564666
or
c(rep(NA, 4,), 100* diff(df$value, lag=4) / head(df$value, -4))
I have two data frames with two different dimensions :
1:
head(x)
Year GDP_deflator
1 1825 NA
2 1826 NA
3 1827 NA
4 1828 NA
5 1829 NA
6 1829 NA
7 1830 NA
8 1830 NA
9 1830 NA
10 1831 NA
dim(x)
1733 2
2:
head(dataDef)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1830 1.866437
7 1831 1.960316
8 1832 2.029601
9 1833 1.880957
10 1834 1.845750
dim(dataDef)
101 2
I would like to substitute values from dataDef$GDP_deflator column into x$GDP_deflator column conditioned on Year column. In other words, I would like the answer to be:
head (x)
Year GDP_deflator
1 1825 1.788002
2 1826 1.884325
3 1827 2.016997
4 1828 1.802907
5 1829 1.781999
6 1829 1.781999
7 1830 1.866437
8 1830 1.866437
9 1830 1.866437
10 1831 1.960316
So the repeating years (i.e. 1830) get the same value, 1.866437. Any suggestions?
Best Regards
One possibility is to use match:
x$GDP_deflator <- dataDef$GDP_deflator[match(x$Year, dataDef$Year)]
You want to merge the two data.frames. It's a many-to-one merge.
I'm have a dataframe as like below. I need to graph based on region, date as x Axis and AveElapsedTime as y axis.
>avg_data
date region AveElapsedTime
1 2012-05-19 betasol 1372
2 2012-05-22 atpTax 1652
3 2012-06-02 betasol 1630
4 2012-06-02 atpTax 1552
5 2012-06-02 Tax 1552
6 2012-06-07 betasol 1408
7 2012-06-12 betasol 1471
8 2012-06-15 betasol 1384
9 2012-06-21 betasol 1390
10 2012-06-22 atpTax 1252
11 2012-06-23 betasol 1442
If I rearrage the above one based on region, it will be as like below. It should not plot if there is no value(NA) for particular date.
date atpTax betasol Tax
1 2012-05-19 NA 1372 NA
2 2012-05-22 1652 NA NA
3 2012-06-02 1552 1630 1552
4 2012-06-07 NA 1408 NA
5 2012-06-12 NA 1471 NA
6 2012-06-15 NA 1384 NA
7 2012-06-21 NA 1390 NA
8 2012-06-22 1252 NA NA
9 2012-06-23 NA 1442 NA
I tried using the below ggplot command, I'm getting geom_path error.
ggplot(avg_data, aes(date, AveElapsedTime)) + geom_line(aes(col=region)) + opts(axis.text.x = theme_text(angle=90, hjust=1))
geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?
> str(avg_data)
'data.frame': 11 obs. of 3 variables:
$ date : Factor w/ 9 levels "2012-05-19","2012-05-22",..: 1 2 3 3 3 4 5 6 7 8 ...
$ region : Factor w/ 3 levels "atpTax","betasol",..: 2 1 2 1 3 2 2 2 2 1 ...
$ AveElapsedTime: int 1372 1652 1630 1552 1552 1408 1471 1384 1390 1252 ...
Please advise on this.
As the error message indicates, you need to specify the group. Like this:
ggplot(avg_data, aes(date, AveElapsedTime, colour=region, group=region)) +
geom_point() + geom_line()