I'm working with some wind direction data for a potential paper. I am trying to compare the number of days the wind is blowing easterly (negative U) and the number of days it is blowing westerly (positive U). I need to calculate this over an austral summer, so the period between October and March e.g.: October 1993 to March 1994.
Here is a sample of my data frame:
Year Month Day Hour Minutes Seconds Ws U V
1 1993 1 1 0 0 0 3.750620 2.822403 1.281318
2 1993 1 1 6 0 0 4.207054 3.600465 1.719147
3 1993 1 1 12 0 0 5.050543 3.155271 3.243411
4 1993 1 1 18 0 0 3.165194 -0.477054 2.926124
5 1993 1 2 0 0 0 1.529690 -0.721395 -0.503101
6 1993 1 2 6 0 0 1.950233 0.303333 -1.728295
7 1993 1 2 12 0 0 4.548992 -2.868217 3.307519
8 1993 1 2 18 0 0 6.563643 -6.245194 1.744419
9 1993 1 3 0 0 0 5.868992 -5.805969 -0.594031
10 1993 1 3 6 0 0 6.530620 -6.446667 -0.689535
11 1993 1 3 12 0 0 7.085736 -6.657984 1.834884
12 1993 1 3 18 0 0 7.685349 -7.111008 2.571783
13 1993 1 4 0 0 0 6.508760 -6.414574 -0.678837
14 1993 1 4 6 0 0 6.141860 -6.006822 -0.272558
15 1993 1 4 12 0 0 7.388295 -6.744574 1.862868
16 1993 1 4 18 0 0 7.281163 -7.054264 0.896512
17 1993 1 5 0 0 0 4.847287 -4.431628 -0.813643
18 1993 1 5 6 0 0 3.482558 -1.670078 2.048915
19 1993 1 5 12 0 0 5.698992 1.097287 5.433721
20 1993 1 5 18 0 0 4.894031 1.445736 4.440465
21 1993 1 6 0 0 0 1.983411 0.783023 1.556047
22 1993 1 6 6 0 0 2.315891 -1.225891 1.756744
23 1993 1 6 12 0 0 4.525581 -4.016124 1.723721
24 1993 1 6 18 0 0 5.123566 -4.618682 0.759225
25 1993 1 7 0 0 0 3.449147 -2.639457 -1.627442
26 1993 1 7 6 0 0 2.067364 1.185891 -0.760233
27 1993 1 7 12 0 0 5.675814 3.872171 3.419690
28 1993 1 7 18 0 0 6.278450 3.989767 4.684031
29 1993 1 8 0 0 0 6.562636 5.496667 3.329302
30 1993 1 8 6 0 0 7.762636 5.280310 5.516589
31 1993 1 8 12 0 0 9.283953 5.575659 7.294264
>
So far I have manage to do this calculation for one month only (see code below), but I'm unsure of how to do it from October of one year to March of the next year. When I tried filter(wind,Year==1993:1994,Month==10:3,U>0) I got the error Warning message:
In Month == 10:3 :
longer object length is not a multiple of shorter object length
This is what I have done so far with calculating the number of positive and negative directions for October 1993, which has worked. I am new to R and stackoverflow, so I hope I have set this out correctly!
filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos<-filter(wind,Year==1993,Month==10,U>0)
Oct_1993_pos
filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg<-filter(wind,Year==1993,Month==10,U<0)
Oct_1993_neg
sum(Oct_1993_pos$U>0)
sum(Oct_1993_neg$U<0)
Your first error (Month == 10:3) occurs because you are comparing a vector (Month) with another vector. When you do this, you do an element-wise comparison, i.e. Month[1] == 10, Month[2] == 9, etc. When the vectors are of unequal length, R repeats the shorter one - but only if the longer one is an exact number of multiples longer:
c(1,2,3,1,2,3) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE FALSE
c(1,2,3,1,2) == c(1,2)
[1] TRUE TRUE FALSE FALSE FALSE
Warning message:
In c(1, 2, 3, 1, 2) == c(1, 2) :
longer object length is not a multiple of shorter object length
For counting positive and negative U's, you can exploit that summing logicals simply counts the number of TRUEs:
sum(c(FALSE, TRUE, TRUE, FALSE))
[1] 2
And you can obtain such logicals by doing a simply comparison:
sum(U > 0)
For your calculations I would recommend using dplyr. With this you can repeat your counting across any collection of subsets. Try:
# if following fails, run install.packages("dplyr")
library(dplyr)
monthly <- wind %>% group_by(Year, Month) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Edit in response to comment:
Depending on if you need intermediate results or not, we could do a couple of things. Regarding the period October to March, you have to be careful if your data spans several years.
monthly %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>% ungroup %>%
summarise_at(vars(pos, neg, nowind, entries), sum)
or, just filter before you summarise:
wind %>% filter((Month => 10 & Year == 1993) | (Month <= 3 & Year == 1994)) %>%
summarise(
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
Beware here that I am using single boolean operators (|, &) and not double (||, &&) as we want to keep the element-wise comparisons (the double-variant collapses to a single element).
If you want to see winter vs. summer periods, across multiple years, we have to figure how to group the seasons correctly. For this, I will build a data set of years and months:
library(tidyr)
seasons <- crossing(month=1:12, year=1992:1994) %>% arrange(year, month) %>%
mutate(
season_start = month %in% c(3, 10),
season = cumsum(season_start)
)
With this approach, we've split the problem in two: 1) Define the seasons you wish to summarise over, and 2) summarise it.
inner_join(wind, seasons, by=c('Year'='year','Month'='month')) %>%
group_by(season) %>%
summarise(
seasonstart = paste0(min(Year), '-', min(Month)),
pos=sum(U > 0),
neg=sum(U < 0),
nowind=sum(U == 0),
entries=n()
)
So, to summarise over the period October-March, same as before, just define a different grouping.
For exercises, try adding Year and/or Month to the group_by call in the last example.
I have panel data that has county data for 15 years of different economic measures (which I have created an index for). There are missing data in the values that I would like to interpolate. However, because the values are randomly missing by year, linear interpolation doesn't work, it only gives me interpolation values between the first and last data points. This is a problem because I need interpolated values for the entire series.
Since all of the series have more than 5 data points, is there any code out there that would interpolate the series based on data that already exists within the specific series?
I first thought about indexing my data to try and run a loop but then I found code on linear interpolation by groups. While the latter solved some of the NA's it did not interpolate all of them. Here would be an example of my data that interpolates some of the data but not all.
library(dplyr)
data <- read.csv(text="
index,year,value
1,2001,20864.135
1,2002,20753.867
1,2003,NA
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,NA
1,2008,NA
1,2009,9021.556
1,2010,NA
1,2011,NA
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,NA
2,2013,128.646
2,2014,NA
2,2015,NA")
Using
interpolation<-data %>%
group_by(index) %>%
mutate(valueIpol = approx(year, value, year,
method = "linear", rule = 1, f = 0, ties = mean)$y)
I get the following interpolated values.
1,2001,20864.135
1,2002,20753.867
1,2003,19231.046
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,11604.686
1,2008,10313.121
1,2009,9021.556
1,2010,10612.955
1,2011,12204.353
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,116.394
2,2013,128.646
2,2014,NA
2,2015,NA
Any help would be appreciated. I'm pretty new to R and have never worked with loops but I have looked up other "interpolation by groups" help. Nothing seems to solve the issue of filling in data when the first and last points are NA's as well.
Maybe this could help:
library(imputeTS)
for(i in unique(data$index)) {
data[data$index == i,] <- na.interpolation(data[data$index == i,])
}
Only works when the groups itself are already ordered by year. (which is the case in your example)
Output would look like this:
> data
index year value
1 1 2001 20864.135
2 1 2002 20753.867
3 1 2003 19231.046
4 1 2004 17708.224
5 1 2005 12483.767
6 1 2006 12896.251
7 1 2007 11604.686
8 1 2008 10313.121
9 1 2009 9021.556
10 1 2010 10612.955
11 1 2011 12204.353
12 1 2012 13795.752
13 1 2013 16663.741
14 1 2014 19349.992
15 1 2015 19349.992
16 2 2001 151.108
17 2 2002 151.108
18 2 2003 151.108
19 2 2004 151.108
20 2 2005 151.108
21 2 2006 151.108
22 2 2007 151.108
23 2 2008 151.108
24 2 2009 107.205
25 2 2010 90.869
26 2 2011 104.142
27 2 2012 116.394
28 2 2013 128.646
29 2 2014 128.646
30 2 2015 128.646
Since the na.interpolation function uses approx internally, you can pass parameters of approx trough to adjust the behavior.
The parameters you used in your example: method = "linear", rule = 1, f = 0, ties = mean are the standard parameters. If you want to use these you don't have to add anything.
Otherwise you would change the part in the loop with for example this:
data[data$index == i,] <- na.interpolation(data[data$index == i,], ties ="ordered", f = 1, rule = 2)
Following data frame
> dx
estyear age us_population
1 1980 0 3559857
2 1980 1 3315535
3 1981 0 3607440
4 1981 1 3436005
Split into groups
years <- c(dx[,'estyear'])
> years
[1] 1980 1980 1981 1981
dx.split <- split(dx, years)
> dx.split
$`1980`
estyear age us_population
1 1980 0 3559857
2 1980 1 3315535
$`1981`
estyear age us_population
3 1981 0 3607440
4 1981 1 3436005
Then return partial match subset by year. When passing a string literal the partial works just fine.
dx.split$'1980'
> dx.split$'1980'
estyear age us_population
1 1980 0 3559857
2 1980 1 3315535
This is where the problem occurs. However, when passing a variable set to the same value it returns NULL.
selyear <- '1980'
> selyear
[1] "1980"
dx.split$selyear
> dx.split$selyear
NULL
Hopefully this is something simple.
The reason I want to use a variable is because I intend to iterate over the data frame by year driven by external inputs. I am open to alternate paths that get to the same result and/or are easier
Thanks
My dataset looks like this:
Year Risk Resource Utilization Band Percent
2014 0 .25
2014 1 .19
2014 2 .17
2014 3 .31
2014 4 .06
2014 5 .01
2015 0 .23
2015 1 .21
2015 2 .19
2015 3 .31
2015 4 .06
2015 5 .31
I am attempting to compare percentage change year to year for the dataset I am working with. For example 2014 decreased 2% in 2015. So far, I have created a loop that puts each by year into bins and runs the calculation. The issue I am having is that the loop is indexing each loop as 1 so I have a bunch of repeating 1s next to my calculations. Here is the code I have been using, any help is much appreciated
Results.data <- data.frame()
head(data)
percent <- 0
baseyear <- 0
nextyear <- 0
bin <- 0
yearPlus1 <-0
bin2 <-0
percent1 <-0
percent2 <-0
percentDif <-0
for(i in 1:nrow(data))
{
percent[i] <- data$PERCENT[i]
baseyear[i] <- as.numeric(data$YEAR_RISK[i])
bin[i] <- as.numeric(data$RESOURCE_UTILIZATION_BAND[i])
#print(percent[i])
#print(baseyear[i])
#print(bin[i])
}
for (k in 1:nrow(data))
{
for (j in 1:nrow(data))
{
yearPlus1 <- as.numeric(baseyear[j])-1
firstYear <- as.numeric(baseyear[k])
bin2 <-bin[j]
bin1 <- bin[k]
percent1 <- as.numeric(percent[k])
percent2 <- as.numeric(percent[j])
if(firstYear==yearPlus1 && bin1==bin2)
{
percentDif <- percent2 - percent1
print(percentDif)
Results.data <- rbind(Results.data, c(percentDif))
}
}
}
If I understand your question, you can use grouping and vectorization to avoid loops. Here's an example using the dplyr package.
The code below first sorts by Year_Risk so that the data are ordered properly by time. Then we group by Resource_Utilization_Band so that we can get results separately for each level of Resource_Utilization_Band. Finally, we calculate the difference in Percent from year to year. The lag function returns the previous value in a sequence. (Instead of lag, we could have done Change = c(NA, diff(Percent)) as well.) All of these operations are chained one after the other using the dplyr chaining operator (%>%).
(Note that when I imported your data, I also changed your column names by adding underscores to make them legal R column names.)
library(dplyr)
# Year-over-year change within each Resource_Utilization_Band
# (Assuming your starting data frame is called "dat")
dat %>% arrange(Year_Risk) %>%
group_by(Resource_Utilization_Band) %>%
mutate(Change = Percent - lag(Percent))
Year_Risk Resource_Utilization_Band Percent Change
1 2014 0 0.25 NA
2 2014 1 0.19 NA
3 2014 2 0.17 NA
4 2014 3 0.31 NA
5 2014 4 0.06 NA
6 2014 5 0.01 NA
7 2015 0 0.23 -0.02
8 2015 1 0.21 0.02
9 2015 2 0.19 0.02
10 2015 3 0.31 0.00
11 2015 4 0.06 0.00
12 2015 5 0.31 0.30
I want to transform an ordinal variabel (0-2) – where 0 is no rights, 1 is some rights, and 2 full rights – to a dichotomous variable.
The original ordinal variable is coded for each country and year (country-year unit).
I want to create a dichotomous variable, (let's call it Improvement), capturing all annual positive changes, for each country-year. So when it goes from 0 to 1 (or from 0 to 2, or from 1 to 0), I want it to be 1 for that year and country. And zero otherwise.
Below I give an example of how my data looks like. The "RIGHTS" is the original ordinal variable. The "MY DICHOTOMOUS" variable is what I want to calculate in R. How can I do it?
COUNTRY YEAR RIGHTS MY DICHOTOMOUS
A 1990 0 0
A 1991 0 0
A 1992 0 0
A 1993 1 1
A 1994 0 0
B 1990 1 1
B 1991 1 0
B 1992 1 0
B 1993 1 0
B 1994 1 0
Please, note that the original data can go the other away as well, i.e. it can go negative. I do not want to code for negative changes for this dichotomous variable.
We can use diff
df1$dichotomous <- +c(FALSE,diff(df1$RIGHTS)==1)
df1$dichotomous
#[1] 0 0 0 1 0 1 0 0 0 0
This assumes you don't consider starting with a 1 in rights as a 1 in dichotomous:
x <- rights
n <- length(x)
dichotomous <- c(0, as.numeric(x[-1] - x[-n] == 1))
Might have to do a series of ifelse() statements. But then again I might be miss reading your question. An example is posted below.
MY.DATA$MY.DICHOTOMOUS <- with(MY.DATA,ifelse(COUNTRY=="A",RIGHTS,ifelse(COUNTRY=="B"&YEAR==1990,1,factor(RIGHTS)))`