Ifelse statements for a dataframe in R - r

I am hoping that someone can help me figure out how to write an if-else statement to work on my dataset. I have data on tree growth rates by year. I need to calculate whether growth rates decreased by >50% from one year to the next. I am having trouble applying an ifelse statement to calculate my final field. I am relatively new to R, so my code is probably not very efficient, but here is an example of what I have so far:
For an example dataset,
test<-data.frame(year=c("1990","1991","1992","1993"),value=c(50,25,20,5))
year value
1 1990 50
2 1991 25
3 1992 20
4 1993 5
I then calculate the difference between the current year and previous year's growth ("value"):
test[-1,"diff"]<-test[-1,"value"]-test[-nrow(test),"value"]
year value diff
1 1990 50 NA
2 1991 25 -25
3 1992 20 -5
4 1993 5 -15
and then calculate what 50% of each years' growth would be:
test$chg<-test$value * 0.5
year value diff chg
1 1990 50 NA 25.0
2 1991 25 -25 12.5
3 1992 20 -5 10.0
4 1993 5 -15 2.5
I am then trying to use an ifelse statement to calculate a field "abrupt" that would be "1" when the decline from one year to the next is greater than 50%. This is the code I am trying to use, but I'm not sure how to properly reference the "chg" field from the previous year, because I am getting an error (copied below):
test$abrupt<-ifelse(test$diff<0 && abs(test$diff)>=test[-nrow(test),"chg"],1,0)
Warning message:
In abs(test$diff) >= test[-nrow(test), "chg"] :
longer object length is not a multiple of shorter object length
> test
year value diff chg abrupt
1 1990 50 NA 25.0 NA
2 1991 25 -25 12.5 NA
3 1992 20 -5 10.0 NA
4 1993 5 -15 2.5 NA
A test of a similar ifelse statement worked when I just assigned a few numbers, but I'm not sure how to get this to work in the context of a datframe. Here is an example of it working on just a few values:
prevyear<-50
curryear<-25
chg<-prevyear*0.5
> chg
[1] 25
> diff<-curryear-prevyear
> diff
[1] -25
> abrupt<-ifelse(diff<0 && abs(diff)>= chg,1,0)
> abrupt
[1] 1
If anyone could help me figure out how to apply a similar ifelse statement to my dataframe I would greatly appreciate it! Thank you for any help you can provide.
thank you,
Katie

It's throwing a warning because the two vectors compared abs(test$diff) >= test[-nrow(test),"chg"] have different lengths. Also, for logical and, you are using && (which gives only one TRUE or FALSE) when you should be using & (which is vectorized: it operates elementwise over two vectors and returns a vector of the same length). Try this:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)

I would change where you're putting chg so that it lines up with the diff you want to compare it to:
test$chg[2:nrow(test)] <- test$value[1:(nrow(test)-1)] * 0.5
Then, correct your logical operator like Blue Magister said:
test$abrupt<-ifelse(test$diff<0 & abs(test$diff)>=test$chg,1,0)
and you have your results:
year value diff chg abrupt
1 1990 50 NA NA NA
2 1991 25 -25 25.0 1
3 1992 20 -5 12.5 0
4 1993 5 -15 10.0 1
Also, you may find the function diff helpful: rather than doing this:
test[-1,"value"]-test[-nrow(test),"value"]
you can just do
diff(test$value)

Related

R: Interpolation of values for NAs by indices/groups when first or last values aren't given

I have panel data that has county data for 15 years of different economic measures (which I have created an index for). There are missing data in the values that I would like to interpolate. However, because the values are randomly missing by year, linear interpolation doesn't work, it only gives me interpolation values between the first and last data points. This is a problem because I need interpolated values for the entire series.
Since all of the series have more than 5 data points, is there any code out there that would interpolate the series based on data that already exists within the specific series?
I first thought about indexing my data to try and run a loop but then I found code on linear interpolation by groups. While the latter solved some of the NA's it did not interpolate all of them. Here would be an example of my data that interpolates some of the data but not all.
library(dplyr)
data <- read.csv(text="
index,year,value
1,2001,20864.135
1,2002,20753.867
1,2003,NA
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,NA
1,2008,NA
1,2009,9021.556
1,2010,NA
1,2011,NA
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,NA
2,2013,128.646
2,2014,NA
2,2015,NA")
Using
interpolation<-data %>%
group_by(index) %>%
mutate(valueIpol = approx(year, value, year,
method = "linear", rule = 1, f = 0, ties = mean)$y)
I get the following interpolated values.
1,2001,20864.135
1,2002,20753.867
1,2003,19231.046
1,2004,17708.224
1,2005,12483.767
1,2006,12896.251
1,2007,11604.686
1,2008,10313.121
1,2009,9021.556
1,2010,10612.955
1,2011,12204.353
1,2012,13795.752
1,2013,16663.741
1,2014,19349.992
1,2015,NA
2,2001,NA
2,2002,NA
2,2003,NA
2,2004,NA
2,2005,NA
2,2006,NA
2,2007,NA
2,2008,151.108
2,2009,107.205
2,2010,90.869
2,2011,104.142
2,2012,116.394
2,2013,128.646
2,2014,NA
2,2015,NA
Any help would be appreciated. I'm pretty new to R and have never worked with loops but I have looked up other "interpolation by groups" help. Nothing seems to solve the issue of filling in data when the first and last points are NA's as well.
Maybe this could help:
library(imputeTS)
for(i in unique(data$index)) {
data[data$index == i,] <- na.interpolation(data[data$index == i,])
}
Only works when the groups itself are already ordered by year. (which is the case in your example)
Output would look like this:
> data
index year value
1 1 2001 20864.135
2 1 2002 20753.867
3 1 2003 19231.046
4 1 2004 17708.224
5 1 2005 12483.767
6 1 2006 12896.251
7 1 2007 11604.686
8 1 2008 10313.121
9 1 2009 9021.556
10 1 2010 10612.955
11 1 2011 12204.353
12 1 2012 13795.752
13 1 2013 16663.741
14 1 2014 19349.992
15 1 2015 19349.992
16 2 2001 151.108
17 2 2002 151.108
18 2 2003 151.108
19 2 2004 151.108
20 2 2005 151.108
21 2 2006 151.108
22 2 2007 151.108
23 2 2008 151.108
24 2 2009 107.205
25 2 2010 90.869
26 2 2011 104.142
27 2 2012 116.394
28 2 2013 128.646
29 2 2014 128.646
30 2 2015 128.646
Since the na.interpolation function uses approx internally, you can pass parameters of approx trough to adjust the behavior.
The parameters you used in your example: method = "linear", rule = 1, f = 0, ties = mean are the standard parameters. If you want to use these you don't have to add anything.
Otherwise you would change the part in the loop with for example this:
data[data$index == i,] <- na.interpolation(data[data$index == i,], ties ="ordered", f = 1, rule = 2)

R: tapply(x,y,sum) returns NA instead of 0

I have a data set that contains occurrences of events over multiple years, regions, quarters, and types. Sample:
REGION Prov Year Quarter Type Hit Miss
xxx yy 2008 4 Snow 1 0
xxx yy 2009 2 Rain 0 1
I have variables defined to examine the columns of interest:
syno.h <- data$Type
quarter.number<-data$Quarter
syno.wrng<- data$Type
I wanted to get the amount of Hits per type, and quarter for all of the data. Given that the Hits are either 0 or 1, then a simple sum() function using tapply was my first attempt.
tapply(syno.h, list(syno.wrng, quarter.number), sum)
this returned:
1 2 3 4
ARCO NA NA NA 0
BLSN 0 NA 15 74
BLZD 4 NA 17 54
FZDZ NA NA 0 1
FZRA 26 0 143 194
RAIN 106 126 137 124
SNOW 43 2 215 381
SNSQ 0 NA 18 53
WATCHSNSQ NA NA NA 0
WATCHWSTM 0 NA NA NA
WCHL NA NA NA 1
WIND 47 38 155 167
WIND-SUETES 27 6 37 56
WIND-WRECK 34 14 44 58
WTSM 0 1 7 18
For a some of the types that have no occurrences in a given quarter, tapply sometimes returns NA instead of zero. I have checked the data a number of times, and I am confident that it is clean. The values that aren't NA are also correct.
If I check the type/quarter combinations that return NA with tapply using just sum() I get values I expect:
sum(syno.h[quarter.number==3&syno.wrng=="BLSN"])
[1] 15
> sum(syno.h[quarter.number==1&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="BLSN"])
[1] 0
> sum(syno.h[quarter.number==2&syno.wrng=="ARCO"])
[1] 0
It seems that my issue is with how I use tapply with sum, and not with the data itself.
Does anyone have any suggestions on what the issue may be?
Thanks in advance
I have two potential solutions for you depending on exactly what you are looking for. If you just are interested in your number of positive Hits per Type and Quarter and don't need a record of when no Hits exist, you can get an answer as
aggregate(data[["Hit"]], by = data[c("Type","Quarter")], FUN = sum)
If it is important to keep a record of the ones where there are no hits as well, you can use
dataHit <- data[data[["Hit"]] == 1, ]
dataHit[["Type"]] <- factor(data[["Type"]])
dataHit[["Quarter"]] <- factor(data[["Quarter"]])
table(dataHit[["Type"]], dataHit[["Quarter"]])

Removing certain values from the dataframe in R

I am not sure how I can do this, but what I need is I need to form a cluster of this dataframe mydf where I want to omit the inf(infitive) values and the values greater than 50. I need to get the table that has no inf and no values greater than 50. How can I get a table that contains no inf and no value greater than 50(may be by nullifying those cells)? However, For clustering part, I don't have any problem because I can do this using mfuzz package. So the only problem I have is that I want to scale the cluster within 0-50 margin.
mydf
s.no A B C
1 Inf Inf 999.9
2 0.43 30 23
3 34 22 233
4 3 43 45
You can use NA, the built in missing data indicator in R:
?NA
By doing this:
mydf[mydf > 50 | mydf == Inf] <- NA
mydf
s.no A B C
1 1 NA NA NA
2 2 0.43 30 23
3 3 34.00 22 NA
4 4 3.00 43 45
Any stuff you do downstream in R should have NA handling methods, even if it's just na.omit

Populate a column with forecasts of panel data using data.table in R

I have a panel data with "entity" and "year". I have a column "x" with values that i consider like time series. I want to create a new column "xp" where for each "entity" I give, for each "year", the value obtained from the forecast of the previous 5 years. If there are less than 5 previous values available, xp=NA.
For the sake of generality, the forecast is the output of a function built in R from a couple of predefinite functions found in some packages like "forecast". If it is easier with a specific function, let's use forecast(auto.arima(x.L5:x.L1),h=1).
For now, I use data.table in R because it is so much faster for all the other manipulations I make on my dataset.
However, what I want to do is not data.table 101 and I struggle with it.
I would so much appreciate a bit of your time to help me on that.
Thanks.
Here is an extract of what i would like to do:
entity year x xp
1 1980 21 NA
1 1981 23 NA
1 1982 32 NA
1 1983 36 NA
1 1984 38 NA
1 1985 45 42.3 =f((21,23,32,36,38))
1 1986 50 48.6 =f((23,32,36,38,45))
2 1991 2 NA
2 1992 4 NA
2 1993 6 NA
2 1994 8 NA
2 1995 10 NA
2 1996 12 12.4 =f((2,4,6,8,10))
2 1997 14 13.9 =f((4,6,8,10,12))
...
As suggested by Eddi, I found a way using rollapply:
DT <- data.table(mydata)
DT <- DT[order(entity,year)]
DT[,xp:=rollapply(.SD$x,5,timeseries,align="right",fill=NA,by="entity"]
with:
timeseries <- function(x){
fit <- auto.arima(x)
value <- as.data.frame(forecast(fit,h=1))[1,1]
return(value)
}
For a sample of mydata, it works perfectly. However, when I use the whole dataset (150k lines), after some computing time, i have the following error message:
Error in seq.default(start.at,NROW(data),by = by) : wrong sign in 'by' argument
Where does it come from?
Can it come from the "5" parameter in rollapply and from some specifities of certain entities in the dataset (not enough data...)?
Thanks again for your time and help.

R - Combining multiple columns together within a data frame, while keeping connected data

So I've looked quite a lot for an answer to this question, but I can't find an answer that satisfies my needs or my understanding of R.
First, here's some code to just give you an idea of what my data set looks like
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
> df
Year Subdiv H1 H2
1 1991 24 31.20 53.9
2 1992 25 34.00 121.5
3 1993 26 70.20 16.9
4 1994 27 19.80 11.9
5 1995 28 433.70 114.6
6 1996 24 126.34 129.9
7 1997 25 178.39 221.1
8 1998 26 30.40 433.4
9 1999 27 56.90 319.2
10 2000 28 818.30 52.6
So what I've got here is a data set containing abundance of herring of different ages in different areas ("Subdiv") over time. H1 stands for herring at age 1. My real data set contains more ages as well as more areas (,and additional species of fish).
What I would like to do is combine the abundance of different ages into one column while keeping the connected data (Year, Subdiv) as well as creating a new column for Age.
Like so:
Year Subdiv Abun Age
1 1991 24 31.20 1
2 1992 25 34.00 1
3 1993 26 70.20 1
4 1994 27 19.80 1
5 1995 28 433.70 1
6 1991 24 53.9 2
7 1992 25 121.5 2
8 1993 26 16.9 2
9 1994 27 11.9 2
10 1995 28 114.6 2
Note: Yes, I removed some rows, but only to not crowd the screen
I hope this is enough of information for making it understandable what I need and for someone to help.
Since I have more species of fish, if someone would like to include a description for adding a Species column as well, that would be helpful.
Here's code for the same data, just duplicated for sprat (Sn):
df <- data.frame("Year" = 1991:2000, "Subdiv" = 24:28, H1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
H2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6),
S1 = c(31.2,34,70.2,19.8,433.7,126.34,178.39,30.4,56.9,818.3),
S2 = c(53.9,121.5,16.9,11.9,114.6,129.9,221.1,433.4,319.2,52.6))
Cheers!
I don't think the tags of this question should be unrelated, but if you don't find the tags fitting for my question, go a head and change.
This is a typical reshape then supplement task so you can:
1) 'Melt' your data with reshape2
library("reshape2")
df.m<-melt(df,id.vars=c("Year","Subdiv"))
2) Then add additional columns based on the variable column that holds your previous df's column names
library("stringr")
df.m$Fish<-str_extract(df.m$variable,"[A-Z]")
df.m$Age<-str_extract(df.m$variable,"[0-9]")
I recommend you look up the reshape functions as these are very commonly required and learning them will save you lots of time in future
http://www.statmethods.net/management/reshape.html
I think the basic data.frame function will do exactly what you want. Try something like:
data.frame(df$Year,df$Subdiv,Abun=c(df$H1,df$H2),
Age=rep(c(1,2),each=nrow(df)))
So I'm concatenating the values you want in the abundance column, and creating a new column that is just the ages replicated for each row. You can create a similar species column easily.
Hope that helps!

Resources