Na/NaN Error In R - r

I have just started using R and have a somewhat complex question. So I have a data frame called "data" for which each individual is assigned a PID number. I want to make a loop to find the closest of two dates (SampleDate and LTROT.Date) since there are multiple sample dates for each LTROT.Date. When Running this code I keep getting "Error in start.of.PID:end.of.PID : NA/NaN argument". The data is confidential so I cant provide that. I am new to stackoverflow so I apologize if my questions doesn't meet some of the guidelines.
unique <- unique(data$PID)
z <- 1
end.of.PID <- 0
max <- 100000000
sample.ideal <- vector(length = 58)
for(i in unique){
start.of.PID <- (end.of.PID + 1)
multi <- sum(unique[i] == data$PID)
end.of.PID <- (start.of.PID + multi)-1
for(j in start.of.PID:end.of.PID){
Sample.Date <- as.Date(data$SampleDate)
LTROT.Date <- as.Date(data$LTROT.Date)
time <- Sample.Date[j]-LTROT.Date[j]
if(time < max){
max <- time
sample.ID <- data$SampleID[j]
}else{
max <- max
}
sample.ideal[z] <- sample.ID
z <- z + 1
}
}

Mistake in OP's code:
unique <- unique(data$PID)
......
for(i in unique){ # i represent as an item in "unique" vector
start.of.PID <- (end.of.PID + 1)
multi <- sum(unique[i] == data$PID) #Here i has been used as 'index'
#The above line should be written as:
multi <- sum(i == data$PID)
Though the sample data is not provided by OP with question but based on logic in for-loop it seems a dplyr based solution can be an easier option. The result can be received by self-join and then filter for the record having minimum date difference. The query can be written as:
library(dplyr)
data %>% mutate(SampleDate = as.Date(SampleDate), LTROT.Date = as.Date(LTROT.Date)) %>%
inner_join(., .,by="PID") %>%
group_by(PID) %>%
mutate(MinDateDiff = (SampleDate.x - LTROT.Date.y)) %>%
filter(MinDateDiff == min(MinDateDiff)) %>%
select(PID, SampleDate = SampleDate.x, LTROT.Date = LTROT.Date.y, MinDateDiff )

Related

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

How do I reference the number of rows in a group in dplyr?

I'm trying to write a function to use with dplyr that uses the number of rows in the group. Is there any way to reference the number of rows in the group in dplyr, other than just creating a new column? This would be equivalent to the .N variable in data.table.
Here's an example of what I'm trying to do:
library(dplyr)
library(RcppRoll)
# Function I'm trying to create
rollingMean <- function(x, n = 4)
if (.N < n) { # I want to test whether we have more than 4 rows
out <- mean(x) # if so, return the overall mean
} else {
out <- roll_meanr(x, n)
}
return(out)
}
# Fake data
tmp <- data.frame(X = 1:21, grouping = c(rep(letters[1:2], 10), letters[3]))
tmp %>%
group_by(grouping) %>%
mutate(ma = rollingMean(X)) %>%
tail # Of course, this doesn't work, but the value for ma for the last row should be 21
This seems like it would be fairly simple to do. Does anyone know how to do it?
I think the test in rollingMean just needs to be
if (length(x) < n)
There is an ?n function in dplyr, but it's special --
... can only be used from within ‘summarise’, ‘mutate’ and ‘filter’ ...

How can I convert these nested loops into an R loop function like sapply or tapply or

I have this code which I run over a data frame t.
for (i in years){
for (j in type){
x <- rbind(x, cbind(i, j,
sum(t[(t$year == i) & (t$type == j),]$Emissions,
na.rm = TRUE)))
}
}
Basically, I have two vectors years and type. I'm finding the sum of each category and merging that into a data frame. The above code works, but I cannot figure out how to use one of the loop functions.
Yes, there are ways to do this using the apply functions. I'm going to suggest a high performance approach using dplyr, though.
library(dplyr)
x <- t %>%
group_by(year,type) %>%
summarize(SumEmmissions=sum(Emissions,na.rm=TRUE))
I think you will find that it is much faster than either a loop or apply approach.
=================== Proof, as requested ===============
library(dplyr)
N <- 1000000
Nyear <- 50
Ntype <- 40
myt <- data.frame(year=sample.int(50,N,replace=TRUE),
type=sample.int(4,N,replace=TRUE),
Emissions=rnorm(N)
)
years <- 1:Nyear
type <- 1:Ntype
v1 <- function(){
x <- myt %>%
group_by(year,type) %>%
summarize(SumEmmissions=sum(Emissions,na.rm=TRUE))
}
v2 <- function(){
x <- data.frame()
for (i in years){
for (j in type){
x <- rbind(x, cbind(i, j,
sum(myt[(myt$year == i) & (myt$type == j),]$Emissions, na.rm = TRUE)))
}
}
}
v3 <- function(){
t0 <- myt[myt$year %in% years & myt$type %in% type, ]
x <- aggregate(Emissions ~ year + type, t0, sum, na.rm = TRUE)
}
system.time(v1())
user system elapsed
0.051 0.000 0.051
system.time(v2())
user system elapsed
176.482 0.402 177.231
system.time(v3())
user system elapsed
7.758 0.011 7.783
As the sizes and number of groups increases, so does the performance spread.
Pick out all rows for which year is in years and type is in type giving t0. Then aggregate Emissions based on years and type.
t0 <- t[t$year %in% years & t$type %in% type, ]
aggregate(Emissions ~ year + type, t0, sum, na.rm = TRUE)
If the years and type vectors contain all years and types then the first line could be omitted and t0 in the second line replaced with t.
Next time please make your example reproducible.
Update Some corrections.

Finding maximum data with corresponding value from other column in R

I have daily data (4011 days) together with indicator (1-weekdays, 2-weekend). I want to find weekly maxima with the corresponding indicator. For example (let say) the data is:
mydat <- matrix(c(0.027,0.034,0.019,0.021,0.026,0.024,0.058,0.026,0.064,
0.066,0.026,0.101,0.069,0.054,rep(2,2),rep(1,5),rep(2,2),rep(1,5)), ncol=2)
I have try with the following code. I manage to get the maximum sequences (in this case, weekly maxima) but I dont want maximum sequences in indicator. Here is the code
week.max <- function(vec){
if(length(vec[is.na(vec)]) == 7){
return(NA)
}
else{
return(max(vec, na.rm = T))
}
}
max.week.dat <- apply(mydat, 2, function(x) tapply(x, rep(1:(length(x)/7),
each=7, len=length(x)), week.max))
and the result
matrix(c(0.058,0.101,2,2),ncol=2)
I want the output like this:
matrix(c(0.058,0.101,1,1),ncol=2)
Many thanks in advance.
Here is the data (with an extra day in the third week)
mydat <- data.frame(value = c(0.027,0.034,0.019,0.021,0.026,0.024,0.058,0.026,0.064,
0.066,0.026,0.101,0.069,0.054,0.95), ind = c(rep(2,2),rep(1,5),rep(2,2),rep(1,5),2))
Your function
week.max <- function(vec){
if(length(vec[is.na(vec)]) == 7){
return(NA)
}
else{
return(max(vec, na.rm = T))
}
}
Add the week information
mydat$week <- c(rep(1:2,each=7),3)
Use the same solution as for here
library(plyr)
ddply(mydat, .(week), subset, subset = value==week.max(value), select = -week)

Subsetting data by condition

I am trying to reshape/ reduce my data. So far, I employ a for loop (very slow) but from what I perceive, this should be quite fast with Plyr.
I have many groups (firms, as a factor in the dataset) and I want to drop entirely every firm which shows a 0 entry for value in any of that firm's cells. I thus create a new data.frame but leave out all groups showing 0 for value at some point.
The forloop:
Data Creation:
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
-----------------------------
splitby = mydf$firmname
new.data <- data.frame()
for (i in 1:(length(unique(splitby)))) {
temp <- subset(mydf, splitby == as.character(paste(unique(splitby)[i])))
if (all(temp$value > 0) == "TRUE") {
new.data <- rbind(new.data, temp)
}
}
Delete all empty firm factors
new.data$splitby <- factor(new.data$splitby)
Is there a way to achieve that with the plyr package? Can the subset function be used in that context?
EDIT: For the purpose of the reproduction of the problem, data creation, as suggested by BenBarnes, is added. Ben, thanks a lot for that. Furthermore, my code is altered so as to comply with the answers provided below.
You could supply an anonymous function to the .fun argument in ddply():
set.seed(1)
mydf <- data.frame(firmname = sample(LETTERS[1:5], 40, replace = TRUE),
value = rpois(40, 2))
library(plyr)
ddply(mydf,.(firmname), function(x) if(any(x$value==0)) NULL else x )
Or using [, as suggested by Andrie:
firms0 <- unique(mydf$firmname[which(mydf$value == 0)])
mydf[-which(mydf$firmname %in% firms0), ]
Note that the results of ddply are sorted according to firmname
EDIT
For the example in your comments, this approach is again faster than using ddply() to subset, selecting only firms with more than three entries:
firmTable <- table(mydf$firmname)
firmsGT3 <- names(firmTable)[firmTable > 3]
mydf[mydf$firmname %in% firmsGT3, ]

Resources