Replace NA with mean value for specific day and grid - r

I am using a large dataset and I am not used to using one this big (286,212 rows, 19 columns) and I am not sure how to go about my problem. the data is made up of values for each day of the year for 782 grid references and I have this for 15 years. It looks as follows
**Month Day Grid x2004 x2005 x2006 x2007**
1 1 A10 0.091 0.134 NA 0.066
1 2 A10 0.12 0.10 0.23 0.054
1 3 A10 0.55 NA NA 0.08
1 1 B10 NA 0.134 NA 0.17
1 2 B10 0.14 0.151 NA 0.21
1 3 B10 0.43 0.162 0.24 NA
However some of the days are missing and I want to insert the mean of that day for that specific grid using values from the other years. So if the Grid A10 for day 1 in 2006 is missing. I want to insert the mean for day 1 grid A10 from 2004, 2005, 2007, in this case 0.097.
I am trying the following code
ind <- which(is.na(data$x2005))
data$x2005[ind] <- sapply(ind, function(i)
with(data, rowMeans(data[c(data$x2004[i], data$x2006[i], data$x2007[i], data$x2008[i], data$x2009[i],
data$x2010[i], data$x2011[i], data$x2012[i],
data$x2013[i], data$x2014[i], data$x2015[i],
data$x2016[i], data$x2017[i]),], na.rm=TRUE)))
and I plan to do that for all years but it is telling me
"Error in rowMeans(data[c(data$x2006[i], data$x2007[i], data$x2012[i]), :
'x' must be numeric"
Although when I check class, it says that they are all numeric, so I am not sure why x is not numeric. I also don't know if even when i get the mean part sorted, if the code will work so that I am getting the mean specific to each grid and day.
Please Help. Thanks

Can you adapt this to your code:
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

Related

Mutate group of columns based on values in other group of columns

I am trying to convert values in one column to NA based on if the values in another corresponding column are NA. I need to do this for two large groups of corresponding columns so I cannot mutate each column one by one.
For example, below, 2002 inflationNext2Years turns to NA since 2002 realReturnNext2Years is NA.
year <- c(2000, 2001, 2002)
realReturnNext1Years <- c(.1,.2,.3)
realReturnNext2Years <- c(.15,.25, NA)
realReturnNext3Years <- c(.45, NA, NA)
inflationNext1Years <- c(.02, .03, .07)
inflationNext2Years <- c(.03, .05, .08)
inflationNext3Years <- c(.04, .06, .09)
data <- data.frame(year, realReturnNext1Years, realReturnNext2Years, realReturnNext3Years, inflationNext1Years, inflationNext2Years, inflationNext3Years)
data
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
1 2000 0.1 0.15 0.45 0.02 0.03 0.04
2 2001 0.2 0.25 NA 0.03 0.05 0.06
3 2002 0.3 NA NA 0.07 0.08 0.09
I am trying to covert data into:
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
2000 0.1 0.15 0.45 0.02 0.03 0.04
2001 0.2 0.25 NA 0.03 0.05 NA
2002 0.3 NA NA 0.07 NA NA
Since I have many columns, I cannot do this one column at a time. I tried to use mutate_at with an ifelse() but was not sure how to test if the number of years lined up.
I have a vector of the realReturn column names and another vector of the inflation column names. I am trying to change the inflation columns to NA if their corresponding realReturnColumn is NA, but keep the inflation column the same if the realReturnColumn is not NA.
We can collect indices of "realReturnNext" columns using grep, get the position of their NA's and replace the corresponding positions in "inflationNext" cols to NA's
real_cols <- grep("^realReturnNext", colnames(data))
inflation_cols <- grep("^inflationNext", colnames(data))
data[inflation_cols][is.na(data[real_cols])] <- NA
# year realReturnNext1Years realReturnNext2Years realReturnNext3Years
#1 2000 0.1 0.15 0.45
#2 2001 0.2 0.25 NA
#3 2002 0.3 NA NA
# inflationNext1Years inflationNext2Years inflationNext3Years
#1 0.02 0.03 0.04
#2 0.03 0.05 NA
#3 0.07 NA NA

Transformation of missing values by taking log(x+1)

I am trying to learn R and I have a data frame which contains 68 continuous and categorical variables. There are two variables -> x and lnx, on which I need help. Corresponding to a large number of 0's & NA's in x, lnx shows NA. Now, I want to write a code through which I can take log(x+1) in order to replace those NA's in lnx to 0, where corresponding x is also 0 (if x == 0, then I want only lnx == 0, if x == NA, I want lnx == NA). Data frame looks something like this -
a b c d e f x lnx
AB1001 1.00 3.00 67.00 13.90 2.63 1776.7 7.48
AB1002 0.00 2.00 72.00 38.70 3.66 0.00 NA
AB1003 1.00 3.00 48.00 4.15 1.42 1917 7.56
AB1004 0.00 1.00 70.00 34.80 3.55 NA NA
AB1005 1.00 1.00 34.00 3.45 1.24 3165.45 8.06
AB1006 1.00 1.00 14.00 7.30 1.99 NA NA
AB1007 0.00 3.00 53.00 11.20 2.42 0.00 NA
I tried writing the following code -
data.frame$lnx[is.na(data.frame$lnx)] <- log(data.frame$x +1)
but I get the following warning message and the output is wrong:
number of items to replace is not a multiple of replacement length. Can someone guide me please.
Thanks.
In R you can select rows using conditionals and assign values directly. In you example you could do this:
df[is.na(df$lnx) & df$x == 0,'lnx'] <- 0
Here's what this does:
is.na(df$lnx) returns a logical vector the length of df$lnx telling, for each row, whether lnx is NA. df$x == 0 does the same thing, checking whether, for each row, x == 0. By using the & operator, we combine those vectors into one that contains TRUE only for rows where both conditions are TRUE.
We then use the bracket notation to select the lnx column of those rows where both conditions are TRUE in df and then insert the value 0 into those cells using <-
The specific error your getting is because log(data.frame$x +1) and df$lnx[is.na(df$lnx)] are different lengths. log(data.frame$x +1) produces a vector whose length is the number of rows of your data frame while the length of df$lnx[is.na(df$lnx)] is the number of rows that have NA in lnx
Using a dplyr solution:
library(dplyr)
df %>%
mutate(lnx = case_when(
x == 0.0 ~ 0,
is.na(x) ~ NA_real_))
This yields for your example:
# A tibble: 7 x 8
a b c d e f x lnx
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AB1001 1. 3. 67. 13.9 2.63 1777. NA
2 AB1002 0. 2. 72. 38.7 3.66 0. 0.
3 AB1003 1. 3. 48. 4.15 1.42 1917. NA
4 AB1004 0. 1. 70. 34.8 3.55 NA NA
5 AB1005 1. 1. 34. 3.45 1.24 3165. NA
6 AB1006 1. 1. 14. 7.30 1.99 NA NA
7 AB1007 0. 3. 53. 11.2 2.42 0. 0.

Using dplyr to manipulate header information imbedded in csv

First off, I am learning to use dplyr after having used base-r for most of my career (not really a data analyst, but trying to learn). I don't know if dplyr is the best option for this, or if I should use something else.
I have a data file generated by a piece of equipment that is very messy. There are header/tombstone data embedded within the data (time/date/location/sensor data for a specific location between rows of data for that location). The files are relatively large (150,000 observations x 14 variables), and I have successfully used dplyr to separate the actual data from the tombstone data (tombstone data has 6 rows of information spread over the 14 columns).
I am trying to create a single row of the tombstone information to append to the actual measurements so that it can be easily readable in R for analysis without relying on a "blackbox" solution from the manufacturer.
a sample of the data file and my script is provided below:
# Read csv file of data into R
data <- read_csv("data.csv", col_names = FALSE)
data
# A tibble: 155,538 x 14
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 80.00 19.00 0.00 37.0 1.0 0.0 3.00 NA NA NA NA NA NA
2 1.4e+01 8.00 6.00 13.00 43.0 9.0 33.0 50.00 1.00 -1.60 -2.00 50.10 14.88 NA
3 5.9e-01 5.15 2.02 -0.57 0.0 0.0 0.0 0.00 24.58 28.02 25.64 25.37 NA NA
4 0.0e+00 0.00 0.00 0.00 0.0 NA NA NA NA NA NA NA NA NA
5 3.0e+04 30000.00 -32768.00 -32768.00 0.0 NA NA NA NA NA NA NA NA NA
6 0.0e+00 0.00 0.00 0.00 0.0 0.0 0.0 0.25 20.30 NA NA NA NA NA
7 3.7e+01 cm BT counts 1.0 0.1 NA NA NA NA NA NA NA NA
8 NA 0.25 13.30 145.46 7.5 -11.0 2.1 0.80 157.00 149.00 158.00 143.00 100.00 2147483647
9 NA 0.35 13.37 144.54 7.8 -10.9 2.4 -0.40 153.00 150.00 148.00 146.00 100.00 2147483647
10 NA 0.45 14.49 144.65 8.4 -11.8 1.8 -0.90 139.00 156.00 151.00 152.00 100.00 2147483647
# ... with 155,528 more rows
# Get header information from file and create index(ens) of header information to later append header data to each line of measured data
header <- data %>%
filter(!is.na(data[,1])) %>%
mutate_all(as.character) %>%
mutate(ens = rep(1:(nrow(header)/6), each = 6)) %>%
group_by(ens)
n.head <- bind_cols(header[header$ens == 1,][1,], header[header$ens == 1,][2,], header[header$ens == 1,][3,], header[header$ens == 1,][4,], header[header$ens == 1,][5,], header[header$ens == 1,][6,])
Rows 2:7 have the information I am trying to work with, I know that creating a row of 90+ variables is not ideal, but this is a first step in cleaning this data up so that I can then work with it.
the last row with n.head is what I am hoping to end up with, without needing to write a loop to run that ~20,000 times... Any help would be appreciated, thank you in advance for input!
The trick here is to use tidy::spread() and tibble::enframe to get the header columns spread out into a single row data frame.
library(tidyverse)
header <- data[2:7] %>%
# convert the data frame to a vector
t %>%
as.vector %>%
# then change it back into a single row data frame that's in long format
enframe %>%
# then push that back into a wide format, ie. 1 row and a bajillion columns
spread(name, value)
# replicate the row as many times as you have data
header[2:nrow(actualdata,] <- header
#use bind_cols() to glue your header rows onto each row of the actual data
actualdata <- data[7:nrow(data),] %>%
bind_cols(foo)

sapply? tapply? ddply? dataframe variable based on rolling index of previous values of another variable

I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.

R Sorting Data Frame by Date

I'm working on a R data.frame which is made of stocks'dividends per year (I've got 60 stocks in columns and the usual calendar in rows). When a dividend is paid, I've got the figure and otherwise there is a NA.
Basically , here is how my Data.frame looks like
BARC LN BARN SE BAS GY BATS LN
1999-01-01 0.26 NA NA
1999-01-02 NA 0.56 0.35 NA
1999-01-03 NA NA NA NA
2000-01-04 NA NA 0.40 NA
1999-01-05 0.23 0.28 NA NA
2001-01-06 NA NA NA NA
2001-01-07 0.85 NA 0.15 NA
I would like to get the amount of dividend paid per year for each stock in order to compute the dividend yield ratio and finally get a Data;frame like the one below :
BARC LN BARN SE BAS GY BATS LN
1999 NA NA NA NA
2000 NA NA NA NA
2001 NA NA NA NA
How can i do that?
So, assuming your data is in a data.frame like the one you've posted above called div:
div <- structure(list(barc.ln = c(0.26, NA, NA, NA, 0.23, NA, 0.85),
barn.se = c(NA, 0.56, NA, NA, 0.28, NA, NA), bas.gy = c(NA,
0.35, NA, 0.4, NA, NA, 0.15), bats.ln = c(NA, NA, NA, NA,
NA, NA, NA)), .Names = c("barc.ln", "barn.se", "bas.gy",
"bats.ln"), row.names = c("1999-01-01", "1999-01-02", "1999-01-03",
"2000-01-04", "1999-01-05", "2001-01-06", "2001-01-07"), class = "data.frame")
just as you've done you can extract the years from your row.names:
div$years <- as.POSIXlt(row.names(div))$year + 1900
The plyr and reshape2 packages work well here and I think make the code particularly clear. Specifically, I'll use melt to make the data long and then ddply to split into groups and sum the dividends:
library(plyr)
library(reshape2)
div.melt <- melt(div, id.vars='years')
div.sum <- ddply(div.melt,
.(years, variable),
summarise,
dividend = sum(value, na.rm=TRUE))
> div.sum
years variable dividend
1 1999 barc.ln 0.49
2 1999 barn.se 0.84
3 1999 bas.gy 0.35
4 1999 bats.ln 0.00
5 2000 barc.ln 0.00
6 2000 barn.se 0.00
7 2000 bas.gy 0.40
8 2000 bats.ln 0.00
9 2001 barc.ln 0.85
10 2001 barn.se 0.00
11 2001 bas.gy 0.15
12 2001 bats.ln 0.00
>
you can then use another function from reshape2 called cast to format your data "wide":
> dcast(div.sum, years ~ variable, value.var='dividend')
years barc.ln barn.se bas.gy bats.ln
1 1999 0.49 0.84 0.35 0
2 2000 0.00 0.00 0.40 0
3 2001 0.85 0.00 0.15 0
>
I think you can do this pretty easily using by(). Here's how I did it. I've put each block, along with an explanation below each block.
dividends <- data.frame(barc_ln=c(0.26,NA,NA,NA,0.23,NA,0.85),
barn_se=c(NA,0.56,NA,NA,0.28,NA,NA),
bas_gy=c(NA,0.35,NA,0.40,NA,NA,0.15),
bats_ln=c(NA,NA,NA,NA,NA,NA,NA),
row.names=c("1999-01-01","1999-01-02","1999-01-03","2000-01-04","1999-01-05","2001-01-06","2001-01-07"))
This just creates the original data frame you gave.
dividends[,"dates"] <- as.Date(row.names(dividends))
dividends <- dividends[order(dividends[,"dates"]),]
dividends[,"year"] <- format(dividends$dates,"%Y")
This takes the row name dates, and then turns them into a new column ("dates") in the data frame. Then, we order the data frame (not necessarily required, but I find it to be more intuitive) by date and extract the year (as a character, mind you) using format.
div_output <- data.frame(row.names=unique(dividends$year))
Next, I create the output data frame that will receive the data. I use the unique() function on the year variable to get the unique vector of years. They're already ordered (one advantage of ordering the data frame).
for(x in 1:4) {
div_output[,x] <- by(dividends[,x],INDICES=dividends$year,FUN=sum,na.rm=TRUE)
}
names(div_output) <- names(dividends)[1:4]
Using a simple loop, we just go through each of the columns and apply the by() function. The variable is the column, the indices are the year, and we just use the sum function. I tag on na.rm=TRUE so that instead of NAs, you get actual data.
print(div_output)
barc_ln barn_se bas_gy bats_ln
1999 0.49 0.84 0.35 0
2000 0.00 0.00 0.40 0
2001 0.85 0.00 0.15 0
And there's the output I get.

Resources