Impute NA values with previous value in R - r

We have 101 variables (companys) their closing prices. We got a lot of NA values (because the stock market closes on saturdays and sundays -> gives NA value in our data) and we need to impute those NA values with the previous value if there is a previous value but we don't succeed. This is our data example
There are also companies that don't have data in the first years since they were not on the stock market so they have NA values for this period. And there are companies that go bankrupt and start having NA values so these should both become 0.
How should we do this since we have several conditions for filling our NA's
Thanks in advance.

My understanding of the rules are:
columns that are all NA are to be left as all NA
leading NA values are left as NA
interior NA values are replaced with the most recent non-NA values
trailing NA values are replaced with 0
To try this out we use the built-in data frame BOD replacing the 1st, 3rd and last rows with NA and adding a column of NA values -- see Note at end.
We define a logical vector ok having one element per column which is TRUE for columns having at least one element that is not NA and FALSE for other columns. Then operating only on the columns for which ok is TRUE we fill in the trailing NA values with 0 using na.fill. Then we use na.locf to fill in the interior NA values.
library(zoo)
ok <- !apply(is.na(BOD), 2, all)
BOD[, ok] <- na.locf(na.fill(BOD[, ok], c(NA, NA, 0)), na.rm = FALSE)
giving:
Time demand X
1 NA NA NA <-- leading NA values are left intact
2 2 10.3 NA
3 2 10.3 NA <-- interior NA values are filled in with last non-NA value
4 4 16.0 NA
5 5 15.6 NA
6 0 0.0 NA <- trailing NA values are filled in with 0
Note
We used the following input above:
BOD[c(1, 3, 6), ] <- NA
BOD <- cbind(BOD, X = NA)
Update
Fix.

Related

Reformat wrapped data coerced into a dataframe? (R)

I have some data I need to extract from a .txt file in a very weird, wrapped format. It looks like this:
eid nint hisv large NA
1 1.00 1.00000e+00 0 1.0 NA
2 -152552.00 -6.90613e+04 -884198 -48775.7 1151.70
3 -5190.13 4.17751e-05 NA NA NA
4 2.00 1.00000e+00 0 1.0 NA
5 -172188.00 -8.16684e+04 -809131 -56956.1 -1364.07
6 -5480.54 4.01573e-05 NA NA NA
Luckily, I do not need all of this data. I just want to match eid with the value written in scientific notation. so:
eid sigma
1 1 4.17751e-005
2 2 4.01573e-005
3 3 3.72098e-005
This data goes on for hundreds of thousands of eids. It needs to discard the last three values of each first row, all of the values in row 2, and keep the last/second value in the third row. Then place it next to the 1st value of row 1. Then repeat. The column names other than 'eid' are totally disposable, too. I've never had to deal with wrapped data before so don't know where to begin.
**edited to show df after read-in.

Rolling mean with ZOO when columns have NA values

I have calculated the rolling mean using data.table and zoo.
Code below: I am calculating this on the artprice column which contains NA rows.
library(data.table)
library(zoo)
rollmean1 <- data.table(newdf)
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, x, fill = NA))][]
Output:
> head(rollmean2)
spdate SP500close artprice MA2 MA3 MA126
1: 19870330 289.20 83.6 NA NA NA
2: 19870331 291.70 NA 290.450 NA NA
3: 19870401 292.39 NA 292.045 291.0967 NA
4: 19870402 293.63 NA 293.010 292.5733 NA
5: 19870403 300.41 NA 297.020 295.4767 NA
6: 19870406 301.95 NA 301.180 298.6633 NA
Notice artprice column has mostly NA values, however, I would like to ignore them and still run the rolling average. However, I wouldnt want my artprice data to remove all NAs and for it not to match the date column.
Any ideas on how to achieve?
I was leaning towards something like this and for it recognize there was NA values but continue with the calculation: (x[!is.na(x)])
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, (x[!is.na(x)]), fill = NA))][]
Any insight appreciated.
EDIT: Changed code to use rollapply.
> rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, rollapplyr, data=artprice, mean, rm.na = TRUE, Fill = NA)][]
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126' (recycled leaving remainder of 125 items).
> tail(rollmean2)
spdate SP500close artprice MA126
1: 20170524 2404.39 NA NA
2: 20170525 2415.07 NA NA
3: 20170526 2415.82 NA NA
4: 20170530 2412.91 NA NA
5: 20170531 2411.80 NA NA
6: 20170601 2430.06 NA NA
It is throwing this warning message:
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126'
data frame rollmean1 has 7616 rows
> nrow(rollmean1)
[1] 7616
And nrow of output table:
> nrow(rollmean2)
[1] 7616
So the rolling calculation is to be performed on a MA126 basis. That means the actual calculation can not begin until line 126 leaving 125 lines not filled at the start of the data set. It is recognizing this fact but still not outputting my desired outcome.

Create multiple columns from a single column and clean up results

I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:
the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?
NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.
You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like #PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
Or using data.table (as contributed by #DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
will give a similar result:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.
With base R we can first coerce NA values where necessary and coerce class numeric:
bar[-1] <- lapply(bar[-1], function(x) {
is.na(x) <- grepl("Zero", x)
as.numeric(gsub("[^0-9.]", "", x))})
# Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1 Zero Start NA NA NA NA
# 2 Zero Start NA NA NA NA
# 3 Zero Start NA NA NA NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.2958815
# 7 Zero Stop NA NA NA NA
# 8 Zero Start NA NA NA NA
To reduce to one line (#Jaap):
bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))

How to extraxt immediate next row after the last observations within a group in R and rbind them to geather

My Data is as follows DF is ABC
Dialled_nbr Ringing_nbr Phone_state duration
111 NA
111 NA
111 NA
NA NA Active 60
NA NA Active 0
222 NA
222 NA
222 NA
NA NA Active 90
NA NA Active 0
NA NA
NA 456
NA 456
NA NA Active 100
I want to extract the immediate next row of the data after the last observations within
a group for **Dialled_nbr**.
Answer I want is
Dialled_nbr Ringing_nbr Phone_state duration
NA NA Active 60
NA NA Active 90
I am new to R....Please help...
Here's a cryptic solution:
x = c(111,111,111,123456,222,222,222,67890);
x[c(T,x[2:length(x)] != x[1:(length(x)-1)]) & c(x[1:(length(x)-1)] != x[2:length(x)],T)];
It basically calculates a logical vector representing which elements are not equal to their immediately preceding element (passing the first element unconditionally), and then ANDing that with a logical vector representing which elements are not equal to their immediately following element (passing the last element unconditionally). Hence, the final logical vector you get represents which elements are not in a group of 2-or-more consecutive identical values. You then index the original vector with that logical vector to get your result.
Actually, upon re-reading your question, the above line may not be what you're looking for, because it would get any value that is not equal to either one of its adjacent elements, even if it's not preceded by a group of 2-or-more identical values (although your example data suggests that all isolated values will follow a 2-or-more group). This one might be more appropriate:
x = c(111,111,111,123456,222,222,222,67890);
group <- c(T,x[2:length(x)] == x[1:(length(x)-1)]) | c(x[1:(length(x)-1)] == x[2:length(x)],F);
x[!group & c(F,group[1:(length(group)-1)])];
This one constructs a logical vector of elements which are equal to either their preceding or their following element. Thus, the TRUE values are the group elements, and the FALSE values are the non-group elements. You can then get all non-group elements by inverting the group vector, and then AND that with a logical vector which represents whether the preceding element is a group element, thus producing a logical vector which represents only the non-group elements that follow a group. You can then use that to index the original vector to get the result.
Looking at your updated question, it now appears that you want to select only the rows where Dialled_nbr is NA and where the previous row did not have an NA in Dialled_nbr. You can accomplish that with this:
df <- data.frame(
Dialled_nbr=c(111,111,111,NA,NA,222,222,222,NA,NA,NA,NA,NA,NA),
Ringing_nbr=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,456,456,NA),
Phone_state=c('','','','Active','Active','','','','Active','Active','','','','Active'),
duration=c('','','','60','0','','','','90','0','','','','100')
);
df[is.na(df$Dialled_nbr) & !c(F,is.na(df$Dialled_nbr[1:(length(df$Dialled_nbr)-1)])),];
Using data.table_1.9.5
library(data.table)
setDT(df)[!is.na(shift(Dialled_nbr)) & is.na(Dialled_nbr)]
# Dialled_nbr Ringing_nbr Phone_state duration
#1: NA NA Active 60
#2: NA NA Active 90
You should replace your NA with 0 since you are working with:
df <- data.frame(
Dialled_nbr=c(111,111,111,NA,NA,222,222,222,NA,NA,NA,NA,NA,NA),
Ringing_nbr=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,456,456,NA),
Phone_state=c('','','','Active','Active','','','','Active','Active','','','','Active'),
duration=c('','','','60','0','','','','90','0','','','','100'))
df[is.na(df)]=0
vec = with(df, c(head(Dialled_nbr,1), head(Dialled_nbr,-1)))
df[df$Dialled_nbr!=vec & df$Dialled_nbr==0,]
# Dialled_nbr Ringing_nbr Phone_state duration
#4 0 0 Active 60
#9 0 0 Active 90

subsetting error in R

I have a large dataframe called dualbeta which contains 2 rows and 6080 columns. Here is a sample:
row.names A.Close AA.Close AADR.Close AAIT.Close AAL.Close
1 upside 1.253929 0.9869027 0.6169613 0.6353903 0.1782124
2 downside 1.027412 1.1936236 0.5915299 0.5697878 0.1702382
I am trying to extract only those with the upside >= 1.00 and those with a downside <=1.00. I used combinations <- subset(dualbeta, upside>=1.00 & downside<=1.00) but i get the following:
row.names A.Close AA.Close AADR.Close AAIT.Close
1 NA NA NA NA NA
2 NA.1 NA NA NA NA
3 NA.2 NA NA NA NA
4 NA.3 NA NA NA NA
5 NA.4 NA NA NA NA
...
It should just return a 2 by x table where x is the number of combinations found. I do not know why I am getting a bunch of rows? Additionally, i thought i had NA values in the dualbeta so i used na.omit(dualbeta)->dualbeta but it deleted everything & turned dualbeta into a 0 by 6080. I also used which(is.na(dualbeta)) which returned 3307 and 3308 but when i checked those columns, they did not contain NAs.
You might work on the transpose of the data in order to select rows with the proper characteristics (which are columns in the transpose):
# Fix up the data, use proper row names
rownames(x) <- x$row.names
# Remove old row name column
x <- x[-1]
# transpose and subset
subset(data.frame(t(x)), upside > 1 & downside < 1)
This expression returns a zero-length result with your example data. Changing the parameters shows what is returned:
subset(data.frame(t(x)), upside > .6 & downside < .6)
## upside downside
## AADR.Close 0.6169613 0.5915299
## AAIT.Close 0.6353903 0.5697878
You can the data with simple indexing.
Let's say this is your data
dualbeta<-data.frame(matrix(runif(24,0,2),
nrow=2,
dimnames=list(c("upside","downside"), letters[1:12])))
then you can extract with
dualbeta[, dualbeta[1,]>=1.00 & dualbeta[2,]<=1.00]

Resources