I have calculated the rolling mean using data.table and zoo.
Code below: I am calculating this on the artprice column which contains NA rows.
library(data.table)
library(zoo)
rollmean1 <- data.table(newdf)
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, x, fill = NA))][]
Output:
> head(rollmean2)
spdate SP500close artprice MA2 MA3 MA126
1: 19870330 289.20 83.6 NA NA NA
2: 19870331 291.70 NA 290.450 NA NA
3: 19870401 292.39 NA 292.045 291.0967 NA
4: 19870402 293.63 NA 293.010 292.5733 NA
5: 19870403 300.41 NA 297.020 295.4767 NA
6: 19870406 301.95 NA 301.180 298.6633 NA
Notice artprice column has mostly NA values, however, I would like to ignore them and still run the rolling average. However, I wouldnt want my artprice data to remove all NAs and for it not to match the date column.
Any ideas on how to achieve?
I was leaning towards something like this and for it recognize there was NA values but continue with the calculation: (x[!is.na(x)])
rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, function(x) rollmeanr(artprice, (x[!is.na(x)]), fill = NA))][]
Any insight appreciated.
EDIT: Changed code to use rollapply.
> rollmean2 <- (rollmean1)[, paste0('MA',126) := lapply(126, rollapplyr, data=artprice, mean, rm.na = TRUE, Fill = NA)][]
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126' (recycled leaving remainder of 125 items).
> tail(rollmean2)
spdate SP500close artprice MA126
1: 20170524 2404.39 NA NA
2: 20170525 2415.07 NA NA
3: 20170526 2415.82 NA NA
4: 20170530 2412.91 NA NA
5: 20170531 2411.80 NA NA
6: 20170601 2430.06 NA NA
It is throwing this warning message:
Warning message:
In `[.data.table`((rollmean1), , `:=`(paste0("MA", 126), lapply(126, :
Supplied 7491 items to be assigned to 7616 items of column 'MA126'
data frame rollmean1 has 7616 rows
> nrow(rollmean1)
[1] 7616
And nrow of output table:
> nrow(rollmean2)
[1] 7616
So the rolling calculation is to be performed on a MA126 basis. That means the actual calculation can not begin until line 126 leaving 125 lines not filled at the start of the data set. It is recognizing this fact but still not outputting my desired outcome.
Related
Given a uncertain number of columns containing source values for the same variable I would like to create a column that defines the final value to be selected depending on source importance and availability.
Reproducible data:
set.seed(123)
actuals = runif(10, 500, 1000)
get_rand_vector <- function(){return (runif(10, 0.95, 1.05))}
get_na_rand_ixs <- function(){return (round(runif(5,0,10),0))}
df = data.frame("source_1" = actuals*get_rand_vector(),
"source_2" = actuals*get_rand_vector(),
"source_n" = actuals*get_rand_vector())
df[["source_1"]][get_na_rand_ixs()] <- NA
df[["source_2"]][get_na_rand_ixs()] <- NA
df[["source_n"]][get_na_rand_ixs()] <- NA
My manual solution is as follows:
df$available <- ifelse(
!is.na(df$source_1),
df$source_1,
ifelse(
!is.na(df$source_2),
df$source_2,
df$source_n
)
)
Given the desired result of:
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
How could I automatically iterate over the available sources to set the data to be considered? Given in some cases n_sources could be 1,2,3..,7 and priority follows the natural order (1 > 2 >..)
Once you have all of the candidate vectors in order and in an appropriate data structure (e.g., data.frame or matrix), you can use apply to apply a function over the rows. In this case, we just look for the first non-NA value. Thus, after the first block of code above, you only need the following line:
df$available <- apply(df, 1, FUN = function(x) x[which(!is.na(x))[1]])
coalesce() from dplyr is designed for this:
library(dplyr)
df %>%
mutate(available = coalesce(!!!.))
source_1 source_2 source_n available
1 NA NA NA NA
2 NA NA 930.1242 930.1242
3 716.9981 NA 717.9234 716.9981
4 NA 988.0446 NA 988.0446
5 931.7081 NA 924.1101 931.7081
6 543.6802 533.6798 NA 543.6802
7 744.6525 767.4196 783.8004 744.6525
8 902.8788 955.1173 NA 902.8788
9 762.3690 NA 761.6135 762.3690
10 761.4092 702.6064 708.7615 761.4092
We have 101 variables (companys) their closing prices. We got a lot of NA values (because the stock market closes on saturdays and sundays -> gives NA value in our data) and we need to impute those NA values with the previous value if there is a previous value but we don't succeed. This is our data example
There are also companies that don't have data in the first years since they were not on the stock market so they have NA values for this period. And there are companies that go bankrupt and start having NA values so these should both become 0.
How should we do this since we have several conditions for filling our NA's
Thanks in advance.
My understanding of the rules are:
columns that are all NA are to be left as all NA
leading NA values are left as NA
interior NA values are replaced with the most recent non-NA values
trailing NA values are replaced with 0
To try this out we use the built-in data frame BOD replacing the 1st, 3rd and last rows with NA and adding a column of NA values -- see Note at end.
We define a logical vector ok having one element per column which is TRUE for columns having at least one element that is not NA and FALSE for other columns. Then operating only on the columns for which ok is TRUE we fill in the trailing NA values with 0 using na.fill. Then we use na.locf to fill in the interior NA values.
library(zoo)
ok <- !apply(is.na(BOD), 2, all)
BOD[, ok] <- na.locf(na.fill(BOD[, ok], c(NA, NA, 0)), na.rm = FALSE)
giving:
Time demand X
1 NA NA NA <-- leading NA values are left intact
2 2 10.3 NA
3 2 10.3 NA <-- interior NA values are filled in with last non-NA value
4 4 16.0 NA
5 5 15.6 NA
6 0 0.0 NA <- trailing NA values are filled in with 0
Note
We used the following input above:
BOD[c(1, 3, 6), ] <- NA
BOD <- cbind(BOD, X = NA)
Update
Fix.
I have a data frame like this:
foo=data.frame(Point.Type = c("Zero Start","Zero Start", "Zero Start", "3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww","Zero Stop","Zero Start"),
Point.Value = c(NA,NA,NA,rnorm(3),NA,NA))
I want to add three columns, by splitting the first column with separator _, and retain only the numeric values obtained after the split. For those rows where the first column doesn't contain any _, the three new columns should be NA. I got somewhat close using separate, but that's not enough:
> library(tidyr)
> bar = separate(foo,Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"), sep="_", remove = FALSE, extra="drop", fill="right")
> bar
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start Zero Start <NA> <NA> NA
2 Zero Start Zero Start <NA> <NA> NA
3 Zero Start Zero Start <NA> <NA> NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG -1.468033
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 1.280868
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000rpm 10% 13barG 0.270126
7 Zero Stop Zero Stop <NA> <NA> NA
8 Zero Start Zero Start <NA> <NA> NA
I'm not sure why my data frame contains now two apparently different kinds of NA, but is.na seems to like them both, so I can live with that. However, I have two kind of problems:
the new columns should be at least numeric, and possibly integer. Instead they're character, because of the trailing rpm, %, barG. How do I get rid of those?
when Point.Type can't be split, rpm_nom should be NA, instead it becomes Zero Start or Zero Stop. Changing the fill= option only changes which one of the new columns get the Zero Start/Zero Stop. Instead I want all three of them to be NA. How can I do that?
NOTE: I'm using tidyr, but of course you don't need to, if you think there's a better way to do this.
You can post-process the columns with dplyr:
library(dplyr)
foo <- foo %>%
separate(Point.Type, c("rpm_nom", "GVF_nom", "p0in_nom"),
sep="_", remove = FALSE, extra="drop", fill="right") %>%
mutate_each(funs(as.numeric(gsub("[^0-9]","",.))), rpm_nom, GVF_nom, p0in_nom)
The gsub("[^0-9]","",.)-part removes all non-numeric characters. If you want to prevent the removal of decimal points, you can use [^0-9.] instead of [^0-9] (like #PierreLafortune used in his answer), but be aware that this will also include points that are not meant to be decimal points. By wrapping it in as.numeric, you convert them to numeric values while at the same time transforming the empty cells to NA. This gives the following result:
> foo
Point.Type rpm_nom GVF_nom p0in_nom Point.Value
1 Zero Start NA NA NA NA
2 Zero Start NA NA NA NA
3 Zero Start NA NA NA NA
4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -1.2361145
5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 -0.8727960
6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.9685555
7 Zero Stop NA NA NA NA
8 Zero Start NA NA NA NA
Or using data.table (as contributed by #DavidArenburg in the comments):
library(data.table)
setDT(foo)[, c("rpm_nom","GVF_nom","p0in_nom") :=
lapply(tstrsplit(Point.Type, "_", fixed = TRUE)[1:3],
function(x) as.numeric(gsub("[^0-9]","",x)))
]
will give a similar result:
> foo
Point.Type Point.Value rpm_nom GVF_nom p0in_nom
1: Zero Start NA NA NA NA
2: Zero Start NA NA NA NA
3: Zero Start NA NA NA NA
4: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww -0.09255445 3000 10 13
5: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 1.18581340 3000 10 13
6: 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 2.14475950 3000 10 13
7: Zero Stop NA NA NA NA
8: Zero Start NA NA NA NA
The advantage of this is that foo is updated by reference. As this is faster and more memory efficient, this is especially valuable for using with large datasets.
With base R we can first coerce NA values where necessary and coerce class numeric:
bar[-1] <- lapply(bar[-1], function(x) {
is.na(x) <- grepl("Zero", x)
as.numeric(gsub("[^0-9.]", "", x))})
# Point.Type rpm_nom GVF_nom p0in_nom Point.Value
# 1 Zero Start NA NA NA NA
# 2 Zero Start NA NA NA NA
# 3 Zero Start NA NA NA NA
# 4 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.3558397
# 5 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 1.1454829
# 6 3000rpm_10%_13barG_Sdsdsa_1.0_ss_Pww 3000 10 13 0.2958815
# 7 Zero Stop NA NA NA NA
# 8 Zero Start NA NA NA NA
To reduce to one line (#Jaap):
bar[-1] <- lapply(bar[-1], function(x) as.numeric(gsub("[^0-9.]", "", x)))
I am new to R and have been trying to implement a code setup to analyses gene expression and genetic mutation status to predict outcomes in breast cancer patients.
the original code was published in Nature for Acute myeloid Leukemia data sets and can be downloaded from: http://www.nature.com/ncomms/2015/150109/ncomms6901/full/ncomms6901.html
following supplemental data 4 code
I am unable to replicate their data, as there is a code error in the data.frame
I am able to load all of my data from cBioportal using the following code:
mycgds <- CGDS("http://www.cbioportal.org/public-portal/")
brca_tcga <- getCancerStudies(mycgds)[15,1] ## 15 for BRCA
cases <- getCaseLists(mycgds,brca_tcga)[8,1] ## 8 for RNA expression z scores
g <- lapply(split(as.numeric(entrez), seq_along(entrez)%/%500), function(genes) getProfileData(mycgds,genes,getGeneticProfiles(mycgds,brca_tcga)[2,1],cases)) ## loads my sample information into a data.frame "g"
then I try to impliment following code:
g <- do.call("cbind", g)
which yields an error-
> g <- do.call("cbind", g)
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 173, 0
I have tried to follow threads but some of them are above my head, I am not sure if something went wrong in constructing the data.frame or where to begin to fix this issue. Any assistance would be appreciated or pointing me to a good document explaining whats going on.
I can print my data by calling g:
WDR38 WDR63 WDR86 ZBED9 ZCWPW2 ZNF283 ZNF300P1 ZNF418 ZNF600
TCGA.AB.2803.03 NA NA NA NA NA NA NA NA NA
TCGA.AB.2805.03 NA NA NA NA NA NA NA NA NA
TCGA.AB.2806.03 NA NA NA NA NA NA NA NA NA
TCGA.AB.2807.03 NA NA NA NA NA NA NA NA NA
TCGA.AB.2808.03 NA NA NA NA NA NA NA NA NA
small example, but am unable to go through the next step of code.
:-(
Thank you all for any assistance or education you may provide!
I have a large dataframe called dualbeta which contains 2 rows and 6080 columns. Here is a sample:
row.names A.Close AA.Close AADR.Close AAIT.Close AAL.Close
1 upside 1.253929 0.9869027 0.6169613 0.6353903 0.1782124
2 downside 1.027412 1.1936236 0.5915299 0.5697878 0.1702382
I am trying to extract only those with the upside >= 1.00 and those with a downside <=1.00. I used combinations <- subset(dualbeta, upside>=1.00 & downside<=1.00) but i get the following:
row.names A.Close AA.Close AADR.Close AAIT.Close
1 NA NA NA NA NA
2 NA.1 NA NA NA NA
3 NA.2 NA NA NA NA
4 NA.3 NA NA NA NA
5 NA.4 NA NA NA NA
...
It should just return a 2 by x table where x is the number of combinations found. I do not know why I am getting a bunch of rows? Additionally, i thought i had NA values in the dualbeta so i used na.omit(dualbeta)->dualbeta but it deleted everything & turned dualbeta into a 0 by 6080. I also used which(is.na(dualbeta)) which returned 3307 and 3308 but when i checked those columns, they did not contain NAs.
You might work on the transpose of the data in order to select rows with the proper characteristics (which are columns in the transpose):
# Fix up the data, use proper row names
rownames(x) <- x$row.names
# Remove old row name column
x <- x[-1]
# transpose and subset
subset(data.frame(t(x)), upside > 1 & downside < 1)
This expression returns a zero-length result with your example data. Changing the parameters shows what is returned:
subset(data.frame(t(x)), upside > .6 & downside < .6)
## upside downside
## AADR.Close 0.6169613 0.5915299
## AAIT.Close 0.6353903 0.5697878
You can the data with simple indexing.
Let's say this is your data
dualbeta<-data.frame(matrix(runif(24,0,2),
nrow=2,
dimnames=list(c("upside","downside"), letters[1:12])))
then you can extract with
dualbeta[, dualbeta[1,]>=1.00 & dualbeta[2,]<=1.00]