Adding NA's to a vector - r

Let's say I have a vector of prices:
foo <- c(102.25,102.87,102.25,100.87,103.44,103.87,103.00)
I want to get the percent change from x periods ago and, say, store it into another vector that I'll call log_returns. I can't bind vectors foo and log_returns into a data.frame because the vectors are not the same length. So I want to be able to append NA's to log_returns so I can put them in a data.frame. I figured out one way to append an NA at the end of the vector:
log_returns <- append((diff(log(foo), lag = 1)),NA,after=length(foo))
But that only helps if I'm looking at percent change 1 period before. I'm looking for a way to fill in NA's no matter how many lags I throw in so that the percent change vector is equal in length to the foo vector
Any help would be much appreciated!

You could use your own modification of diff:
mydiff <- function(data, diff){
c(diff(data, lag = diff), rep(NA, diff))
}
mydiff(foo, 1)
[1] 0.62 -0.62 -1.38 2.57 0.43 -0.87 NA
data.frame(foo = foo, diff = mydiff(foo, 3))
foo diff
1 102.25 -1.38
2 102.87 0.57
3 102.25 1.62
4 100.87 2.13
5 103.44 NA
6 103.87 NA
7 103.00 NA

Let's say you have an array with number 1 to 10 arranged in the matrix form, in which
The matrix contains Elements from 5 rows 2 columns & 2nd column to be assigned NA , #
then Making one 5*2 matrix of elements 1:10
Array_test=array(c(1:10),dim=c(5,2,1))
Array_test
Array_test[ ,2, ]=c(NA)# Defining 2nd column to get NA
Array_test
# Similarly to make only one element of the entire matrix be NA
# let's say 4nd-row 2nd column to be made NA then
Array_test[4 ,2, ]=c(NA)

Related

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Create dataframe with missing data

I'm very new to R, so please excuse my potentially noob question.
I have data from 23 individuals of hormone concentrations collected hourly - I've interpolated between hourly collections to get concentrations between 2.0 - 15pg/ml at intervals of 0.1 : this equals to 131 rows of data per individual.
Some individials' concentrations, however, don't go beyond 6.0 pg/ml (for example) which means I have dataframes of unequal number of rows across individials. I need all individuals to have 131 rows for the next step where I combine all the data.
I've tried to create a dataframe of NAs with 131 rows and two coloumns, and then add the individual's interplotated data into the NA dataframe - so that the end result is a 131 row data from with missing data as NA - but it's not going so well.
interp_saliva_002_x <- as.tibble(matrix(, nrow = 131, ncol = 1))
interp_sequence <- as.numeric(seq(2,15,.1))
interp_saliva_002_x[1] <- interp_sequence
colnames(interp_saliva_002_x)[1] <- "saliva_conc"
test <- left_join(interp_saliva_002_x, interp_saliva_002, by "saliva_conc")
Can you help me to understand where I'm going wrong or is there a more logical way to do this?
Thank you!
Lets assume you have 3 vectors with different lengths:
A<-seq(1,5); B<-seq(2,8); C<-seq(3,5)
Change the length of the vectors to the length that you want (in your case it's 131, I picked 7 for simplicity):
length(A)<-7; length(B)<-7; length(C)<-7 #this replaces all the missing values to NA
Next you can cbind the vectors to a matrix:
m <-cbind(A,B,C)
# A B C
#[1,] 1 2 3
#[2,] 2 3 4
#[3,] 3 4 5
#[4,] 4 5 NA
#[5,] 5 6 NA
#[6,] NA 7 NA
#[7,] NA 8 NA
You can also change your matrix to a dataframe:
df<-as.data.frame(m)

Find the 2 max values for each factor in R

I have a question about finding the two largest values of column C, for each unique ID in column A, then calculating the mean of column B. A sample of my data is here:
ID layer weight
1 0.6843629 0.35
1 0.6360772 0.70
1 0.6392318 0.14
2 0.3848640 0.05
2 0.3882660 0.30
2 0.3877026 0.10
2 0.3964194 0.60
2 0.4273218 0.02
2 0.3869507 0.12
3 0.4748541 0.07
3 0.5853659 0.42
3 0.5383678 0.10
3 0.6060287 0.60
4 0.4859274 0.08
4 0.4720740 0.48
4 0.5126481 0.08
4 0.5280899 0.48
5 0.7492097 0.07
5 0.7220433 0.35
5 0.8750000 0.10
5 0.8302752 0.50
6 0.4306283 0.10
6 0.4890895 0.25
6 0.3790714 0.20
6 0.5139686 0.50
6 0.3885678 0.02
6 0.4706815 0.05
For each ID, I want to calculate the mean value of layer, using only the rows where with the two highest weights.
I can do this with the following code in R:
ind.max1 <- ddply(index1, "ID", function(x) x[which.max(x$weight),])
dt1 <- data.table(index1, key=c("layer"))
dt2 <- data.table(ind.max1, key=c("layer"))
index2 <- dt1[!dt2]
ind.max2 <- ddply(index2, "ID", function(x) x[which.max(x$weight),])
ind.max.all <- merge(ind.max1, ind.max2, all=TRUE)
ind.ndvi.mean <- as.data.frame(tapply(ind.max.all$layer, list(ind.max.all$ID), mean))
This uses ddply to select the first highest weight value per ID and put into a dataframe with layer. Then remove these highest weight values from the original dataframe using data.table. I then repeat the ddply select max value, and merge the two max weight value dataframes into one. Finally, computing mean with tapply.
There must be a more efficient way to do this. Does anyone have any insight? Cheers.
You could use data.table
library(data.table)
setDT(dat)[, mean(layer[order(-weight)[1:2]]), by=ID]
# ID Meanlayer
#1: 1 0.6602200
#2: 2 0.3923427
#3: 3 0.5956973
#4: 4 0.5000819
#5: 5 0.7761593
#6: 6 0.5015291
Order weight column in descending order(-weight)
Select first two from the order created [1:2] by group ID
subset the corresponding layer row based on the index layer[order..]
Do the mean
Alternatively, in 1.9.3 (current development version) or from the next version on, a function setorder is exported for reordering data.tables in any order, by reference:
require(data.table) ## 1.9.3+
setorder(setDT(dat), ID, -weight) ## dat is now reordered as we require
dat[, mean(layer[1:min(.N, 2L)]), by=ID]
By ordering first, we avoid the call to order() for each group (unique value in ID). This'll be more advantageous with more groups. And setorder() is much more efficient than order() as it doesn't need to create a copy of your data.
This actually is a question for StackOverflow... anyway!
Don't know if the version below is efficient enough for you...
s.ind<-tapply(df$weight,df$ID,function(x) order(x,decreasing=T))
val<-tapply(df$layer,df$ID,function(x) x)
foo<-function(x,y) list(x[y][1:2])
lapply(mapply(foo,val,s.ind),mean)
I think this will do it. Assuming the data is called dat,
> sapply(split(dat, dat$ID), function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
})
# 1 2 3 4 5 6
# 0.6602200 0.3923427 0.5956973 0.5000819 0.7761593 0.5015291
You'll likely want to include na.rm = TRUE as the second argument to mean to account for any rows that contain NA values.
Alternatively, mapply is probably faster, and has the exact same code just in a different order,
mapply(function(x) {
with(x, {
mean(layer[ weight %in% rev(sort(weight))[1:2] ])
})
}, split(dat, dat$ID))

Reformatting Messy Data Frame Column in R

I've imported a large data frame from a CSV file with oddly formatted numerical data. Here's a reproducible example of the data frame I'm working with:
df <- data.frame("r1" = c(1,2,3,4,5), "r2" = c(1,2.01,-3,"-","2,000"))
'r2' contains values with negatives signs, e.g. "-", and values with zeros represented as dashes '-'. To run some numerical analysis on this messy r2 column, I will need to:
Replace the "-" with zeros "0" while avoiding to remove the
negative sign in front of the negative values.
Avoid coercion of legitimate values like "2,000" to NAs. For some reason, when I run the command: foo$row2<- as.numeric(sub("-",0,foo$row2)) R coerces the values formatted with commas to NAs, thus corrupting the data in the column.
Here's an example of output after running foo$row2<- as.numeric(sub("-",0,foo$row2)) :
Warning message:
NAs introduced by coercion
r1 r2
1 1 1.00
2 2 2.01
3 3 3.00
4 4 0.00
5 5 NA
As you can see, "2,000" was coerced to NA. -3 was erroneously converted to 3 (dash removed). But hey, at least we got rid of the "-" in row 3, right!!!
Here's ultimately what I would like to produce:
r1 r2
1 1 1.00
2 2 2.01
3 3 -3.00
4 4 0.00
5 5 2000
Note that the comma from row 5 is removed. Column r2 should be formatted such that I can run commands like sum(df$r2) on it.
Your approach was sound. Just run the substitution twice, once to remove anything that is just a dash, and once more to remove any commas.
df$r2<-as.numeric(gsub('^-$','0',gsub(',','',df$r2)))
And, if you aren't familiar with regular expressions, by ^-$ I mean remove only strings that start (^), have a dash, and then end ($).
nograpes' solution is way cooler:
## df <- data.frame("r1" = c(1,2,3,4,5), "r2" = c(1,2.01,-3,"-","2,000"))
df$r2 <- as.numeric(gsub(",", "", df$r2))
df$r2[is.na(df$r2)] <- 0
## r1 r2
## 1 1 1.00
## 2 2 2.01
## 3 3 -3.00
## 4 4 0.00
## 5 5 2000.00

Avoiding a loop when entry i might take the value of entry i-1

I have vectors with mostly NA entries. I want to replace every NA with that vector's preceding non-NA entry. Specifically, the first entry will always be a 1, followed by a bunch of NAs, which I want to replace with 1. The ith entry could be another numeric, let's say 2, followed by more NAs that I want to replace with 2. And so on. The loop below achieves this, but there has to be a more R way to do it, right? I wanted to vectorize with ifelse(), but I couldn't figure out how to replace the ith entry with the i-1th entry.
> vec <- rep(NA, 10)
> vec
[1] NA NA NA NA NA NA NA NA NA NA
> vec[1] <- 1; vec[4] <- 2; vec[7] <- 3
> vec
[1] 1 NA NA 2 NA NA 3 NA NA NA
> for (i in 1:length(vec)) if (is.na(vec[i])) vec[i] <- vec[i-1]
> vec
[1] 1 1 1 2 2 2 3 3 3 3
Thanks!
If context helps, I am adjusting for stock splits from the WRDS database, which has a column that shows when and how a split occurs.
This is already implemented in package zoo, function na.locf (Last Observation Carried Forward). See also here: Propagating data within a vector
If you're adjusting for stock splits, maybe you could use adjRatios from the TTR package. All the heavy lifting in done in C, so it is very fast. See adjustOHLC for an example of how to use adjRatios to adjust your data.
> require(quantmod)
> getSymbols("IBM",from="1980-01-01")
[1] "IBM"
> spl <- getSplits("IBM",from="1980-01-01")
> div <- getDividends("IBM",from="1980-01-01")
>
> head(spl)
IBM.spl
1997-05-28 0.5
1999-05-27 0.5
> head(div)
[,1]
1980-02-06 0.215
1980-05-08 0.215
1980-08-07 0.215
1980-11-05 0.215
1981-02-05 0.215
1981-05-07 0.215
>
> adjusters <- adjRatios(spl, div, Cl(IBM))
> head(adjusters)
Split Div
1980-01-02 0.25 0.7510967
1980-01-03 0.25 0.7510967
1980-01-04 0.25 0.7510967
1980-01-07 0.25 0.7510967
1980-01-08 0.25 0.7510967
1980-01-09 0.25 0.7510967

Resources