I am trying to convert values in one column to NA based on if the values in another corresponding column are NA. I need to do this for two large groups of corresponding columns so I cannot mutate each column one by one.
For example, below, 2002 inflationNext2Years turns to NA since 2002 realReturnNext2Years is NA.
year <- c(2000, 2001, 2002)
realReturnNext1Years <- c(.1,.2,.3)
realReturnNext2Years <- c(.15,.25, NA)
realReturnNext3Years <- c(.45, NA, NA)
inflationNext1Years <- c(.02, .03, .07)
inflationNext2Years <- c(.03, .05, .08)
inflationNext3Years <- c(.04, .06, .09)
data <- data.frame(year, realReturnNext1Years, realReturnNext2Years, realReturnNext3Years, inflationNext1Years, inflationNext2Years, inflationNext3Years)
data
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
1 2000 0.1 0.15 0.45 0.02 0.03 0.04
2 2001 0.2 0.25 NA 0.03 0.05 0.06
3 2002 0.3 NA NA 0.07 0.08 0.09
I am trying to covert data into:
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
2000 0.1 0.15 0.45 0.02 0.03 0.04
2001 0.2 0.25 NA 0.03 0.05 NA
2002 0.3 NA NA 0.07 NA NA
Since I have many columns, I cannot do this one column at a time. I tried to use mutate_at with an ifelse() but was not sure how to test if the number of years lined up.
I have a vector of the realReturn column names and another vector of the inflation column names. I am trying to change the inflation columns to NA if their corresponding realReturnColumn is NA, but keep the inflation column the same if the realReturnColumn is not NA.
We can collect indices of "realReturnNext" columns using grep, get the position of their NA's and replace the corresponding positions in "inflationNext" cols to NA's
real_cols <- grep("^realReturnNext", colnames(data))
inflation_cols <- grep("^inflationNext", colnames(data))
data[inflation_cols][is.na(data[real_cols])] <- NA
# year realReturnNext1Years realReturnNext2Years realReturnNext3Years
#1 2000 0.1 0.15 0.45
#2 2001 0.2 0.25 NA
#3 2002 0.3 NA NA
# inflationNext1Years inflationNext2Years inflationNext3Years
#1 0.02 0.03 0.04
#2 0.03 0.05 NA
#3 0.07 NA NA
Related
I am using a large dataset and I am not used to using one this big (286,212 rows, 19 columns) and I am not sure how to go about my problem. the data is made up of values for each day of the year for 782 grid references and I have this for 15 years. It looks as follows
**Month Day Grid x2004 x2005 x2006 x2007**
1 1 A10 0.091 0.134 NA 0.066
1 2 A10 0.12 0.10 0.23 0.054
1 3 A10 0.55 NA NA 0.08
1 1 B10 NA 0.134 NA 0.17
1 2 B10 0.14 0.151 NA 0.21
1 3 B10 0.43 0.162 0.24 NA
However some of the days are missing and I want to insert the mean of that day for that specific grid using values from the other years. So if the Grid A10 for day 1 in 2006 is missing. I want to insert the mean for day 1 grid A10 from 2004, 2005, 2007, in this case 0.097.
I am trying the following code
ind <- which(is.na(data$x2005))
data$x2005[ind] <- sapply(ind, function(i)
with(data, rowMeans(data[c(data$x2004[i], data$x2006[i], data$x2007[i], data$x2008[i], data$x2009[i],
data$x2010[i], data$x2011[i], data$x2012[i],
data$x2013[i], data$x2014[i], data$x2015[i],
data$x2016[i], data$x2017[i]),], na.rm=TRUE)))
and I plan to do that for all years but it is telling me
"Error in rowMeans(data[c(data$x2006[i], data$x2007[i], data$x2012[i]), :
'x' must be numeric"
Although when I check class, it says that they are all numeric, so I am not sure why x is not numeric. I also don't know if even when i get the mean part sorted, if the code will work so that I am getting the mean specific to each grid and day.
Please Help. Thanks
Can you adapt this to your code:
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!
Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0
Consider the following list:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:
The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Here's a solution using reshape2 to help reshape the data
library(reshape2)
aa<-do.call(rbind, Map(function(d, n)
cbind.data.frame(
xterm=if (length(d)>0) names(d) else NA,
cor=if(length(d)>0) d else NA,
term=n),
a, names(a))
)
dcast(aa, term~xterm, value.var="cor")
Or you could use dplyr and tidyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
cor=x, stringsAsFactors=FALSE)), term)
a1 %>%
spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
# term 15.8 ability above agreement analysts buyers clearly emergency fixed
#1 oil 0.87 NA 0.76 0.71 0.79 0.70 0.8 0.75 0.73
#2 opec 0.85 0.8 0.82 0.76 0.85 0.83 NA 0.87 NA
# late market meeting prices prices. said that they trying who winter
#1 0.8 0.75 0.77 0.72 NA 0.78 0.73 NA 0.8 0.8 0.8
#2 NA NA 0.88 NA 0.79 0.82 NA 0.8 NA NA NA
Update
aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x),
cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
res <- aNew3 %>%
spread(xterm, cor)
dim(res)
#[1] 1021 160
res[1:3,1:5]
# term ... 100,000 10.8 1.1
#1 ... NA NA NA NA
#2 100,000 NA NA NA 1
#3 10.8 NA NA NA NA
I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.
I'm working on a R data.frame which is made of stocks'dividends per year (I've got 60 stocks in columns and the usual calendar in rows). When a dividend is paid, I've got the figure and otherwise there is a NA.
Basically , here is how my Data.frame looks like
BARC LN BARN SE BAS GY BATS LN
1999-01-01 0.26 NA NA
1999-01-02 NA 0.56 0.35 NA
1999-01-03 NA NA NA NA
2000-01-04 NA NA 0.40 NA
1999-01-05 0.23 0.28 NA NA
2001-01-06 NA NA NA NA
2001-01-07 0.85 NA 0.15 NA
I would like to get the amount of dividend paid per year for each stock in order to compute the dividend yield ratio and finally get a Data;frame like the one below :
BARC LN BARN SE BAS GY BATS LN
1999 NA NA NA NA
2000 NA NA NA NA
2001 NA NA NA NA
How can i do that?
So, assuming your data is in a data.frame like the one you've posted above called div:
div <- structure(list(barc.ln = c(0.26, NA, NA, NA, 0.23, NA, 0.85),
barn.se = c(NA, 0.56, NA, NA, 0.28, NA, NA), bas.gy = c(NA,
0.35, NA, 0.4, NA, NA, 0.15), bats.ln = c(NA, NA, NA, NA,
NA, NA, NA)), .Names = c("barc.ln", "barn.se", "bas.gy",
"bats.ln"), row.names = c("1999-01-01", "1999-01-02", "1999-01-03",
"2000-01-04", "1999-01-05", "2001-01-06", "2001-01-07"), class = "data.frame")
just as you've done you can extract the years from your row.names:
div$years <- as.POSIXlt(row.names(div))$year + 1900
The plyr and reshape2 packages work well here and I think make the code particularly clear. Specifically, I'll use melt to make the data long and then ddply to split into groups and sum the dividends:
library(plyr)
library(reshape2)
div.melt <- melt(div, id.vars='years')
div.sum <- ddply(div.melt,
.(years, variable),
summarise,
dividend = sum(value, na.rm=TRUE))
> div.sum
years variable dividend
1 1999 barc.ln 0.49
2 1999 barn.se 0.84
3 1999 bas.gy 0.35
4 1999 bats.ln 0.00
5 2000 barc.ln 0.00
6 2000 barn.se 0.00
7 2000 bas.gy 0.40
8 2000 bats.ln 0.00
9 2001 barc.ln 0.85
10 2001 barn.se 0.00
11 2001 bas.gy 0.15
12 2001 bats.ln 0.00
>
you can then use another function from reshape2 called cast to format your data "wide":
> dcast(div.sum, years ~ variable, value.var='dividend')
years barc.ln barn.se bas.gy bats.ln
1 1999 0.49 0.84 0.35 0
2 2000 0.00 0.00 0.40 0
3 2001 0.85 0.00 0.15 0
>
I think you can do this pretty easily using by(). Here's how I did it. I've put each block, along with an explanation below each block.
dividends <- data.frame(barc_ln=c(0.26,NA,NA,NA,0.23,NA,0.85),
barn_se=c(NA,0.56,NA,NA,0.28,NA,NA),
bas_gy=c(NA,0.35,NA,0.40,NA,NA,0.15),
bats_ln=c(NA,NA,NA,NA,NA,NA,NA),
row.names=c("1999-01-01","1999-01-02","1999-01-03","2000-01-04","1999-01-05","2001-01-06","2001-01-07"))
This just creates the original data frame you gave.
dividends[,"dates"] <- as.Date(row.names(dividends))
dividends <- dividends[order(dividends[,"dates"]),]
dividends[,"year"] <- format(dividends$dates,"%Y")
This takes the row name dates, and then turns them into a new column ("dates") in the data frame. Then, we order the data frame (not necessarily required, but I find it to be more intuitive) by date and extract the year (as a character, mind you) using format.
div_output <- data.frame(row.names=unique(dividends$year))
Next, I create the output data frame that will receive the data. I use the unique() function on the year variable to get the unique vector of years. They're already ordered (one advantage of ordering the data frame).
for(x in 1:4) {
div_output[,x] <- by(dividends[,x],INDICES=dividends$year,FUN=sum,na.rm=TRUE)
}
names(div_output) <- names(dividends)[1:4]
Using a simple loop, we just go through each of the columns and apply the by() function. The variable is the column, the indices are the year, and we just use the sum function. I tag on na.rm=TRUE so that instead of NAs, you get actual data.
print(div_output)
barc_ln barn_se bas_gy bats_ln
1999 0.49 0.84 0.35 0
2000 0.00 0.00 0.40 0
2001 0.85 0.00 0.15 0
And there's the output I get.