R Sorting Data Frame by Date - r

I'm working on a R data.frame which is made of stocks'dividends per year (I've got 60 stocks in columns and the usual calendar in rows). When a dividend is paid, I've got the figure and otherwise there is a NA.
Basically , here is how my Data.frame looks like
BARC LN BARN SE BAS GY BATS LN
1999-01-01 0.26 NA NA
1999-01-02 NA 0.56 0.35 NA
1999-01-03 NA NA NA NA
2000-01-04 NA NA 0.40 NA
1999-01-05 0.23 0.28 NA NA
2001-01-06 NA NA NA NA
2001-01-07 0.85 NA 0.15 NA
I would like to get the amount of dividend paid per year for each stock in order to compute the dividend yield ratio and finally get a Data;frame like the one below :
BARC LN BARN SE BAS GY BATS LN
1999 NA NA NA NA
2000 NA NA NA NA
2001 NA NA NA NA
How can i do that?

So, assuming your data is in a data.frame like the one you've posted above called div:
div <- structure(list(barc.ln = c(0.26, NA, NA, NA, 0.23, NA, 0.85),
barn.se = c(NA, 0.56, NA, NA, 0.28, NA, NA), bas.gy = c(NA,
0.35, NA, 0.4, NA, NA, 0.15), bats.ln = c(NA, NA, NA, NA,
NA, NA, NA)), .Names = c("barc.ln", "barn.se", "bas.gy",
"bats.ln"), row.names = c("1999-01-01", "1999-01-02", "1999-01-03",
"2000-01-04", "1999-01-05", "2001-01-06", "2001-01-07"), class = "data.frame")
just as you've done you can extract the years from your row.names:
div$years <- as.POSIXlt(row.names(div))$year + 1900
The plyr and reshape2 packages work well here and I think make the code particularly clear. Specifically, I'll use melt to make the data long and then ddply to split into groups and sum the dividends:
library(plyr)
library(reshape2)
div.melt <- melt(div, id.vars='years')
div.sum <- ddply(div.melt,
.(years, variable),
summarise,
dividend = sum(value, na.rm=TRUE))
> div.sum
years variable dividend
1 1999 barc.ln 0.49
2 1999 barn.se 0.84
3 1999 bas.gy 0.35
4 1999 bats.ln 0.00
5 2000 barc.ln 0.00
6 2000 barn.se 0.00
7 2000 bas.gy 0.40
8 2000 bats.ln 0.00
9 2001 barc.ln 0.85
10 2001 barn.se 0.00
11 2001 bas.gy 0.15
12 2001 bats.ln 0.00
>
you can then use another function from reshape2 called cast to format your data "wide":
> dcast(div.sum, years ~ variable, value.var='dividend')
years barc.ln barn.se bas.gy bats.ln
1 1999 0.49 0.84 0.35 0
2 2000 0.00 0.00 0.40 0
3 2001 0.85 0.00 0.15 0
>

I think you can do this pretty easily using by(). Here's how I did it. I've put each block, along with an explanation below each block.
dividends <- data.frame(barc_ln=c(0.26,NA,NA,NA,0.23,NA,0.85),
barn_se=c(NA,0.56,NA,NA,0.28,NA,NA),
bas_gy=c(NA,0.35,NA,0.40,NA,NA,0.15),
bats_ln=c(NA,NA,NA,NA,NA,NA,NA),
row.names=c("1999-01-01","1999-01-02","1999-01-03","2000-01-04","1999-01-05","2001-01-06","2001-01-07"))
This just creates the original data frame you gave.
dividends[,"dates"] <- as.Date(row.names(dividends))
dividends <- dividends[order(dividends[,"dates"]),]
dividends[,"year"] <- format(dividends$dates,"%Y")
This takes the row name dates, and then turns them into a new column ("dates") in the data frame. Then, we order the data frame (not necessarily required, but I find it to be more intuitive) by date and extract the year (as a character, mind you) using format.
div_output <- data.frame(row.names=unique(dividends$year))
Next, I create the output data frame that will receive the data. I use the unique() function on the year variable to get the unique vector of years. They're already ordered (one advantage of ordering the data frame).
for(x in 1:4) {
div_output[,x] <- by(dividends[,x],INDICES=dividends$year,FUN=sum,na.rm=TRUE)
}
names(div_output) <- names(dividends)[1:4]
Using a simple loop, we just go through each of the columns and apply the by() function. The variable is the column, the indices are the year, and we just use the sum function. I tag on na.rm=TRUE so that instead of NAs, you get actual data.
print(div_output)
barc_ln barn_se bas_gy bats_ln
1999 0.49 0.84 0.35 0
2000 0.00 0.00 0.40 0
2001 0.85 0.00 0.15 0
And there's the output I get.

Related

Update dt columns based on named list

Let's say, I have the following my_dt datatable:
neutrons
spectrum
geography
2.30
-1.2
KIEL
2.54
-1.6
KIEL
2.56
-0.9
JUNG
2.31
-0.3
ANT
Also I have the following named list (my_list):
> my_list
$particles
[1] "neutrons"
$station
[1] NA
$energy
[1] "spectrum"
$area
[1] "geography"
$gamma
[1] NA
The values of this list correspond to the columns names from my dataset (if they exist, if they are absent - NA).
Based on my dataset and this list, I need to check which columns exist in my_dt and rename them (based on my_list names), and for NA values - I need to create columns filled with NAs.
So, I want to obtain the following dataset:
>final_dt
particles
station
energy
area
gamma
2.30
NA
-1.2
KIEL
NA
2.54
NA
-1.6
KIEL
NA
2.56
NA
-0.9
JUNG
NA
2.31
NA
-0.3
ANT
NA
I try to implement this using apply family functions, but at the moment I can't obtain exactly what I want.
So, I would be grateful for any help!
data.table using lapply
library(data.table)
setDT(my_dt)
setDT(my_list)
final_dt <- setnames( my_list[, lapply( .SD, function(x){
if( x %in% colnames(my_dt)){ my_dt[,x,with=F] }else{ NA } } ) ],
names(my_list) )
final_dt
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
base R using sapply
setDF(my_dt)
setDF(my_list)
data.frame( sapply( my_list, function(x) if(!is.na(x)){ my_dt[,x] }else{ NA } ) )
particles station energy area gamma
1 2.30 NA -1.2 KIEL NA
2 2.54 NA -1.6 KIEL NA
3 2.56 NA -0.9 JUNG NA
4 2.31 NA -0.3 ANT NA
Data
my_dt <- structure(list(neutrons = c(2.3, 2.54, 2.56, 2.31), spectrum = c(-1.2,
-1.6, -0.9, -0.3), geography = c("KIEL", "KIEL", "JUNG", "ANT"
)), class = "data.frame", row.names = c(NA, -4L))
my_list <- list(particles = "neutrons", station = NA, energy = "spectrum",
area = "geography", gamma = NA)
This may not meet your needs, but since I had come up with this separately thought I would share just in case. You can use setnames to rename the columns based on my_list. After that, add in the missing column names with values of NA. Finally, you can use setcolorder to reorder based on your list if desired.
library(data.table)
my_vec <- unlist(my_list)
setnames(my_dt, names(my_vec[match(names(my_dt), my_vec)]))
my_dt[, (setdiff(names(my_vec), names(my_dt))) := NA]
setcolorder(my_dt, names(my_vec))
my_dt
Output
particles station energy area gamma
1: 2.30 NA -1.2 KIEL NA
2: 2.54 NA -1.6 KIEL NA
3: 2.56 NA -0.9 JUNG NA
4: 2.31 NA -0.3 ANT NA
I wrote a simple code that should do the job for you:
l = list(c = 'cc', a = 'aa', b = NA) # replace this with your my_list
dt = data.frame(aa = 1:3, cc = 2:4) # replace this with my_dt
dtl = data.frame(l)
names(dt) = names(l)[na.omit(match(l, names(dt)))]
m = merge(dt, dtl[!is.element(names(dtl), names(dt))])

Mutate group of columns based on values in other group of columns

I am trying to convert values in one column to NA based on if the values in another corresponding column are NA. I need to do this for two large groups of corresponding columns so I cannot mutate each column one by one.
For example, below, 2002 inflationNext2Years turns to NA since 2002 realReturnNext2Years is NA.
year <- c(2000, 2001, 2002)
realReturnNext1Years <- c(.1,.2,.3)
realReturnNext2Years <- c(.15,.25, NA)
realReturnNext3Years <- c(.45, NA, NA)
inflationNext1Years <- c(.02, .03, .07)
inflationNext2Years <- c(.03, .05, .08)
inflationNext3Years <- c(.04, .06, .09)
data <- data.frame(year, realReturnNext1Years, realReturnNext2Years, realReturnNext3Years, inflationNext1Years, inflationNext2Years, inflationNext3Years)
data
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
1 2000 0.1 0.15 0.45 0.02 0.03 0.04
2 2001 0.2 0.25 NA 0.03 0.05 0.06
3 2002 0.3 NA NA 0.07 0.08 0.09
I am trying to covert data into:
year realReturnNext1Years realReturnNext2Years realReturnNext3Years inflationNext1Years inflationNext2Years inflationNext3Years
2000 0.1 0.15 0.45 0.02 0.03 0.04
2001 0.2 0.25 NA 0.03 0.05 NA
2002 0.3 NA NA 0.07 NA NA
Since I have many columns, I cannot do this one column at a time. I tried to use mutate_at with an ifelse() but was not sure how to test if the number of years lined up.
I have a vector of the realReturn column names and another vector of the inflation column names. I am trying to change the inflation columns to NA if their corresponding realReturnColumn is NA, but keep the inflation column the same if the realReturnColumn is not NA.
We can collect indices of "realReturnNext" columns using grep, get the position of their NA's and replace the corresponding positions in "inflationNext" cols to NA's
real_cols <- grep("^realReturnNext", colnames(data))
inflation_cols <- grep("^inflationNext", colnames(data))
data[inflation_cols][is.na(data[real_cols])] <- NA
# year realReturnNext1Years realReturnNext2Years realReturnNext3Years
#1 2000 0.1 0.15 0.45
#2 2001 0.2 0.25 NA
#3 2002 0.3 NA NA
# inflationNext1Years inflationNext2Years inflationNext3Years
#1 0.02 0.03 0.04
#2 0.03 0.05 NA
#3 0.07 NA NA

Replace NA with mean value for specific day and grid

I am using a large dataset and I am not used to using one this big (286,212 rows, 19 columns) and I am not sure how to go about my problem. the data is made up of values for each day of the year for 782 grid references and I have this for 15 years. It looks as follows
**Month Day Grid x2004 x2005 x2006 x2007**
1 1 A10 0.091 0.134 NA 0.066
1 2 A10 0.12 0.10 0.23 0.054
1 3 A10 0.55 NA NA 0.08
1 1 B10 NA 0.134 NA 0.17
1 2 B10 0.14 0.151 NA 0.21
1 3 B10 0.43 0.162 0.24 NA
However some of the days are missing and I want to insert the mean of that day for that specific grid using values from the other years. So if the Grid A10 for day 1 in 2006 is missing. I want to insert the mean for day 1 grid A10 from 2004, 2005, 2007, in this case 0.097.
I am trying the following code
ind <- which(is.na(data$x2005))
data$x2005[ind] <- sapply(ind, function(i)
with(data, rowMeans(data[c(data$x2004[i], data$x2006[i], data$x2007[i], data$x2008[i], data$x2009[i],
data$x2010[i], data$x2011[i], data$x2012[i],
data$x2013[i], data$x2014[i], data$x2015[i],
data$x2016[i], data$x2017[i]),], na.rm=TRUE)))
and I plan to do that for all years but it is telling me
"Error in rowMeans(data[c(data$x2006[i], data$x2007[i], data$x2012[i]), :
'x' must be numeric"
Although when I check class, it says that they are all numeric, so I am not sure why x is not numeric. I also don't know if even when i get the mean part sorted, if the code will work so that I am getting the mean specific to each grid and day.
Please Help. Thanks
Can you adapt this to your code:
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

tm package: Output of findAssocs() in a matrix instead of a list in R

Consider the following list:
library(tm)
data("crude")
tdm <- TermDocumentMatrix(crude)
a <- findAssocs(tdm, c("oil", "opec", "xyz"), c(0.7, 0.75, 0.1))
How do I manage to have a data frame with all terms associated with these 3 words in the columns and showing:
The corresponding correlation coefficient (if it exists)
NA if it does not exists for this word (for example the couple (oil, they) would show NA)
Here's a solution using reshape2 to help reshape the data
library(reshape2)
aa<-do.call(rbind, Map(function(d, n)
cbind.data.frame(
xterm=if (length(d)>0) names(d) else NA,
cor=if(length(d)>0) d else NA,
term=n),
a, names(a))
)
dcast(aa, term~xterm, value.var="cor")
Or you could use dplyr and tidyr
library(dplyr)
library('devtools')
install_github('hadley/tidyr')
library(tidyr)
a1 <- unnest(lapply(a, function(x) data.frame(xterm=names(x),
cor=x, stringsAsFactors=FALSE)), term)
a1 %>%
spread(xterm, cor) #here it removed terms without any `cor` for the `xterm`
# term 15.8 ability above agreement analysts buyers clearly emergency fixed
#1 oil 0.87 NA 0.76 0.71 0.79 0.70 0.8 0.75 0.73
#2 opec 0.85 0.8 0.82 0.76 0.85 0.83 NA 0.87 NA
# late market meeting prices prices. said that they trying who winter
#1 0.8 0.75 0.77 0.72 NA 0.78 0.73 NA 0.8 0.8 0.8
#2 NA NA 0.88 NA 0.79 0.82 NA 0.8 NA NA NA
Update
aNew <- sapply(tdm$dimnames$Terms, function(i) findAssocs(tdm, i, corlimit=0.95))
aNew2 <- aNew[!!sapply(aNew, function(x) length(dim(x)))]
aNew3 <- unnest(lapply(aNew2, function(x) data.frame(xterm=rownames(x),
cor=x[,1], stringsAsFactors=FALSE)[1:3,]), term)
res <- aNew3 %>%
spread(xterm, cor)
dim(res)
#[1] 1021 160
res[1:3,1:5]
# term ... 100,000 10.8 1.1
#1 ... NA NA NA NA
#2 100,000 NA NA NA 1
#3 10.8 NA NA NA NA

sapply? tapply? ddply? dataframe variable based on rolling index of previous values of another variable

I haven't found something which precisely matches what I need, so I thought I'd post this.
I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from.
For example,
data<-as.data.frame(as.matrix(seq(1:30)))
data$V1<-data$V1/100
str(data)
data$V1<-NA # rolling 5 day product
for (i in 5:nrow(data)){
start<-i-5
end<-i
data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100
}
data
> head(data,15)
V1 V1_MA5d
1 0.01 NA
2 0.02 NA
3 0.03 NA
4 0.04 NA
5 0.05 0.1500850
6 0.06 0.2101751
7 0.07 0.2702952
8 0.08 0.3304453
9 0.09 0.3906255
10 0.10 0.4508358
11 0.11 0.5110762
12 0.12 0.5713467
13 0.13 0.6316473
14 0.14 0.6919780
15 0.15 0.7523389
But really, I should be able to do something like:
data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100
But I'm not sure what that would look like.
Likewise, the count of a variable by another variable:
data$V1_MA5_cat<-NA
data$V1_MA5_cat[data$V1_MA5d<.5]<-0
data$V1_MA5_cat[data$V1_MA5d>.5]<-1
data$V1_MA5_cat[data$V1_MA5d>1.5]<-2
table(data$V1_MA5_cat)
data$V1_MA5_cat_n<-NA
data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0))
data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1))
data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2))
> head(data,15)
V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n
1 0.01 NA NA NA
2 0.02 NA NA NA
3 0.03 NA NA NA
4 0.04 NA NA NA
5 0.05 0.1500850 0 6
6 0.06 0.2101751 0 6
7 0.07 0.2702952 0 6
8 0.08 0.3304453 0 6
9 0.09 0.3906255 0 6
10 0.10 0.4508358 0 6
11 0.11 0.5110762 1 17
12 0.12 0.5713467 1 17
13 0.13 0.6316473 1 17
14 0.14 0.6919780 1 17
15 0.15 0.7523389 1 17
I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family:
sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100)
[1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299
[13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781
[25] 1.6005622 1.6613766
Notice my code inside the [] is different from yours. check out the difference:
i <- 10
i - 5:i
(i-5):i
Or you can use rollapply from the zoo package:
library(zoo)
myfun <- function(x) (prod(x/100 + 1)-1)*100
rollapply(data$V1, 5, myfun)
[1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577
[13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726
[25] 1.3573047 1.4078569
As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data:
rollapply(data$V1, 5, myfun, fill=NA, align='right')
In regard to your second question, plyr is handy here.
library(plyr)
data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf))
ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts))
But there are many other choices too.

Resources