Assign different values to a large number of columns - r

I have a large set of financial data that has hundreds of columns. I have cleaned and sorted the data based on date. Here is a simplified example:
df1 <- data.frame(matrix(vector(),ncol=5, nrow = 4))
colnames(df1) <- c("Date","0.4","0.3","0.2","0.1")
df1[1,] <- c("2000-01-31","0","0","0.05","0.07")
df1[2,] <- c("2000-02-29","0","0.13","0.17","0.09")
df1[3,] <- c("2000-03-31","0.03","0.09","0.21","0.01")
df1[4,] <- c("2004-04-30","0.05","0.03","0.19","0.03")
df1
Date 0.4 0.3 0.2 0.1
1 2000-01-31 0 0 0.05 0.07
2 2000-02-29 0 0.13 0.17 0.09
3 2000-03-31 0.03 0.09 0.21 0.01
4 2000-04-30 0.05 0.03 0.19 0.03
I assigned individual weights (based on market value from the raw data) as column headers, because I don’t care about the company names and I need the weights for calculating the result.
My ultimate goal is to get: 1. Sum of the weighted returns; and 2. Sum of the weights when returns are non-zero. With that being said, below is the result I want to get:
Date SWeightedR SWeights
1 2000-01-31 0.017 0.3
2 2000-02-29 0.082 0.6
3 2000-03-31 0.082 1
4 2000-04-30 0.07 1
For instance, the SWeightedR for 2000-01-31 = 0.4x0+0.3x0+0.2x0.05+0.1x0.07, and SWeights = 0.2+0.1.
My initial idea was to assign the weights to each column like WCol2 <- 0.4, then use cbind to create new columns and use c(as.matrix() %*% ) to get the sums. Soon I realize that this is impossible as there are hundreds of columns. Any advice or suggestion is appreciated!

Here's a simple solution using matrix multiplications (as you were suggesting yourself).
First of all, your data seem to be of character type and I'm not sure it's the real case with the real data, but I would first convert it to an appropriate type
df1[-1] <- lapply(df1[-1], type.convert)
Next, we will convert the column names to a numeric class too
vec <- as.numeric(names(df1)[-1])
Finally, we could easily create the new columns in two simple steps. This indeed has a to matrix conversion overhead, but maybe you should work with matrices in the first place. Either way, this is fully vectorized
df1["SWeightedR"] <- as.matrix(df1[, -1]) %*% vec
df1["SWeights"] <- (df1[, -c(1, ncol(df1))] > 0) %*% vec
df1
# Date 0.4 0.3 0.2 0.1 SWeightedR SWeights
# 1 2000-01-31 0.00 0.00 0.05 0.07 0.017 0.3
# 2 2000-02-29 0.00 0.13 0.17 0.09 0.082 0.6
# 3 2000-03-31 0.03 0.09 0.21 0.01 0.082 1.0
# 4 2004-04-30 0.05 0.03 0.19 0.03 0.070 1.0
Or, you could convert to a long format first (here's a data.table example), though I believe it will be less efficient as this are basically by row operations
library(data.table)
res <- melt(setDT(df1), id = 1L, variable.factor = FALSE
)[, c("value", "variable") := .(as.numeric(value), as.numeric(variable))]
res[, .(SWeightedR = sum(variable * value),
SWeights = sum(variable * (value > 0))), by = Date]
# Date SWeightedR SWeights
# 1: 2000-01-31 0.017 0.3
# 2: 2000-02-29 0.082 0.6
# 3: 2000-03-31 0.082 1.0
# 4: 2004-04-30 0.070 1.0

Related

Sample variables from two columns in R

I have tried to sample values of two columns that are related (diversification rates of several siter groups), but I have no idea of how to do it. I am trying with sample function, but it limits me so I cannot choose any further condition.
df<-data.frame("M"=c(0.06,0.14,0.05,0.07), "H"=c(0.06,0.08,0.04,0.05))
df
# M H
# 1 0.06 0.06
# 2 0.14 0.08
# 3 0.05 0.04
# 4 0.07 0.05
sample(df,size=1000,replace=TRUE)
When I use this command, it resamples rows and columns:
H M M.1 M.2 M.3
1 0.06 0.06 0.06 0.06 0.06
2 0.08 0.14 0.14 0.14 0.14
3 0.04 0.05 0.05 0.05 0.05
4 0.05 0.07 0.07 0.07 0.07
...
But I want it to only sample one value from each row, and go to the next row with the same condition until the end of the rows. Finally, when there are no more rows, it should start all over again up to size=1000 so I can have a vector of length 1000.
Example of what I want (r = row, c = column): 0.06(r1c1), 0.14(r2c1), 0.05(r3c1), 0.05(r4c2), 0.06(r1c2), 0.14(r2c1),0.03(r3c2), 0.07(r4c1) and so on.
Thank you in advance for your help!
EDITED:
I think that what I am looking for is something like a loop function, but I still do not know how to do it.
You should first create an indexing matrix of two columns (row index and column index), then index the original matrix with it.
idx <- matrix(c(rep(1:4,250), sample(1:2, 1000, replace=T)), ncol=2)
res <- as.matrix(df)[idx]
With your specifications, you'll need to use a custom function.
Here's one small way to do it:
myfunc <- function(dataframe, nsamples = 1000){
rows = ((0:nsamples)%%nrow(df)) + 1 #use the %% to get the row to sample
cols = sample(ncol(df), nsamples, replace = TRUE) #and the cols
sapply(1:nsamples, function(x){df[rows[x],cols[x]]}) #sapply to get as a vector
}
myfunc(df,10)
[1] 0.08 0.05 0.07 0.06 0.08 0.05 0.05 0.06 0.08 0.05

Mean of quartile for multiple columns and multiple dates

I'm trying to find the mean forward return (column fwd_rtn) of each quartile for each column (ie for quartiles for PB, PE, PS) for each date group (1/1/2016...1/4/2016)
head(df)
Date Stock Price PB PE PS fwd_rtn
1 1/1/2016 A 11.90 0.4 0.10 0.57 -0.015
2 1/1/2016 B 3.56 0.8 0.09 0.26 -0.036
3 1/1/2016 C 1.29 1.2 0.18 1.60 0.10
......
4 1/4/2016 A 12.80 0.39 0.13 0.53 -0.01
5 1/4/2016 B 4.03 0.76 0.08 0.23 0.02
6 1/4/2016 C 1.83 0.87 0.14 1.16 0.03
So far i have been able to find the mean return for 1 column for 1 date using this code
df$qPB <- cut(df$PB, breaks = quantile(df$PB, c(0,.25,.5,.75,1)),include.lowest = TRUE)
aggregate(df$fwd_rtn,list(qPB = df$qPB),FUN=mean)
which gave me the right answers. But I'm struggling to do it for the multiple columns. I think I'm supposed to use dplyr and the gather() function but i dont know how.
To get quartiles of a single variable by date you can use the ave function:
df$qPB <- ave(df$PB, df$Date, FUN= function(i) cut(i, breaks = quantile(df$PB,
c(0,.25,.5,.75,1)),include.lowest = TRUE)
# a minor addition to aggregate
aggregate(df$fwd_rtn, list("qPB"=df$qPB, "date"=df$Date), FUN=mean)
You should take a look at using lapply or sapply to move through multiple columns.

Use value from the previous row for manipulation of multiple column at once in R

Hi everyone I need a help.
I have the data-set similar to this which contain multiple rows and columns.
df<- data.frame(A=seq(0.01,0.05,0.01),
B=c(0.01, -0.24, 0, -0.21, 0),
C=seq(0.03,0.07,0.01),
D=c(0.4,0.5,0,0,2))
I used shift command and created another row E.
df[ , E := shift(A)+A]
Now I want to apply similar function to whole data frame df and create row F, G, H similar to E using similar method at once.
Thank you.
If we are using data.table, the shift can take multiple columns at once and output the lag of those. If we are not selecting any particular sets of columns, specifying shift(.SD) (.SD represents the Subset of Data.table) gives the lag of all the columns in the dataset. By assigning (:=) it to different column names (LETTERS[5:8]), we get a new set of lag columns in the original dataset.
library(data.table)
setDT(df)[, LETTERS[5:8] := shift(.SD)+.SD]
df
# A B C D E F G H
#1: 0.01 0.01 0.03 0.4 NA NA NA NA
#2: 0.02 -0.24 0.04 0.5 0.03 -0.23 0.07 0.9
#3: 0.03 0.00 0.05 0.0 0.05 -0.24 0.09 0.5
#4: 0.04 -0.21 0.06 0.0 0.07 -0.21 0.11 0.0
#5: 0.05 0.00 0.07 2.0 0.09 -0.21 0.13 2.0
Or we can loop through lapply
setDT(df)[, LETTERS[5:8] := lapply(.SD, function(x) x+shift(x))]
Here is an alternative for this:
new_cols <- c('E','F','G','H')
old_cols <- colnames(df)
for (i in seq_along(new_cols)){
eval(parse(text = paste0("df[,",new_cols[i],":= shift(",old_cols[i],")+",old_cols[i],"]")))
}

Round multiple vectors in dataframe with plyr

The numbers in this data.frame are rounded to 3 decimal places:
habitats_df <- data.frame(habitat = c("beach", "grassland", "freshwater"), v1 = c(0.000, 0.670, 0.032), v2 = c(0.005, 0.824, 0.012))
habitat v1 v2
1 beach 0.000 0.005
2 grassland 0.670 0.824
3 freshwater 0.032 0.012
I need them rounded to 2 decimal places. I tried to use plyr::l_ply like this:
library(plyr)
l_ply(habitats_df[,2:3], function(x) round(x, 2))
But it didn't work. How can I use plyr:: l_ply to round the numbers in habitats_df?
You don't really need plyr for this, since a simply lapply combined with round does the trick. I provide a solution in base R as well as plyr
Try this in base R:
roundIfNumeric <- function(x, n=1)if(is.numeric(x)) round(x, n) else x
as.data.frame(
lapply(habitats_df, roundIfNumeric, 2)
)
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01
And the same with plyr:
library(plyr)
quickdf(llply(habitats_df, roundIfNumeric, 2))
habitat v1 v2
1 beach 0.00 0.00
2 grassland 0.67 0.82
3 freshwater 0.03 0.01
# plyr alternative
library(plyr)
data.frame(habitat = habitats_df$habitat,
numcolwise(.fun = function(x) round(x, 2))(habitats_df))
# habitat v1 v2
# 1 beach 0.00 0.00
# 2 grassland 0.67 0.82
# 3 freshwater 0.03 0.01
# base alternative
data.frame(habitat = habitats_df$habitat,
lapply(habitats_df[ , -1], function(x) round(x, 2)))

Speed up `strsplit` when possible output are known

I have a large data frame with a factor column that I need to divide into three factor columns by splitting up the factor names by a delimiter. Here is my current approach, which is very slow with a large data frame (sometimes several million rows):
data <- readRDS("data.rds")
data.df <- reshape2:::melt.array(data)
head(data.df)
## Time Location Class Replicate Population
##1 1 1 LIDE.1.S 1 0.03859605
##2 2 1 LIDE.1.S 1 0.03852957
##3 3 1 LIDE.1.S 1 0.03846853
##4 4 1 LIDE.1.S 1 0.03841260
##5 5 1 LIDE.1.S 1 0.03836147
##6 6 1 LIDE.1.S 1 0.03831485
Rprof("str.out")
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
Rprof(NULL)
head(data.df)
## Time Location Species SizeClass Infected Replicate Population
##1 1 1 LIDE 1 S 1 0.03859605
##2 2 1 LIDE 1 S 1 0.03852957
##3 3 1 LIDE 1 S 1 0.03846853
##4 4 1 LIDE 1 S 1 0.03841260
##5 5 1 LIDE 1 S 1 0.03836147
##6 6 1 LIDE 1 S 1 0.03831485
summaryRprof("str.out")
$by.self
self.time self.pct total.time total.pct
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"do.call" 0.04 1.49 2.54 94.78
"unique.default" 0.04 1.49 0.04 1.49
"data.frame" 0.02 0.75 0.12 4.48
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
$by.total
total.time total.pct self.time self.pct
"do.call" 2.54 94.78 0.04 1.49
"strsplit" 1.34 50.00 1.34 50.00
"<Anonymous>" 1.16 43.28 1.16 43.28
"cbind" 0.14 5.22 0.00 0.00
"data.frame" 0.12 4.48 0.02 0.75
"as.data.frame.matrix" 0.08 2.99 0.00 0.00
"as.data.frame" 0.08 2.99 0.00 0.00
"as.factor" 0.08 2.99 0.00 0.00
"factor" 0.06 2.24 0.00 0.00
"unique.default" 0.04 1.49 0.04 1.49
"unique" 0.04 1.49 0.00 0.00
"is.factor" 0.02 0.75 0.02 0.75
"match" 0.02 0.75 0.02 0.75
"structure" 0.02 0.75 0.02 0.75
"unlist" 0.02 0.75 0.02 0.75
"[.data.frame" 0.02 0.75 0.00 0.00
"[" 0.02 0.75 0.00 0.00
$sample.interval
[1] 0.02
$sampling.time
[1] 2.68
Is there any way to speed up this operation? I note that there are a small (<5) number of each of the categories "Species", "SizeClass", and "Infected", and I know what these are in advance.
Notes:
stringr::str_split_fixed performs this task, but not any faster
The data frame is actually initially generated by calling reshape::melt on an array in which Class and its associated levels are a dimension. If there's a faster way to get from there to here, great.
data.rds at http://dl.getdropbox.com/u/3356641/data.rds
This should probably offer quite an increase:
library(data.table)
DT <- data.table(data.df)
DT[, c("Species", "SizeClass", "Infected")
:= as.list(strsplit(Class, "\\.")[[1]]), by=Class ]
The reasons for the increase:
data.table pre allocates memory for columns
every column assignment in data.frame reassigns the entirety of the data (data.table in contrast does not)
the by statement allows you to implement the strsplit task once per each unique value.
Here is a nice quick method for the whole process.
# Save the new col names as a character vector
newCols <- c("Species", "SizeClass", "Infected")
# split the string, then convert the new cols to columns
DT[, c(newCols) := as.list(strsplit(as.character(Class), "\\.")[[1]]), by=Class ]
DT[, c(newCols) := lapply(.SD, factor), .SDcols=newCols]
# remove the old column. This is instantaneous.
DT[, Class := NULL]
## Have a look:
DT[, lapply(.SD, class)]
# Time Location Replicate Population Species SizeClass Infected
# 1: integer integer integer numeric factor factor factor
DT
You could get a decent increase in speed by just extracting the parts of the string you need using gsub instead of splitting everything up and trying to put it back together:
data <- readRDS("~/Downloads/data.rds")
data.df <- reshape2:::melt.array(data)
# using `strsplit`
system.time({
cl <- which(names(data.df)=="Class")
Classes <- do.call(rbind, strsplit(as.character(data.df$Class), "\\."))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
data.df <- cbind(data.df[,1:(cl-1)],Classes,data.df[(cl+1):(ncol(data.df))])
})
user system elapsed
3.349 0.062 3.411
#using `gsub`
system.time({
data.df$Class <- as.character(data.df$Class)
data.df$SizeClass <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\2", data.df$Class,
perl = TRUE)
data.df$Infected <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\3", data.df$Class,
perl = TRUE)
data.df$Class <- gsub("(\\w+)\\.(\\d+)\\.(\\w+)", "\\1", data.df$Class,
perl = TRUE)
})
user system elapsed
0.812 0.037 0.848
Looks like you have a factor, so work on the levels and then map back. Use fixed=TRUE in strsplit, adjusting to split=".".
Classes <- do.call(rbind, strsplit(levels(data.df$Class), ".", fixed=TRUE))
colnames(Classes) <- c("Species", "SizeClass", "Infected")
df0 <- as.data.frame(Classes[data.df$Class,], row.names=NA)
cbind(data.df, df0)

Resources