Looping arithmetic calculation between tables - r

I have a table that look like this:
table1
Not Visible Visible <NA>
All 0.29 0.50 0.20
Bowtie 0.24 0.17 0.59
Cola 0.15 0.83 0.02
Squig 0.49 0.51 0.49
I then have 9 other similar tables. Below is an example:
table2
Not Visible Visible <NA>
All 0.28 0.50 0.23
Bowtie 0.11 0.30 0.59
Cola 0.30 0.67 0.03
Squig 0.42 0.51 0.06
I want the result of table1 - table2 as below but I also want table 1 with each of the other 9 tables.
Not Visible Visible <NA>
All 0.01 0.00 -0.03
Bowtie 0.13 -0.13 0.00
Cola -0.15 0.16 -0.01
Squig 0.07 0.00 0.43
How do I do this without writing Table 1 - table 2; table 1 - table 3; table 1 - table 4 etc?
If I try looping with the code below (as an example), I get the non-numeric argument to binary error:
Tables <- c("table1", "table2") ## as an example
for (r in Tables) {
yy <- paste(r,"res", sep = "-")
zz <- table1-r
assign(yy,zz)
}
Any ideas?

Consider using a list of tables (not the string literals of their names) and then use lapply() where resulting list can be saved as individual tables or binded into dataframe:
# LIST OF TABLES WITH NAMED ELEMENTS (t1 NOT INCLUDED)
tables <- setNames(list(t2, t3, t4, t5, t6, t7, t8, t9),
c("table2", "table3", "table4", "table5",
"table6", "table7", "table8", "table9"))
# ITERATIVELY SUBTRACT FROM t1
tableList <- lapply(tables, function(x) t1 - x)
# SAVE EACH TABLE AS SEPARATE OBJECTS
list2env(tableList, envir=.GlobalEnv)
# DATAFRAME BINDING - WIDE FORMAT (INCLUDING t1)
df <- as.data.frame(cbind(t1, do.call(cbind, tableList)))
# DATAFRAME BINDING - LONG FORMAT (INCLUDING t1)
df <- as.data.frame(rbind(t1, do.call(rbind, tableList)))

You could try this without looping
z=names(table1)
table3 = table1[z]-table2[z]

Related

how to use the `map` family command in **purrr** pacakge to swap the columns across rows in data frame?

Imagine there are 4 cards on the desk and there are several rows of them (e.g., 5 rows in the demo). The value of each card is already listed in the demo data frame. However, the exact position of the card is indexed by the pos columns, see the demo data I generated below.
To achieve this, I swap the cards with the [] function across the rows to switch the cards' values back to their original position. The following code already fulfills such a purpose. To avoid explicit usage of the loop, I wonder whether I can achieve a similar effect if I use the vectorization function with packages from tidyverse family, e.g. pmap or related function within the package purrr?
# 1. data generation ------------------------------------------------------
rm(list=ls())
vect<-matrix(round(runif(20),2),nrow=5)
colnames(vect)<-paste0('card',1:4)
order<-rbind(c(2,3,4,1),c(3,4,1,2),c(1,2,3,4),c(4,3,2,1),c(3,4,2,1))
colnames(order)=paste0('pos',1:4)
dat<-data.frame(vect,order,stringsAsFactors = F)
# 2. data swap ------------------------------------------------------------
for (i in 1:dim(dat)[1]){
orders=dat[i,paste0('pos',1:4)]
card=dat[i,paste0('card',1:4)]
vec<-card[order(unlist(orders))]
names(vec)=paste0('deck',1:4)
dat[i,paste0('deck',1:4)]<-vec
}
dat
You could use pmap_dfr :
card_cols <- grep('card', names(dat))
pos_cols <- grep('pos', names(dat))
dat[paste0('deck', seq_along(card_cols))] <- purrr::pmap_dfr(dat, ~{
x <- c(...)
as.data.frame(t(unname(x[card_cols][order(x[pos_cols])])))
})
dat
# card1 card2 card3 card4 pos1 pos2 pos3 pos4 deck1 deck2 deck3 deck4
#1 0.05 0.07 0.16 0.86 2 3 4 1 0.86 0.05 0.07 0.16
#2 0.20 0.98 0.79 0.72 3 4 1 2 0.79 0.72 0.20 0.98
#3 0.50 0.79 0.72 0.10 1 2 3 4 0.50 0.79 0.72 0.10
#4 0.03 0.98 0.48 0.06 4 3 2 1 0.06 0.48 0.98 0.03
#5 0.41 0.72 0.91 0.84 3 4 2 1 0.84 0.91 0.41 0.72
One thing to note here is to make sure that the output from pmap function does not have original names of the columns. If they have the original names, it would reshuffle the columns according to the names and output would not be in correct order. I use unname here to remove the names.

Multiply each value of a dataframe by a row of another dataframe searched by id

I'm new with R and I have tried a lot to solve this problem, if anyone could help me I'd be very grateful! This is my problem:
I have two data frame (df1 and df2) and what I need is to multiply each value of df1 by a row of df2 searched by id. This is an example of what I'm looking for:
df1<-data.frame(ID=c(1,2,3), x1=c(6,3,2), x2=c(2,3,1), x3=c(4,10,7))
df1
df2<-data.frame(ID=c(1,2,3), y1=c(0.01,0.02,0.05), y2=c(0.2,0.03,0.11), y3=c(0.3,0.09,0.07))
df2
#Example of what I need
df1xdf2<- data.frame(ID=c(1,2,3), r1=c(0.06,0.06,0.1), r2=c(1.2,0.09,0.22), r3=c(1.8,0.27,0.14),
r4=c(0.02,0.06,0.05),r5=c(0.4,0.09,0.11),r6=c(0.6,0.27,0.07),r7=c(0.04,0.2,0.35),r8=c(0.8,0.3,0.77),r9=c(1.2,0.9,0.49))
df1xdf2
I've tried with loops by row and column but I only get a 1x1 multiplication.
My dataframes have same number of rows, columns and factor names. My real life dataframes are much larger, both rows and columns.
Does anyone know how to solve it?
You could use lapply to multiply every column of df1 with complete df2. We can cbind the dataframes together and rename the columns
output <- do.call(cbind, lapply(df1[-1], `*`, df2[-1]))
cbind(df1[1], setNames(output, paste0("r", seq_along(output))))
# ID r1 r2 r3 r4 r5 r6 r7 r8 r9
#1 1 0.06 1.20 1.80 0.02 0.40 0.60 0.04 0.80 1.20
#2 2 0.06 0.09 0.27 0.06 0.09 0.27 0.20 0.30 0.90
#3 3 0.10 0.22 0.14 0.05 0.11 0.07 0.35 0.77 0.49
You could use the dplyr package
#Example with dplyr
require(dplyr)
# First we use merge() to join both DF
result <- merge(df1, df2, by = "ID") %>%
mutate(r1 = x1*y1,
r2 = x1*y2,
r3 = etc.)
within mutate() you can specify your new column formulas and names
An option with map
library(tidyverse)
bind_cols(df1[1], map_dfc(df1[-1], `*`, df2[-1]))
Or in base R by replicating the columns and multiplying
out <- cbind(df1[1], df1[-1][rep(seq_along(df1[-1]), each = 3)] *
df2[-1][rep(seq_along(df2[-1]), 3)])
names(out)[-1] <- paste0("r", seq_along(out[-1]))
out
# ID r1 r2 r3 r4 r5 r6 r7 r8 r9
#1 1 0.06 1.20 1.80 0.02 0.40 0.60 0.04 0.80 1.20
#2 2 0.06 0.09 0.27 0.06 0.09 0.27 0.20 0.30 0.90
#3 3 0.10 0.22 0.14 0.05 0.11 0.07 0.35 0.77 0.49

apply a function on columns with specific names

I am new in R.
I have hundreds of data frames like this
ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07
This is just an example. The number and names of the Ratio_ columns are different between data frames, but all of them start with Ratio_. I want to apply a function (for example, log(x)), to the Ratio_ columns without specify the column number or the whole name.
I know how to do it df by df, for the one in the example:
A <- function(x) log(x)
df_log<-data.frame(df[1:2], lapply(df[3:6], A))
but I have a lot of them, and as I said the number of columns is different in each.
Any suggestion?
Thanks
Place the datasets in a list and then loop over the list elements
lapply(lst, function(x) {i1 <- grep("^Ratio_", names(x));
x[i1] <- lapply(x[i1], A)
x})
NOTE: No external packages are used.
data
lst <- mget(paste0("df", 1:100))
This type of problem is very easily dealt with using the dplyr package. For example,
df <- read.table(text = 'ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
AA ABCD 0.09 0.67 0.10 0.14
AB ABCE 0.04 0.85 0.04 0.06
AC ABCG 0.43 0.21 0.54 0.14
AD ABCF 0.16 0.62 0.25 0.97
AF ABCJ 0.59 0.37 0.66 0.07',
header = TRUE)
library(dplyr)
df_transformed <- mutate_each(df, funs(log(.)), starts_with("Ratio_"))
df_transformed
# > df_transformed
# ID NAME Ratio_A Ratio_B Ratio_C Ratio_D
# 1 AA ABCD -2.4079456 -0.4004776 -2.3025851 -1.96611286
# 2 AB ABCE -3.2188758 -0.1625189 -3.2188758 -2.81341072
# 3 AC ABCG -0.8439701 -1.5606477 -0.6161861 -1.96611286
# 4 AD ABCF -1.8325815 -0.4780358 -1.3862944 -0.03045921
# 5 AF ABCJ -0.5276327 -0.9942523 -0.4155154 -2.65926004

Aggregating columns

I have a data frame of n columns and r rows. I want to determine which column is correlated most with column 1, and then aggregate these two columns. The aggregated column will be considered the new column 1. Then, I remove the column that is correlated most from the set. Thus, the size of the date is decreased by one column. I then repeat the process, until the data frame result has has n columns, with the second column being the aggregation of two columns, the third column being the aggregation of three columns, etc. I am therefore wondering if there is an efficient or quicker way to get to the result I'm going for. I've tried various things, but without success so far. Any suggestions?
n <- 5
r <- 6
> df
X1 X2 X3 X4 X5
1 0.32 0.88 0.12 0.91 0.18
2 0.52 0.61 0.44 0.19 0.65
3 0.84 0.71 0.50 0.67 0.36
4 0.12 0.30 0.72 0.40 0.05
5 0.40 0.62 0.48 0.39 0.95
6 0.55 0.28 0.33 0.81 0.60
This is what result should look like:
> result
X1 X2 X3 X4 X5
1 0.32 0.50 1.38 2.29 2.41
2 0.52 1.17 1.78 1.97 2.41
3 0.84 1.20 1.91 2.58 3.08
4 0.12 0.17 0.47 0.87 1.59
5 0.40 1.35 1.97 2.36 2.84
6 0.55 1.15 1.43 2.24 2.57
I think most of the slowness and eventual crash comes from memory overheads during the loop and not from the correlations (though that could be improved too as #coffeeinjunky says). This is most likely as a result of the way data.frames are modified in R. Consider switching to data.tables and take advantage of their "assignment by reference" paradigm. For example, below is your code translated into data.table syntax. You can time the two loops, compare perfomance and comment the results. cheers.
n <- 5L
r <- 6L
result <- setDT(data.frame(matrix(NA,nrow=r,ncol=n)))
temp <- copy(df) # Create a temporary data frame in which I calculate the correlations
set(result, j=1L, value=temp[[1]]) # The first column is the same
for (icol in as.integer(2:n)) {
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1]) # Determine which are correlated most
set(x=result, i=NULL, j=as.integer(icol), value=(temp[[1]] + temp[[mch]]))# Aggregate and place result in results datatable
set(x=temp, i=NULL, j=1L, value=result[[icol]])# Set result as new 1st column
set(x=temp, i=NULL, j=as.integer(mch), value=NULL) # Remove column
}
Try
for (i in 2:n) {
maxcor <- names(which.max(sapply(temp[,-1, drop=F], function(x) cor(temp[, 1], x) )))
result[,i] <- temp[,1] + temp[,maxcor]
temp[,1] <- result[,i] # Set result as new 1st column
temp[,maxcor] <- NULL # Remove column
}
The error was caused because in the last iteration, subsetting temp yields a single vector, and standard R behavior is to reduce the class from dataframe to vector in such cases, which causes sapply to pass on only the first element, etc.
One more comment: currently, you are using the most positive correlation, not the strongest correlation, which may also be negative. Make sure this is what you want.
To adress your question in the comment: Note that your old code could be improved by avoiding repeat computation. For instance,
mch <- match(c(max(cor(temp)[-1,1])),cor(temp)[,1])
contains the command cor(temp) twice. This means each and every correlation is computed twice. Replacing it with
cortemp <- cor(temp)
mch <- match(c(max(cortemp[-1,1])),cortemp[,1])
should cut the computational burden of the initial code line in half.

Ordering Table A based on Rank of Table B in R

pretty newb question here, but I have not been able to track down a solution for some time:
I have an XTS object of trading indicators (indicate) for stock data that looks like
A XOM MSFT
2000-11-30 -0.59 0.22 0.10
2000-12-29 0.55 -0.23 0.05
2001-01-30 -0.52 0.09 -0.10
And a table with an identical index for the corresponding period returns (return) that looks like
A XOM MSFT
2000-11-30 -0.15 0.10 0.03
2000-12-29 0.03 -0.05 0.02
2001-01-30 -0.04 0.02 -0.05
I have sorted the indicator table and had it return the column name with the following code:
indicate.label <- colnames(indicate)
indicate.rank <- t(apply(indicate, 1, function(x) indicate.label[order(-x)]))
indicate.rank <- xts(indicate.rank, order.by = index(returns))
Which gives the table (indicate.rank) of the symbol names ranked by their trading indicator:
1 2 3
2000-11-30 XOM MSFT A
2000-12-29 A MSFT XOM
2001-01-30 XOM A MSFT
I would like to also have a table that gives the period returns based on the indicator rank:
2000-11-30 0.10 0.03 -0.15
2000-12-29 0.03 0.02 -0.05
2001-01-30 0.02 -0.04 -0.05
I cannot figure out how to call the correct symbol for all rows or just sort the table return based on the order of indicate.
Thank you for any suggestions.
Trevor J
I'm not particularly satisfied with this solution, but it works.
row.rank <- t(apply(indicate, 1, order, decreasing=TRUE))
indicate.rank <- return.rank <- indicate # pre-allocate
for(i in 1:NROW(indicate.rank)) {
indicate.rank[i,] <- colnames(indicate)[row.rank[i,]]
return.rank[i,] <- return[i,row.rank[i,]]
}
It would probably be easier to handle this if the returns and the indicators for each symbol were in the same object, but I don't know how that would fit with the rest of your workflow.

Resources