Merging Embedded Lists with Different Date Formats in R - r

I need to merge two lists with each other but I am not getting what I want and I think it is because the "Date" column is in two different formats. I have a list called li and in this list there are 12 lists each with the following format:
> tail(li$fxe)
Date fxe
3351 2020-06-22 0.0058722768
3352 2020-06-23 0.0044256216
3353 2020-06-24 -0.0044998220
3354 2020-06-25 -0.0027309539
3355 2020-06-26 0.0002832672
3356 2020-06-29 0.0007552346
I am trying to merge each of these unique lists with a different list called factors which looks like :
> tail(factors)
Date Mkt-RF SMB HML RF
3351 20200622 0.0071 0.83 -1.42 0.000
3352 20200623 0.0042 0.15 -0.56 0.000
3353 20200624 -0.0261 -0.52 -1.28 0.000
3354 20200625 0.0112 0.25 0.50 0.000
3355 20200626 -0.0243 0.16 -1.37 0.000
3356 20200629 0.0151 1.25 1.80 0.000
The reason I need this structure is because I am trying to send them to a function I wrote to do linear regressions. But the first line of my function aims to merge these lists. When I merge them I end up with a null structure even thought my lists clearly have the same number of rows. In my function df is li. The embedded list of li is confusing me. Can someone help please?
Function I want to use:
Bf <- function(df, fac){
#This function calculates the beta of the french fama factor #using linear regression
#Input: df = a dataframe containg returns of the security
# fac = dataframe containing excess market retrun and
# french fama 3 factor
#Output: a Beta vectors of the french fama model
temp <- merge(df, fac, by="Date")
temp <- temp[, !names(temp) %in% "Date"]
temp[ ,1] <- temp[,1] - temp$RF return(lm(temp[,1]~temp[,2]+temp[,3]+temp[,4])$coeff)
}

a: you are dealing with data frames and not lists
b: if you want to merge them, you need to modify the factors$date column to match that of li$fxe$date
try to do:
factors$date <- as.Date(strptime(factors$date, format = "%Y%M%d"))
This should convert, the factors column to "Date" format.

Related

Looping over two dataframes of same size and calculating correlation

Apologies in advance, as I am new to R. I'm attempting to calculate the correlation coefficient for each cortical measurement per ROI from two dataframes. The two dataframes are df3 and df5.
They both look something like this:
> df3
SubjID SeshLab L_Frontal R_Frontal
123456 2016-06-03 2.54 2.78
135791 2016-07-21 2.39 2.49
147036 2016-12-03 2.67 2.39
> df5
SubjID SeshLab L_Frontal R_Frontal
123456 2016-06-03 2.45 2.87
135791 2016-07-21 2.93 2.94
147036 2016-12-03 2.76 2.93
And I'm trying to obtain a dataframe that looks like this:
SubjID SeshLab L_Frontal R_Frontal
123456 2016-06-03 corr.test(2.54 and 2.45) corr.test(2.78 and 2.87)
135791 2016-07-21 corr.test(2.39 and 2.93) corr.test(2.49 and 2.94)
147036 2016-12-03 corr.test(2.67 and 2.76) corr.test(2.39 and 2.93)
I've attempted to write the loop to match indices since the dataframes are the same size.
correlation_results = list()
for (i in 3:df3$SubjID)
{
x = as.numeric(df3[1,i])
y = as.numeric(df5[1,i])
correlation_results <- cor.test(x,y,method="spearman")
}
numerical expression has 110 elements: only the first usedError in cor.test.default(x, y, method = "spearman") :
not enough finite observations
for (i in df3$SubjID)
{
x = as.numeric(df3[i,i])
y = as.numeric(df5[i,i])
correlation_results <- cor.test(x,y,method="spearman")
}
Error: Can't subset columns that don't exist.
x Column `107516` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
How do I index the two dataframes properly to get this result?

Conditional change of a column in a data frame

Apologies in advance if this has already been asked elsewhere, but I've tried different attempts and nothing has worked so far.
I have a data frame Data containing the measurements of air pollution. The columns "Measuring.Unit" and "Uncertainty.Unit" show that most of the measurements are expressed in "mol/L" but some of them are expressed are "mol/mL.
head(Data)
Locality.Name Chemical Concentration Measuring.Unit Uncertainty Uncertainty.Unit
1 xxxx NH3 0.065 mol/L 0.010 mol/L
2 xxxx CO 0.015 mol/L 0.004 mol/L
3 xxxx CO2 0.056 mol/L 0.006 mol/L
4 xxxx O3 0.67 mol/mL 0.010 mol/mL
5 xxxx H2SO4 0.007 mol/L 0.0008 mol/L
6 xxxx NO 0.89 mol/mL 0.08 mol/mL
Before starting any analysis, I want to change each value expressed in mol/mL in mol/L using a simple function and of course, change the associated character "mol/mL" in "mol/L". This should be something like this (but I guess there are much simple ways using dplyr or tidyverse)
:
# First step
if (Data$Measuring.Unit == "mol/mL") {Data$Concentration <- Data$Concentration * 1000 }
else {Data$Concentration <- Data$Concentration }
if (Data$Uncertainty.Unit == "mol/mL") {Data$Uncertainty <- Data$Uncertainty * 1000 }
else {Data$Uncertainty <- Data$Uncertainty}
# Second step
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'
You can try:
Data$Concentration <- ifelse(Data$Measuring.Unit == "mol/mL",Data$Concentration * 1000,Data$Concentration)
Data$Uncertainty <- ifelse(Data$Uncertainty.Unit == "mol/mL",Data$Uncertainty * 1000,Data$Uncertainty)
This step looks fine:
Data$Measuring.Unit[Data$Measuring.Unit == 'mol/mL'] <- 'mol/L'
Data$Uncertainty.Unit[Data$Uncertainty.Unit == 'mol/mL'] <- 'mol/L'
if() is used for values while ifelse() is vectorized for dataframes.

R: Aggregating over several variables and observations (depending on values) and creating a new variable

The data set has the following structure
Key Date Mat Amount
<int> <date> <chr> <dbl>
1 1001056 2014-12-12 10025 0.10
2 1001056 2014-12-23 10025 0.20
3 1001056 2015-01-08 10025 0.10
4 1001056 2015-04-07 10025 0.20
5 1001056 2015-05-08 10025 0.20
6 1001076 2013-10-29 10026 3.00
7 1001140 2013-01-18 10026 0.72
8 1001140 2013-04-11 10026 2.40
9 1001140 2014-10-08 10026 0.24
10 1001237 2015-02-17 10025 2.40
11 1001237 2015-02-17 10026 3.40
Mat takes values in {10001,...,11000}, hence A:=|Mat|=1000.
I would like to accomplish the following goals:
1) (Intermediate step) For each Key-Date combination I would like to calculate for all materials, which are availabe at such a combination (which might vary from key to key), the differences in amount,
e.g. for combination "1001237 2015-02-17" this would be for materials 10025 and 10026 2.40-3.40=-1 (but might be more combinations). (How to store those values effienently?)
This step might be skipped.
2) Finally, I would like to construct a new matrix of dimension A=1000 where each entry (i,j) (Material combination i and j) contains the average of the values calculated in the step before.
More formally, entry (i,j) is given by,
1/|all key-date combinationas containing Mat i and Mat j| \sum_{all key-date combinationas containing Mat i and Mat j} Amount_i - Amount_j
As the table is quite large efficiency of the computation is very important.
Thank you very much for your help in advance!
I can do it with list columns in tidyverse; the trick is to use group_by to get distinct combinations of Key and Date. Here's the code:
materials <- unique(x$Mat)
n <- length(materials)
x <- x %>%
group_by(Key, Date) %>%
nest() %>%
# Create a n by n matrix for each combination of Key and Date
mutate(matrices = lapply(data,
function(y) {
out <- matrix(nrow = n, ncol = n,
dimnames = list(materials, materials))
# Only fill in when the pair of materials is present
# for the date of interest
mat_present <- as.character(unique(y$Mat))
for (i in mat_present) {
for (j in mat_present) {
# You may want to take an absolute value
out[i,j] <- y$Amount[y$Mat == i] - y$Amount[y$Mat == j]
}
}
out
}))
If you really want speed, you can implement the function in lapply with Rcpp. You can use RcppParallel to further speed it up. Now one of the columns of the data frame is a list of matrices. Then, for each element of the matrices, take an average while ignoring NAs:
x_arr <- array(unlist(x$matrices), dim = c(2,2,10))
results <- apply(x_arr, 2, rowMeans, na.rm = TRUE)
I stacked the list of matrices into a 3D array and found row means slice by slice. For performance, you can also do it in RcppArmadillo, with sum(x_arr, 2), but it's hard to deal with missing values when not all types of materials are represented in a combination of Key and Date.

Prevent reshape2 from converting column headings to numbers

I am trying to create a lookup table from an actual distance matrix for US zip code locations using the GoogleMaps model shown on Lars Relund Nielsen's webpage. The Zip codes in the Northeastern US begin with a "0" and therefore get dropped when reshape2 converts the matrix from wide to long as described on his page.
example matrix of distances (km) between 5 Zip codes:
mdat <- matrix(c(0.000, 113.288, 145.986, 126.271, 368.354
,103.988, 0.000, 69.637, 49.922, 294.386
,144.851, 69.285, 0.000, 25.547, 244.024
,124.531, 48.965, 25.245, 0.000, 258.729
,368.346, 295.159, 243.478, 258.598, 0.000)
, nrow = 5
, ncol = 5
, byrow = TRUE,
dimnames = list(c("01014", "01747", "02144", "02453", "04040"),
c("01014", "01747", "02144", "02453", "04040")))
Looks like this (all well and good);
# 01014 01747 02144 02453 04040
#01014 0.000 113.288 145.986 126.271 368.354
#01747 103.988 0.000 69.637 49.922 294.386
#02144 144.851 69.285 0.000 25.547 244.024
#02453 124.531 48.965 25.245 0.000 258.729
#04040 368.346 295.159 243.478 258.598 0.000
But when I reshape the Matrix to a lookup table it converts the row/col names to a number dropping the leading zero from the Zip code.
#reshape into a table of distances
library(reshape2)
dat<-(melt(mdat))
dat
colnames(dat)<-c("from","to","km")
head(dat)
from to km
1 1014 1014 0.000
2 1747 1014 103.988
3 2144 1014 144.851
4 2453 1014 124.531
5 4040 1014 368.346
6 1014 1747 113.288
I am hoping to get;
from to km
1 01014 01014 0.000
2 01747 01014 103.988
3 02144 01014 144.851
4 02453 01014 124.531
5 04040 01014 368.346
6 01014 01747 113.288
Any thoughts on how I can keep reshape2 from converting the the Zip codes to a number?
Include as.is=TRUE in your melt function:
dat<-(melt(mdat, as.is=TRUE))
colnames(dat)<-c("from","to","km")
This will keep the column names as strings through the melting process.

How to return the date associated with each member of a sorted list in xts

I have an xts object 'foo' containing the top 6 largest negative percentage changes in the value of the share price of a stock over a certain period. Using sort(foo) produces a list sorted by date, as shown below, but I want to sort the list based on the values.
sort(coredata(foo)) gives me the list I expect, but returns the values without the associated index date value, as shown below. I would like a list in the format:
2008-11-07 -0.150
2008-11-06 -0.145
etc
I feel that some combination of index() and which() might work but haven't been able to produce anything useful. Any pointers gratefully received.
sort(foo)
[,1]
2008-10-08 -0.105
2008-10-16 -0.119
2008-10-27 -0.109
2008-11-06 -0.145
2008-11-07 -0.150
2008-12-12 -0.121
sort(coredata(foo))
[1] -0.150 -0.145 -0.121 -0.119 -0.109 -0.105
Something along the lines of this would give you the dates in the desired order:
index(foo)[ order(coredata(foo)) ]
xts and zoo objects do not like to be displayed out of order, so I think you would need to coerce them to a less "order"-ly class:
> foo3 <- as.data.frame(foo)
> foo3[order(foo3$V1), ]
[1] -0.150 -0.145 -0.121 -0.119 -0.109 -0.105
> foo3[order(foo3$V1), ,drop=FALSE]
V1
2008-11-07 -0.150
2008-11-06 -0.145
2008-12-12 -0.121
2008-10-16 -0.119
2008-10-27 -0.109
2008-10-08 -0.105
As #DWin pointed out, the index must be ordered (zoo stands for "Z's ordered objects), so you can't un-order xts/zoo objects.
You can use the fmt argument to coredata.xts (as described in ?coredata.xts) combined with drop to get a named vector that you could then sort like you want.
sort(drop(coredata(x,fmt=TRUE)))
# 2008-11-07 2008-11-06 2008-12-12 2008-10-16 2008-10-27 2008-10-08
# -0.150 -0.145 -0.121 -0.119 -0.109 -0.105
I think the print method for xts imposes a date ordering, so you can not solve this by subsetting alone: compare foo[1:2] and foo[2:1].
To get what you want you could try,
ord <- order(foo)
dat <- data.frame(x=coredata(foo)[ord])
rownames(dat) <- index(foo)[ord]
This would not be an xts object however.

Resources