Quickly create new columns in dataframe using lists - R - r

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879

It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

Related

R data frame: Change value in 1 column depending on value in another

I have a data frame called nurse. At the moment it contains several columns but only one (nurse$word) is relevant at the moment. I want to create a new column named nurse$w.frequency which looks at the words in the nurse$word column and if it finds the one specified, I want it to change the corresponding nurse$w.frequency value to a specified integer.
nurse <- read.csv(...)
file word w.frequency
1 determining
2 journey
3 journey
4 serving
5 work
6 journey
... ...
The word frequency for determining and journey, for instance, is 1590 and 4650 respectively. So it should look like the following:
file word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
... ...
I have tried it with the an ifelse statement (below) which seems to work, however, every time I try to change the actual word and frequency it overwrites the results from before.
nurse$w.frequency <- ifelse(nurse$word == "determining", nurse$w.frequency[nurse$word["determining"]] <- 1590, "")
You could first initialise an empty column
nurse$w.frequency <- NA
then populated it with the data you want
nurse$w.frequency[nurse$word == "determining"] <- 1590
nurse$w.frequency[nurse$word == "journey"] <- 4650
Using dplyr:
nurse %>%
mutate(w.frequency =
case_when(
word == "determining" ~ "1590",
word == "journey" ~ "4650",
TRUE ~ ""
))
Gives us:
word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
Data:
nurse <- data.frame(word = c("determining", "journey", "journey", "serving", "work", "journey"))

New variable: sum of numbers from a list powered by value of different columns

This is my first question in Stackoverflow. I am not new to R, although I sometimes struggle with things that might be considered basic.
I want to calculate the count median diameter (CMD) for each of my rows from a Particle Size Distribution dataset.
My data looks like this (several rows and 53 columns in total):
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
2015-01-01 00:00:00 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2015-01-01 01:00:00 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
Each variable starting with "n" indicates the number of particles for the corresponding size (variable n3.16 = number of particles of median size of 3.16nm). I will divide the values by 100 prior to the calculations, in order to avoid such high numbers that prevent from the computation.
To compute the CMD, I need to do the following calculation:
CMD = (D1^n1*D2^n2...Di^ni)^(1/N)
where Di is the diameter (to be extracted from the column name), ni is the number of particles for diameter Di, and N is the total sum of particles (sum of all the columns starting with "n").
To get the Di, I created a numeric list from the column names that start with n:
D <- as.numeric(gsub("n", "", names(data)[3:54]))
This is my attempt to create a new variable with the calculation of CMD, although it doesn't work.
data$cmd <- for i in 1:ncol(D) {
prod(D[[i]]^data[,i+2])
}
I also tried to use apply, but I again, it didn't work
data$cmd <- for i in 1:ncol(size) {
apply(data,1, function(x) prod(size[[i]]^data[,i+2])
}
I have different datasets from different sites which have different number of columns, so I would like to make code "universal".
Thank you very much
This should work (I had to mutilate your date variable because of read.table, but it is not involved in the calculations, so just ignore that):
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846
N <- sum(df[3:11]) # did you mean the sum of all n.columns over all rows? if not, you'd need to edit this
> N
[1] 7235.488
D <- as.numeric(gsub("n", "", names(df)[3:11]))
> D
[1] 3.16 3.55 3.98 4.47 5.01 5.62 6.31 7.08 7.94
new <- t(apply(df[3:11], 1, function(x, y) (x^y), y = D))
> new
n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94
[1,] 772457.6 41933406 336296640 9957341349 5.167135e+12 1.232886e+15 3.625318e+17 2.054007e+20 3.621747e+23
[2,] 7980615.0 5922074 348176502 25783108893 1.368736e+12 2.305272e+14 9.119184e+16 5.071946e+20 1.129304e+24
df$CMD <- rowSums(new)^(1/N)
> df
date CPC n3.16 n3.55 n3.98 n4.47 n5.01 n5.62 n6.31 n7.08 n7.94 CMD
1 2015-01-01 5263.434 72.988 140.346 138.801 172.473 344.806 484.415 606.430 739.625 927.082 1.007526
2 2015-01-01 4813.182 152.823 80.861 140.017 213.382 264.496 359.455 487.293 840.349 1069.846 1.007684

Speed-tune a for-loop in R

I have read up on vectorization as a solution for speeding up a for-loop. However, the data structure I am creating within a for-loop seems to need to be a data.frame/table.
Here is the scenario:
I have a large table of serial numbers and timestamps. Several timestamps can apply to the same serial number. I only want the latest timestamp for every serial number.
My approach now is to create a vector with unique serial numbers. Then for each loop through this vector, I create a temporary table that holds all observations of a serial number/timestamp combinations ('temp'). I then take the last entry of this temporary table (using tail command) and put it into another table that will eventually hold all unique serial numbers and their latest timestamp ('last.pass'). Finally, I simply remove rows from the starting table serial where number/timestamp combination cannot be found 'last.pass'
Here is my code:
#create list of unique serial numbers found in merged 9000 table
hddsn.unique <- unique(merge.data$HDDSN)
#create empty data.table to populate
last.pass < data.table(HDDSN=as.character(1:length(hddsn.unique)),
ENDDATE=as.character(1:length(hddsn.unique)))
#populate last.pass with the combination of serial numbers and their latest timestamps
for (i in 1:length(hddsn.unique)) {
#create temporary table that finds all serial number/timestamp combinations
temp <- merge.data[merge.data$HDDSN %in% hddsn.unique[i],][,.(HDDSN, ENDDATE)]
#populate last.pass with the latest timestamp record for every serial number
last.pass[i,] <- tail(temp, n=1)
}
match <- which(merge.data[,(merge.data$HDDSN %in% last.pass$HDDSN) &
(merge.data$ENDDATE %in% last.pass$ENDDATE)]==TRUE)
final <- merge.data[match]
My ultimate question is, how do I maintain the automated nature of this script while speeding it up, say, through vectorization or turning it into a function.
Thank you!!!
How about this. Without a clear idea of what your input data looks like, I took a guess.
# make some dummy data with multiple visits per serial
merge.data <- data.frame(HDDSN = 1001:1020,
timestamps = sample(1:9999, 100))
# create a function to find the final visit for a given serial
fun <- function(serial) {
this.serial <- subset(merge.data, HDDSN==serial)
this.serial[which.max(this.serial$timestamps), ]
}
# apply the function to each serial number and clean up the result
final <- as.data.frame(t(sapply(unique(merge.data$HDDSN), fun)))
This data has several ENDDATE for each HDDSN
merge.data <- data.frame(HDDSN = 1001:1100, ENDDATE = sample(9999, 1000))
place it in order, first by HDDSN then by ENDDATE
df = merge.data[do.call("order", merge.data),]
then find the last entry for each HDDSN
df[!duplicated(df[["HDDSN"]], fromLast=TRUE),]
The following illustrate the key steps
> head(df, 12)
HDDSN ENDDATE
701 1001 4
101 1001 101
1 1001 1225
301 1001 2800
201 1001 6051
501 1001 6714
801 1001 6956
601 1001 7894
401 1001 8234
901 1001 8676
802 1002 247
402 1002 274
> head(df[!duplicated(df[["HDDSN"]], fromLast=TRUE),])
HDDSN ENDDATE
901 1001 8676
902 1002 6329
803 1003 9947
204 1004 8825
505 1005 8472
606 1006 9743
If there are composite keys, then look for duplicates on a data.frame rather than a vector, !duplicated(df[, c("key1", "key2")]), as illustrated in the following:
> df = data.frame(k0=c(1:3, 1:6), k1=1:3)
> df[!duplicated(df, fromLast=TRUE),]
k0 k1
1 1 1
2 2 2
3 3 3
7 4 1
8 5 2
9 6 3
(the row numbers are from the original data frame, so rows 4-6 were duplicates). (Some care might need to be taken, especially if one of the columns is numeric, because duplicated.data.frame pastes columns together into a single string and rounding error may creep in).

R - subtracting multiple columns from multiple columns with 2 data frames

I have two dataframes as below:
> head(VN.GRACE.Int, 4)
DecimDate CSR GFZ JPL
1 2003.000 12.1465164 5.50259937 15.7402752
2 2003.083 1.8492431 0.27744418 3.4811423
3 2003.167 1.5168512 -0.06333961 1.7962201
4 2003.250 -0.2355813 6.16296554 0.7215013
> head(VN.GLDAS, 4)
Decim_Date NOAH_SManom CLM_SManom VIC_SManom SM_Month_Mean
1 2003.000 3.0596372 0.4023805 -0.2175665 1.081484
2 2003.083 -1.4459928 -1.0255955 -3.1338024 -1.868464
3 2003.167 -3.9945788 -1.4646734 -4.2052981 -3.221517
4 2003.250 -0.9737429 0.4213161 -1.0537822 -0.535403
EDIT: The names below (UN.GRACE.Int and UN.GLDAS) are the names of the two dataframes above. Have added an example of what the final data frame will look like.
I want to subtract columns [,2:5] in VN.GLDAS data frame from EACH of the columns [,2:4] in UN.GRACE.Int and put the results in a separate data frame (new data frame will have 12 columns) as below:
EXAMPLE <- data.frame(CSR_NOAH=numeric(), CSR_CLM=numeric(), CSR_VIC=numeric(), CSR_SM_Anom=numeric(),
GFZ_NOAH=numeric(), GFZ_CLM=numeric(), GFZ_VIC=numeric(), GFZ_SM_Anom=numeric(),
JPL_NOAH=numeric(), JPL_CLM=numeric(), JPL_VIC=numeric(), JPL_SM_Anom=numeric())
I've looked into 'sweep' as suggested in another post, but am not sure whether my query would be better suited using a for loop, which I'm a novice at. Also looked at subtracting values in one data frame from another but doesn't answer my query I don't believe - Thanks in advance
res <- cbind(VN.GRACE.Int[,1,drop=F],
do.call(cbind,lapply(VN.GLDAS[,2:5],
function(x) VN.GRACE.Int[,2:4]-x)))
dim(res)
#[1] 4 13

Merge and fill R from two dataframes on date

I have reproduced my example as follows:
x<-as.Date(c(as.Date('2015-01-01'):as.Date('2016-01-01')), origin='1970-01-01')
dates<-as.Date(c(as.Date('2015-01-01'),as.Date('2015-03-04'),as.Date('2015-07-01')),origin='1970-01-01')
values<-c(3,7,10)
What I would like is an output dataframe where values for a particular date in x are joined to the most recent, but historic, date entry. For example:
x, value
2015-01-01, 3
2015-01-02, 3
....
2015-03-04, 7
2015-03-05, 7
....
2015-07-01, 10
2015-07-02, 10
....
2016-01-01, 10
I've currently implemented this through a for loop, but it feels slow and horrendously inefficient - I'm sure there must be some way in R to do it more automatically?
Try something like it
x<-data.frame(x=as.Date(c(as.Date('2015-01-01'):as.Date('2016-01-01')), origin='1970-01-01'))
dates<-as.Date(c(as.Date('2015-01-01'),as.Date('2015-03-04'),as.Date('2015-07-01')),origin='1970-01-01')
values<-c(3,7,10)
a=data.frame(dates,values)
y=merge(x,a,by.x='x',by.y='dates',all.x=T)
colnames(y)=c("x","value")
test=function(i){
if(is.na(y[i,2])){
if(i==1) return(NA)
return(test(i-1))
}else{
return(y[i,2])
}
}
y$value=sapply(1:nrow(y),test)

Resources