How can I calculate the difference between different rows within different list?
and different list have different dimensions.
I use the code as follows
names(ri1)
[1] "Sedol" "code" "ri" "date"
ri1<-ri1[order(ri1$Sedol,ri1$date),]
sri<-split(ri1,ri1$Sedol)
ri1$r<-as.vector(sapply(seq_along(sri), function(x) diff(c(0, sri[[x]][,3]))))
however it shows the result
"Error in `$<-.data.frame`(`*tmp*`, "r", value = list(c(100, 0.00790000000000646, :
replacement has 1485 rows, data has 4687655"
for example
I have three lists
date ri
1990 1
1991 2
1992 3
date ri
1990 1
1991 2
1992 3
1993 4
date ri
1990 1
1991 2
I want the results like
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
date ri r
1990 1 0%
1991 2 100%
1992 3 100%
1993 4 100%
date ri r
1990 1 0%
1991 2 100%
notice: r= r(t+1)/r(t)-1
Using diff and lapply you can get something like
# I generate some data
dat1 <- data.frame(date = seq(1990,1999,length.out=5),ri = seq(1,10,length.out=5))
dat2 <- data.frame(date = seq(1990,1999,length.out=5),ri=seq(1,5,length.out=5))
# I put the data.frame in a list
ll <- list(dat1,dat2)
# I use lapply:
ll <- lapply(ll,function(dat){
# I apply the formula you give in a vector version
# maybe you need only diff in percent?
dat$r <- round(c(0,diff(dat$ri))/dat$ri*100)
dat
})
ll
[[1]]
date ri r
1 1990.00 1.00 0
2 1992.25 3.25 69
3 1994.50 5.50 41
4 1996.75 7.75 29
5 1999.00 10.00 22
[[2]]
date ri r
1 1990.00 1 0
2 1992.25 2 50
3 1994.50 3 33
4 1996.75 4 25
5 1999.00 5 20
You should use a combination of head and tail as follows:
r.fun <- function(ri) c(0, tail(ri, -1) / head(ri, -1) - 1)
lapply(sri1, transform, r = r.fun(ri))
If your goal is to recombine (rbind) your data afterwards, then know that you can split/apply/combine everything within a single call to ave from the base package, or ddply from the plyr package:
transform(ri1, r = ave(ri, Sedol, FUN = r.fun))
or
library(plyr)
ddply(ri1, "Sedol", transform, r = r.fun(ri))
Edit: If you want the output to be in XX% as in your example, replace r.fun with:
r.fun <- function(ri) paste0(round(100 * c(0, tail(ri, -1) / head(ri, -1) - 1)), "%")
Related
I have a function that takes arguments and I want to apply it over each row in a data.table.
My data looks like follows:
Row Temp Humidity Elevation
1 10 0.5 1000
2 25 1.5 2000
3 28 2.0 1500
and I have a function
myfunc <- function(x, n_features=3){
# Here x represents each row of a data table.
# Feature names are important for me as my actual function is operating on feature names
return(x[,Temp]+x[,Humidity]+(x[,Elevation]*(n_features)))
}
What I want my output to look like is
Row Temp Humidity Elevation myfuncout
1 10 0.5 1000 3010.5
2 25 1.5 2000 6026.5
3 28 2.0 1500 4530
I have tried df[, myfuncout := myfunc(x, n_features=3), by=.I] but this didnt work.
Also, not sure if I have to use .SD here to make this work...
Any inputs here on how I can achieve this?
Thanks!
If it is a data.table, we can use
myfunc <- function(dt, n_features = 3) {
dt[, out := (Temp + Humidity) + (Elevation * n_features)]
dt
}
myfunc(df, 3)
-output
df
# Row Temp Humidity Elevation out
#1: 1 10 0.5 1000 3010.5
#2: 2 25 1.5 2000 6026.5
#3: 3 28 2.0 1500 4530.0
I have a data frame that is in the following format
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
name location value
A 2000 1
B 2021 0
C 4532 1
D 1931 1
E 3457 0
There are approximately a million rows in the data frame. How would I create a new dataframe that has the distance between every location if the locations are within 1000 of each other also checks to see if the values are both one for both locations?
For the above dataset, the dataframe would only have three rows in it with values of 21 (absolute value of 2000 - 2021), 69 (absolute value of 2000 - 1931), and 90 (abs. value of 2021-1931) because those are the only differences that are less than 1000. It would also have a column of 0 (because A and B values are not 1 and 1), 1 (because A and C values are 1 and 1), and 0 (because B and C are not 1 and 1). So it would look like:
21 0
69 1
90 0
I've tried using loops but since there are so many rows, it's inefficient. Is there some built in function that I should use to do this faster?
Thanks in advance.
library(sqldf)
sqldf("
select a.location
, b.location
, a.location - b.location as locdiff
, a.value*b.value as value
from df a
inner join df b
on a.location - b.location between 1 and 1000
")
This gives
a.location b.location locdiff value
1 2000 1931 69 1
2 2021 2000 21 0
3 2021 1931 90 0
Or with data.table. This is just #MKR's solution but adding a column to avoid a large join result. Not sure if it's possible to achieve this without creating a new column.
setDT(df)
df[, loc2 := location - 1000]
df[df
, .( locdiff = i.location - x.location
, locationA = i.location
, locationB = x.location
, value = x.value*i.value)
, on = .(location >= loc2
, location < location)
, nomatch = 0]
gives
locdiff locationA locationB value
1: 69 2000 1931 1
2: 90 2021 1931 0
3: 21 2021 2000 0
I agree with #Gregor comment where he mentioned sqldf to be better option to in above scenario in the sense that it avoid cartesian join of million records.
But I tried to optimize data.table based solution by first joining on x.location > i.location and then filtering on diff <=1000.
df <- data.frame(name=LETTERS[1:5], location=c(2000,2021,4532,1931,3457),
value=c(1,0,1,1,0))
library(data.table)
setDT(df)
df[df,.(name, diff = x.location - i.location, value = x.value*i.value),
on=.(location > location), nomatch=0][diff<=1000]
# name diff value
# 1: B 21 0
# 2: A 69 1
# 3: B 90 0
I am analysing some data and need help.
Basically, I have a dataset that looks like this:
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
As can be seen, there's a switch column that switches between 1 and 2 every 10 data points. I want to write a code that does: when the "switch" column changes its value (from 1 to 2, or 2 to 1), delete the first 5 rows of data after the switch (i.e. leaving the 5 last data points for all the 4 variables), average the rest of the data points for O2 and CO2, and put them in 2 new columns (avg.O2 and avg.CO2) before the next switch. Then repeat this process until the end.
It's quite easy to do manually on paper or excel, but my real dataset would comprise thousands of data points and I would like to use R to do it automatically for me. So anyone has any ideas that could help me?
Please find my edits which should work for both regular and irregular
date <- seq(as.Date("2017-04-01"),as.Date("2017-05-09"),length.out=40)
switch <- c(rep(1:2,each=10),rep(1:2,each=10))
O2 <- runif(40,min=21.02,max=21.06)
CO2 <- runif(40,min=0.076,max=0.080)
test.data <- data.frame(date,switch,O2,CO2)
CleanMachineData <- function(Data, SwitchData, UnreliableRows = 5){
# First, we can properly turn your switch column into a grouping column (1,2,1,2)->(1,2,3,4)
grouplength <- rle(Data[,"switch"])$lengths
# mapply lets us input vector arguments into typically one/first-element only argument functions.
# In this case we create a sequence of lengths (output is a list/vector)
grouping <- mapply(seq, grouplength)
# Here we want it to become a single vector representing groups
groups <- mapply(rep, 1:length(grouplength), each = grouplength)
# if frequency was irregular, it will be a list, if regular it will be a matrix
# convert either into a vector by doing as follows:
if(class(grouping) == "list"){
groups <- unlist(groups)
} else {
groups <- as.vector(groups)
}
Data$group <- groups
#
# vector of the first row of each new switch (except the starting 0)
switchRow <- c(0,which(abs(diff(SwitchData)) == 1))+1
# I use "as.vector" to turn the matrix output of mapply into a sequence of numbers.
# "ToRemove" will have all the row numbers to get rid of from your original data, except for what happens before (in this case) row 10
ToRemove <- c(1:UnreliableRows, as.vector(mapply(seq, switchRow, switchRow+(UnreliableRows)-1)))
# I concatenate the missing beginning (1,2,3,4,5) and theToRemove them with c() and then remove them from n with "-"
Keep <- seq(nrow(Data))[-c(1:UnreliableRows,ToRemove)]
# Create the new data, (in case you don't know: data[<ROW>,<COLUMN>])
newdat <- Data[-ToRemove,]
# print the results
newdat
}
dat <- CleanMachineData(test.data, test.data$switch, 5)
dat
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
19 2017-04-18 2 21.03252 0.07960098 2
20 2017-04-19 2 21.04032 0.07892145 2
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
39 2017-05-08 2 21.04136 0.07781525 4
40 2017-05-09 2 21.05375 0.07941123 4
aggregate(cbind(O2,CO2) ~ group, dat, mean)
group O2 CO2
1 1 21.04675 0.07812336
2 2 21.03497 0.07819329
3 3 21.03967 0.07834986
4 4 21.04166 0.07882221
# crazier, irregular switching
test.data2 <- test.data
test.data2$switch <- unlist(mapply(rep, 1:2, times = 1, each = c(10,8,10,5,3,10)))[1:20]
dat2 <- CleanMachineData(test.data2, test.data2$switch, 5)
dat2
date switch O2 CO2 group
6 2017-04-05 1 21.03922 0.07648886 1
7 2017-04-06 1 21.04071 0.07747368 1
8 2017-04-07 1 21.05742 0.07946615 1
9 2017-04-08 1 21.04673 0.07782362 1
10 2017-04-09 1 21.04966 0.07936446 1
16 2017-04-15 2 21.02526 0.07833825 2
17 2017-04-16 2 21.04511 0.07747774 2
18 2017-04-17 2 21.03165 0.07662803 2
24 2017-04-23 1 21.05658 0.07669662 3
25 2017-04-24 1 21.04452 0.07983165 3
26 2017-04-25 1 21.03691 0.07691438 3
27 2017-04-26 1 21.05846 0.07857017 3
28 2017-04-27 1 21.04128 0.07891908 3
29 2017-04-28 1 21.03837 0.07817021 3
30 2017-04-29 1 21.02334 0.07917546 3
36 2017-05-05 2 21.02890 0.07723042 4
37 2017-05-06 2 21.04606 0.07979641 4
38 2017-05-07 2 21.03822 0.07985775 4
# You can try removing a vector with the following
lapply(5:7, function(x) {
dat <- CleanMachineData(test.data2, test.data2$switch, x)
list(data = dat, means = aggregate(cbind(O2,CO2)~group, dat, mean))
})
Use
test.data[rep(c(FALSE, TRUE), each=5),]
to select always the last five rows from the group of 10 rows.
Then you can use aggregate:
d2 <- test.data[rep(c(FALSE, TRUE), each=5),]
aggregate(cbind(O2, CO2) ~ 1, data=d2, FUN=mean)
If you want the average for every 5-rows-group:
aggregate(cbind(O2, CO2) ~ gl(k=5, n=nrow(d2)/5L), data=d2, FUN=mean)
Here is a generalization for the situation of arbitrary number of rows in test.data:
stay <- rep(c(FALSE, TRUE), each=5, length.out=nrow(test.data))
d2 <- test.data[stay,]
group <- gl(k=5, n=nrow(d2)/5L+1L, length=nrow(d2))
aggregate(cbind(O2, CO2) ~ group, data=d2, FUN=mean)
Here is a variant for mixing the data with the averages:
group <- gl(k=10, n=nrow(test.data)/10L+1L, length=nrow(test.data))
L <- split(test.data, group)
mySummary <- function(x) {
if (nrow(x) <= 5) return(NULL)
x <- x[-(1:5),]
d.avg <- aggregate(cbind(O2, CO2) ~ 1, data=x, FUN=mean)
rbind(x, cbind(date=NA, switch=-1, d.avg))
}
lapply(L, mySummary) # as list of dataframes
do.call(rbind, lapply(L, mySummary)) # as one dataframe
I have data like this (derived using the table() function):
dat <- read.table(text = "responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6", header = TRUE)
dat
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
All I want are row totals, so to create a new row at the bottom that says Total and then in column freq it will show 35 and in percent it will show 100. I am unable to find a solution. colSums doesn't work because of the first column which is a string.
One option is converting to 'matrix' and using addmargins to get the column sum as a separate row at the bottom. But, this will be a matrix.
m1 <- as.matrix(df1[-1])
rownames(m1) <- df1[,1]
res <- addmargins(m1, 1)
res
# freq percent
#A 9 25.7
#B 13 37.1
#C 10 28.6
#D 3 8.6
#Sum 35 100.0
If you want to convert to data.frame
data.frame(responses=rownames(res), res)
Another option would be getting the sum with colSums for the numeric columns (df1[-1]) (I think here is where the OP got into trouble, ie. applying the colSums on the entire dataset instead of subsetting), create a new data.frame with the responses column and rbind with the original dataset.
rbind(df1, data.frame(responses='Total', as.list(colSums(df1[-1]))))
# responses freq percent
#1 A 9 25.7
#2 B 13 37.1
#3 C 10 28.6
#4 D 3 8.6
#5 Total 35 100.0
data
df1 <- structure(list(responses = c("A", "B", "C", "D"), freq = c(9L,
13L, 10L, 3L), percent = c(25.7, 37.1, 28.6, 8.6)),
.Names = c("responses", "freq", "percent"), class = "data.frame",
row.names = c(NA, -4L))
This might be relevant, using SciencesPo package, see this example:
library(SciencesPo)
tab(mtcars,gear,cyl)
#output
=================================
cyl
--------------------
gear 4 6 8 Total
---------------------------------
3 1 2 12 15
6.7% 13% 80% 100%
4 8 4 0 12
66.7% 33% 0% 100%
5 2 1 2 5
40.0% 20% 40% 100%
---------------------------------
Total 11 7 14 32
34.4% 22% 44% 100%
=================================
Chi-Square Test for Independence
Number of cases in table: 32
Number of factors: 2
Test for independence of all factors:
Chisq = 18.036, df = 4, p-value = 0.001214
Chi-squared approximation may be incorrect
X^2 df P(> X^2)
Likelihood Ratio 23.260 4 0.00011233
Pearson 18.036 4 0.00121407
Phi-Coefficient : NA
Contingency Coeff.: 0.6
Cramer's V : 0.531
#akrun I posted it but you already did the same. Correct me if I'm wrong, I think we can just need this without creating a new data frame or using as.list.
rbind(df1, c("Total", colSums(df1[-1])))
Output:
responses freq percent
1 A 9 25.7
2 B 13 37.1
3 C 10 28.6
4 D 3 8.6
5 Total 35 100
sqldf Classes of the data frame are preserved.
library(sqldf)
sqldf("SELECT * FROM df1
UNION
SELECT 'Total', SUM(freq) AS freq, SUM(percent) AS percent FROM df1")
Or, alternatively you can use margin.table and rbind function within R-base. Two lines and voila...
PS: The lines here are longer as I am recreating the data, but you know what I mean :-)
Data
df1 <- matrix(c(9,25.7,13,37.1,10,28.6,3,8.6),ncol=2,byrow=TRUE)
colnames(df1) <- c("freq","percent")
rownames(df1) <- c("A","B","C","D")
Creating Total Calculation
Total <- margin.table(df1,2)
Combining Total Calculation to Original Data
df2 <- rbind(df,Total)
df2
Inelegant but it gets the job done, please provide reproducible data frames so we don't have to build them first:
data = data.frame(letters[1:4], c(9,13,10,3), c(25.7,37.1, 28.6, 8.6))
colnames(data) = c("X","Y","Z")
data = rbind(data[,1:3], matrix(c("Sum",lapply(data[,2:3], sum)), nrow = 1)[,1:3])
library(janitor)
dat %>%
adorn_totals("row")
responses freq percent
A 9 25.7
B 13 37.1
C 10 28.6
D 3 8.6
Total 35 100.0
I have this code in R :
corr = function(x, y) {
sx = sign(x)
sy = sign(y)
cond_a = sx == sy && sx > 0 && sy >0
cond_b = sx < sy && sx < 0 && sy >0
cond_c = sx > sy && sx > 0 && sy <0
cond_d = sx == sy && sx < 0 && sy < 0
cond_e = sx == 0 || sy == 0
if(cond_a) return('a')
else if(cond_b) return('b')
else if(cond_c) return('c')
else if(cond_d) return('d')
else if(cond_e) return('e')
}
Its role is to be used in conjunction with the mapply function in R in order to count all the possible sign patterns present in a time series. In this case the pattern has a length of 2 and all the possible tuples are : (+,+)(+,-)(-,+)(-,-)
I use the corr function this way :
> with(dt['AAPL'], table(mapply(corr, Return[-1], Return[-length(Return)])) /length(Return)*100)
a b c d e
24.6129416 25.4466058 25.4863041 24.0174672 0.3969829
> dt["AAPL",list(date, Return)]
symbol date Return
1: AAPL 2014-08-29 -0.3499903
2: AAPL 2014-08-28 0.6496702
3: AAPL 2014-08-27 1.0987923
4: AAPL 2014-08-26 -0.5235654
5: AAPL 2014-08-25 -0.2456037
I would like to generalize the corr function to n arguments. This mean that for every nI would have to write down all the conditions corresponding to all the possible n-tuples. Currently the best thing I can think of for doing that is to make a python script to write the code string using loops, but there must be a way to do this properly. Do you have an idea about how I could generalize the fastidious condition writing, maybe I could try to use expand.grid but how do the matching then ?
I think you're better off using rollapply(...) in the zoo package for this. Since you seem to be using quantmod anyway (which loads xts and zoo), here is a solution that does not use all those nested if(...) statements.
library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
AAPL <- AAPL["2007-08::2009-03"] # AAPL during the crash...
Returns <- dailyReturn(AAPL)
get.patterns <- function(ret,n) {
f <- function(x) { # identifies which row of `patterns` matches sign(x)
which(apply(patterns,1,function(row)all(row==sign(x))))
}
returns <- na.omit(ret)
patterns <- expand.grid(rep(list(c(-1,1)),n))
labels <- apply(patterns,1,function(row) paste0("(",paste(row,collapse=","),")"))
result <- rollapply(returns,width=n,f,align="left")
data.frame(100*table(labels[result])/(length(returns)-(n-1)))
}
get.patterns(Returns,n=2)
# Var1 Freq
# 1 (-1,-1) 22.67303
# 2 (-1,1) 26.49165
# 3 (1,-1) 26.73031
# 4 (1,1) 23.15036
get.patterns(Returns,n=3)
# Var1 Freq
# 1 (-1,-1,-1) 9.090909
# 2 (-1,-1,1) 13.397129
# 3 (-1,1,-1) 14.593301
# 4 (-1,1,1) 11.722488
# 5 (1,-1,-1) 13.636364
# 6 (1,-1,1) 13.157895
# 7 (1,1,-1) 12.200957
# 8 (1,1,1) 10.765550
The basic idea is to create a patterns matrix with 2^n rows and n columns, where each row represents one of the possible patterns (e,g, (1,1), (-1,1), etc.). Then pass the daily returns to this function n-wise using rollapply(...) and identify which row in patterns matches sign(x) exactly. Then use this vector of row numbers an an index into labels, which contains a character representation of the patterns, then use table(...) as you did.
This is general for an n-day pattern, but it ignores situations where any return is exactly zero, so the $Freq columns do not add up to 100. As you can see, this doesn't happen very often.
It's interesting that even during the crash it was (very slightly) more likely to have two up days in succession, than two down days. If you look at plot(Cl(AAPL)) during this period, you can see that it was a pretty wild ride.
This is a little different approach but it may give you what you're looking for and allows you to use any size of n-tuple. The basic approach is to find the signs of the adjacent changes for each sequential set of n returns, convert the n-length sign changes into n-tuples of 1's and 0's where 0 = negative return and 1 = positive return. Then calculate the decimal value of each n-tuple taken as binary number. These numbers will clearly be different for each distinct n-tuple. Using a zoo time series for these calculations provides several useful functions including get.hist.quote() to retrieve stock prices, diff() to calculate returns, and the rollapply() function to use in calculating the n-tuples and their sums.The code below does these calculations, converts the sum of the sign changes back to n-tuples of binary digits and collects the results in a data frame.
library(zoo)
library(tseries)
n <- 3 # set size of n-tuple
#
# get stock prices and compute % returns
#
dtz <- get.hist.quote("AAPL","2014-01-01","2014-10-01", quote="Close")
dtz <- merge(dtz, (diff(dtz, arithmetic=FALSE ) - 1)*100)
names(dtz) <- c("prices","returns")
#
# calculate the sum of the sign changes
#
dtz <- merge(dtz, rollapply( data=(sign(dtz$returns)+1)/2, width=n,
FUN=function(x, y) sum(x*y), y = 2^(0:(n-1)), align="right" ))
dtz <- fortify.zoo(dtz)
names(dtz) <- c("date","prices","returns", "sum_sgn_chg")
#
# convert the sum of the sign changes back to an n-tuple of binary digits
#
for( i in 1:nrow(dtz) )
dtz$sign_chg[i] <- paste(((as.numeric(dtz$sum_sgn_chg[i]) %/%(2^(0:2))) %%2), collapse="")
#
# report first part of result
#
head(dtz, 10)
#
# report count of changes by month and type
#
table(format(dtz$date,"%Y %m"), dtz$sign_chg)
An example of possible output is a table showing the count of changes by type for each month.
000 001 010 011 100 101 110 111 NANANA
2014 01 1 3 3 2 3 2 2 2 3
2014 02 1 2 4 2 2 3 2 3 0
2014 03 2 3 0 4 4 1 4 3 0
2014 04 2 3 2 3 3 2 3 3 0
2014 05 2 2 1 3 1 2 3 7 0
2014 06 3 4 3 2 4 1 1 3 0
2014 07 2 1 2 4 2 5 5 1 0
2014 08 2 2 1 3 1 2 2 8 0
2014 09 0 4 2 3 4 2 4 2 0
2014 10 0 0 1 0 0 0 0 0 0
so this would show that in month 1, January of 2014, there was one set of three days with 000 indicating 3 down returns , 3 days with the 001 change indicating two down return and followed by one positive return and so forth. Most months seem to have a fairly random distribution but May and August show 7 and 8 sets of 3 days of positive returns reflecting the fact that these were strong months for AAPL.