I have daily stock return data for several companies and need to remove those values, which have a specific relation to the previous (= the day before) return value.
In a mathematical formula it looks something like that:(1+r)*(1+e)-1<= 50%, where r is the return on the current day and e the return on the previous day, and at least either r or e being greater than 100%.
The data frame DF looks like that.
Date A B C D
01.01.2015 0.15 0.17 0.70 0.65
02.01.2015 1.01 0.75 0.01 -0.18
01.02.2015 -0.50 0.64 1.20 0.1
06.02.2015 0.12 0.54 0.13 1.50
01.03.2016 0.45 0.54 1.89 0.56
Afte apllying this filter DF should look like this.
Date A B C D
01.01.2015 0.15 0.17 0.70 0.65
02.01.2015 1.01 0.75 0.01 -0.18
01.02.2015 NA 0.64 1.20 0.1
06.02.2015 0.12 0.54 0.13 1.50
01.03.2016 0.45 0.54 1.89 0.56
Thanks for your help!
I would try this:
library(tidyverse)
check_fn <- function(Z){
ifelse(((lag(Z, n=1) > 1 | Z >1) & ((1+lag(Z, n=1))*(1+Z) <= 1.5)), NA, Z)
}
Y <- X %>%
mutate_at(vars(2:5), check_fn)
Y[1, 2:5] <- X[1, 2:5]
Y
... which generates:
Date A B C D
1 01.01.2015 0.15 0.17 0.70 0.65
2 02.01.2015 1.01 0.75 0.01 -0.18
3 01.02.2015 NA 0.64 1.20 0.10
4 06.02.2015 0.12 0.54 0.13 1.50
5 01.03.2016 0.45 0.54 1.89 0.56
I hope it helps you.
There might for shure be a more elegang solution.
> M=matrix(c(0.15,0.17,0.70,0.65,1.01,0.75,0.01,-0.18,-0.50,0.64,1.20,0.1,0.12,0.54,0.13,1.50,0.45,0.54,1.89,0.56),nrow = 5, byrow = TRUE)
> M
[,1] [,2] [,3] [,4]
[1,] 0.15 0.17 0.70 0.65
[2,] 1.01 0.75 0.01 -0.18
[3,] -0.50 0.64 1.20 0.10
[4,] 0.12 0.54 0.13 1.50
[5,] 0.45 0.54 1.89 0.56
> ifelse(rbind(c(T,T,T,T), !(((M[2:5,]>1)|(M[1:4,]>1))&(((1+M[2:5,])*(1+M[1:4,])-1)<.5))), M, NA)
[,1] [,2] [,3] [,4]
[1,] 0.15 0.17 0.70 0.65
[2,] 1.01 0.75 0.01 -0.18
[3,] NA 0.64 1.20 0.10
[4,] 0.12 0.54 0.13 1.50
[5,] 0.45 0.54 1.89 0.56
Sorry, I misread some of your post. I have corrected it so it matches your expected output.
library(data.table)
setDT(dat)
dat = dat[ , lapply(.SD, relationship), .SDcols = c("A", "B", "C", "D")]
relationship = function(x){
return(ifelse(((1 + x)*(1 + shift(x)) - 1) < .5 & !is.na(shift(x)) & (x > 1 | shift(x) > 1), NA, x))
}
> dat[ , lapply(.SD, relationship), .SDcols = c("A", "B", "C", "D")]
A B C D
1: 0.15 0.17 0.70 0.65
2: 1.01 0.75 0.01 -0.18
3: NA 0.64 1.20 0.10
4: 0.12 0.54 0.13 1.50
5: 0.45 0.54 1.89 0.56
You can cbind the dates back onto the data.table
I should add incase there are many more columns that this needs to be done on and you don't want to write them all out you could do something like this.
Dates = dat$Date
dat[ , "Date" := NULL]
dat = dat[ , lapply(.SD, relationship)]
That will apply the function to every column in the data.table.
Related
# Create a data frame
> df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
> df <- round(abs(df), 2)
>
> df
a b c threshold
1 1.17 0.27 1.26 0.19
2 1.41 1.57 1.23 0.97
3 0.16 0.11 0.35 1.34
4 0.03 0.04 0.10 1.50
5 0.23 1.10 2.68 0.45
6 0.99 1.36 0.17 0.30
7 0.28 0.68 1.22 0.56
>
>
# Replace values in columns a, b, and c with NA if > value in threshold
> df[1:3][df[1:3] > df[4]] <- "NA"
Error in Ops.data.frame(df[1:3], df[4]) :
‘>’ only defined for equally-sized data frames
There could be some obvious solutions that I am incapable of producing. The intent is to replace values in columns "a", "b", and "c" with NA if the value is larger than that in "threshold". And I need to do that row-by-row.
If I had done it right, the df would look like this:
a b c threshold
1 NA NA NA 0.19
2 NA NA NA 0.97
3 0.16 0.11 0.35 1.34
4 0.03 0.04 0.10 1.50
5 0.23 NA NA 0.45
6 NA NA 0.17 0.30
7 0.28 NA NA 0.56
I had also tried the apply() approach but to no avail. Can you help, please??
You should use dplyr for most of such use cases.
One way below:
> set.seed(10)
> df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
> df <- round(abs(df), 2)
> df
a b c threshold
1 0.02 0.36 0.74 2.19
2 0.18 1.63 0.09 0.67
3 1.37 0.26 0.95 2.12
4 0.60 1.10 0.20 1.27
5 0.29 0.76 0.93 0.37
6 0.39 0.24 0.48 0.69
7 1.21 0.99 0.60 0.87
>
> df %>%
+ mutate_at(vars(a:c), ~ifelse(.x > df$threshold, NA, .x))
a b c threshold
1 0.02 0.36 0.74 2.19
2 0.18 NA 0.09 0.67
3 1.37 0.26 0.95 2.12
4 0.60 1.10 0.20 1.27
5 0.29 NA NA 0.37
6 0.39 0.24 0.48 0.69
7 NA NA 0.60 0.87
You can use apply function across dataframe
df[,c(1:3)]<- apply(df[,c(1:3),drop=F], 2, function(x){ ifelse(x>df[,4],NA,x)})
The problem with your code was the usage of df[4] instead of df[, 4]. The difference is that df[4] returns a data.frame with one column and df[, 4] returns a vector.
That's why
df[1:3] > df[4]
returns
error in Ops.data.frame(df[1:3], df[4]) :
‘>’ only defined for equally-sized data frames
While this works as expected
df[1:3][df[1:3] > df[, 4]] <- NA
df
# a b c threshold
#1 0.63 0.74 NA 0.78
#2 NA NA 0.04 0.07
#3 0.84 0.31 0.02 1.99
#4 NA NA NA 0.62
#5 NA NA NA 0.06
#6 NA NA NA 0.16
#7 0.49 NA 0.92 1.47
data
set.seed(1)
df <- data.frame(a = rnorm(7), b = rnorm(7), c = rnorm(7), threshold = rnorm(7))
df <- round(abs(df), 2)
You can use a for-loop like this:
for(i in 1:(ncol(df)-1)){
df[, i] <- ifelse(df[, i] > df[, 4], NA, df[, i])
}
Dataframe (df1) has a column with variable of interest (V1), some of these values in that column correspond to column names in other data frame (df2).
I would need to find an overlap between values (rows) of that column in df1 and all the columns in df2.
head(df1)
V1 CHR MAPINFO Pval
a 2 38067017 0.27
c 2 38070880 0.29
d 2 38073394 0.00
e 2 38073443 0.00
f 2 38073564 0.01
head(df2)
a b c d f
-0.09 -0.08 -0.50 0.50 0.35
0.00 0.00 0.40 -0.40 -0.85
0.32 0.30 0.20 0.74 0.42
-0.41 -0.52 -0.72 -0.90 -0.96
1.30 1.30 1.10 1.10 1.20
-1.12 -1.78 -1.40 1.40 1.20
For example, in the df2, there is no "e" and in df1 there is no "b". How could I only keep the ones that are present both in df1$V1 and all columns of df2?
In the end I would need intersect between both dataframes (values present only in both).
head(df1)
V1 CHR MAPINFO Pval
a 2 38067017 0.27
c 2 38070880 0.29
d 2 38073394 0.00
f 2 38073564 0.01
head(df2)
a c d f
-0.09 -0.50 0.50 0.35
0.00 0.40 -0.40 -0.85
0.32 0.20 0.74 0.42
-0.41 -0.72 -0.90 -0.96
1.30 1.10 1.10 1.20
-1.12 -1.40 1.40 1.20
Since the real number of these columns is > ~1200, I can not filter one by one.
Is there another elegant way other than transpose?
Base R solution:
df1 <- subset(df1, df1$V1 %in% names(df2))
df2 <- df2[,df1$V1]
I am having some problems sorting my dataset into bins, that based on the numeric value of the data value. I tried doing it with the function shingle from the lattice which seem to split it accurately.
I can't seem to extract the desired output which is the knowledge how the data is divided into the predefined bins. I seem only able to print it.
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
How do i extract the intervals which is outputted by the shingle function, and not only print it...
the intervals being the output:
Intervals:
min max count
1 0.38 0.40 0
2 0.42 0.44 6
3 0.46 0.48 46
4 0.50 0.52 251
5 0.54 0.56 697
6 0.58 0.60 1062
7 0.62 0.64 1215
8 0.66 0.68 1227
9 0.70 0.72 1231
10 0.74 0.76 1293
11 0.78 0.80 1330
12 0.82 0.84 1739
13 0.86 0.88 2454
14 0.90 0.92 3048
15 0.94 0.96 8936
16 0.98 1.00 71446
As an variable, that can be fed to another function.
The shingle() function returns the values using attributes().
The levels are specifically given by attr(bin_1,"levels").
So:
set.seed(1337)
data_1 = runif(100)
bin_interval = matrix(c(0.38,0.42,0.46,0.50,0.54,0.58,0.62,0.66,0.70,0.74,0.78,0.82,0.86,0.90,0.94,0.98,
0.40,0.44,0.48,0.52,0.56,0.60,0.64,0.68,0.72,0.76,0.80,0.84,0.88,0.92,0.96,1.0),
ncol = 2, nrow = 16)
bin_1 = shingle(data_1,intervals = bin_interval)
attr(bin_1,"levels")
This gives:
[,1] [,2]
[1,] 0.38 0.40
[2,] 0.42 0.44
[3,] 0.46 0.48
[4,] 0.50 0.52
[5,] 0.54 0.56
[6,] 0.58 0.60
[7,] 0.62 0.64
[8,] 0.66 0.68
[9,] 0.70 0.72
[10,] 0.74 0.76
[11,] 0.78 0.80
[12,] 0.82 0.84
[13,] 0.86 0.88
[14,] 0.90 0.92
[15,] 0.94 0.96
[16,] 0.98 1.00
Edit
The count information for each interval is only computed within the print.shingle method. Thus, you would need to run the following code:
count.shingle = function(x){
l <- levels(x)
n <- nlevels(x)
int <- data.frame(min = numeric(n), max = numeric(n),
count = numeric(n))
for (i in 1:n) {
int$min[i] <- l[[i]][1]
int$max[i] <- l[[i]][2]
int$count[i] <- length(x[x >= l[[i]][1] & x <= l[[i]][2]])
}
int
}
a = count.shingle(bin_1)
This gives:
> a
min max count
1 0.38 0.40 0
2 0.42 0.44 1
3 0.46 0.48 3
4 0.50 0.52 1
5 0.54 0.56 2
6 0.58 0.60 2
7 0.62 0.64 2
8 0.66 0.68 4
9 0.70 0.72 1
10 0.74 0.76 3
11 0.78 0.80 2
12 0.82 0.84 2
13 0.86 0.88 5
14 0.90 0.92 1
15 0.94 0.96 1
16 0.98 1.00 2
where a$min is lower range, a$max is upper range, and a$count is the number within the bins.
I have an input file with Format as below :
RN KEY MET1 MET2 MET3 MET4
1 1 0.11 0.41 0.91 0.17
2 1 0.94 0.02 0.17 0.84
3 1 0.56 0.64 0.46 0.7
4 1 0.57 0.23 0.81 0.09
5 2 0.82 0.67 0.39 0.63
6 2 0.99 0.90 0.34 0.84
7 2 0.83 0.01 0.70 0.29
I have to execute Kmeans in R -separately for DF with Key=1 and Key=2 and so on...
Afterwards the final output CSV should look like
RN KEY MET1 MET2 MET3 MET4 CLST
1 1 0.11 0.41 0.91 0.17 1
2 1 0.94 0.02 0.17 0.84 1
3 1 0.56 0.64 0.46 0.77 2
4 1 0.57 0.23 0.81 0.09 2
5 2 0.82 0.67 0.39 0.63 1
6 2 0.99 0.90 0.34 0.84 2
7 2 0.83 0.01 0.70 0.29 2
Ie Key=1 is to be treated as separate DF and Key=2 is be treated as separate DF and so on...
Finally the output of clustering (of each DF)is to be combined with Key column first (since Key cannot participate in clustering) and then combined with each different DF for final output
In the above example :
DF1 is
KEY MET1 MET2 MET3 MET4
1 0.11 0.41 0.91 0.17
1 0.94 0.02 0.17 0.84
1 0.56 0.64 0.46 0.77
1 0.57 0.23 0.81 0.09
DF2 is
KEY MET1 MET2 MET3 MET4
2 0.82 0.67 0.39 0.63
2 0.99 0.90 0.34 0.84
2 0.83 0.01 0.70 0.29
Please suggest how to achieve in R
Psuedo code :
n<-Length(unique(Mydf$key))
for i=1 to n
{
#fetch partial df for each value of Key and run K means
dummydf<-subset(mydf,mydf$key=i
KmeansIns<-Kmeans(dummydf,2)
# combine with cluster result
dummydf<-data.frame(dummydf,KmeansIns$cluster)
# combine each smalldf into final Global DF
finaldf<-data.frame(finaldf,dummydf)
}Next i
#Now we have finaldf then it can be written to file
I think the easiest way would be to use by. Something along the lines of
by(data = DF, INDICES = DF$KEY, FUN = function(x) {
# your clustering code here
})
where x is a subset of your DF for each KEY.
A solution using data.tables.
library(data.table)
setDT(DF)[,CLST:=kmeans(.SD, centers=2)$clust, by=KEY, .SDcols=3:6]
DF
# RN KEY MET1 MET2 MET3 MET4 CLST
# 1: 1 1 0.11 0.41 0.91 0.17 2
# 2: 2 1 0.94 0.02 0.17 0.84 1
# 3: 3 1 0.56 0.64 0.46 0.70 1
# 4: 4 1 0.57 0.23 0.81 0.09 2
# 5: 5 2 0.82 0.67 0.39 0.63 2
# 6: 6 2 0.99 0.90 0.34 0.84 2
# 7: 7 2 0.83 0.01 0.70 0.29 1
#Read data
mdf <- read.table("mydat.txt", header=T)
#Convert to list based on KEY column
mls <- split(mdf, f=mdf$KEY)
#Define columns to use in clustering
myv <- paste0("MET", 1:4)
#Cluster each df item in list : modify kmeans() args as appropriate
kls <- lapply(X=mls, FUN=function(x){x$clust <- kmeans(x[, myv],
centers=2)$cluster ; return(x)})
#Make final "global" df
finaldf <- do.call(rbind, kls)
I want to subset a dataframe (df), in order to contain only the maximum value of each row for columns 1 to 10 and the name of the column.
example dataframe:
0 1 2 3 4
0.01 0.12 0.41 0.11 0.11
0.13 0.12 0.33 0.14 0.07
0.02 0.20 0.11 0.27 0.17
0.11 0.33 0.04 0.09 0.24
0.08 0.07 0.04 0.05 0.58
Currently I'm using this:
new_df[] <- apply(df[, 1:4], 1, max) #get the max value of current row
new_df<- subset(new_df, select = c(1)) #keep only one column
I get this:
0.41
0.33
0.27
0.33
0.58
but I cant get the column name where the max value came from.
Desired result:
2 0.41
2 0.33
3 0.27
1 0.33
4 0.58
Thanks in advance for your help.
Try this
> t(apply(df, 1, function(x) c(which.max(x)-1, max(x))))
[,1] [,2]
[1,] 2 0.41
[2,] 2 0.33
[3,] 3 0.27
[4,] 1 0.33
[5,] 4 0.58
Another alternative:
> t(apply(df, 1, function(x) as.numeric(c(names(which.max(x)), max(x)))))
[,1] [,2]
[1,] 2 0.41
[2,] 2 0.33
[3,] 3 0.27
[4,] 1 0.33
[5,] 4 0.58
As suggested by DWin, another alternative is:
t(apply(df, 1, function(x) as.numeric(c(names(x)[which.max(x)], max(x)))))