Pull subset of rows of dataframe based on conditions from other columns - r

I have a dataframe like the one below:
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","call"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Tickers Type Strike Other
1: A put 35.0 6
2: A call 37.5 5
3: A put 37.5 13
4: B call 10.0 15
5: B call 11.0 12
6: B put 11.0 4
7: B call 12.0 20
8: D put 40.0 7
9: D call 40.0 11
10: D put 42.0 10
11: D call 42.0 1
I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:
x[c(2,3,5,6,8:11),]
Tickers Type Strike Other
1: A call 37.5 5
2: A put 37.5 13
3: B call 11.0 12
4: B put 11.0 4
5: D put 40.0 7
6: D call 40.0 11
7: D put 42.0 10
8: D call 42.0 1
I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like
x$id <- paste(x$Tickers,x$Strike,sep="_")
Then use this vector to only pull values where there are multiple ids.
x[x$id %in% x$id[duplicated(x$id)],]
Tickers Type Strike Other id
1: A call 37.5 5 A_37.5
2: A put 37.5 13 A_37.5
3: B call 11.0 12 B_11
4: B put 11.0 4 B_11
5: D put 40.0 7 D_40
6: D call 40.0 11 D_40
7: D put 42.0 10 D_42
8: D call 42.0 1 D_42
I'm not sure how efficient this is, as my actual data consists of a lot more rows.
Also, this solution does not check for the type condition of there being one put and one call.
also the wording of the title could be a lot better, I apologize
EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"
I could also use this solution:
x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]

You could try something like:
x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]
# Tickers Type Strike Other select
#1: A call 37.5 17 TRUE
#2: A put 37.5 16 TRUE
#3: B call 11.0 11 TRUE
#4: B put 11.0 20 TRUE
#5: D put 40.0 1 TRUE
#6: D call 40.0 12 TRUE
#7: D put 42.0 6 TRUE
#8: D call 42.0 2 TRUE
Another idea might be a merge:
x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]
I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.

An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","put"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:
x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]
The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.

I am not a user of package data.table so this code is base R only.
agg <- aggregate(Type ~ Tickers + Strike, data = x, length)
result <- merge(x, subset(agg, Type > 1)[1:2], by = c("Tickers", "Strike"))[, c(1, 3, 2, 4)]
result
# Tickers Type Strike Other
#1: A call 37.5 17
#2: A put 37.5 7
#3: B call 11.0 14
#4: B put 11.0 20
#5: D put 40.0 15
#6: D call 40.0 2
#7: D put 42.0 8
#8: D call 42.0 1
rm(agg) # final clean up

Related

How to lookup by row and column > column names?

I am thinking how to lookup time data by University name (first row: A,...,F), Field name (first column: Acute,...,En) and/or graduation time (time) in the following file DS.csv.
I am thinking dplyr approach but could not expand numerical ID lookup (thread answer How to overload function parameters in R?) to the lookup by three variables.
Challenges
How to lookup by the first row? Maybe, something similar to $1 == "A".
How to Expand university lookup to two columns? Pseudocode $1 == "A" is about the second and third column, ..., $1 == "F" about two last columns.
Do lookup by 3 lookup criterias: first row (no header), first column with header Field and for the header time. Pseudocode
times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
Data DS.csv has the data. The first column denotes experiment. The data below is in crosstab format such that
,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3
and in the straight table format such that
Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F
Pseudocode
library('dplyr')
ow <- options("warn")
DF <- read.csv("/home/masi/CSV/DS.csv", header = T)
# Lookup by first row, Lookup by Field, lookup by Field's first column?
times <- getTimes($1 == "A", Field == "Ane", by = "desc(time)")
Expected output: 9
Expected output generalised: a, b, c, ...
## Data where values marked by small letters a,b,c, ... are wanted
# uni1 uni2 ...
# time T time T ...
#Field1 a c
#Field2 b ...
#... ...
R: 3.3.3 (2017-03-06)
OS: Debian 8.7
Hardware: Asus Zenbook UX303UA
Taking your initial raw data as starting point:
# read the data & skip 1st & 2nd line which contain only header information
DF <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, skip=2)
# read the first two lines which contain the header information
headers <- read.csv(text=",A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3", header=FALSE, stringsAsFactors=FALSE, nrow=2)
# extract the university names for the 'headers' data.frame
universities <- unlist(headers[1,])
universities <- universities[universities != '']
# create column names from the 'headers' data.frame
vec <- headers[2,][headers[2,] == 'T']
headers[2,][headers[2,] == 'T'] <- paste0(vec, seq_along(vec))
names(DF) <- paste0(headers[2,],headers[1,])
You dataframe now looks as follows:
> DF
Field timeA T1 timeB T2 timeC T3 timeD T4 timeE T5 timeF T6
1: Acute 0.0 0 8.3 1 7.5 1 8.6 2 0.0 0 8.3 4
2: Ane 9.0 120 7.7 26 7.9 43 7.8 77 7.9 60 8.2 326
3: En 15.6 2 12.9 1 0.0 0 0.0 0 14.3 1 14.6 4
4: Fo 9.2 2 0.0 0 5.4 1 0.0 0 0.0 0 7.9 3
As it is better to transform you data into long format:
library(data.table)
DT <- melt(setDT(DF), id = 1,
measure.vars = patterns('^time','^T'),
variable.name = 'university',
value.name = c('time','t')
)[, university := universities[university]][]
Now your data looks like:
> DT
Field university time t
1: Acute A 0.0 0
2: Ane A 9.0 120
3: En A 15.6 2
4: Fo A 9.2 2
5: Acute B 8.3 1
6: Ane B 7.7 26
7: En B 12.9 1
8: Fo B 0.0 0
9: Acute C 7.5 1
10: Ane C 7.9 43
11: En C 0.0 0
12: Fo C 5.4 1
13: Acute D 8.6 2
14: Ane D 7.8 77
15: En D 0.0 0
16: Fo D 0.0 0
17: Acute E 0.0 0
18: Ane E 7.9 60
19: En E 14.3 1
20: Fo E 0.0 0
21: Acute F 8.3 4
22: Ane F 8.2 326
23: En F 14.6 4
24: Fo F 7.9 3
Now you can select the required info:
DT[university == 'A' & Field == 'Ane']
which gives:
Field university time t
1: Ane A 9 120
Several dplyr examples to filter the data:
library(dplyr)
DT %>%
filter(Field=="En" & t > 1)
gives:
Field university time t
1 En A 15.6 2
2 En F 14.6 4
Or:
DT %>%
arrange(desc(time)) %>%
filter(time < 14 & t > 3)
gives:
Field university time t
1 Ane A 9.0 120
2 Acute F 8.3 4
3 Ane F 8.2 326
4 Ane C 7.9 43
5 Ane E 7.9 60
6 Ane D 7.8 77
7 Ane B 7.7 26
Change your crosstab
,A,,B,,C,,D,,E,,F,
Field,time,T,time,T,time,T,time,T,time,T,time,T
Acute,0,0,8.3,1,7.5,1,8.6,2,0,0,8.3,4
Ane,9,120,7.7,26,7.9,43,7.8,77,7.9,60,8.2,326
En,15.6,2,12.9,1,0,0,0,0,14.3,1,14.6,4
Fo,9.2,2,0,0,5.4,1,0,0,0,0,7.9,3
into a straight data format
Field,time,T,Experiment
Acut,0,0,A
An,9,120,A
En,15.6,2,A
Fo,9.2,2,A
Acute,8.3,1,B
An,7.7,26,B
En,12.9,1,B
Fo,0,0,B
Acute,7.5,1,C
An,7.9,43,C
En,0,0,C
Fo,5.4,1,C
Acute,8.6,2,D
An,7.8,77,D
En,0,0,D
Fo,0,0,D
Acute,0,0,E
An,7.9,60,E
En,14.3,1,E
Fo,0,0,E
Acute,8.3,4,F
An,8.2,326,F
En,14.6,4,F
Fo,7.9,3,F
where I used Vim.csv plugin and visual-block mode.
Multiple ways to do the selection
This is very easy to do in multiple ways after tidying the data into easy-to-format straight table (not crosstab), I would prefer SQL. I demonstare SQLDDF-package below that is very inefficient with large data but this is small so it will work.
Also instead of the very inefficient builtin functions, such as read.csv, I would refer the very efficient fread in data.table package for reading files.
SQLDF
> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> sqldf("select time from a where Experiment='A' and Field='An'")
time
1 9
Other without sqldf
> library(data.table);
> a<-fread("~/DS_straight_table.csv");
> a[Experiment=='A' & Field=='An']
Field time T Experiment
1: An 9 120 A
Using the "Tall" (straight table) format and library dplyr. Your data only has one value per Field, Experiment.
library(dplyr)
## this is the more general result
df %>%
group_by(Field, Experiment) %>%
top_n(1, wt = -time)
## example function
getTimes<- function(data, field, experiment) {
data %>%
filter(Field == field, Experiment == experiment) %>%
top_n(1, wt = -time)
}
getTimes(df, 'An', 'A')
# Field time T Experiment
# 1 An 9 120 A

Functions without arguments

I'm not sure about this. Here is an example of a function which does not work:
myfunction<-function(){
mydata=read_excel("Square_data.xlsx", sheet = "Data", skip=0)
mydata$Dates=as.Date(mydata$Dates, format= "%Y-%m-%d")
mydata.ts=ts(mydata, start=2006, frequency=1)
}
The files do not load. When I execute each command line by line in R the files are loaded, so there's no problem with the commands. My question is, can I run a function such as myfunction to load the files? Thanks.
Last statement in function is an assignment If the last executed statement in a function is an assignment then it will not display on the console unless you use print but if the function result is assigned then you can print the assigned value later. For example, using the built in BOD data frame:
> f <- function() bod <- BOD
> f() # no result printed on console because f() was not explicitly printed
> print(f()) # explicitly print
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
> X <- f() # assign and then print the assigned value
> X
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
Last statement in function is expression producing a result If the last statement produces a value rather than being an assignment then a result is printed on the console. For example:
> g <- function() BOD
> g()
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
Thus make sure that the last statement in your function is not an assignment if you want it to display on the console automatically.
Note 1: sourcing code Also, note that if your code is sourced using a source() statement or if the code is called by another function then it also won't print automatically on the console unless you use a print.
Note 2: Two results Regarding some comments to the question, if you want to output two results then output them in a named list. For example. this outputs a list with components named BOD and BOD2:
h <- function() list(BOD = BOD, BOD2 = 2*BOD)
h()
$BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
$BOD2
Time demand
1 2 16.6
2 4 20.6
3 6 38.0
4 8 32.0
5 10 31.2
6 14 39.6
We could refer to them like this:
> H <- h()
> H$BOD
Time demand
1 1 8.3
2 2 10.3
3 3 19.0
4 4 16.0
5 5 15.6
6 7 19.8
> H$BOD2
Time demand
1 2 16.6
2 4 20.6
3 6 38.0
4 8 32.0
5 10 31.2
6 14 39.6
Note 3: <<- operator Regarding the comments to the question, in general, using the <<- operator should be avoided because it undesirably links the internals of your function to the global workspace in an invisible and therefore error-prone way. If you want to return a value it is normally best to return it as the output of the function. There are some situations where <<- is warranted but they are relatively uncommon.
Sure. Just give it a value to be returned:
myfunction<-function(){
mydata=read_excel("Square_data.xlsx", sheet = "Data", skip=0)
mydata$Dates=as.Date(mydata$Dates, format= "%Y-%m-%d")
ts(mydata, start=2006, frequency=1) # The last object is returned by an R function
}
so calling dat <- myfunction() will make dat the ts-object that was created inside the function.
P.S.: There also in a return function in R. As a best practice only use this if you want to return an object early, e.g. in combination with if

R Compute Statistics on Lagged Partitions

I have a data.frame with one column containing categorical data, one column containing dates, and one column containing numeric values. For simplicity, see the sample below:
A B C
1 L 2015-12-01 5.7
2 M 2015-11-30 2.1
3 K 2015-11-01 3.2
4 L 2015-10-05 5.7
5 M 2015-12-05 1.2
6 L 2015-11-15 2.3
7 L 2015-12-03 4.4
I would like to, for each category in A, compute a lagging average (e.g. average of the previous 30 days' values in column C).
I cannot for the life of me figure this one out. I have tried using sapply and a custom function that subsets the data.frame on category and date (or a deep copy of it) and returns the statistic (think mean or sd) and that works fine for single values, but it returns all NA's from inside sapply.
Any help you can give is appreciated.
This could be done more compactly, but here I have drawn it out to make it easiest to understand. The core is the split, lapply/apply, and then putting it back together. It uses a date window rather than a solution based on sorting, so it is very general. I also put the object back to its original order to enable direct comparison.
# set up the data
set.seed(100)
# create a data.frame with about a two-month period for each category of A
df <- data.frame(A = rep(c("K", "L", "M"), each = 60),
B = rep(seq(as.Date("2015-01-01"), as.Date("2015-03-01"), by="days"), 3),
C = round(runif(180)*6, 1))
head(df)
## A B C
## 1 K 2015-01-01 1.8
## 2 K 2015-01-02 1.5
## 3 K 2015-01-03 3.3
## 4 K 2015-01-04 0.3
## 5 K 2015-01-05 2.8
## 6 K 2015-01-06 2.9
tail(df)
## A B C
## 175 M 2015-02-24 4.8
## 176 M 2015-02-25 2.0
## 177 M 2015-02-26 5.7
## 178 M 2015-02-27 3.9
## 179 M 2015-02-28 2.8
## 180 M 2015-03-01 3.6
# preserve original order
df$originalOrder <- 1:nrow(df)
# randomly shuffle the order
randomizedOrder <- order(runif(nrow(df)))
df <- df[order(runif(nrow(df))), ]
# split on A - your own data might need coercion of A to a factor
df.split <- split(df, df$A)
# set the window size
window <- 30
# compute the moving average
listD <- lapply(df.split, function(tmp) {
apply(tmp, 1, function(x) mean(tmp$C[tmp$B <= as.Date(x["B"]) & tmp$B (as.Date(x["B"]) - window)]))
})
# combine the result with the original data
result <- cbind(do.call(rbind, df.split), rollingMean = unlist(listD))
# and tidy up:
# return to original order
result <- result[order(result$originalOrder), ]
result$originalOrder <- NULL
# remove the row names
row.names(result) <- NULL
result[c(1:5, 59:65), ]
## A B C rollingMean
## 1 K 2015-01-01 1.8 1.800000
## 2 K 2015-01-02 1.5 1.650000
## 3 K 2015-01-03 3.3 2.200000
## 4 K 2015-01-04 0.3 1.725000
## 5 K 2015-01-05 2.8 1.940000
## 59 K 2015-02-28 3.6 3.080000
## 60 K 2015-03-01 1.3 3.066667
## 61 L 2015-01-01 2.8 2.800000
## 62 L 2015-01-02 3.9 3.350000
## 63 L 2015-01-03 5.8 4.166667
## 64 L 2015-01-04 4.1 4.150000
## 65 L 2015-01-05 2.7 3.860000

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

element-by-element multiplication within a data table - Multiple variables at the same time (R)

Let's say I have the following data table in R:
L3 <- LETTERS[1:3]
(d <- data.table(cbind(x = 1, y = 1:10), fac = sample(L3, 10, replace = TRUE)))
vecfx=c(5.3,2.8)
and I would like to compute two new variables, dot1 and dot2 that are:
d[,dot1:=5.3*x]
d[,dot2:=2.8*y]
But I don't want to compute them this way as this is a relaxation of my problem. In my original problem, vecfx consists of 12 elements and my data table has twuelve columns so I want to avoid writing that twuelve times.
I tried this: vecfx*d[,list(x,y)] but I'm not getting the desired result (it seems like the product is done by rows instead of by columns). Also, I want to create those two new variables within my data table d.
This is also useful when one wants to create several columns at the same time within a data table in R.
Any help is appreciated.
Update: In v1.8.11, FR #2077 is now implemented - set() can now add columns by reference, . From NEWS:
set() is able to add new columns by reference now. For example, set(DT, i=3:5, j="bla", 5L) is equivalent to DT[3:5, bla := 5L]. This was FR #2077. Tests added.
with which one would then be able to do (as #MatthewDowle suggests under comments):
for (j in seq_along(vecfx))
set(d, i=NULL, j=paste0("dot", j), vecfx[j]*d[[j]])
I think you're looking for ?set. Note that set() also adds by reference and is very fast! Pasting the relevant section from ?set:
Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more flexible than set() because := is intended to be combined with i and by in single queries on large datasets.
for (j in seq_along(vecfx))
set(d, i=NULL, j=j, vecfx[j]*d[[j]])
x y fac
1: 5.3 2.8 B
2: 5.3 5.6 C
3: 5.3 8.4 C
4: 5.3 11.2 C
5: 5.3 14.0 B
6: 5.3 16.8 B
7: 5.3 19.6 C
8: 5.3 22.4 C
9: 5.3 25.2 C
10: 5.3 28.0 C
It's just a matter of providing the right indices to set.
Arun's answer is good.
The LHS and RHS of := accept multiple items so another way is :
d[,paste0("dot",1:2):=mapply("*",vecfx,list(x,y),SIMPLIFY=FALSE)]
d
x y fac dot1 dot2
1: 1 1 C 5.3 2.8
2: 1 2 B 5.3 5.6
3: 1 3 C 5.3 8.4
4: 1 4 C 5.3 11.2
5: 1 5 B 5.3 14.0
6: 1 6 A 5.3 16.8
7: 1 7 A 5.3 19.6
8: 1 8 B 5.3 22.4
9: 1 9 A 5.3 25.2
10: 1 10 A 5.3 28.0
Maybe there's a better way than that. I think Arun's for should be faster though, and maybe easier to read.

Resources