Splitting data based on row values based on conditions - r

this is how my data looks like:
df <- structure(list(`1` = c(1 , 2 , 3 , 4 , 5 ,6 , 7 , 8 , 9, 10 ,11 ,12, 13, 14 ,15 ,16, 17, 18), `2` = structure(c(4L,5L, 2L, 5L, 2L, 3L, 1L, 6L,4L,5L, 2L, 5L, 2L, 3L, 1L, 6L,4L,5L), .Label=c("a","a","b","b","b","c","c","b","b","b","e","e","f","g","g","g","f","f"),
class="factor"),`3`=c(1,0,1,1,0,1,0,1,1,0,1,0,1,1,0,1,0,0),`4`=c(0,1,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,1), `5` =c(10,5,20,20,5,0,0,10,10,5,10,5,15,5,5,5,2,2)),
.Names = c("N", "Condition", "AOI_hit_b", "AOI_hit_f", "Time"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9","10","11","12","13","14","15","16","17","18"))
I now want to make calculations on Time, depending on whether Condition b is preceded by a or c - depending on wheter AOI_hit_b is a 0 or 1. To that comes that only the first hit of 1 in a Condition is relevant for time, as it always writes the same time for multiple 1's in a condition.
I tried it with the ddply package, but didn't get the output I wanted.
My output should look something like this:
sum of
Condition Time
b(a) 25
b(c) 10
This is the code I have so far:
hit_cb = list()
for (i in 1:nrow(df)){
if (df[i,2] == "b") & (df[i-1,2] == "c") {
hit_cb[i] = ddply(df,.(AOI_hit_b), summarize, mysum=sum(unique(Time)))
}
}

Related

Calculate median based on counts of value

I have a following data frame:
numbers Freq
1 4 2
2 5 1
3 23 2
4 34 2
I know how to calculate median in r when frequency is 1 (median(`numbers`, na.rm = TRUE)), but how to do that when frequency is different from 1?
We can create a logical index with !=, use that to subset the 'numbers' column and get the median
with(df1, median(numbers[Freq != 1], na.rm = TRUE))
#[1] 23
data
df1 <- structure(list(numbers = c(4L, 5L, 23L, 34L), Freq = c(2L, 1L,
2L, 2L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))
Here is another option using tapply along with subset
> with(subset(df,Freq != 1),tapply(numbers,Freq,median,na.rm = TRUE))
2
23
Data
> dput(df)
structure(list(numbers = c(4L, 5L, 23L, 34L), Freq = c(2L, 1L,
2L, 2L)), class = "data.frame", row.names = c("1", "2", "3",
"4"))

Rolling multiple regressions with panel data [duplicate]

This question already has an answer here:
Rolling multiple regression panel data
(1 answer)
Closed 2 years ago.
I am trying to do a rolling multiple regression of a dataset grouped by stock. A sample of the dataset can be seen below. It goes from 1991 to 2019 and contains information on stocks like returns, etc. What I intend to do is regress dependant variable ExcessReturn on EPU_Paper for each stock using a 36 month rolling regression window. I also want the stocks to have at least 18 monthly return observations in the 36 month rolling windows. ISIN is the identifier of each stock in this case. I also want to include SIZE, INVEST, BM and OP as controls in the regression. The dataset does not contain any NAs except for the beta_monthly column which is all NAs. .
This is the code I have tried to run. I am able to get the for-loop working, but do not get the desired result. Optimally I would want the coefficient of EPU_Paper to be appended in df_Final in the correct row.
I am open for both completely new solutions and variants of my current attempt.
# Create date sequence
date <- seq(as.Date("1991-01-01"),as.Date("2019-12-31"), by = "month")
## Model
v <- 36 # No. of observations in rolling regression
w <- 18 # observations of stocks requred in period
df_Final$beta_monthly <- NA
for (i in 1:(length(date)-v)) {
beta.tab <- df_Final %>% filter(Date >= date[i] & Date < date[i+v]) %>%
group_by(ISIN) %>% filter(n() >= w) %>%
do(ols.model = lm(formula = ExcessReturn ~ EPU_Paper + SIZE, .)) %>%
mutate(beta_monthly = coefficients(ols.model)[2],
Date = date[v+1])
}
This is the dataframe df_Final. It contains all the data I need.
structure(list(Year = c(2002, 2004, 2011, 2011, 2012, 1993, 2005,
2019, 2005, 1998), Month = c(5, 4, 12, 11, 4, 11, 7, 2, 12, 12
), ISIN = c("NO0003172207", "NO0003072803", "NO0010001118", "NO0010096985",
"NO0010052350", "NO0004031303", "NO0003733800", "NO0003049405",
"NO0003028904", "NO0004684408"), SIZE = c(1143750000, 894618192,
257727844.92, 293346266180.2, 104014912.25, 1312826651.5, 51164845865,
535492777.6, 1.2465e+10, 8815671800), BM = c(2.69336652499494e-06,
6.25913195949328e-07, 3.0680673824874e-07, 9.99841307356348e-07,
3.99901247813628e-06, 3.6136784151303e-06, 6.27009692475242e-07,
6.397720392755e-07, 1.985559566787e-07, 9.2518383241951e-07),
OP = c(-0.0259646808923766, 0.197313839816668, 0.136649432305334,
0.594948150836374, -0.0018535993529254, -0.0801364023870418,
0.130539826349566, 0.0244477246423, 0.620295983086681, 0.103857566765579
), INVEST = c(0.129154816408376, 0.0321275661230328, -0.092547902189399,
0.142434794968375, -0.121033439243494, -0.00124744840099796,
-0.240237999927217, 0.0376008757633188, 0.060294968189705,
0.112664489390554), MonthlyReturn = c(-0.039797852179406,
-0.066030013642565, 0.019230769230769, 0.049271412097704,
-0.12516823687752, -0.02219755826859, 0.057851239669421,
-0.043636363636364, 0.05232436939754, 0.32743529766845),
RiskFreeRate = c(0.00558, 0.00163, 0.00209, 0.00251, 0.00163,
0.00467, 0.00181, 0.00086, 0.00208, 0.00726), ShareTurnover = c(69750L,
5250L, 369135L, 183793926L, 54869L, 2879656L, 7957362L, 367551L,
2478662L, 2245928L), MarketExcessReturn = c(-2.7155, -3.0781,
1.0322, -0.3552, -0.9447, -4.9307, 6.0359, 3.8371, 6.932,
-0.7896), ExcessReturn = c(-4.5377852179406, -6.7660013642565,
1.7140769230769, 4.6761412097704, -12.679823687752, -2.686755826859,
5.6041239669421, -4.4496363636364, 5.024436939754, 32.017529766845
), TradeDate = structure(c(11838, 12538, 15338, 15308, 15460,
8734, 12993, 17955, 13147, 10590), class = "Date"), GR_SIZE = structure(c(3L,
2L, 1L, 3L, 1L, 2L, 3L, 1L, 3L, 3L), .Label = c("1", "2",
"3"), class = "factor"), GR_OP = structure(c(1L, 2L, 2L,
3L, 1L, 1L, 2L, 1L, 3L, 1L), .Label = c("1", "2", "3"), class = "factor"),
GR_BM = structure(c(3L, 2L, 1L, 3L, 3L, 3L, 2L, 2L, 1L, 3L
), .Label = c("1", "2", "3"), class = "factor"), GR_INVEST = structure(c(3L,
2L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1", "2",
"3"), class = "factor"), SIZE_BM = structure(c(9L, 5L, 1L,
9L, 3L, 6L, 8L, 2L, 7L, 9L), .Label = c("11", "12", "13",
"21", "22", "23", "31", "32", "33"), class = "factor"), SIZE_OP = structure(c(7L,
5L, 2L, 9L, 1L, 4L, 8L, 1L, 9L, 7L), .Label = c("11", "12",
"13", "21", "22", "23", "31", "32", "33"), class = "factor"),
SIZE_INVEST = structure(c(9L, 5L, 1L, 9L, 1L, 4L, 7L, 2L,
8L, 8L), .Label = c("11", "12", "13", "21", "22", "23", "31",
"32", "33"), class = "factor"), Date = structure(c(11808,
12509, 15309, 15279, 15431, 8705, 12965, 17928, 13118, 10561
), class = "Date"), EPU_Paper = c(53.995111032374, 68.0510031873012,
150.261825109363, 124.78265498286, 47.2994312059608, 164.273390295025,
27.168222382902, 181.297305839429, 29.292072793154, 139.423199892468
), beta_monthly = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), row.names = c(NA, -10L), class = "data.frame")
You could try a split-apply-combine approach.
.date <- sort(unique(df2$date))
i <- 1; v <- 24; w <- 3
res <- do.call(rbind, lapply(1:v, function(i) {
ds <- df2$date %in% .date[0:(v - 1) + i]
S <- split(df2[ds, ], df2[ds, "date"])
dat <- do.call(rbind, unname(S[sapply(S, nrow) >= w]))
fit <- lm(y ~ x1 + x2, dat)
rs <- dat$date %in% .date[i]
beta_monthly <- unname(fit$coef[2])
if (any(rs)) cbind(dat[rs, ], beta_monthly) else NULL
}))
head(res, 10)
# cl date y x1 x2 beta_monthly
# 6 A 2011-02-01 0.1947135 38.20878 62846231450 -0.01949786
# 7 B 2011-02-01 5.7908162 130.09371 73446134000 -0.01949786
# 8 C 2011-02-01 -1.6335241 63.67381 93917412861 -0.01949786
# 9 D 2011-02-01 -4.8414052 151.70718 76852791458 -0.01949786
# 10 E 2011-02-01 4.1640901 123.10342 16714132588 -0.01949786
# 11 A 2011-03-01 -2.0569659 104.46436 28101485893 -0.01935559
# 12 B 2011-03-01 9.2205063 24.58415 42584043997 -0.01935559
# 13 C 2011-03-01 -0.1572310 65.94721 83745620495 -0.01935559
# 14 D 2011-03-01 5.2782394 25.69336 15235322119 -0.01935559
# 15 E 2011-03-01 3.6096263 163.65887 66618792459 -0.01935559
Data:
set.seed(42)
df2 <- expand.grid(cl=LETTERS[1:5],
date=seq(as.Date("2011-01-01"), as.Date("2019-12-31"), by="month"))
df2 <- df2[-sample(1:nrow(df2), nrow(df2)*.1), ]
n <- nrow(df2)
df2 <- transform(df2,
y=rnorm(n, 2, 5),
x1=runif(n, 20, 180),
x2=runif(n, 1e8, 1e11))

Using lapply to group list of data frames by column

I have a list that contains multiple data frames. I would like to sort the data by Category (A) and sum the Frequencies (B) using the lapply-command.
The data is df_list
df_list
$`df.1`
A B
1 Apples 2
2 Pears 5
3 Apples 6
4 Pears 1
5 Apples 3
$`df.2`
A B
1 Oranges 2
2 Pineapples 5
3 Oranges 6
4 Pineapples 1
5 Oranges 3
The desired outcome df_list_2 looks like this:
df_list_2
$`df.1`
A B
1 Apples 11
2 Pears 6
$`df.2`
A B
1 Oranges 11
2 Pineapples 6
I have tried the following code based on lapply:
df_list_2<-df_list[, lapply(B, sum), by = A]
However, I get an error code, saying that A was not found.
Either I mistake how the lapply command works in this case or my understating of how it should work is flawed.
Any help much appreciated.
You need to aggregate in lapply
lapply(df_list, function(x) aggregate(B~A, x, sum))
#[[1]]
# A B
#1 Apples 11
#2 Pears 6
#[[2]]
# A B
#1 Oranges 11
#2 Pineapples 6
Using map from purrr and dplyr it would be
library(dplyr)
purrr::map(df_list, ~.x %>% group_by(A) %>% summarise(sum = sum(B)))
data
df_list <- list(structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L),
.Label = c("Apples", "Pears"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)),
class = "data.frame", row.names = c("1", "2", "3", "4", "5")),
structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("Oranges",
"Pineapples"), class = "factor"), B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame",
row.names = c("1", "2", "3", "4", "5")))
I fear you might not have a clear idea of lapply nor the extract operator ([). Remember lapply(list, function) applies the specified function you give it to each element of the list you give it. Extract gives you the element you specify:
x <- c('a', 'b', 'c')
x[2]
## "b"
I would imagine that somewhere in your R workspace you have an object names B which is why you didn't get an error along the lines of
## Error in lapply(B, sum) : object 'B' not found
Conversely if you had (accidentally or intentionally) defined both A and B you would see the error
## Error in df_list[, lapply(B, sum), by = A] : incorrect number of dimensions
because that's not at all how to use [; remember, you just pass indexes or booleans to [ along with the occasional optional argument, but by is not one of those.
So without further adieu, here's how I would do this (in base R):
# make some data
a <- c(1, 2, 1, 2, 1)
b <- c(2, 5, 6, 1, 3)
df_list <- list(df.1 = data.frame(A = c('Apples', 'Pears')[a], B = b),
df.2 = data.frame(A = c('Oranges', 'Pineapples')[a], B = b))
# simplify it
df_list_2 <- lapply(df_list, function(x) {
aggregate(list(B = x$B), list(A = x$A), sum)
})
# the desired result
df_list_2
## $df.1
## A B
## 1 Apples 11
## 2 Pears 6
##
## $df.2
## A B
## 1 Oranges 11
## 2 Pineapples 6
You can take advantage of the fact that a data.frame is just a list and shorten up your code like this:
df_list_2 <- lapply(df_list, function(x) {
aggregate(x['B'], x['A'], sum)
})
but the first way of writing it should help make more clear what we're doing
The data.table syntax in OP's post can changed to
library(data.table)
lapply(df_list, function(x) as.data.table(x)[, .(B = sum(B)), by = A])
#$df.1
# A B
#1: Apples 11
#2: Pears 6
#$df.2
# A B
#1: Oranges 11
#2: Pineapples 6
data
df_list <- list(df.1 = structure(list(A = structure(c(1L, 2L, 1L, 2L, 1L
), .Label = c("Apples", "Pears"), class = "factor"), B = c(2L,
5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1", "2",
"3", "4", "5")), df.2 = structure(list(A = structure(c(1L, 2L,
1L, 2L, 1L), .Label = c("Oranges", "Pineapples"), class = "factor"),
B = c(2L, 5L, 6L, 1L, 3L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5")))

Remove column names in a DataFrame

In sparkR I have a DataFrame data.
When I type head(data) we get this output
C0 C1 C2 C3
1 id user_id foreign_model_id machine_id
2 1 3145 4 12
3 2 4079 1 8
4 3 1174 7 1
5 4 2386 9 9
6 5 5524 1 7
I want to remove C0,C1,C2,C3 because they give me problems later one. For example when I use the filter function:
filter(data,data$machine_id==1)
can't run because of this.
I have read the data like this
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv")
SparkR made the header into the first row and gave the DataFrame a new header because the default for the header option is "false". Set the header option to header="true" and then you won't have to handle with this problem.
data <- read.df(sqlContext, "/home/ole/.../data", "com.databricks.spark.csv", header="true")
Try
colnames(data) <- unlist(data[1,])
data <- data[-1,]
> data
# id user_id foreign_model_id machine_id
#2 1 3145 4 12
#3 2 4079 1 8
#4 3 1174 7 1
#5 4 2386 9 9
#6 5 5524 1 7
If you wish, you can add rownames(data) <- NULL to correct for the row numbers after the deletion of the first row.
After this manipulation, you can select rows that correspond to certain criteria, like
subset(data, data$machine_id==1)
# id user_id foreign_model_id machine_id
#4 3 1174 7 1
In base R, the function filter() suggested in the OP is part of the stats namespace and is usually reserved for the analysis of time series.
data
data <- structure(list(C0 = structure(c(6L, 1L, 2L, 3L, 4L, 5L),
.Label = c("1", "2", "3", "4", "5", "id"), class = "factor"),
C1 = structure(c(6L, 3L, 4L, 1L, 2L, 5L), .Label = c("1174", "2386",
"3145", "4079", "5524", "user_id"), class = "factor"),
C2 = structure(c(5L, 2L, 1L, 3L, 4L, 1L),
.Label = c("1", "4", "7", "9", "foreign_model_id"), class = "factor"),
C3 = structure(c(6L, 2L, 4L, 1L, 5L, 3L),
.Label = c("1", "12", "7", "8", "9", "machine_id"), class = "factor")),
.Names = c("C0", "C1", "C2", "C3"), class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6"))
try this
names <- c()
for (i in seq(along = names(data))) {
names <- c(names, toString(data[1,i]))
}
names(data) <- names
data <- data[-1,]
I simply can't use the answers because in sparkR it can't run: object of type 'S4' is not subsettable. I solved the problem this way, however, I think there is a better way to solve it.
data <- withColumnRenamed(data, "C0","id")
data <- withColumnRenamed(data, "C1","user_id")
data <- withColumnRenamed(data, "C2","foreign_model_id")
data <- withColumnRenamed(data, "C3","machine_id")
And now I can successfully use the filter function as I want to.

Make Dataframe loop code run faster

DATA AND REQUIREMENTS
The first table (myMatrix1) is from an old geological survey that used different region boundaries (begin and finish) columns to the newer survey.
What I wish to do is to match the begin and finish boundaries and then create two tables one for the new data on sedimentation and one for the new data on bore width characterised as a boolean.
myMatrix1 <- read.table("/path/to/file")
myMatrix2 <- read.table("/path/to/file")
> head(myMatrix1) # this is the old data
sampleIDs begin finish
1 19990224 4 5
2 20000224 5 6
3 20010203 6 8
4 20019024 29 30
5 20020201 51 52
> head(myMatrix2) # this is the new data
begin finish sedimentation boreWidth
1 0 10 1.002455 0.014354
2 11 367 2.094351 0.056431
3 368 920 0.450275 0.154105
4 921 1414 2.250820 1.004353
5 1415 5278 0.114109 NA`
Desired output:
> head(myMatrix6)
sampleIDs begin finish sedimentation #myMatrix4
1 19990224 4 5 1.002455
2 20000224 5 6 1.002455
3 20010203 6 8 2.094351
4 20019024 29 30 2.094351
5 20020201 51 52 2.094351
> head(myMatrix7)
sampleIDs begin finish boreWidthThresh #myMatrix5
1 19990224 4 5 FALSE
2 20000224 5 6 FALSE
3 20010203 6 8 FALSE
4 20019024 29 30 FALSE
5 20020201 51 52 FALSE`
CODE
The following code has taken me several hours to run on my dataset (about 5 million data points). Is there any way to change the code to make it run any faster?
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}
EDIT:
> dput(head(myMatrix2)
structure(list(V1 = structure(c(6L, 1L, 2L, 4L, 5L, 3L), .Label = c("0",
"11", "1415", "368", "921", "begin"), class = "factor"), V2 = structure(c(6L,
1L, 3L, 5L, 2L, 4L), .Label = c("10", "1414", "367", "5278",
"920", "finish"), class = "factor"), V3 = structure(c(6L, 3L,
4L, 2L, 5L, 1L), .Label = c("0.114109", "0.450275", "1.002455",
"2.094351", "2.250820", "sedimentation"), class = "factor"),
V4 = structure(c(5L, 1L, 2L, 3L, 4L, 6L), .Label = c("0.014354",
"0.056431", "0.154105", "1.004353", "boreWidth", "NA"), class = "factor")), .Names = c("V1",
"V2", "V3", "V4"), row.names = c(NA, 6L), class = "data.frame")
> dput(head(myMatrix1)
structure(list(V1 = structure(c(6L, 1L, 2L, 3L, 4L, 5L), .Label = c("19990224",
"20000224", "20010203", "20019024", "20020201", "sampleIDs"), class = "factor"),
V2 = structure(c(6L, 2L, 3L, 5L, 1L, 4L), .Label = c("29",
"4", "5", "51", "6", "begin"), class = "factor"), V3 = structure(c(6L,
2L, 4L, 5L, 1L, 3L), .Label = c("30", "5", "52", "6", "8",
"finish"), class = "factor")), .Names = c("V1", "V2", "V3"
), row.names = c(NA, 6L), class = "data.frame")
First look at these general suggestions on speeding up code: https://stackoverflow.com/a/8474941/636656
The first thing that jumps out at me is that I'd create only one results matrix. That way you're not duplicating the sampleIDs begin finish columns, and you can avoid any overhead that comes with running the matching algorithm twice.
Doing that, you can avoid selecting more than once (although it's trivial in terms of speed as long as you store your selection vector rather than re-calculate).
Here's a solution using apply:
myMatrix1 <- data.frame(sampleIDs=c(19990224,20000224),begin=c(4,5),finish=c(5,6))
myMatrix2 <- data.frame(begin=c(0,11),finish=c(10,367),sed=c(1.002,2.01),boreWidth=c(.014,.056))
glommer <- function(x,myMatrix2) {
x[4:5] <- as.numeric(myMatrix2[ myMatrix2$begin <= x["begin"] & myMatrix2$finish >= x["finish"], c("sed","boreWidth") ])
names(x)[4:5] <- c("sed","boreWidth")
return( x )
}
> t(apply( myMatrix1, 1, glommer, myMatrix2=myMatrix2))
sampleIDs begin finish sed boreWidth
[1,] 19990224 4 5 1.002 0.014
[2,] 20000224 5 6 1.002 0.014
I used apply and stored everything as numeric. Other approaches would be to return a data.frame and have the sampleIDs and begin, finish be ints. That might avoid some problems with floating point error.
This solution assumes there are no boundary cases (e.g. the begin, finish times of myMatrix1 are entirely contained within the begin, finish times of the other). If your data is more complicated, just change the glommer() function. How you want to handle that is a substantive question.

Resources