Sliding window over data.frame with nested hierarchy - r

Description of the data
My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.
Goal
For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...
At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.
Note on performance
I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.
Example data
set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
country city salary arg1 arg2
1 1 1 22791 -1.4606212 1.07084528
2 1 1 34598 0.9244679 1.19519158
3 1 1 76411 0.8288587 0.86737330
4 1 1 76790 1.3013056 0.07380115
5 1 1 87297 -1.4021137 1.62395596
6 1 2 12581 1.3062181 -1.03360620
With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.

Unless I've misunderstood something, the following might be helpful.
Define a simple function regardless of hierarchical groupings:
ff = function(salary, wSz, wSt, arg1, arg2)
{
froms = (wSt * (0:ceiling(max(salary) / wSt)))
tos = froms + wSz
Ds = mapply(function(from, to, salaries, args1, args2) {
inds = salaries > from & salaries < to
sum(args1[inds]) / sum(args2[inds])
},
from = froms, to = tos,
MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
list(from = froms, to = tos, D = Ds)
}
Compute on the groups with, for example, data.table:
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
# country city from to D
# 1: 1 1 0 70000 -0.2366040
# 2: 1 1 30000 100000 0.4394620
# 3: 1 1 60000 130000 0.2838260
# 4: 1 1 90000 160000 NaN
# 5: 1 2 0 70000 1.8112196
# 6: 1 2 30000 100000 0.6134090
# 7: 1 2 60000 130000 0.5959344
# 8: 1 2 90000 160000 NaN
# 9: 1 3 0 70000 1.3216255
#10: 1 3 30000 100000 1.8812397
I.e. a faster equivalent of
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))

Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.
set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)
# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)
WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) /
rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)),
by = list(country = country, city = city))]
You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.

Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
#subset of data for that country
CountrySubset = dt[dt$country == x,,drop=F]
for(y in CountrySubset$city){
#subset of data for citys within country
CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
for(z in 1:StepNumber){
WinRange = WindowRange + (z*WindowStep)
#subset of salarys within country of city via windowRange
WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
Table = rbind(Table,Output)
}
}
}
Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.
So if someone can alter this to fix that. Should be good.
WindowStep is the difference between each consecutive WindowSize you want.
StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.

Related

Merging data frames by selecting for correct value

I have a data frame called "ref" that contains information that allows mapping of gene entrez ID to the gene's start and end positions. I have another data frame "ori_data" where each row contains unique mutations from samples, which gives a genomic position. I am trying to assign each position given in "ori_data" to map to information on "ref" in order to assign entrez ID to each mutation. I have tried a for loop to match for the same chromosome, and then select for positions in "ori_data" that fall between the coordinates in "ref" though I have not been successful. The "ori_data" dataset is over 1 million rows, so I'm not sure a for loop is an efficient solution. Note that many positions will be mapped to the same entrez ID in my real dataset. "Final" is what I want to happen- which would just add a column for entrezID according to chromosome/position. TYIA!
ref = data.frame("EntrezID" = c(1, 10, 100, 1000), "Chromosome" = c("19", "8", "20", "18"), "txStarts" = c("58345182", "18391281", "44619518", "27950965"), "txEnds" = c("58353492", "18401215", "44651758", "28177130"))
ori_data = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"))
final = data.frame("Chromosome" = c("19", "8", "20", "18"), "Pos" = c("58345186", "18401213", "44619519", "27950966"),
"Sample" = c("HCC1", "HCC2", "HCC1", "HCC3"), "EntrezID" = c(1,10,100,1000))
I have tried this line of code and I'm unsure as to why it does not work.
for (i in 1:dim(ori_data)[1])
{
for (j in 1:dim(ref)[1])
{
ID = which(ori_data[i, "Chromosome"] == ref[j,
"Chromosome"])
if (length(ID) > 0)
{
Pos = ori_data[ID, "POS"]
IDj = which(Pos >= ref[j, "txStarts"] & Pos <=
ref[j, "txEnds"])
print(IDj)
if (length(IDj) > 0)
{
ori_data = cbind("Entrez" = ref[IDj,
"EntrezID"], ori_data)
}
}
}
}
In base apply could be used to find matches per row for Chromosome and test if Pos is in the range of txStarts txEnds.
ori_data$EntrezID <- apply(ori_data[c("Chromosome", "Pos")], 1, \(x)
ref$EntrezID[ref$Chromosome == x["Chromosome"] &
x["Pos"] >= ref$txStarts & x["Pos"] <= ref$txEnds][1])
ori_data
# Chromosome Pos Sample EntrezID
#1 19 58345186 HCC1 1
#2 8 18401213 HCC2 10
#3 20 44619519 HCC1 100
#4 18 27950966 HCC3 1000
A version which could be faster:
lup <- list2env(split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome))
ori_data$EntrezID <- Map(\(x, y) {
. <- get(x, envir=lup)
.$EntrezID[y >= .$txStarts & y <= .$txEnds][1]
}, ori_data$Chromosome, ori_data$Pos)
Or another way but not keeping the original order. (If original order is important, have a look at unsplit.)
#Assuming you have many rows with same Chromosome
x <- split(ori_data, ori_data$Chromosome)
#Assuming you have also here many rows with same Chromosome
lup <- split(ref[c("EntrezID", "txStarts", "txEnds")], ref$Chromosome)
#Now I am soting this by the names of x - try which Method ist faster
#Method 1:
lup <- lup[names(x)]
#Method 2:
lup <- mget(names(x), list2env(lup))
res <- do.call(rbind, Map(\(a, b) {
cbind(a, b[1][a$Pos >= b[[2]] & a$Pos <= b[[3]]][1])
}, x, lup))
One option would be to use sqldf, which should also be efficient for a large dataframe.
library(tibble)
library(sqldf)
as_tibble(sqldf("select dna.*, ref.EntrezID from dna
join ref on dna.Pos > ref.'txStarts' and
dna.Pos < ref.'txEnds'"))
Another option using fuzzy_join:
library(dplyr)
library(fuzzyjoin)
dna %>%
fuzzy_join(ref %>% select(-Chromosome), by = c("Pos" = "txStarts", "Pos" = "txEnds"),
match_fun = list(`>`, `<`)) %>%
select(names(dna), EntrezID)
Output
Chromosome Pos Sample EntrezID
1 19 58345186 HCC1 1
2 8 18401213 HCC2 10
3 20 44619519 HCC1 100
4 18 27950966 HCC3 1000
If the 'Pos', 'txStarts', 'txEnds' are numeric, then we can use non-equi join
library(data.table)
setDT(dna)[ref, EntrezID := i.EntrezID,
on = .(Chromosome, Pos > txStarts, Pos <txEnds)]
-output
> dna
Chromosome Pos Sample EntrezID
<char> <num> <char> <num>
1: 19 58345186 HCC1 1
2: 8 18401213 HCC2 10
3: 20 44619519 HCC1 100
4: 18 27950966 HCC3 1000
data
dna <- type.convert(dna, as.is = TRUE)
ref <- type.convert(ref, as.is = TRUE)

R Sample By Minimum Cell Size

set.seed(1)
data=data.frame(SCHOOL = rep(1:10, each = 1000), GRADE = sample(7:12, r = T, size = 10000),SCORE = sample(1:100, r = T, size = 10000))
I have 'data' that contains information about student test score. I wish to: count how many GRADE for each SCHOOL, and then take the smallest value of GRADE for all SCHOOLS. Like this:
For each SCHOOL count the number of rows for a specific GRADE.
Then for each GRADE find the smallest values across all SCHOOLs.
Finally I wish to take a random sample based on the smallest value found in step 2.
So basically in this basic example with two SCHOOLs and GRADE 7 and GRADE 8:
SCHOOL 1 has 2 SCOREs for GRADE 7 and SCHOOL 1 has 3 SCOREs for GRADE 8.
SCHOOL 2 has 1 SCOREs for GRADE 7 and SCHOOL 2 has 4 SCOREs for GRADE 8.
So the new data contains one SCORE for GRADE 7 from SCHOOL 1 and SCHOOL 2, and three SCORE for GRADE 8 from SCHOOL 1 and SCHOOL 2 and these SCORE that are picked are RANDOMLY SAMPLED.
like this:
My attempt:
data[, .SD[sample(x = .N, size = min(sum(GRADE), .N))], by = .(SCHOOL,GRADE]
This follows your description of how to do it step-by-step.
library(data.table)
setDT(data)
data[, N := .N, .(SCHOOL, GRADE)]
data[, N := min(N), GRADE]
data[, .(SCORE = sample(SCORE, N)), .(SCHOOL, GRADE, N)][, -'N']
If you have multiple SCORE-like columns and you want keep the same rows from each then you can use .SD like in your attempt:
data[, .SD[sample(.N, N)], .(SCHOOL, GRADE, N)][, -'N']

R: create column to determine wether stock is bought, held or sold

I have a dataset of portfolio components that gives information about the 10 stocks held in my portfolio at each rebalancing date(monthly).
An example dataframe with just 4 dates and 4 stocks:
Date <- c(rep(as.Date("2010/12/14"), 4), rep(as.Date("2011/01/13"), 4), rep(as.Date("2011/02/10"), 4),rep(as.Date("2011/03/10"), 4))
Name <- c("A","B","C","D","A","C","F","G","A","B","F","H","A","F","H","I")
df <- data.frame(Date, Name)
The stocks vary between the dates, so i need to compute direct transaction costs based on wether i buy, hold or sell each stock. What i would like is to add a column with values of 0,1,2 depending on this, so that:
value of 0 if the stock is held at time t-1 and time t
value of 1 if the stock is held at time t-1, but sold at time t
value of 1 if the stock is bought at time t-1, and held(not sold) at time t
value of 2 if if the stock is bought at time t-1, and sold at time t
With the values being assigned to the stock at t-1.
example of what this would look like:
Date Stock Status
2010-12-14 A 1
2010-12-14 B 2
2010-12-14 C 1
2010-12-14 D 2
2011-01-13 A 0
2011-01-13 C 1
2011-01-13 F 1
2011-01-13 G 2
2011-02-10 A 0
2011-02-10 B 1
2011-02-10 F 0
2011-02-10 H 1
2011-03-10 A
2011-03-10 F
2011-03-10 H
2011-03-10 I
I think your example data is not enough to show all cases, but this should generally do:
Date <- c(rep(as.Date("2010/12/14"), 4), rep(as.Date("2011/01/13"), 4), rep(as.Date("2011/02/10"), 4),rep(as.Date("2011/03/10"), 4))
dummy <- rep(1, 16)
Name <- c("A","B","C","D","A","C","F","G","A","B","F","H","A","F","H","I")
df <- data.frame(Date, Name, dummy)
le = LETTERS[1:9]
temp = CJ(Name=le, Date=unique(Date))
df = data.table(df)
setkey(df, Name, Date)
setkey(temp, Name, Date)
df = df[temp]
df[,value:=case_when(dummy==1 & shift(dummy, type = "lag", n = 1)==1 & shift(dummy, type = "lag", n = 2)==1 ~ 0,
dummy==0 & shift(dummy, type = "lag", n = 1)==1 & shift(dummy, type = "lag", n = 2)==1 ~ 1,
dummy==1 & shift(dummy, type = "lag", n = 1)==1 & shift(dummy, type = "lag", n = 2)==0 ~ 1,
dummy==0 & shift(dummy, type = "lag", n = 1)==1 & shift(dummy, type = "lag", n = 2)==0 ~ 2,
TRUE~88), by=Name][dummy==1]

sort and delete data in data.table

I need to create the function, which will accept purchases, and return the table with columns ordernumber and product_ID. I need to sort the result in descending order of the value price end negative values of vector quantity must be removed.
Function example
sample.purchases <- data.table(price = c(100000, 6000, 7000, 5000000),
ordernumber = 1:4,
quantity = c(1,2,1,-1),
product_id = 1:4)
ordered.short.purchase.data(sample.purchases)
# ordernumber product_id
#1: 1 1
#2: 3 3
#3: 2 2
I wrote code, but it does not work
ordered.short.purchase.data <- function(purchases) {
purchases[order(price, quantity >= 0 , decreasing = T), list(ordernumber,product_id)]
}

Generating different iterations of a dataset

I have a dataset made of items (rows) categorized only with integers from 0 to 4 (which represent degrees of my discrete variable). To which I have two years of data, 1980 and then from 1996 (columns).
df <- read.table(text = "
1980 1996
1 1
2 4
4 1", header = T)
My goal is to generate data for the intermediate years 1984, 1988 and 1992.
df.new <- data.frame(X1980 = NULL, X1984 = NULL, X1988 = NULL, X1992 = NULL, X1996 = NULL)
However for this virtual data to be reality based, it must follow 3 laws:
items assigned the same integer in 1980 and 1996, remain the same throughout the whole period
items which increase or decrease from 1980 to 1996, can only change by one integer in a given time step (items can not skip integers).
items can only increase or decrease (items have to be monotonic)
to achieve this I am using:
for(i in 1:nrow(df)){
lst <- ifelse(df$X1980[i] > df$X1996[i],
list(sort(sample(df$X1980[i]:df$X1996[i],3,replace = T), decreasing = T)),
list(sort(sample(df$X1980[i]:df$X1996[i],3,replace = T), decreasing = F)))
lst <- c(df$X1980[i], unlist(lst), df$X1996[i])
df.new <- rbind(df.new, data.frame(X1980 = lst[1],
X1984 = lst[2],
X1988 = lst[3],
X1992 = lst[4],
X1996 = lst[5]))
}
Which seems to work well, since df.new produces:
X1980 X1984 X1988 X1992 X1996
1 1 1 1 1 1
2 2 3 4 4 4
3 4 4 3 2 1
There are of course multiple variations of this dataset that also follow my 3 laws.
How should I write a loop that allows me to generate sim = 1000 law abiding iterations of this dataset?
and how can I be sure that no item (in any database) breaks any of my 3 laws?
Currently trying results <- foreach (i = 1:sim, .combine="df") %dopar% before the loop but have been unsuccessful so far.
Any help or advise will be greatly appreciated.
You can do:
library(foreach)
results <- foreach(i = 1:100) %dopar% {
foreach(i = 1:nrow(df), .combine = "rbind") %do% {
lst <- ifelse(df$X1980[i] > df$X1996[i],
list(sort(sample(df$X1980[i]:df$X1996[i],3,replace = T), decreasing = T)),
list(sort(sample(df$X1980[i]:df$X1996[i],3,replace = T), decreasing = F)))
lst <- c(df$X1980[i], unlist(lst), df$X1996[i])
data.frame(X1980 = lst[1],
X1984 = lst[2],
X1988 = lst[3],
X1992 = lst[4],
X1996 = lst[5])
}
}
do.call("rbind", results)
foreach works like lapply, it puts in a list what you return (the last element) from your expressions.

Resources