Compute conditionally across rows in data.table in R - r

I have a data.table with three relevant columns: id, timepoint and metric (actual size is much larger).
I am trying to calculate the percent change between the metric values at timepoints A and D and use it to create a label (Good metric, Half-decent metric, Subpar metric).
The situation becomes more complicated because if the metric is less than or equal to 2, then the new column should report "Super metric!". If not, then the percent difference should be calculated. Based off of the percent change, the id's will be reported as either "Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%).
If there is an NA value at timepoints A or D, then returning NA is okay. If timepoint A or D are missing, also return NA.
My initial thought was that I could calculate this in data.table without creating unnecessary columns, but I haven't even been able to get the more simple solution where I do the calculations separately and then join them later.
# Example data
library(data.table)
dat <- data.table(id = c(1,1,1,1,2,2,3,3,3,3,4,4,4,6,6,10,10,10,11,11,12,12,14,14),
timepoint = c("A","B","C","D","A","D","A","B","C","D","A","B","C","A","D","A","B","D", "A","D","A","D", "A","D"),
metric = c(NA, 3, 3, 4, 4, 2, 3, 3, 2, 1, 4, 3, NA, NA, 4, 1, 5, 2, 5,3, 5,5,6,3))
Partial solution: first identify the "Super metric" id's, but I would like this to class all instances of "Super metric" id's as such (right now it returns "Super metric" only for timepoint D.
# Inefficient solution
# Step 1: Identify id's that need to be computed
dat1 <- dat[, `:=` (Metric_score = if (metric <= 2 & timepoint == "D")
Metric_score = "Super metric"
else Metric_score = "Calc PC"),
by = 'id,timepoint']
# id timepoint metric Metric_score
# 1: 1 A NA Calc PC
# 2: 1 B 3 Calc PC
# 3: 1 C 3 Calc PC
# 4: 1 D 4 Calc PC
# 5: 2 A 4 Calc PC # Should be Super metric
# 6: 2 D 2 Super metric
Performing the calculation:
This calculates the percent change for all ID's, regardless of whether or not it needs to be calculated
# Step 2: Calculate percent change between timepoint D and A
dat[ , `:=`(col = (metric[timepoint == "A"] - metric[timepoint == "D"])/metric[timepoint == "A"]*100), by = 'id']
Desired output: Class each metric as "Super metric" when final score (timepoint D) is <= 2, otherwise, calculate percent change ((metric#timeD-metric#timeA)/metric#timeA)*100) and classify based on result ("Subpar metric" (< 30%), "Half-decent metric"(30 - 50%), "Good metric" (>50%)
id
timepoint
metric
metric_class
1
A
NA
NA
1
B
3
NA
1
C
3
NA
1
D
4
NA
2
A
4
Super metric
2
D
2
Super metric
3
A
3
Super metric
3
B
3
Super metric
3
C
2
Super metric
3
D
1
Super metric
4
A
4
NA
4
B
3
NA
4
C
NA
NA
6
A
NA
NA
6
D
4
NA
10
A
1
Super metric
10
B
5
Super metric
10
D
2
Super metric
11
A
5
Half-decent metric
11
D
3
Half-decent metric
12
A
5
Subpar metric
12
D
5
Subpar metric
14
A
6
Good metric
14
D
3
Good metric

Using fcase should give you a desirable result.
Since 0.5 is both between 0.3-0.5 and >= 0.5 it will take the first case in the list which is "Good metric" in this case, if you want that changed you can simply change the order.
metrics <- dcast.data.table(dat, id~timepoint)
metrics[, metric_class := fcase(D <= 2, "Super metric",
abs(D-A)/A < 0.3, "Subpar metric",
abs(D-A)/A >= 0.5, "Good metric",
between(abs(D-A)/A, 0.3, 0.5), "Half-decent metric")]
dat <- merge(dat, metrics[, .(id, metric_class)], by = "id")

Here is another approach that doesn't require dcast.
metric_class <- function(t,m) {
if("D" %in% t && m[t=="D"]<=2) return(rep("Super metric", length(t)))
mvals = c("a"= m[t=="A"], "d" = m[t=="D"])
val = abs((mvals["d"]-mvals["a"])/mvals["a"])
return(rep(fcase(val<0.3, "Subpar metric", val>=0.5, "Good metric", val>=0.3 & val<0.5, "Half-decent metric"), length(t)))
}
setDT(dat)[, metric_class:=metric_class(timepoint, metric), by=id][]

Related

Match datasets on a subset of conditions

When matching 2 data sets, is it possible to somehow specify the matching such that an observation from the first dataset is matched to the second dataset if at least one of the conditions are met?
Let's say I have the following 2 data.tables:
dt1<- data.table(c1=c(rep('a', 2), rep('b', 2), rep('c', 2)),
c2=c('x','y','x','y','x','z'),
c3.min = c(rep(0,3), rep(-1,3)),
c3.max = c(rep(10,3), rep(11,3)),
x= (1:6))
dt2 <- data.table(c1=c(rep('a', 3), rep('b', 3), rep('c', 4)),
c2=c(rep(c('x','y'), 5)),
c3=c(-1, 2, 0, 10, 11, -1, 3, 6, 3, 12),
y= (1:10))
I have 3 conditions based on which I want to match dt1 to dt2, and the 3rd condition is a range. If I just do a normal merge by these 3 conditions I will get:
> dt2[dt1, on=.(c1,
+ c2,
+ c3 <= c3.max,
+ c3 >= c3.min), nomatch=NA ]
c1 c2 c3 y c3.1 x
1: a x 10 3 0 1
2: a y 10 2 0 2
3: b x 10 NA 0 3
4: b y 11 4 -1 4
5: b y 11 6 -1 4
6: c x 11 7 -1 5
7: c x 11 9 -1 5
8: c z 11 NA -1 6
As you can see the observations from dt1 with x=3 and x=6 aren't matched. My main concern is to find at least one match for as many observations in dt1 as possible, even if I have to relax some conditions. So I want to know if there is anyway to perform a match where dt1 matches with dt2 on at least 1 out of the 3 conditions?
I could write a loop, but in reality my 2 datasets are much bigger than this (the first has 10K observations and the 2nd has 300K observations), and I have 4 conditions in total, so I'm looking for a more efficient way.
Thanks!
My first instinct with this type of problem would be to use the sqldf package, since we need to join using OR conditions, not AND conditions.
library(sqldf)
names(dt1) <- c("c1", "c2", "c3_min", "c3_max", "x") # need to get rid of the "."
query1 <- "select * from dt1
left join dt2
on (dt1.c1 = dt2.c1) or (dt1.c2 = dt2.c2) or (dt2.c3 between dt1.c3_min and dt1.c3_max)"
sqldf(query1)

Rolling standard deviation for multiple firm, with different time periods

I have a dataset with monthly stock return for approximately 100 firms. They have different time periods, and the reason for this is when they went on and off the stock exchange.
I have ordered my dataset by Company, Year, Month and I want the standard deviation to account for this so that it starts for a firm after 24 months, and ends when the last observation for that firm is due.
This means that the command has to be able to tell the difference between firms, so that the window doesn't transfer over to the next firm.
Year, Month, Company, Return
1990, 1, Company 1, -0,005
1990, 2, Company 1 , 0,003
etc...
1990, 1, Company 2, ...
1990, 2, Company 2, ...
etc...
2017, 6, Company 50, ...
I have been trying with this code, but it just keeps going when the next row contains a new firm, i.e. it just does a rolling standard deviation for the whole dataset.
rolling_sd <- (rollapply(Dataset$RETURN, width=24,
FUN = sd, fill=NA, align = "right"))
Also it does not align with the right date. If I have no align command, the first row of standard deviation should be 24 rows down, with the "right" it moves 12 down, but still not properly aligned.
How can I make it to take Company name into account?
If you omit the align="right" argument the sd values would be centered as discussed in the question but since the code shown does use right alignment the sd values would start in row 24. I suspect you are confusing runs made with and without the align= argument.
Using the data shown in the Note at the end and changing 24 to 3 in order to demonstrate it with this smaller dataset we use ave to apply the rolling sd to each company separately. The r at the end of rollapplyr is a shorter way of specifying align="right". With right alignment the sd shown in the ith row is the sd of the width rows ending in row i, i.e. rows i-width+1 to i inclusive.
library(zoo)
roll <- function(x) rollapplyr(x, width = 3, FUN = sd, fill = NA)
transform(Dataset, sd = ave(RETURN, Company, FUN = roll))
giving:
Year Month Company RETURN sd
1 1 1 A -0.042484496 NA
2 1 2 A 0.057661027 NA
3 1 3 A -0.018204616 0.05224021
4 1 4 A 0.076603481 0.05017135
5 2 1 A 0.088093457 0.05833792
6 2 2 A -0.090888700 0.10018338
7 2 3 A 0.005621098 0.08958278
8 2 4 A 0.078483809 0.08496093
9 1 1 B -0.042484496 NA
10 1 2 B 0.057661027 NA
11 1 3 B -0.018204616 0.05224021
12 1 4 B 0.076603481 0.05017135
13 2 1 B 0.088093457 0.05833792
14 2 2 B -0.090888700 0.10018338
15 2 3 B 0.005621098 0.08958278
16 2 4 B 0.078483809 0.08496093
Note
Some data in reproducible form
set.seed(123)
tmp <- data.frame(Year = c(1, 1, 1, 1, 2, 2, 2, 2), Month = 1:4, Company = "A",
RETURN = runif(8, -.1, .1))
Dataset <- rbind(tmp, transform(tmp, Company = "B"))

Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
In tabular format,
---------
a b c d
---------
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1
----------
Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)
I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?
Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.
There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.
Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.
library(tidyverse)
map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>% # get combinations
flatten() %>% # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>% # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x)) # calculate TRUE proportion
## ab ac ad bc bd cd abc abd acd bcd abcd
## 0.6 0.2 1.0 0.2 0.6 0.2 0.0 0.6 0.2 0.0 0.0
If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:
f <- function(cols, data) {
sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}
Results are as expected when applying the function to example groups:
f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1
With base R functions you could do:
groupVec = c("a","b","d")
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
subDF
# [,1] [,2] [,3] [,4] [,5]
# a 1 2 1 2 1
# b 1 2 2 1 1
# d 1 2 1 2 1
#if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
match_pct
# [1] 0.6
Wrapping it in a custom funtion:
fn_matchPercent = function(groupVec = c("a","d") ) {
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)
return(outputDF)
}
fn_matchPercent(c("a","d"))
# groups match_pct
# 1 a,d 1
fn_matchPercent(c("a","b","d"))
# groups match_pct
# 1 a,b,d 0.6
Try this:
find.unanimous.percentage <- function(df, at.a.time) {
cols <- as.data.frame(t(combn(names(df), at.a.time)))
names(cols) <- paste('O', 1:at.a.time, sep='')
cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
return(cols)
}
find.unanimous.percentage(df, 2) # take 2 at a time
O1 O2 percent.unanimous
1 a b 60
2 a c 20
3 a d 100
4 b c 20
5 b d 60
6 c d 20
find.unanimous.percentage(df, 3) # take 3 at a time
O1 O2 O3 percent.unanimous
1 a b c 0
2 a b d 60
3 a c d 20
4 b c d 0
find.unanimous.percentage(df, 4)
O1 O2 O3 O4 percent.unanimous
1 a b c d 0
Clustering similarity metrics
It seems that you might want to calculate a substantially different (better?) metric than what you propose now, if your actual problem requires to evaluate various options of clustering the same data.
This http://cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf is a good overview of the problem, but the BCubed precision/recall metrics are commonly used for similar problems in NLP (e.g http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html).
Try this code. It works for your example and should hold for the extended case.
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
# Find all unique combinations of the column names
group_pairs <- data.frame(t(combn(colnames(df), 2)))
# For each combination calculate the similarity
group_pairs$similarities <- apply(group_pairs, 1, function(x) {
sum(df[x["X1"]] == df[x["X2"]])/nrow(df)
})

Reverse Scoring Items

I have a survey of about 80 items, primarily the items are valanced positively (higher scores indicate better outcome), but about 20 of them are negatively valanced, I need to find a way to reverse score the ones negatively valanced in R. I am completely lost on how to do so. I am definitely an R beginner, and this is probably a dumb question, but could someone point me in an direction code-wise?
Here's an example with some fake data that you can adapt to your data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
dat
Q1 Q2 Q3
1 2 2 5
2 2 1 2
3 3 4 4
4 5 2 1
5 2 4 2
6 5 3 2
7 5 4 1
8 4 5 2
9 4 2 5
10 1 4 2
# Say you want to reverse questions Q1 and Q3
cols = c("Q1", "Q3")
dat[ ,cols] = 6 - dat[ ,cols]
dat
Q1 Q2 Q3
1 4 2 1
2 4 1 4
3 3 4 2
4 1 2 5
5 4 4 4
6 1 3 4
7 1 4 5
8 2 5 4
9 2 2 1
10 5 4 4
If you have a lot of columns, you can use tidyverse functions to select multiple columns to recode in a single operation.
library(tidyverse)
# Reverse code columns Q1 and Q3
dat %>% mutate(across(matches("^Q[13]"), ~ 6 - .))
# Reverse code all columns that start with Q followed by one or two digits
dat %>% mutate(across(matches("^Q[0-9]{1,2}"), ~ 6 - .))
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ 6 - .))
If different columns could have different maximum values, you can (adapting #HellowWorld's suggestion) customize the reverse-coding to the maximum value of each column:
# Reverse code columns Q11 through Q20
dat %>% mutate(across(Q11:Q20, ~ max(.) + 1 - .))
Here is an alternative approach using the psych package. If you are working with survey data this package has lots of good functions. Building on #eipi10 data:
# Fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
original_data = data.frame(Q1=sample(1:5,10,replace=TRUE),
Q2=sample(1:5,10,replace=TRUE),
Q3=sample(1:5,10,replace=TRUE))
original_data
# Say you want to reverse questions Q1 and Q3. Set those keys to -1 and Q2 to 1.
# install.packages("psych") # Uncomment this if you haven't installed the psych package
library(psych)
keys <- c(-1,1,-1)
# Use the handy function from the pysch package
# mini is the minimum value and maxi is the maimum value
# mini and maxi can also be vectors if you have different scales
new_data <- reverse.code(keys,original_data,mini=1,maxi=5)
new_data
The pro to this approach is that you can recode your entire survey in one function. The con to this is you need a library. The stock R approach is more elegant as well.
FYI, this is my first post on stack overflow. Long time listener, first time caller. So please give me feedback on my response.
Just converting #eipi10's answer using tidyverse:
# Create same fake data: Three questions answered on a 1 to 5 scale
set.seed(1)
dat <- data.frame(Q1 = sample(1:5,10, replace=TRUE),
Q2 = sample(1:5,10, replace=TRUE),
Q3 = sample(1:5,10, replace=TRUE))
# Reverse scores in the desired columns (Q2 and Q3)
dat <- dat %>%
mutate(Q2Reversed = 6 - Q2,
Q3Reversed = 6 - Q3)
Another example is to use recode in library(car).
#Example data
data = data.frame(Q1=sample(1:5,10, replace=TRUE))
# Say you want to reverse questions Q1
library(car)
data$Q1reversed <- recode(data$Q1, "1=5; 2=4; 3=3; 4=2; 5=1")
data
The psych package has the intuitive reverse.code() function that can be helpful. Using the dataset started by #eipi10 and the same goal or reversing q1 and q2:
set.seed(1)
dat <- data.frame(q1 =sample(1:5,10,replace=TRUE),
q2=sample(1:5,10,replace=TRUE),
q3 =sample(1:5,10,replace=TRUE))
You can use the reverse.code() function. The first argument is the keys. This is a vector of 1 and -1. -1 means that you want to reverse that item. These go in the same order as your data.
The second argument, called items, is simply the name of your dataset. That is, where are these items located?
Last, the mini and maxi arguments are the smallest and largest values that a participant could possibly score. You can also leave these arguments to NULL and the function will use the lowest and highest values in your data.
library(psych)
keys <- c(-1, 1, -1)
dat1 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat1
Alternatively, your keys can also contain the specific names of the variables that you want to reverse score. This is helpful if you have many variables to reverse score and yields the same answer:
library(psych)
keys <- c("q1", "q3")
dat2 <- reverse.code(keys = keys, items = dat, mini = 1, maxi = 5)
dat2
Note that, after reverse scoring, reverse.code() slightly modifies the variable name to have a - behind it (i.e., q1 becomes q1- after being reverse scored).
The solutions above assume wide data (one score per column). This reverse scores specific rows in long data (one score per row).
library(magrittr)
max <- 5
df <- data.frame(score=sample(1:max, 20, replace=TRUE))
df <- mutate(df, question = rownames(df))
df
df[c(4,13,17),] %<>% mutate(score = max + 1 - score)
df
Here is another attempt that will generalize to any number of columns. Let's use some made up data to illustrate the function.
# create a df
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The columns to reverse code
# variables to reverse code
vtcode = c("A", "B")
The function to reverse-code the selected columns
reverseCode <- function(data, rev){
# get maximum value per desired col: lapply(data[rev], max)
# subtract values in cols to reverse-code from max value plus 1
data[, rev] = mapply("-", lapply(data[rev], max), data[, rev]) + 1
return(data)
}
reverseCode(df1, vtcode)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
This code was inspired by another response a response from #catastrophic-failure relating to subtract max of column from all entries in column R

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Resources