Ranking elements within a data.frame - r

Let's say I have a data frame, like this:
df <- data.frame(
variable = rep(letters[1:10], 2),
y2 = 1:10,
y1 = c(10, 9, 8 ,7, 6, 5, 4, 2, 1, 3),
stat = c(rep(letters[1], 10), rep(letters[2], 10))
)
By "stat", I would like to create three new columns, one that shows a numbered rank for y1 and y2, and another that calculates the change in rank between y1 and y2 (short for year 1 and year 2).
I've been tinkering with ddply, but I can't seem to get it to do what I want. Here's an example of what I've tried (which may also illustrate what I'm attempting to do):
ddply(df, .(stat), function(x) data.frame(
df,
y1rank = rank(x$x),
y2rank = rank(x$y),
change = rank(x$y) - rank(x$x)
))

You can also use the new mutate function which saves you from recalculating the columns:
ddply(df, .(stat), mutate,
y1rank = rank(y1),
y2rank = rank(y2),
change = y2rank - y1rank
)

Would this work for you?
ddply(df, .(stat), transform,
y1rank = rank(y1),
y2rank = rank(y2),
change = rank(y2) - rank(y1)
)

Related

Creating subplot (facets) with custom x,y position of the subplots in ggplot2

How can we custom the position of the panels/subplot in ggplot2?
Concretely I have a grouped times series and I want to produce 1 subplot per time series with custom positions of the subplot, not necessarily in a grid.
The facet_grid() or facet_wrap() functions do not provide a full customization of the position of the panel as it uses grid.
library(tidyverse)
df = data.frame(group = LETTERS[1:5],
x = c(1,2,3,1.5,2.5),
y =c(2,1,2,3,3),
stringsAsFactors = F)%>%
group_by(group)%>%
expand_grid(time = 1:20)%>%
ungroup()%>%
mutate(dv = rnorm(n()))%>%
arrange(group,time)
## plot in grid
df%>%
ggplot()+
geom_line(aes(x=time,y=dv))+
facet_grid(~group)
## plot with custom x, y position
## Is there an equivalent of facet_custom()?
df%>%
ggplot()+
geom_line(aes(x=time,y=dv))+
facet_custom(~group, x.subplot = x, y.subplot = y)
FYI: This dataset is only an example. My data are EEG data where each group represents an electrode (up to 64) and I want to plot the EEG signals of each electrode accordingly to the position of the electrode on the head.
Well, I guess this would not really be a 'facet plot' any more. I therefore don't think there is a specific function out there.
But you can use the fantastic patchwork package for that, in particular the layout option in wrap_plots.
As the main package author Thomas describes in the vignette, the below option using area() may be a bit verbose, but it would give you full programmatic options about positioning all your plots.
library(tidyverse)
library(patchwork)
mydf <- data.frame(
group = LETTERS[1:5],
x = c(1, 2, 3, 1.5, 2.5),
y = c(2, 1, 2, 3, 3),
stringsAsFactors = F
) %>%
group_by(group) %>%
expand_grid(time = 1:20) %>%
ungroup() %>%
mutate(dv = rnorm(n())) %>%
arrange(group, time)
## plot in grid
mylist <-
mydf %>%
split(., .$group)
p_list <-
map(1:length(mylist), function(i){
ggplot(mylist[[i]]) +
geom_line(aes(x = time, y = dv)) +
ggtitle(names(mylist)[i])
}
)
layout <- c(
area(t = 1, l = 1, b = 2, r = 2),
area(t = 2, l = 3, b = 3, r = 4),
area(t = 3, l = 5, b = 4, r = 6),
area(t = 4, l = 3, b = 5, r = 4),
area(t = 5, l = 1, b = 6, r = 2)
)
wrap_plots(p_list, design = layout)
#> result not shown, it's the same as below
For a more programmatic approach, one option is to create the required "patch_area" object manually.
t = 1:5
b = t+1
l = c(1,3,5,3,1)
r = l+1
list_area <- list(t = t, b = b, l = l, r = r)
class(list_area) <- "patch_area"
wrap_plots(p_list, design = list_area)
Created on 2020-04-22 by the reprex package (v0.3.0)

Correlation between variables under the for loop

I have an issue that is shown below. I tried to solve it but was not successful. I have a dataframe df1. I need to make a table of correlation between the variables within a for loop. Reason being I do not want to make the code look long and complicated.
df1 <- structure(list(a = c(1, 2, 3, 4, 5), b = c(3, 5, 7, 4, 3), c = c(3,
6, 8, 1, 2), d = c(5, 3, 1, 3, 5)), class = "data.frame", row.names =
c(NA, -5L))
I tried with the below code using 2 for loops
fv <- as.data.frame(combn(names(df1),2,paste, collapse="&"))
colnames(fv) <- "ColA"
fv$ColB <- sapply(strsplit(fv$ColA,"\\&"),'[',1)
fv$ColC <- sapply(strsplit(fv$ColA,"\\&"),'[',2)
asd <- list()
for (i in fv$ColB) {
for (j in fv$ColC) {
asd[i,j] <- as.data.frame(cor(df1[,i],df1[,j]))}}
May I know what wrong I am doing
We can apply cor directly on the data.frame and convert to 'long' format with melt. As the values in the lower triangular part is the mirror values of those in the upper triangular part, either one of these can be assigned to NA and then do the melt
library(reshape2)
out[lower.tri(out, diag = TRUE)] <- NA
melt(out, na.rm = TRUE)

Using foreach to create new observations and deleting erroneous observations in parallel

I am currently trying to clean a very large data set. I have working code to clean it, but it takes about three days to run without any parallelization, so I want to parallelize it. The original code works fine, but I can't figure out how to parallelize it in R using the doParallel and foreach packages or any other pre-built ones.
In particular, if I observe two data points that have the same time stamp, they should really be one data point. The non-parallelized code can accurately identify the points, flag them to be deleted later and create a new data point that is correct.
I've tried adapting existing code to convert the for loops into foreach loops using the %do% option provided by the doParallel package. Doing this works fine. Changing the %do% to %dopar% causes the code to stop working. I understand that this is the incorrect way to use %dopar%, but I don't know how to correctly accomplish my goal.
library(doParallel)
library(foreach)
df1 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(10, 1, 9, 4, 11),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0)) #Indicator for problem observations
df2 <- data.frame(ID = c(1, 2, 3, 4, 5),
date = c(12, 10, 7, 5, 6),
var2 = c(2, 4, 6, 8, 10),
var3 = c(2, 4, 6, 8, 10),
ind = c(0, 0, 0, 0, 0))
foreach (row1 = 1:nrow(df1)) %dopar% {
for (row2 in 1:nrow(df2)) {
if(df1[row1, "date"] == df2[row2, "date"]) { #Observations that occur on the same date should be combined
df1[row1, "ind"] <- 1 #Tag problem observations to delete them later
df2[row2, "ind"] <- 1
temp_obs <- data.frame(ID = df2[row2, "ID"],
date = df1[row1, "date"],
var2 = df1[row1, "var2"],
var3 = df1[row1, "var3"] + df2[row2, "var3"],
ind = 0)
df1 <- rbind(df1, temp_obs)
rm(temp_obs)
}
}
}
The sample code demonstrates my problem in a simpler context. It loops through all observations in df1 and df2, and identifies observations with the same date. It should add a 6th observation to df1, and change the indicators from 0 to 1 in the 1st entry of df1 and the second entry of df2 to indicate that they have been matched. As is, this code does not change df1 or df2 at all. It works when %dopar% is replaced with %do%.

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

ffbase: merge on columns X and Y and closest column Z

I would like to accomplish the following using ffdf: Merge on columns X and Y and closest Time and then merge on the closes column B. However,the procedure that I know in smaller samples involves using outer merges (as shown below). What is a way around this for a large sample that won't fit in memory (and probably wouldn't work on sqldf), using ffbase? If not possible, what would be the best library for this?
As a reproducible example, same as below:
set.seed(1)
df.ff <- as.ffdf(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge.ff <- as.ffdf(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
I borrow the following example from #ChinmayPatil here to highlight the similar procedure I would like to follow: (R - merge dataframes on matching A, B and *closest* C?):
require(data.table)
set.seed(1)
df <- setDT(cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30)))
to.merge <- setDT(data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F))
## First do a left outer merge
A <- merge(to.merge,df, by = c('x','y'), all.x = T )
## Then calculate a diff row as such
A$diff <- abs(A$time.x - A$time.y)
##then take the minimum distance
A[ , .I[which.min(diff)] , by = c('x', 'y' ) ]
Given that my question got so few views and no answers, I will describe the approach I came up with to solve this problem with the hopes that someone might find it useful (or even for me as a reminder for later in the future):
To me, the most difficult aspect of performing this match on one columns and then nearest match on another columns is that I kept thinking that doing an outer join (as described in the post) was necessary. The solution is pretty simple using data.table and ffdfdply. For the purpose of illustration, assume there is one large ffdf object and one regular data.table that fits in memory:
### Large ffdf object
A <- as.ffdf(data.table( dates.A = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-31'), by = '3 days'),
letters.A = LETTERS[1:4] , value.A = runif(4) ))
### Small data.table that fits in memory
B <- data.table( date.B = seq.Date(as.Date('2008-01-01'),as.Date('2008-01-05'), by = 'days'),
letters.B = LETTERS[1:4] , value.B = runif(4) )
Then you can simply define a function that does the merging using data.table and roll = 'nearest':
merge.ff <- function(x){
setDT(x)
x[, ':=' (dates.merge = dates.A, letters.merge = letters.A)]
B[, ':=' (dates.merge = date.B, letters.merge = letters.B)]
setkeyv(x, c('letters.merge','dates.merge'))
setkeyv(B, c('letters.merge','dates.merge'))
as.data.frame(B[x, roll = 'nearest'])
}
and apply it to A:
result <- ffdfdply( A, split = A$dates.A, FUN = merge.ff)
the key was just essentially using the roll method in data.table and pass it to ffdfdply. It seemed to be quite efficient.

Resources