calculate median from data.table columns in R - r

I am trying to calculate a median value across a number of columns, however my data is a bit funky. It looks like the following example.
library(data.table)
dt <- data.table("ID" = c(1,2,3,4),"none" = c(0,5,5,3),
"ten" = c(3,2,5,4),"twenty" = c(0,2,3,1))
ID none ten twenty
1: 1 0 3 0
2: 2 5 2 2
3: 3 5 5 3
4: 4 3 4 1
In the table to column represents the number of occurrences of that value. I am wanting to calculate the median occurrence.
For example for ID = 1
median(c(10, 10, 10))
is the calculation I am wanting to create.
for ID = 2
median(c(0, 0, 0, 0, 0, 10, 10, 20, 20))
I have tried using rep() and lapply() with very limited success and am after some clear guidance on how this might be achieved. I understand for the likes of rep() I would be having to hard code my value to be repeated (e.g. rep(0,2) or rep(10,2)) and this is what I expect. I am just struggling to create a list or vector with the repetitions from each column.

Here's another data.table way (assuming unique ID):
dt[, median(rep(c(0, 10, 20), c(none, ten, twenty))), by=ID]
# ID V1
# 1: 1 10
# 2: 2 0
# 3: 3 10
# 4: 4 10
This is just an attempt to get #eddi's answer without reshaping (which I tend to use as a last resort).

You need a dictionary to translate column names to corresponding numbers, and then it's fairly straightforward:
dict = data.table(name = c('none', 'ten', 'twenty'), number = c(0, 10, 20))
melt(dt, id.var = 'ID')[
dict, on = c(variable = 'name')][, median(rep(number, value)), by = ID]
# ID V1
#1: 1 10
#2: 2 0
#3: 3 10
#4: 4 10

Here's a way that avoids by-row operations and reshaping:
dt[, m := {
cSD = Reduce(`+`, .SD, accumulate=TRUE)
k = floor(cSD[[length(.SD)]]/2)
m = integer(.N)
for(i in seq_along(cSD)) {
left = m == 0L
if(!any(left)) break
m[left] = i * (cSD[[i]][left] >= k[left])
}
names(.SD)[m]
}, .SDcols=none:twenty]
which gives
ID none ten twenty m
1: 1 0 3 0 ten
2: 2 5 2 2 none
3: 3 5 5 3 ten
4: 4 3 4 1 ten
For the loop, I'm borrowing #alexis_laz' style, e.g. https://stackoverflow.com/a/30513197/
I've skipped translation of the column names, but that's pretty straightforward. You could use c(0,10,20) instead of names(.SD) at the end.

Here is a rowwise dplyr way:
dt %>% rowwise %>%
do(med = median(c(rep(0, .$none), rep(10, .$ten), rep(20, .$twenty)))) %>%
as.data.frame
med
1 10
2 0
3 10
4 10
Inspired by #Arun's answer, this is also working:
dt %>% group_by(ID) %>%
summarise(med = median(rep(c(0, 10, 20), c(none, ten, twenty))))
Source: local data table [4 x 2]
ID med
(dbl) (dbl)
1 1 10
2 2 0
3 3 10
4 4 10

Related

Is there an R function to make x rows equal to a specific row and repeat the operation?

everyone!
Being a beginner with the R software (I think my request is feasible on this software), I would like to ask you a question.
In a large Excel type file, I have a column where the values I am interested in are only every 193 lines. So I would like the previous 192 rows to be equal to the value of the 193rd position ... and so on for all 193 rows, until the end of the column.
Concretely, here is what I would like to get for this little example:
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 3 2 NA
2 3 NA NA
3 3 NA NA
4 3 NA NA
1 8 4 5
2 8 5 5
3 8 5 5
4 8 5 5
The column "Cluster_ref_INPUT" is partitioned according to the column "Fund_number" (one observation for each fund every month for 4 months). The values that interest me in the INPUT column appear every 4 observations (the value in the 4th month).
Thus, we can see that for each fund number, we find in the column "Expected_output" the values corresponding to the value found in the last line of the column "Cluster_ref_INPUT". (every 4 lines). I think we should partition by "Fund_number" and put that all the lines are equal to the last one... something like that?
Do you have any idea what code I should use to make this work?
I hope that's clear enough. Do not hesitate if I need to clarify.
Thank you very much in advance,
Vanie
Here's a one line solution using data.table:
library(data.table)
exdata <- fread(text = "
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 2 2 NA
2 2 NA NA
3 2 NA NA
4 2 NA NA
1 3 4 5
2 3 5 5
3 3 5 5
4 3 5 5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 3 1 1
4: 4 1 1 1 1
5: 1 2 2 NA NA
6: 2 2 NA NA NA
7: 3 2 NA NA NA
8: 4 2 NA NA NA
9: 1 3 4 5 5
10: 2 3 5 5 5
11: 3 3 5 5 5
12: 4 3 5 5 5
There are probably solutions using tidyverse that'll be a lot faster, but here's a solution in base R.
#Your data
df <- data.frame(Month = rep_len(c(1:4), 12),
Fund_number = rep(c(1:3), each = 4),
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5),
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
# Month Fund_number Cluster_ref_INPUT expected_input
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 3 1
# 4 4 1 1 1
# 5 1 2 2 NA
# 6 2 2 NA NA
# 7 3 2 NA NA
# 8 4 2 NA NA
# 9 1 3 4 5
# 10 2 3 5 5
# 11 3 3 5 5
# 12 4 3 5 5
EDIT: additional solutions + benchmarking
Per OP's comment on this answer, I've presented some faster solutions (dplyr and the data.table solution from the other answer) and also benchmarked them on a 950,004 row simulated dataset similar to the one in OP's example. Code and results below; the entire code-block can be copy-pasted and run directly as long as the necessary libraries (microbenchmark, dplyr, data.table) and their dependencies are installed. (If someone knows a solution based on apply() they're welcome to add it here.)
rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#----------------
#Base R solution
#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(),
Fund_number = c(),
Cluster_ref_INPUT = c(),
expected_input = c(),
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}
#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df),
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077 8
# dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905 8
# dt_func(df_t) 38.06384 38.27545 40.94850 38.88269 43.58225 48.04335 8
# dt_func_conv(df) 48.87009 51.13212 69.62772 54.36058 57.68829 181.78970 8
#----------------
As can be seen, using data.table would be the way to go if speed is a necessity. data.table is faster than dplyr and base R even when the overhead of converting a regular data.frame to a data.table is considered (see results of dt_func_conv()).
Edit: following up on Carlos Eduardo Lagosta's comments, using setDT() to coerce the df from a data.frame to a data.table, makes the overhead of said coercion close to nil. Code snippet and benchmark values below.
#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163 8
# dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133 8
# dt_func(df_t) 38.21601 38.57787 40.79427 39.53428 43.14732 45.61921 8
# dt_func_conv(df) 41.11210 43.28519 46.72589 46.74063 50.16052 52.32235 8
For the OP specifically: whatever solution you wish to use, the code you're looking for is within the body of the corresponding function. So, for instance, if you want to use the dplyr solution, you would need to take this code and tailor it to your data objects:
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))

How do I flag the last observation in a group while maintaining a specific sort order within the group?

This is related to this question. I have data like this:
x t
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
I'd like to flag the last observation in every group (and keep the other observations), defined by x, where the "last" observation is defined by t. I tried this:
dt[order(x, t), flag_last := 1, by = "x", mult = "last"]
but that returns
x t flag_last
1: 1 1 1
2: 1 2 1
3: 1 3 1
4: 2 1 1
5: 2 2 1
6: 2 3 1
The desired output is
x t flag_last
1: 1 1 0
2: 1 2 0
3: 1 3 1
4: 2 1 0
5: 2 2 0
6: 2 3 1
Am I going about this the wrong way?
A couple of caveats:
The actual dataset is roughly 61 GB and there are only a couple of observations per x group, so if possible I'd like to avoid creating another copy with the unique values or creating another copy with dplyr. If that's unavoidable, I'll make do.
Obviously this is simple data. The number of observations within each group is not necessarily the same, and the values for t differ too, so simply picking out t == 3 will not work.
Use the built-in .I like this:
DT[, is.end := .I == last(.I), by = "x"]
dt[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x]
# x t flag_last
# 1: 1 1 0
# 2: 1 2 0
# 3: 1 3 1
# 4: 2 1 0
# 5: 2 2 0
# 6: 2 3 1
One option is to use .N and which.max to check for equality between the row index and the row index at which t is maximized
df[, flag := as.integer(1:.N == which.max(t)), x]
But benchmarking shows replace is faster on my machine for this dataset, and if you don't mind NAs instead of 0s, David Arenburg's suggested method using .I is fastest.
df <- data.table(x = rep(1:1e4, each = 1e4), t = sample(1e4*1e4))
library(microbenchmark)
microbenchmark(
replace = df[, flag_last := replace(vector(mode = "integer", length = .N), which.max(t), 1L), x],
use.N = df[, flag := as.integer(1:.N == which.max(t)), x],
use.max = df[, flag := as.integer(t==max(t)), x],
use.I = {
df[, flag := 0L]
df[df[, .I[which.max(t)], by = x]$V1, flag := 1L]
},
use.I.no0 = df[df[, .I[which.max(t)], by = x]$V1, flag := 1L],
times = 20)
# Unit: seconds
# expr min lq mean median uq max neval cld
# replace 1.228490 1.292348 1.442919 1.443021 1.578300 1.659990 20 b
# use.N 1.439939 1.522782 1.617104 1.574932 1.696046 1.923207 20 c
# use.max 1.405741 1.436817 1.596363 1.502337 1.663895 2.743942 20 c
# use.I 1.497599 1.547276 1.574657 1.564789 1.589066 1.686353 20 bc
# use.I.no0 1.080715 1.115329 1.162752 1.145145 1.182280 1.383989 20 a
This would do the trick, if you create an id variable that you can then use to merge the two datasets together:
library(dplyr)
x <- c(1,1,1,2,2,2)
t <- c(1,2,3,1,2,3)
id <- as.character(c(1,2,3,4,5,6))
data <- data.frame(x,t, id)
You create a sliced dataset with the max value of each group, and then you merge it back to the initial dataframe.
sliced <- data %>%
group_by(x) %>%
slice(which.max(t)) %>%
ungroup() %>%
select(id) %>%
mutate(max = "max_group")
tot <- left_join(data, sliced, c("id"))
The sliced df has only two variables, so might be not too bad to work with. This is the easier way that came to my mind.

compare the first and last observation in each group

I have a dataset like this:
df <- data.frame(group = c(rep(1,3),rep(2,2), rep(3,3),rep(4,3),rep(5, 2)), score = c(30, 10, 22, 44, 50, 5, 20, 1,35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 2 44
5 2 50
6 3 5
7 3 20
8 3 1
9 4 35
10 4 2
11 4 60
12 5 14
13 5 5
I wish to compare the first score and last score in each group, if the last score is smaller than the first score, then output the group number. The expected output should be like:
group 1 3 5
does anyone have idea how to realized this?
Here's data.table approach
library(data.table)
setDT(df)[, score[1] > score[.N], by = group][V1 == TRUE]
## group V1
## 1: 1 TRUE
## 2: 3 TRUE
## 3: 5 TRUE
Or
setDT(df)[, group[score[1] > score[.N]], by = group]
## group V1
## 1: 1 1
## 2: 3 3
## 3: 5 5
Or
setDT(df)[, .BY[score[1] > score[.N]], by = group]
As per #beginneR's comment, if you don't like V1 you could do
df2 <- as.data.table(df)[, .BY[score[1] > score[.N]], by = group][, V1 := NULL]
df2
## group
## 1: 1
## 2: 3
## 3: 5
This should do the job:
# First split the data frame by group
# This returns a list
df.split <- split(df, factor(df$group))
# Now use sapply on the list to check first and last of each group
# We return the group or NA using ifelse
res <- sapply(df.split,
function(x){ifelse(x$score[1] > x$score[nrow(x)], x$group[1], NA)})
# Finally, filter away the NAs
res <- res[!is.na(res)]
This answer assumes that every group has at least 2 observations:
newdf <- merge(rbind(df[diff(df$group) == 1 ,] , df[dim(df)[1], ]),
df[!duplicated(df$group), ],
by="group")
newdf[which(newdf$score.x < newdf$score.y), 'group']
#[1] 1 3 5
df[diff(df$group) == 1 ,] identifies the last observation of each group, except for the last group, which is why I rbind the last entry (i.e. df[dim(df)[1], ]). Then, the first observation of each group is given by df[!duplicated(df$group), ]. We merge these on the group column, then identify which ones meet the criteria.
Another option for the merge step:
merge(df[which(!duplicated(df$group))+(rle(df$group)$lengths-1),],
df[!duplicated(df$group), ],
by="group")
One more base R option:
with(df, unique(df$group[as.logical(ave(score, group, FUN = function(x) head(x,1) > tail(x, 1)))]))
#[1] 1 3 5
Or using dplyr:
library(dplyr)
group_by(df, group) %>% filter(first(score) > last(score)) %>% do(head(.,1)) %>%
select(group)
# group
#1 1
#2 3
#3 5
I'm plyr package fun..
library(plyr)
df1<-ddply(df,.(group),summarise,shown=score[length(group)]<score[1])
subset(df1,shown)
group shown
1 TRUE
3 TRUE
5 TRUE

R: Calculating offset differences between elements in data frame with the same identifier

Below is a subset of my data:
> head(dt)
name start end
1: 1 3195984 3197398
2: 1 3203519 3205713
3: 2 3204562 3207049
4: 2 3411782 3411982
5: 2 3660632 3661579
6: 3 3638391 3640590
dt <- data.frame(name = c(1, 1, 2, 2, 2, 3), start = c(3195984,
3203519, 3204562, 3411782, 3660632, 3638391), end = c(3197398,
3205713, 3207049, 3411982, 3661579, 3640590))
I want to calculate another value: the difference between the end coordinate of line n and the start coordinate of line n+1 but only if both elements share a name. To elaborate this is what I want a resulting data frame to look like:
name start end dist
1: 1 3195984 3197398
2: 1 3203519 3205713 -6121
3: 2 3204562 3207049
4: 2 3411782 3411982 −204733
5: 2 3660632 3661579 −248650
6: 3 3638391 3640590
The reason I want to do this is that I'm looking for dist values that are positive. One way I've tried this is to offset the start and end coordinates but then I run into a problem where I am comparing things with different names.
How does one do this in R?
A data.table solution may be good here:
library(data.table)
dt <- as.data.table(dt)
dt[, dist := c(NA, end[-(length(end))] - start[-1]) , by=name]
dt
# name start end dist
#1: 1 3195984 3197398 NA
#2: 1 3203519 3205713 -6121
#3: 2 3204562 3207049 NA
#4: 2 3411782 3411982 -204733
#5: 2 3660632 3661579 -248650
#6: 3 3638391 3640590 NA
Assuming your data is sorted, you can also do it with base R functions:
dt$dist <- unlist(
by(dt, dt$name, function(x) c(NA, x$end[-(length(x$end))] - x$start[-1]) )
)
Using dplyr (with credit to #thelatemail for the calculation of dist):
library(dplyr)
dat.new <- dt %.%
group_by(name) %.%
mutate(dist = c(NA, end[-(length(end))] - start[-1]))
Here is a different dplyr solution:
dt %.% group_by(name) %.% mutate(dist = lag(end) - start)
giving:
Source: local data frame [6 x 4]
Groups: name
name start end dist
1 1 3195984 3197398 NA
2 1 3203519 3205713 -6121
3 2 3204562 3207049 NA
4 2 3411782 3411982 -204733
5 2 3660632 3661579 -248650
6 3 3638391 3640590 NA

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Resources