extracting data using dplyr - r

Say I have the following data
set.seed(123)
a <- c(rep(1,30),rep(2,30))
b <- rep(1:30)
c <- sample(20:60, 60, replace = T)
data <- data.frame(a,b,c)
data
Now I want to extract data whereby:
For each unique value of a, extract/match data where the b value is the same and the c value is within a limit of +-5
so a desired output should produce:

You want to compare within each distinct b group (as they are unique within each a), thus you should group by b. It is also not possible to group by a and compare between them, thus a possible solution would be
data %>%
group_by(b) %>%
filter(abs(diff(c)) <= 5)
with data.table package this would be something like
library(data.table)
setDT(data)[, .SD[abs(diff(c)) <= 5], b]
Or
data[, if (abs(diff(c)) <= 5) .SD, b]
Or
data[data[, abs(diff(c)) <= 5, b]$V1]
In base R it would be something like
data[with(data, !!ave(c, b, FUN = function(x) abs(diff(x)) <= 5)), ]

Related

Sample from specific rows in a dataframe column [duplicate]

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.
OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]
You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)
I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here
Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

Join 2 data frames using data.table with conditions

I have these two data frames:
set.seed(42)
A <- data.table(station = sample(1:10, 1000, replace=TRUE),
hash = sample(letters[1:5], 1000, replace=TRUE),
point = sample(1:24, 1000, replace=TRUE))
B <- data.table(station = sample(1:10, 100, replace=TRUE),
card = sample(letters[6:10], 100, replace=TRUE),
point = sample(1:24, 100, replace=TRUE))
Dataframe A contains more than 1M rows.
I try to find hash (from A) for each card (from B). I have some conditions there: stations and points in A lays in a range(for station +- 1 and for points just + 2).
I use grouping B by card and execute for each group function for binding rows after implementing such conditions and get max by freq.
detect <- function(x){
am0 <- data.frame(station = 0,
hash = 0,
point = 0)
for (i in 1:nrow(x)) {
am1 <- A %>%
filter(station %in% (B$station[i] - 1) : (B$station[i] + 1) &
point > B$point[i] & point < B$point[i] + 2)
am0 <- rbind(am0, am1)
}
t <- as.data.frame(table(am0$hash))
t <- t %>%
arrange(-Freq) %>%
filter(row_number() == 1)
return(t)
}
And then just:
library(dplyr)
B %>%
group_by(card) %>%
do(detect(.)) %>%
ungroup
But I don't know how to implement function by each group with indices [i] so I actually get a wrong result.
# A tibble: 5 x 3
card Var1 Freq
<chr> <fctr> <int>
1 f c 46
2 g c 75
3 h c 41
4 i c 64
5 j c 62
I`m a beginner but I know best solution for big datasets - using data.table library for join 2 datasets like these. Can you help me to find decision for it?
I think what you want to do is:
#### Prepare join limits
B[, point_limit := as.integer(point + 2)]
B[, station_lower := as.integer(station - 1)]
B[, station_upper := as.integer(station + 1)]
## Join A on B, creates All combinations of points in A and B fulfilling the conditions
joined_table <- B[A,
, on = .( point_limit >= point, point <= point,
station_lower <= station, station_upper >= station),
nomatch = 0,
allow.cartesian=TRUE]
## Count the occurrences of the combinations
counted_table <- joined_table[,.N, by=.(card,hash)][order(card, -N)]
## Select the top for each group.
counted_table[, head(.SD, 1 ),by = .(card)][order(card)]
This will create a full table with all the information in and then do the counting on that. It relies purely on data.tables since to fully take advantage of the speed gains from that package. The data.table vignette is good if you are unfamiliar with the syntax. The nomatch condition ensures that we are doing an inner join.
This will probably be fine if A is only 1M rows and B is kept the same size, depending on your datas distribution. We can however split B also in a similar way to your do statement using the package purrr. I'm not sure how this interacts with R:s garabage collection however.
frame_list <- purrr::map(unique(B$card),
~ B[card == .x][A,
, on = .(point_limit >= point,
point <= point,
station_lower <= station,
station_upper >= station),
nomatch = 0,
allow.cartesian = TRUE][, .N, by = .(card, hash)])
counted_table_mem <- rbindlist(frame_list )
Something to note in this is that I use, rbindlist instead of multiple rbind. Repeatedly calling rbind will be very slow, since you will need to allocate new memory each time.

Removing infrequent rows in a data frame

Let's say I have a following very simple data frame:
a <- rep(5,30)
b <- rep(4,80)
d <- rep(7,55)
df <- data.frame(Column = c(a,b,d))
What would be the most generic way for removing all rows with the value that appear less then 60 times?
I know you could say "in this case it's just a", but in my real data there are many more frequencies, so I wouldn't want to specify them one by one.
I was thinking of writing a loop such that if length() of an 'i' is smaller than 60, these rows will be deleted, but perhaps you have other ideas. Thanks in advance.
A solution using dplyr.
library(dplyr)
df2 <- df %>%
group_by(Column) %>%
filter(n() >= 60)
Or a solution from base R
uniqueID <- unique(df$Column)
targetID <- sapply(split(df, df$Column), function(x) nrow(x) >= 60)
df2 <- df[df$Column %in% uniqueID[targetID], , drop = FALSE]
We create a frequency table and then subset the rows based on the 'count' of values in 'Column'
tbl <- table(df$Column) >=60
subset(df, Column %in% names(tbl)[tbl])
Or with ave from base R
df[with(df, ave(Column, Column, FUN = length)>=60),]
Or we use data.table
library(data.table)
setDT(df)[, .SD[.N >= 60], Column]
Or another option with data.table is .I
setDT(df)[df[, .I[.N >=60], Column]$V1]
If there are more than one column to group, place it in a list (or compactly .()
setDT(df)[df[, .I[.N >=60], by = .(Column1, Column2)]$V1]
If there are many columns, we can also pass as a character string or object
colnms <- paste0("Column", 1:5)
setDT(df)[df[, .I[.N >=60], by = c(colnms)]$V1]
Using data.table
library(data.table)
setDT(df)
df[Column %in% df[, .N, by = Column][N >= 60, Column]]
There is also a variant to Eric Watt's answer which uses a join instead of %in%:
library(data.table)
setDT(df)
df[df[, .N, by = Column][N >= 60, .(Column)], on = "Column"]

How to compare two columns in R data frame and return 0 or 1 in the third column based on the comparison?

I have a dataframe with two columns(both are dates) and a million rows. I have to compare both the dates and return value in the third column. i.e if date in column A is greater than date in column B, return 1 in column C.
Thanks in advance :)
In base:
DF$C <- as.numeric(DF$A > DF$B)
In dplyr:
DF %>%
mutate(C = as.numeric(A > B))
library(data.table)
dt <- as.data.table(dt)
dt$A <- as.Date(dt$A)
dt$B <- as.Date(dt$B)
Here are two ways you can try:
dt[, C := ifelse(A > B, 1, 0)]
or
dt[, C := 0][A > B, C := 1]
In second way, you can change to dt[, C := 1][A <= B, C := 0] by checking which has less obs.
Maybe you need to provide a little reproducible example.

Mean row by imbricated levels of factors

I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!
Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1
(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)
Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").
You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865

Resources