Per group, select first row and another which matches a condition - r

Let's say I have the following data.table:
x <- data.table(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
And, after grouping by b, I want to select rows which:
are the first row of the group
have the highest a in the group
If a single row satisfies both conditions, it should only be selected once (the group will only contain one row).
Each of these selections is trivial:
x[, .SD[1], by = b] # selects first row per group
# b a
# 1: 1 1
# 2: 2 2
# 3: 3 10
x[, .SD[which.max(a)], by = b] # selects row with the highest 'a' in the group
# b a
# 1: 1 3
# 2: 2 7
# 3: 3 10
But I can't figure out how to do both at once (obviously .SD[1 | which.max(a)] doesn't work). I could perform them separately and then rbindlist the final result, but I'd like to know if there's a simpler way.
For clarity, in the case above, the expected output would be (different order is also acceptable):
b a
1: 1 1
2: 1 3
3: 2 2
4: 2 7
5: 3 10

One option is to concatenate the index 1 (for the first row) along with which.max -returns a numeric index as well, then take the unique of that (in case the same value 1 is returned by which.max and use that to subset the data.table (.SD)
x[, .SD[unique(c(1, which.max(a)))], by = b]
# b a
#1: 1 1
#2: 1 3
#3: 2 2
#4: 2 7
#5: 3 10
Or use .I
x[x[, .I[unique(c(1, which.max(a)))], by = b]$V1]

Here is how I would do it in dplyr:
library(dplyr)
x <- data.frame(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
x %>% group_by(b) %>% filter(row_number() == 1 | a == max(a))
Output
# a b
#1: 1 1
#2: 3 1
#3: 2 2
#4: 7 2
#5: 10 3

If you only have those two columns, just take the union of the two tables:
funion(
x[, lapply(.SD, max), by=b],
x[, lapply(.SD, first), by=b]
)
I guess max is more efficient than your which.max, since it is optimized (see ?GForce).

Related

Remove rows of a data frame from another dataframe but keep duplicated in R

I'm working in R and I have two dataframes, one is the base dataframe, and another has the rows that i need to remove from the base one. But I can't use setdiff() function, because it removes duplicated rows. Here's an example:
a <- data.frame(var1 = c(1, NA, 2, 2, 3, 4, 5),
var2 = c(1, 7, 2, 2, 3, 4, 5))
b <- data.frame(id = c(2, 4),
numero = c(2, 4))
And the result must be:
id numero
1 1
NA 7
2 2
3 3
5 5
It must be an efficient algorithm, too, because the base dataframe has 3 million rows with 26 columns.
We may need to create a sequence column before joining
library(data.table)
setDT(a)[, rn := rowid(var1, var2)][!setDT(b)[,
rn:= rowid(id, numero)], on = .(var1 = id, var2 = numero, rn)][,
rn := NULL][]
-output
var1 var2
<num> <num>
1: 1 1
2: NA 7
3: 2 2
4: 3 3
5: 5 5

Transform subject ID across groups that vary in size

A MWE is as follows:
I have 3 groups with 2, 4, and 3 subjects consecutively. So I have:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2)
df <- rbind(Group, Subject_ID)
Since the subjects in different groups are different subjects, so I want the subject ID be unique for each subject in the dataset. What I did was as follows:
Num_Subjects <- (length(unique(filter(df, Group == 1)$Subject)),
length(unique(filter(df, Group == 2)$Subject)),
length(unique(filter(df, Group == 3)$Subject)),
)
# Then I defined a summation function to calculate how many subjects there are in all previous groups.
sumfun <- function(x,start,end){
return(sum(x[start:end]))
}
# Then I defined another function that generates a new subject ID for each subject in each group.
SubjIDFn <- function(x, i) {
x %>% filter(Session == i) %>% mutate(
Sujbect = Subject + sumfun(Num_Subjects, 1, i-1)
)
}
# Then I loop this from group 2 to group 3,
for (i in 2:3) {
df.Corruption.WithoutS1 <- SubjIDFn(df.Corruption.WithoutS1, i)
}
Then the data set has zero observations. I don't know where it went wrong, and I don't know what is the smart solution to this problem. Thanks for your help!
I think you're a bit overshooting it... If Subject_ID is unique within groups, you may just go with:
library(dplyr)
Group <- c(1, 1, 2, 2, 2, 2, 3, 3, 3)
Subject_ID <- c(1, 2, 1 ,2, 3, 4, 1, 2, 3)
df <- bind_cols(Group=Group, Subject_ID=Subject_ID)
df %>% mutate(unique_id = paste(Group, Subject_ID, sep="."))
# A tibble: 9 x 3
Group Subject_ID unique_id
<dbl> <dbl> <chr>
1 1 1 1.1
2 1 2 1.2
3 2 1 2.1
4 2 2 2.2
5 2 3 2.3
6 2 4 2.4
7 3 1 3.1
8 3 2 3.2
9 3 3 3.3
Note that I used bind_cols instead of rbind to have a dataframe instead of a matrix.

Combining 3 versions of same table together in R

I scraped some data from a website but it was really janky and for some reason had little mistakes in it. So, I scraped the same data 3 times, and produced 3 tables that look like:
library(data.table)
df1 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(2, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)
df2 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 1, 4),
otherthing = c(2,2, 3, 4)
)
df3 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 3)
)
Except I have many more columns. I want to combine the 3 tables together, and when the values for "thing" and "other thing" etc. conflict, I want it to pick the value that has at least 2/3 and perhaps return an N/A if there is no 2/3 value. I'm confident the "name" and "id" field are good and they're what I want to sort of merge on.
I was considering setting the names for the tables to be, "thing1" "thing2" and "thing3" in the 3 tables respectively, merging together, and then writing some loops through the names. Is there a more elegant solution? It needs to work for 300+ value columns although I'm not super worried about speed.
In this example, the solution I think should be:
final_result <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)
To generalize the approach from #IceCreamToucan, we can use:
library(dplyr)
n_mode <- function(...) {
x <- table(c(...))
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
bind_rows(df1, df2, df3) %>%
group_by(name, id) %>%
summarise_all(funs(n_mode(.)))
N.B. Be careful with your namespace and how you name the function...preferring something like n_mode() to avoid conflicts with base::mode. Finally, if you extend this to more data.frames, you probably want to put them in a list. If that's not possible/practical, you could replace the bind_rows with purrr::map_df(ls(pattern = "^df[[:digit:]]+"), get)
data table version of Jason's solution (you should leave his as accepted)
library(data.table)
n_mode <- function(x) {
x <- table(x)
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
my_list <- list(df1, df2, df3)
rbindlist(my_list)[, lapply(.SD, n_mode), .(name, id)]
# name id thing otherthing
# 1: adam 1 1 2
# 2: bob 2 1 1
# 3: carl 3 3 3
# 4: dan 4 4 4
Here's the output of rbindlist. Hopefully this makes it more clear why just taking n_mode of all the columns, grouped by name and id, gives the output you want.
rbindlist(my_list)[order(name, id)]
# name id thing otherthing
# 1: adam 1 2 2
# 2: adam 1 1 2
# 3: adam 1 1 2
# 4: bob 2 1 1
# 5: bob 2 1 2
# 6: bob 2 1 1
# 7: carl 3 3 3
# 8: carl 3 1 3
# 9: carl 3 3 3
# 10: dan 4 4 4
# 11: dan 4 4 4
# 12: dan 4 4 3

Row-wise sum for columns with certain names

I have a sample data:
SampleID a b d f ca k l cb
1 0.1 2 1 2 7 1 4 3
2 0.2 3 2 3 4 2 5 5
3 0.5 4 3 6 1 3 9 2
I need to find row-wise sum of columns which have something common in names, e.g. row-wise sum(a, ca) or row-wise sum(b,cb). The problem is that i have large data.frame and ideally i would be able to write what is common in column header, so that code would pick only those columns to sum
Thank you beforehand for any assistance.
We can select the columns that have 'a' with grep, subset the columns and do rowSums and the same with 'b' columns.
rowSums(df1[grep('a', names(df1)[-1])+1])
rowSums(df1[grep('b', names(df1)[-1])+1])
If you want the output as a data frame, try using dplyr
# Recreating your sample data
df <- data.frame(SampleID = c(1, 2, 3),
a = c(0.1, 0.2, 0.5),
b = c(2, 3, 4),
d = c(1, 2, 3),
f = c(2, 3, 6),
ca = c(7, 4, 1),
k = c(1, 2, 3),
l = c(4, 5, 9),
cb = c(3, 5, 2))
Process the data
# load dplyr
library(dplyr)
# Sum across columns 'a' and 'ca' (sum(a, ca))
df2 <- df %>%
select(contains('a'), -SampleID) %>% # 'select' function to choose the columns you want
mutate(row_sum = rowSums(.)) # 'mutate' function to create a new column 'row_sum' with the sum of the selected columns. You can drop the selected columns by using 'transmute' instead.
df2 # have a look
a ca row_sum
1 0.1 7 7.1
2 0.2 4 4.2
3 0.5 1 1.5

How to count the number of times an element appears consecutively in a data.table?

I have a data.table that looks like this
ID, Order, Segment
1, 1, A
1, 2, B
1, 3, B
1, 4, C
1, 5, B
1, 6, B
1, 7, B
1, 8, B
Basically by ordering the data using the Order column. I would like to understand the number of consecutive B's for each of the ID's. Ideally the output I would like is
ID, Consec
1, 2
1, 4
Because the segment B appears consecutively in row 2 and 3 (2 times), and then again in row 5,6,7,8 (4 times).
The loop solution is quite obvious but would also be very slow.
Are there elegant solutions in data.table that is also fast?
P.S. The data I am dealing with has ~20 million rows.
Try
library(data.table)#v1.9.5+
DT[order(ID, Order)][, indx:=rleid(Segment)][Segment=='B',
list(Consec=.N), by = list(indx, ID)][,indx:=NULL][]
# ID Consec
#1: 1 2
#2: 1 4
Or as #eddi suggested
DT[order(ID, Order)][, .(Consec = .N), by = .(ID, Segment,
rleid(Segment))][Segment == 'B', .(ID, Consec)]
# ID Consec
#1: 1 2
#2: 1 4
A more memory efficient method would be to use setorder instead of order (as suggested by #Arun)
setorder(DT, ID, Order)[, .(Consec = .N), by = .(ID, Segment,
rleid(Segment))][Segment == 'B', .(ID, Consec)]
# ID Consec
#1: 1 2
#2: 1 4

Resources