sample data.table rows with different conditions - r

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer

You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x

Related

How to check if a value exists within a set of columns?

My dataframe looks something like the following, where there are >100 columns that starts with "i10_" and many other columns with other data. I would like to create a new variable that tells me whether the values C7931 and C7932 are in each row within only the columns that start with "i10_". I would like to create a new variable that states TRUE or FALSE depending on whether the value exists in that row or not.
So the output would be c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE)
Create a vector with the columns of interest and use rowSums(), i.e.
i1 <- grep('i10_', names(d1))
rowSums(d1[i1] == 'C7931' | d1[i1] == 'C7932', na.rm = TRUE) > 0
where,
d1 <- structure(list(v1 = c("A", "B", "C", "D", "E", "F"), i10_a = c(NA,
"C7931", NA, NA, "S272XXA", "R55"), i10_1 = c("C7931", "C7931",
"R079", "S272XXA", "S234sfs", "N179")), class = "data.frame", row.names = c(NA,
-6L))
Ideally, you would give us a reproducible example with dput(). Assuming your dataframe is called df, you can do something like this with only base.
df$present <- apply(
df[, c(substr(names(df), 1, 3) == "i10")],
MARGIN = 1,
FUN = function(x){"C7931" %in% x & "C7932" %in% x})
This will go row by row and check columns that start with i10 if they contain "C7931" and "C7932".
Similar approach with dplyr::across()
my_eval<-c("C7932","C7931")
d1%>%
mutate(is_it_here=
rowSums(across(starts_with("i10_"),
~. %in% my_eval))!=0)

Multiply rows of R dataframe by matching named numeric

How can you selectively multiple a row in a dataframe by the corresponding number in a named numeric?
## Create DataFrame
d <- data.frame(a=c(1,1,2), b=c(2,2,1), c=c(3,0,1))
rownames(d) = c("X", "Y", "Z")
## Create Named Numeric
named_numeric = c(2,1,3)
names(named_numeric) = c("X", "Y", "Z")
named_numeric
All values in the first row of the dataframe ("X": 1,2,3) would by multiplied by the value of "X" in the named numeric (2) in this case. The result of the first row would be "X": 2,4,6.
expected_output = data.frame(a=c(2,1,6), b=c(4,2,3), c=c(6,0,3))
rownames(expected_output) = c("X", "Y", "Z")
We can just do
out <- d * named_numeric
identical(expected_output, out)
#[1] TRUE
if the names of named_numeric is not aligned with the row names, then reorder one of them and multiply
d * named_numeric[row.names(d)]

Extract rows from a single column to form two new columns

Update:
I realized that the dummy data frame I created originally does not reflect the structure of the data frame that I am working with. Allow me to rephrase my question here.
Data frame that I'm starting with:
StudentAndClass <- c("Anthropology College_Name","x","y",
"Geology College_Name","z","History College_Name", "x","y","z")
df <- data.frame(StudentAndClass)
Students ("x","y","z") are enrolled in classes that they are listed under. e.g. "x" and "y" are in Anthropology, while "x", "y", "z" are in History.
How can I create the desired data frame below?
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("Anthropology College_Name", "Anthropology College_Name",
"Geology College_Name", "History College_Name",
"History College_Name", "History College_Name")
df_tidy <- data.frame(Student, Class)
Original post:
I have a data frame with observations of two variables merged in a single column like so:
StudentAndClass <- c("A","x","y","A","B","z","B","C","x","y","z","C")
df <- data.frame(StudentAndClass)
where "A", "B", "C" represent classes, and "x", "y", "z" students who are taking these classes. Notice that observations of students are wedged between observations of classes.
I'm wondering how I can create a new data frame with the following format:
Student <- c("x", "y", "z", "x", "y","z")
Class <- c("A", "A", "B", "C", "C", "C")
df_tidy <- data.frame(Student, Class)
I want to extract the rows containing observations of students and put them in a new column, while making sure that each Student observation is paired with the corresponding Class observation in the Class column.
One option is to create a vector
v1 <- c('x', 'y', 'z')
Then split the data based on logical vector and rbind
setNames(do.call(cbind, split(df, !df[,1] %in% v1)), c('Student', 'Class'))
# Student Class
#2 x A
#3 y A
#6 z B
#9 x B
#10 y C
#11 z C
Or with tidyverse
library(tidyverse)
df %>%
group_by(grp = c('Class', 'Student')[(StudentAndClass %in% v1) + 1]) %>%
mutate(n = row_number()) %>%
spread(grp, StudentAndClass) %>%
select(-n)
# A tibble: 6 x 2
# Class Student
#* <fctr> <fctr>
#1 A x
#2 A y
#3 B z
#4 B x
#5 C y
#6 C z
Update
If we need this based on elements between each pair of same 'LETTERS'
grp <- with(df, cummax(match(StudentAndClass, LETTERS[1:3], nomatch = 0)))
do.call(rbind, lapply(split(df, grp), function(x)
data.frame(Class = x[,1][2:(nrow(x)-1)], Student = x[[1]][1], stringsAsFactors=FALSE)))
Updated
In essence, you just need to find which indexes have college names, use those to get the range of students in each college, then subset the main vector by those ranges. Since students aren't guaranteed to be nested between two similar values, you have to be careful about any "empty" colleges.
college_indices <- which(endsWith(StudentAndClass, 'College_Name'))
colleges <- StudentAndClass[college_indices]
bounds_mat <- rbind(
start = college_indices,
end = c(college_indices[-1], length(StudentAndClass))
)
colnames(bounds_mat) <- colleges
bounds_mat['start', ] <- bounds_mat['start', ] + 1
bounds_mat['end', ] <- bounds_mat['end', ] - 1
# This prevents any problems if a college has no listed students
empty_college <- bounds_mat['start', ] > bounds_mat['end', ]
bounds_mat <- bounds_mat[, !empty_college]
class_listing <- apply(
bounds_mat,
2,
function(bounds) {
StudentAndClass[bounds[1]:bounds[2]]
}
)
df_tidy <- data.frame(
Student = unlist(class_listing),
Class = rep(names(class_listing), lengths(class_listing)),
row.names = NULL
)

data.table - group by all except one column

Can I group by all columns except one using data.table? I have a lot of columns, so I'd rather avoid writing out all the colnames.
The reason being I'd like to collapse duplicates in a table, where I know one column has no relevance.
library(data.table)
DT <- structure(list(N = c(1, 2, 2), val = c(50, 60, 60), collapse = c("A",
"B", "C")), .Names = c("N", "val", "collapse"), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
> DT
N val collapse
1: 1 50 A
2: 2 60 B
3: 2 60 C
That is, given DT, is there something like like DT[, print(.SD), by = !collapse] which gives:
> DT[, print(.SD), .(N, val)]
collapse
1: A
collapse
1: B
2: C
without actually having to specify .(N, val)? I realise I can do this by copy and pasting the column names, but I thought there might be some elegant way to do this too.
To group by all columns except one, you can use:
by = setdiff(names(DT), "collapse")
Explanation: setdiff takes the general form of setdiff(x, y) which returns all values of x that are not in y. In this case it means that all columnnames are returned except the collapse-column.
Two alternatives:
# with '%in%'
names(dt1)[!names(dt1) %in% 'colB']
# with 'is.element'
names(dt1)[!is.element(names(dt1), 'colB')]

Conditionally split rows based on value in a specific column

I would like to convert the columns below into the format below this. The way the reformatting works is that the sample is grouped between sample type N. For example the first two rows below are grouped together, and 7397-DNA_A01 to 7399-DNA_A01 is grouped together.
Sample Sample Type
7393.DNA_A01 N
7394-DNA_A01 T
7395-DNA_A01 N
7396-DNA_A01 T
7397-DNA_A01 N
7398-DNA_A01 T
7399-DNA_A01 LN
7400-DNA_A01 N
7401-DNA_A01 T
7402-DNA_A01 B
desired output
N T B LN
7393.DNA_A01 7394-DNA_A01
7395-DNA_A01 7396-DNA_A01
7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
I'm really not sure how to split the rows when N is encountered and then I suppose I would need to transpose somehow. Please help!
We need to create a grouping index ('indx') based on the occurence of 'N'. Here, a logical vector was created (SampleType=='N') and cumsum it to create the 'indx'. Based on the order of the columns, it may be useful to change the 'SampleType' column to factor and specify the levels as in the order of column names in the expected result. Then we can use dcast from either reshape2 or data.table.
library(data.table)#v1.9.5+
setDT(df1)[, indx:=cumsum(SampleType=='N')
][, SampleType:= factor(SampleType, levels=c('N', 'T', 'B', 'LN'))]
dcast(df1, indx~SampleType, value.var='Sample', fill='')[,-1,with=FALSE]
# N T B LN
#1: 7393.DNA_A01 7394-DNA_A01
#2: 7395-DNA_A01 7396-DNA_A01
#3: 7397-DNA_A01 7398-DNA_A01 7399-DNA_A01
#4: 7400-DNA_A01 7401-DNA_A01 7402-DNA_A01
If you are using dcast from reshape2, the 'indx' column can be created by base R options. You can also change the 'SampleType' column to factor using a similar code as below.
df1$indx <- cumsum(df1$SampleType=='N')
library(reshape2)
dcast(df1, indx~SampleType, value.var='Sample', fill='')
data
df1 <- structure(list(Sample = c("7393.DNA_A01", "7394-DNA_A01",
"7395-DNA_A01",
"7396-DNA_A01", "7397-DNA_A01", "7398-DNA_A01", "7399-DNA_A01",
"7400-DNA_A01", "7401-DNA_A01", "7402-DNA_A01"), SampleType = c("N",
"T", "N", "T", "N", "T", "LN", "N", "T", "B")), .Names = c("Sample",
"SampleType"), class = "data.frame", row.names = c(NA, -10L))

Resources