Sample from specific rows in a dataframe column [duplicate] - r

I'm looking for an efficient way to select rows from a data table such that I have one representative row for each unique value in a particular column.
Let me propose a simple example:
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
dt = as.data.table( z )
my objective is to subset data table dt by sampling one row for each letter a-h in column z.

OP provided only a single column in the example. Assuming that there are multiple columns in the original dataset, we group by 'z', sample 1 row from the sequence of rows per group, get the row index (.I), extract the column with the row index ($V1) and use that to subset the rows of 'dt'.
dt[dt[ , .I[sample(.N,1)] , by = z]$V1]

You can use dplyr
library(dplyr)
dt %>%
group_by(z) %%
sample_n(1)

I think that shuffling the data.table row-wise and then applying unique(...,by) could also work. Groups are formed with by and the previous shuffling trickles down inside each group:
# shuffle the data.table row-wise
dt <- dt[sample(dim(dt)[1])]
# uniqueness by given column(s)
unique(dt, by = "z")
Below is an example on a bigger data.table with grouping by 3 columns. Comparing with #akrun ' solution seems to give the same grouping:
set.seed(2017)
dt <- data.table(c1 = sample(52*10^6),
c2 = sample(LETTERS, replace = TRUE),
c3 = sample(10^5, replace = TRUE),
c4 = sample(10^3, replace = TRUE))
# the shuffling & uniqueness
system.time( test1 <- unique(dt[sample(dim(dt)[1])], by = c("c2","c3","c4")) )
# user system elapsed
# 13.87 0.49 14.33
# #akrun' solution
system.time( test2 <- dt[dt[ , .I[sample(.N,1)] , by = c("c2","c3","c4")]$V1] )
# user system elapsed
# 11.89 0.10 12.01
# Grouping is identical (so, all groups are being sampled in both cases)
identical(x=test1[,.(c2,c3)][order(c2,c3)],
y=test2[,.(c2,c3)][order(c2,c3)])
# [1] TRUE
For sampling more than one row per group check here

Updated workflow for dplyr. I added a second column v that can be grouped by z.
require(data.table)
y = c('a','b','c','d','e','f','g','h')
x = sample(2:10,8,replace = TRUE)
z = rep(y,x)
v <- 1:length(z)
dt = data.table(z,v)
library(dplyr)
dt %>%
group_by(z) %>%
slice_sample(n = 1)

Related

Filter rows in dataset for distinct words in r

Goal: To filter rows in dataset so that only distinct words remain At the moment, I have used inner_join to retain rows in 2 datasets which has made my rows in this dataset duplicate.
Attempt 1: I have tried to use distinct to retain only those rows which are unique, but this has not worked. I may be using it incorrectly.
This is my code so far; output attached in png format:
# join warriner emotion lemmas by `word` column in collocations data frame to see how many word matches there are
warriner2 <- dplyr::inner_join(warriner, coll, by = "word") # join data; retain only rows in both sets (works both ways)
warriner2 <- distinct(warriner2)
warriner2
coll2 <- dplyr::semi_join(coll, warriner, by = "word") # join all rows in a that have a match in b
# There are 8166 lemma matches (including double-ups)
# There are XXX unique lemma matches
You can try :
library(dplyr)
warriner2 <- inner_join(warriner, coll, by = "word") %>%
distinct(word, .keep_all = TRUE)
To even further clarify Ronak's answer, here is an example with some mock data. Note that you can just use distinct() at the end of the pipe to keep distinct columns if that's what you want. Your error might very well have occurred because you performed two operations, and assigned the result to the same name both times (warriner2).
library(dplyr)
# Here's a couple sample tibbles
name <- c("cat", "dog", "parakeet")
df1 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
df2 <- tibble(
x = sample(5, 99, rep = TRUE),
y = sample(5, 99, rep = TRUE),
name = rep(name, times = 33))
# It's much less confusing if you do this in one pipe
p <- df1 %>%
inner_join(df2, by = "name") %>%
distinct()

How can I subtract 1 column from another in R based on 2 aggregate columns

First Data frame 'total_coming_in' column names: 'LocationID','PartNumber',"Quantity"
Second Data frame 'total_going_out' column names: 'LocationID','PartNumber',"Quantity"
I want output as 'total_data' column names: 'LocationID','PartNumber',"Quantity_subtract" where
Quantity_subtract = total_coming_in$Quantity - total_going_out$Quantity grouped for each 'LocationID','PartNumber'
I tried this :-
matchingCols <- c('LocationID','PartNumber')
mergingCols <- names(coming_in)[3]
total_coming_in[total_going_out,on=matchingCols,
lapply(
setNames(mergingCols),
function(x) get(x) - get(paste0("i.", x))
),
nomatch=0L,
by=.EACHI
]
Using data.table as you seem to want to, I would first cleanly merge the two tables and then do the substract operation on just the rows that make sense (i.e. for rows in total_coming_in which have matching values values in total_going_out and vice-versa):
library(data.table)
M <- merge(total_coming_in, total_going_out, by = c('LocationID','PartNumber'))
# i.e. all.x = FALSE, all.y = FALSE,
# thereby eliminating rows in x without matching row in y and vice-versa
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
Now for completenes, as your question might be interpreted as allowing 0 values for Quantity.y in total_going_out for rows of total_coming_in that have no matching values in total_going_out and vice-versa, you could do in this case:
M <- merge(total_coming_in, total_going_out, all = TRUE, by = c('LocationID','PartNumber'))
# i.e. all.x = TRUE, all.y = TRUE,
# thereby allowing rows in x without matching row in y and vice-versa
M[is.na(Quantity.x), Quantity.x := 0]
M[is.na(Quantity.y), Quantity.y := 0]
M[ , Quantity_subtract := Quantity.x - Quantity.y,
by = c('LocationID','PartNumber')]
So you want to have a column that gives you the difference of total_coming_in and total_going_out for each combination of PartNumber and LocationID, correct?
If so, the following will do:
library(dplyr)
matchingCols <- c("LocationID", "PartNumber")
total_data <- full_join(total_coming_in, total_going_out, by=matchingCols)
total_data <- mutate(total_data, Quantity_subtract = Quantity.x - Quantity.y)
total_data <- select(total_data, -Quantity.x, -Quantity.y) #if you want to get rid of these columns
I used this example data:
total_coming_in <- list(LocationID = round(runif(26, 1000, 9000)),
PartNumber = paste(runif(26, 10000, 20000), LETTERS, sep="-"),
Quantity = round(runif(26, 2, 4))
) %>% as_tibble()
random_integers <- sample(1:26,26,FALSE)
total_going_out <- list(LocationID = total_coming_in$LocationID[random_integers],
PartNumber = total_coming_in$PartNumber[random_integers],
Quantity = round(runif(26, 1, 3))
) %>% as_tibble()

Delete rows after a negative value in multiple data frames

I have multiple data frames which are individual sequences, consisting out the same columns. I need to delete all the rows after a negative value is encountered in the column "OnsetTime". So not the row of the negative value itself, but the row after that. All sequences have 16 rows in total.
I think it must be able by a loop, but I have no experience with loops in r and I have 499 data frames of which I am currently deleting the rows of a sequence one by one, like this:
sequence_6 <- sequence_6[-c(11:16), ]
sequence_7 <- sequence_7[-c(11:16), ]
sequence_9 <- sequence_9[-c(6:16), ]
Is there a faster way of doing this? An example of a sequence can be seen here example sequence
Ragarding this example, I want to delete row 7 to row 16
Data
Since the odd web configuration at work prevents me from accessing your data, I created three dataframes based on random numbers
set.seed(123); data_1 <- data.frame( value = runif(25, min = -0.1) )
set.seed(234); data_2 <- data.frame( value = runif(20, min = -0.1) )
set.seed(345); data_3 <- data.frame( value = runif(30, min = -0.1) )
First, you could create a list containing all your dataframes:
list_df <- list(data_1, data_2, data_3)
Now you can go through this list with a for loop. Since there are several steps, I find it convenient to use the package dplyr because it allows for a more readable notation:
library(dplyr)
for( i in 1:length(list_df) ){
min_row <-
list_df[[i]] %>%
mutate( id = row_number() ) %>% # add a column with row number
filter(value < 0) %>% # get the rows with negative values
summarise( min(id) ) %>% # get the first row number
as.numeric() # transform this value to a scalar (not a dataframe)
list_df[[i]] <- list_df[[i]] %>% slice(1:min_row) # get rows 1 to min_row
}
Hope it helps!
We can get the datasets into a list assuming that the object names start with 'sequence' followed by a - and one or more digits. Then use lapply to loop over the list and subset the rows based on the condition
lst1 <- lapply(mget(ls(pattern="^sequence_\\d+$")), function(x) {
i1 <- Reduce(`|`, lapply(x, `<`, 0))
#or use rowSums
#i1 <- rowSums(x < 0) > 0
i2 <- which(i1)[1]
x[seq(i2),]
}
)
data
set.seed(42)
sequence_6 <- as.data.frame(matrix(sample(-1:10, 16 *5, replace = TRUE), nrow = 16))
sequence_7 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))
sequence_9 <- as.data.frame(matrix(sample(-2:10, 16 *5, replace = TRUE), nrow = 16))

R: Concatenated values in column B based on values in column A

QUESTION: Using R, how would you create values in column B prefixed with a constant "1" + n 0's where n is the value in each row in column A?
#R CODE EXAMPLE
df <- as.data.frame(1:3);colnames(df)[1] <- "A";
print(df);
# A
# 1
# 2
# 3
preFixedValue <- 1; repeatedValue <- 0;
#pseudo code: create values in column B with n 0's prefixed with 1
df <- cbind(df,paste(rep(c(preFixedValue,repeatedValue), times = c(1,df[1:nrow(df),])),collapse = ""));
#expected/desired result
# A B
# 1 10
# 2 100
# 3 1000
USE CASE: Real data contains hundreds of rows in column A with random integers, not just three sequential int's as shown in the code above.
Below is an example using Excel to demonstrate what I want to do in R.
The rowwise() function in dplyr lets you make variables from column values in each row.
require(dplyr)
df <- data.frame(A = 1:3, B = NA)
preFixedValue <- 1; repeatedValue <- 0;
df <- df %>%
rowwise() %>%
mutate(B = as.numeric(paste0(c(preFixedValue, rep(repeatedValue, A)), collapse = "")))
For maximum flexibility, i.e. total freedom of choosing prefixed and repeated values as single values or vectors, and for simplicity of the syntax (one single line):
library(stringr)
df$B <- str_pad(preFixedValue, width = df$A, pad = repeatedValue, side = c("right"))
Would something like this work?
B<-10^(df$A)
df<-cbind(df,B)

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Resources