Substr() function within the apply() function in R - r

I have a data frame with 25 million rows and I need to run a substring function to all 25 million rows of data. Because of the size of the data frame I thought apply would be the most efficient way of doing this.
df <- data.frame( seq_start=c(75, 59, 44),
seq_end=c(151, 135, 120),
sequence=c("NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA", "NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG", "NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA"))
Function to accomplish this that I thought would be the most efficient:
apply(df,1,substr(sequence,seq_start,seq_end))
I'm not familiar with the apply function and a loop is way to inefficient to process 25 million lines.

Not 100% sure what you need/want but it seems that using the dplyrsyntax is useful here (more useful than apply as you're only looking to extract a substring from a single column)
library(dplyr)
df %>%
mutate(substring = substr(sequence,seq_start,seq_end))
seq_start seq_end
1 75 151
2 59 135
3 44 120
sequence
1 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 NTATTACTAAGAGATTTGGTTTTAACTATGAATCCATGATGAAATTATGAACTCTTAATAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACACAAATAGATCGGAAGAG
3 NCCTCTACCAGCCTTTTATTGTTAAAAATTGTGAATTTATGGAAAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATATGGACCATGATCTGATGGGACTACTGGAATCA
substring
1 ATTATTCTCATTCTTAGGTGCATTTTATATGGACCATGATCTGATGGGACTACTGGAATCAGGCTTGGTTCATTTTA
2 TAAATTTAAAAAGACAAGCAACCCAATCAAAAAATGGGCAAAGGATATGAATGGGGAATTCACAGACAAGAAAACAC
3 AAGGTTGTAGGAATAAGTTTCTAATGTATTAATTATTCTCATTCTTAGGTGCATTTTTATCTGGTGTTTGAATATAT
Base R:
df$substring <- substr(df$sequence,df$seq_start,df$seq_end)

Related

Adjusting inequalities in a dataframe and changing value to become a numeric based on a given desired value

I'm trying to wrangle the data here by adjusting certain values in a data frame.
I have to use plots that cannot use character symbols in my data. Is there a way where I can replace data below with those inequality symbols with a numerical value, such that the column in that data frame becomes numeric?
The photo has my data frame and goal to achieve.
(I apologize for adding a photo, the question box didn't like something I added into the box it seemed)
Is there an efficient way to do this on R without having to manually change each data point on excel or something?
Thank you for your help!
You could use the dplyr Package to manipulate the data.frame
with mutate().
You could utilize regular expressions (regex) to search the character test_score for entries that starts with a > or < with: grepl("^>",test_score)
and use ifelse to
either instead write '100' or '0' in the corresponding cell. If the character does not start with < or > you can just keep the old value with test_score
# Create Minimal Reproducible Example
DF1 <- data.frame(SampleID = paste0("Subject",c(1:5)),
test_score=c(">90",">90","<50","<50","67"))
library(dplyr)
DF1 %>%
mutate(test_score_converted = ifelse(grepl("^>",test_score),100,
ifelse(grepl("^<",test_score),0,
test_score))) %>%
mutate(test_score_converted = as.numeric(test_score_converted))
Output:
SampleID test_score test_score_converted
1 Subject1 >90 100
2 Subject2 >90 100
3 Subject3 <50 0
4 Subject4 <50 0
5 Subject5 67 67
Please note that the code above also converts to a numeric value (instead of a character). If you have other special characters in the column test_score, the code will fail. You can then either additionally remove all non digit characters with another regex like this gsub("\\D","",DF1$test_score) or just comment out the second %>% mutate() in the example above that does the conversion to numeric values
Convert the values in test_score column to numbers using parse_number, then use case_when to check for various conditions and assign output.
library(dplyr)
df1 %>%
mutate(score_num = readr::parse_number(test_score),
final_score = case_when(grepl('>', test_score) & score_num >= 90 ~ 100,
grepl('<', test_score) & score_num <= 50 ~ 0,
#...add more conditions if needed
TRUE ~ score_num))
# SampleID test_score score_num final_score
#1 Subject1 >90 90 100
#2 Subject2 >90 90 100
#3 Subject3 <50 50 0
#4 Subject4 <50 50 0
#5 Subject5 67 67 67

Assign multiple columns via vector without recycling

I am importing measurement data as a dataframe and want to include the experimental conditions in the data which are given in the filename. I want to add new columns to the dataframe that represent the conditions, and I want to assign the columns with the value specified by the filename. Later, this will facilitate comparisons to other experimental conditions once I merge the editted dataframes from each individual sample/file.
Here is an example of my pre-existing dataframe Measurements:
Measurements <- data.frame(
X = 1:4,
Length = c(130, 150, 170, 140)
)
Here are the example vectors of variables and values that would be derived from the filename:
FileVars.vec <- c("Condition", "Plant")
FileInfo.vec <- c("aKG", "1")
Here is one way that I have solved how to do what I want:
for (i in 1:length(FileVars.vec)) {
Measurements[FileVars.vec[i]] <- FileInfo.vec[i]
}
Which gives the desired output:
X Length Condition Plant
1 130 aKG 1
2 150 aKG 1
3 170 aKG 1
4 140 aKG 1
But my (limited) understanding of R is that it is a vectorized language that often avoids the need for using for-loops. I feel like this simpler code should work:
Measurements[FileVars.vec] <- FileInfo.vec
But instead of assigning one value for one entire column, it recycles the values within each column:
X Length Condition Plant
1 130 aKG aKG
2 150 1 1
3 170 aKG aKG
4 140 1 1
Is there any way to do a similar simple assignment but without recycling, i.e. one value is assigned to one full column only? I imagine there's a simple formatting fix but I've searched for a solution for >6 hours and no where did I see an assignment like this. I have also thought of creating a separate dataframe of just the experimental conditions and then merging to the actual dataframe, but that seems more roundabout to me, especially with more experimental conditions and observations than these examples.
Also, if there is a more established pipeline/package for taking information from the filename and adding it to the data in a tidy fashion, that would be marvelous as well! The original filename would be something like:
"aKG_1.csv"
Thank you for helping an R noobie! May you receive good coding karma when debugging!
We can convert to a list and then assign to avoid the recycling of values column wise. As it is a list, each element will be treated as a unit and the assignment occurs for the respectively columns by recycling those elements
Measurements[FileVars.vec] <- as.list(FileInfo.vec)
-output
Measurements
# X Length Condition Plant
#1 1 130 aKG 1
#2 2 150 aKG 1
#3 3 170 aKG 1
#4 4 140 aKG 1
If we want to reset the type, use type.convert
Measurements <- type.convert(Measurements, as.is = TRUE)
Note that by creating a vector for FileInfo.vec, it will have a single type i.e. character. Instead if we want to have multiple types, it can be a list
Measurements[FileVars.vec] <- list("akg", 1)
For the second part of the question, if we have a string
str1 <- "aKG_1.csv"
and wants to create two columns from that, either use, read.csv or strsplit
Measurements[FileVars.vec] <- read.table(text = tools::file_path_sans_ext(str1),
sep="_", header = FALSE)

Subsetting several data frames in R using the same condition but fewer lines of code

I have several data frames that have the same columns, but are split up by year. I want to drop rows in the data frames using the same conditions, but want to reduce the number of lines of code it takes me to do that.
df1
lat long ID
44 10 1
43 20 2
42 30 3
45 39 4
df2
lat long ID
47 10 1
44 20 2
46 30 3
43 39 4
For example, I only want to keep the observations where lat is greater than or equal to 44 and less than or equal to 45, and longs that are more than or equal to 10 and less than or equal to 30 (not actually the data Im working with, but you get the idea).
I want to avoid a ton of lines of code (a few lines for these example frames two doesn't seem like a lot, but I have 10 different data frames, each with millions of observations and I would like to keep them separate). I know loops are typically slow in R, so what's the best way to efficiently use the same function to subset several data frames without combining them.
You can put the dataframes in a list and use lapply to subset them.
list_data <- list(df1, df2)
result <- lapply(list_data, subset, lat >= 44 & lat <= 45 & long >= 10 & long <= 30)
A tidyverse solution would be :
library(dplyr)
library(purrr)
result <- map(list_data, ~.x %>% filter(between(lat, 44, 45) & between(long, 10, 30)))
Ronak beat me to it, but here's a slightly different solution in base R
Put all of your dataframes into a list
dfs <- Filter(function(x) is(x, "data.frame"), mget(ls()))
Create a function based on your parameters
your_func <- function(x){
subset(x, lat %in% c(44,45) & long >= 10 & long <= 30)
}
Apply function to all dataframes in the list
dfs <- lapply(dfs, FUN = function(x) your_func(x))
Move your dataframes from list to global environment
list2env(dfs,globalenv())
We can also use data.table
library(data.table)
list_data <- list(setDT(df1), setDT(df2))
result <- lapply(list_data, function(dt) dt[between(lat, 44, 45) & between(long, 10, 30)])

R parsing large data frame - speed optimization [duplicate]

This question already has answers here:
R Optimizing double for loop, matrix manipulation
(4 answers)
Closed 7 years ago.
Suppose I have an extremely large data frame with 2 columns and .5 mil rows.
For example, a few rows may look like this:
# Start End
# 89 100
# 93 120
# 95 125
# 101 NA
# 115 NA
# 123 NA
# 124 NA
I would like to manipulate this data frame to output a data frame that looks
like this:
# End Start
# 100 89, 93, 95
# 120 101, 115
# 125 123, 124
What would be the absolute quickest way to do this, given that there are
.5 million rows? bgoldst suggested this awesome piece of code:
# m is a large two column data frame
end <- na.omit(m[,'V2']);
out <- data.frame(End=end,
Start=unname(sapply(split(m[,'V1'],findInterval(m[,'V1'],end [as.character(0:c(length(end)-1))],paste,collapse='.')))
However this is taking a little bit too long.
Thanks for the help!
The answers on the possible duplicate post did not address the time issue. bgoldst's answer produced the desired outcome, but was very slow on my computer. I was wondering if there was something further that I could do to make this run faster.
A solution with data.table may be faster:
library(data.table)
dt = setDT(df)[, id:=findInterval(Start, End[!is.na(End)])][,paste(Start,collapse=','),id]
result = data.frame(End = df$End[!is.na(df$End)],Start = dt$V1)
# End Start
#1 100 89,93,95
#2 120 101,115
#3 125 123

Identifying duplicate columns in a dataframe

I'm an R newbie and am attempting to remove duplicate columns from a largish dataframe (50K rows, 215 columns). The frame has a mix of discrete continuous and categorical variables.
My approach has been to generate a table for each column in the frame into a list, then use the duplicated() function to find rows in the list that are duplicates, as follows:
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
tables=apply(testframe,2,table)
dups=which(duplicated(tables))
testframe <- subset(testframe, select = -c(dups))
This isn't very efficient, especially for large continuous variables. However, I've gone down this route because I've been unable to get the same result using summary (note, the following assumes an original testframe containing duplicates):
summaries=apply(testframe,2,summary)
dups=which(duplicated(summaries))
testframe <- subset(testframe, select = -c(dups))
If you run that code you'll see it only removes the first duplicate found. I presume this is because I am doing something wrong. Can anyone point out where I am going wrong or, even better, point me in the direction of a better way to remove duplicate columns from a dataframe?
How about:
testframe[!duplicated(as.list(testframe))]
You can do with lapply:
testframe[!duplicated(lapply(testframe, summary))]
summary summarizes the distribution while ignoring the order.
Not 100% but I would use digest if the data is huge:
library(digest)
testframe[!duplicated(lapply(testframe, digest))]
A nice trick that you can use is to transpose your data frame and then check for duplicates.
duplicated(t(testframe))
unique(testframe, MARGIN=2)
does not work, though I think it should, so try
as.data.frame(unique(as.matrix(testframe), MARGIN=2))
or if you are worried about numbers turning into factors,
testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))]
which produces
age height gender
1 18 76.1 M
2 19 77.0 F
3 20 78.1 M
4 21 78.2 M
5 22 78.8 F
6 23 79.7 F
7 24 79.9 M
8 25 81.1 M
9 26 81.2 F
10 27 81.8 M
11 28 82.8 F
12 29 83.5 M
It is probably best for you to first find the duplicate column names and treat them accordingly (for example summing the two, taking the mean, first, last, second, mode, etc... To find the duplicate columns:
names(df)[duplicated(names(df))]
What about just:
unique.matrix(testframe, MARGIN=2)
Actually you just would need to invert the duplicated-result in your code and could stick to using subset (which is more readable compared to bracket notation imho)
require(dplyr)
iris %>% subset(., select=which(!duplicated(names(.))))
Here is a simple command that would work if the duplicated columns of your data frame had the same names:
testframe[names(testframe)[!duplicated(names(testframe))]]
If the problem is that dataframes have been merged one time too many using, for example:
testframe2 <- merge(testframe, testframe, by = c('age'))
It is also good to remove the .x suffix from the column names. I applied it here on top of Mostafa Rezaei's great answer:
testframe2 <- testframe2[!duplicated(as.list(testframe2))]
names(testframe2) <- gsub('.x','',names(testframe2))
Since this Q&A is a popular Google search result but the answer is a bit slow for a large matrix I propose a new version using exponential search and data.table power.
This a function I implemented in dataPreparation package.
The function
dataPreparation::which_are_bijection
which_are_in_double(testframe)
Which return 3 and 4 the columns that are duplicated in your example
Build a data set with wanted dimensions for performance tests
age=18:29
height=c(76.1,77,78.1,78.2,78.8,79.7,79.9,81.1,81.2,81.8,82.8,83.5)
gender=c("M","F","M","M","F","F","M","M","F","M","F","M")
testframe = data.frame(age=age,height=height,height2=height,gender=gender,gender2=gender)
for (i in 1:12){
testframe = rbind(testframe,testframe)
}
# Result in 49152 rows
for (i in 1:5){
testframe = cbind(testframe,testframe)
}
# Result in 160 columns
The benchmark
To perform the benchmark, I use the library rbenchmark which will reproduce each computations 100 times
benchmark(
which_are_in_double(testframe, verbose=FALSE),
duplicated(lapply(testframe, summary)),
duplicated(lapply(testframe, digest))
)
test replications elapsed
3 duplicated(lapply(testframe, digest)) 100 39.505
2 duplicated(lapply(testframe, summary)) 100 20.412
1 which_are_in_double(testframe, verbose = FALSE) 100 13.581
So which are bijection 3 to 1.5 times faster than other proposed solutions.
NB 1: I excluded from the benchmark the solution testframe[,colnames(unique(as.matrix(testframe), MARGIN=2))] because it was already 10 times slower with 12k rows.
NB 2: Please note, the way this data set is constructed we have a lot of duplicated columns which reduce the advantage of exponential search. With just a few duplicated columns, one would have much better performance for which_are_bijection and similar performances for other methods.

Resources