Select 1 column when DF has 2 similar column names in R - r

I have 2 problems. First, I have datasets with 2 column names that are similar. I want to select the first one and not use the second one. The numeric values in the column names are the serial number of the sensor and can vary and they can be in various columns.
How can I select the first column name of the 2 so I can plot it or use it in calculations?
How can I recover those long column names so I can use them? For example how to I get "Depth_456" to use in depthmax2 with out typing it in or making a subset named depth. The problem is the numeric value which is the serial number of the sensor and it changes from instrument to instrument and dataset to dataset. I am trying to write generic code that will work on all the different instruments.
My Data
df1 <- data.frame(Sal_224 = 1:8, Temp_696 = 1:8, Depth_456 = 1:8, Temp_654 = 8:15)
df2<-data.frame(sapply(df1, function(x) as.numeric(as.character(x))))
temp<- df2[grep("Temp", names(df2), value=TRUE)]
depth<- df2[grep("Depth", names(df2), value=TRUE)]
depthmax<- max(depth, na.rm = TRUE)
depthmax2<- max(df2$"Depth_456", na.rm = TRUE)
This doesn't work
depthmax2<- max(df2$grep("Depth", names(df2), value=TRUE), na.rm = TRUE)

We need [[ instead of $.
max(df2[[ grep("Depth", names(df2), value=TRUE)]], na.rm = TRUE)
#[1] 8
Or another option is startsWith
max(df2[[names(df2)[startsWith(names(df2), "Depth")]]], na.rm = TRUE)
#[1] 8
Also, max works on a vector. If there are more than one match, we may have to loop over and get the max
sapply(df2[ grep("Depth", names(df2), value=TRUE)], max, na.rm = TRUE)

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

I wan to replace NA values with median of a column which are of numeric type in Rstudio

Here is what I have done, but with this solution I have new data frame which has numeric columns only but I want to keep my original data frame.
data_without_na <- select_if(new_data,is.numeric)
data_without_na[] <- lapply(
data_without_na,
function(data_without_na) {
data_without_na[is.na(data_without_na)] <- median(data_without_na, na.rm = TRUE)
data_without_na
})
This is what my code is but I would prefer to perform the same operation on my original data frame. The idea is to get the index of columns which are of numeric data type, ind <- which(sapply(new_data, is.numeric)) and get the column number to perform operation on my original data frame, but it's giving me an error
Simulate a dataframe:
d <- data.frame("char1" = sample(letters,100, replace = T),
"char2" = sample(letters,100, replace = T),
"numeric1" = sample(c(NA,seq(1,50,2.5)),100, replace = T),
"numeric2" = sample(c(NA,seq(1,50,2.5)),100, replace = T))
d %>%
mutate_if(is.numeric, ~ifelse(is.na(.x),median(.x, na.rm = T),.x))
We take this data frame and mutate all columns which are numeric. "~" defines an anonymous function and ".x" stands for the variable/column.

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

Parsing with nested lapply

I have a df with multiple vars containing dates.
Among these vars some report multiple dates separated by formatting symbols.
For each cell in each of the relevant vars, I would like to split the string, reformat as data, and pick the last date.
DATA
data <- data.frame(ex=c(1,2),date_1 = c("30/12/1997\n22/12/1998","15/12/1993"), date_2 = c("21/03/1997\n11/04/1996\n11/04/1996\n11/04/1996\n11/04/1996",NA))
expected <- data.frame(ex=c(1,2),date_1 = c("1998-12-22","1993-12-15"), date_2 = c("1997-03-21",NA))
CODE ATTEMPTED (1) ERROR: ALL ENTRIES GET THE VAR MAX VALUE NOT THE CELL MAX VALUE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(x) max(as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
CODE ATTEMPTED (2) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y, lapply(data[grep("date",names(data),value = T)], function(x)
as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T)))
CODE ATTEMPTED (3) (NESTED LAPPLY) ERROR: CODE BREAKS DOWN SOMEWHERE
data[grep("date",names(data),value = T)] <- lapply(data[grep("date",names(data),value = T)], function(y) max(y,function(x) as.Date(str_split(x,"\n")[[1]],format="%d/%m/%Y"), na.rm = T))
We can use :
data[-1] <- lapply(data[-1], function(y) sapply(strsplit(y ,"\n"),
function(x) max(as.Date(x, "%d/%m/%Y"))))
data[-1] <- lapply(data[-1], as.Date)
data
# ex date_1 date_2
#1 1 1998-12-22 1997-03-21
#2 2 1993-12-15 <NA>
The logic is the same as described for every column (except first) we split the string on "\n", convert to date and return the max value. The inner sapply loop returns numeric representation of dates so we use another lapply to convert the numbers to date.

Efficiently reformat column entries in large data set in R

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about.
The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns.
The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns.
The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.
The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.
My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.
Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.
pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
current <- pos[[i]]
mutation <- change[[i]]
af <- fraction[[i]]
total.esp$chrom[i] <- current[1]
total.esp$base[i] <- current [2]
total.esp$ref[i] <- mutation[1]
total.esp$alt[i] <- mutation[2]
total.esp$af[i] <- af[3]
}
Thanks!
Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.
library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]
# Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1: C G 10 130448 0.02 0.03
#2: A C 2 40483 0.05 0.03
#3: A G 5 30821291 0.02 0.04
# MAFinPercent3
#1: 0.20
#2: 0.04
#3: 0.03
NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.
data
df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'),
NCBI.Base=c('10:130448', '2:40483', '5:30821291'),
MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'),
stringsAsFactors=FALSE)
You can use tidyr, dplyr and separate:
library(tidyr)
library(dplyr)
total.esp %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent.EA.AA.All., c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)
You'll need to be careful about that last MAFinPercent.EA.AA.All. - you have a horrible column name so may have to rename it/quote it depending on how exactly r has it (this is also a good reason to include at least some data in your question, such as the output of dput(head(total.esp))).
data used to check:
total.esp <- data.frame(Alleles= rep("C>G", 50), NCBI.Base = rep("10:130448", 50), 'MAFinPercent(EA/AA/All)'= rep(".02/.03/.20", 50))
Because we now have a tidyr/dplyr solution, a data.table solution and a base solution, let's benchmark them. First, data from #akrun, 300,000 rows in total:
df1 <- data.frame(Alleles =rep(c('C>G', 'A>C', 'A>G'), 100000),
NCBI.Base=rep(c('10:130448', '2:40483', '5:30821291'), 100000),
MAFinPercent= rep(c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 100000),
stringsAsFactors=FALSE)
Now, the benchmark:
microbenchmark::microbenchmark(
tidyr = {df1 %>% separate(Alleles, c("ref", "alt"), sep=">") %>%
separate(NCBI.Base, c("chrom", "base"), sep=":") %>%
separate(MAFinPercent, c("af1", "af2", "af3"), sep="/") %>%
select(-af1, -af2, af = af3)},
data.table = {setDT(df1)[, unlist(lapply(.SD, tstrsplit,
split='[>:/]', type.convert=TRUE), recursive=FALSE)]},
base = {pos <- strsplit(df1$NCBI.Base, ":");
change <- strsplit(df1$Alleles, ">");
fraction <- strsplit(df1$MAFinPercent, "/");
data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( fraction, "[", 3)
)}
)
Unit: seconds
expr min lq mean median uq max neval
tidyr 1.295970 1.398792 1.514862 1.470185 1.629978 1.889703 100
data.table 2.140007 2.209656 2.315608 2.249883 2.481336 2.666345 100
base 2.718375 3.079861 3.183766 3.154202 3.221133 3.791544 100
tidyr is the winner
Try this (after retaining your first three lines of code):
total.esp <- data.frame( chrom =sapply( pos, "[", 1),
base = sapply( pos, "[", 2),
ref = sapply( change, "[", 1),
alt = sapply(change, "[", 2),
af = sapply( af, "[", 3)
)
I cannot imagine this taking more than a couple of minutes. (I do work with R objects of similar size.)

Resources