I want to perform a set of operations (in R) on a number of data frames located within a list. In particular, for each of one I create a "library" column, which is then used to determine which kind of filtering operation to perform. This is the actual code:
sampleList <- list(RNA1 = "data/not_processed/dedup.Bp1R4T2_S2.txt",
RNA2 = "data/not_processed/dedup.Bp1R4T3_S4.txt",
RNA3 = "data/not_processed/dedup.Bp1R5T2_S1.txt",
RNA4 = "data/not_processed/dedup.Bp1R5T3_S2.txt",
RNA5 = "data/not_processed/dedup.Bp1R14T5_S1.txt",
RNA6 = "data/not_processed/dedup.Bp1R14T6_S1.txt",
RNA7 = "data/not_processed/dedup.Bp1R14T6_S2.txt",
RNA8 = "data/not_processed/dedup.Bp1R14T7_S2.txt",
RNA9 = "data/not_processed/dedup.Bp1R14T8_S3.txt",
RNA10 = "data/not_processed/dedup.Bp1R14T9_S3.txt",
RNA11 = "data/not_processed/dedup.Bp1R14T9_S4.txt",
DNA1 = "data/not_processed/dedup.dna10_1_S4.txt",
DNA2 = "data/not_processed/dedup.dna10_2_S5.txt",
DNA3 = "data/not_processed/dedup.dna10_3_S6.txt",
DNA4 = "data/not_processed/dedup.dna50_1_S1.txt",
DNA5 = "data/not_processed/dedup.dna50_2_S2.txt",
DNA6 = "data/not_processed/dedup.dna50_3_S3.txt",
DNA7 = "data/not_processed/dedup.dna50_pcrcocktail_S7.txt")
batch <- lapply(names(sampleList),function(mysample){
aux <- read.table(sampleList[[mysample]], col.names=c(column1, column2, ..., ID, library, column4, etc...))
aux %>% mutate(library = mysample, R = Fw_ref + Rv_ref, A = Fw_alt + Rv_alt) %>% distinct(ID, .keep_all=T)
if (grepl("DNA", aux$library)){
aux %>% filter(aux$R>1 & aux$A>1)
} else {
aux %>% filter((aux$R+aux$A)>7 & aux$Fw_ref>=1 & aux$Rv_ref>=1 & aux$Fw_alt>=1 & aux$Rv_alt>=1)
}
aux
})
batch_file <- do.call(rbind, batch)
write.table(batch_file, "data/batch_file.txt", col.names = T, sep = "\t")
The possible values of the library column are DNA1 to DNA7, and RNA1 to 11. I tried also with "char" %in%, but it gives the same problem:
Error in if (grepl("DNA", aux$library)) { : argument is of length zero
Seems like the if condition is not able to identify the value in library. However, when I tried to apply the if/else condition on the batch_file (not filtered, basically obtained with this code without the if/else part) it worked perfectly.
Many thanks in advance.
Related
I am generating dynamic variables names like P_1_Onsets_PRH, p_2_Onsets_PRH, etc. in a for loop. In this same loop, I'd like to read these variable names and generate corresponding matrices P1_Durations_PRH, etc. having the same number of elements as the respective Onset matrix.
for (i in 1:nrow(LabviewFiles)){
assign(x = paste("P",i , "Onsets_PRH", sep = "_"), value = t(subset.data.frame(All_Phase, All_Phase$Phase==i) %>%
filter(CONDITIONS == "NULL_TRIAL",
MISC_REWARD == 1,
MISC_PASSIVE_FAILED == 1) %>%
select(Feedback_onset)))
assign(x = paste("P",i , "Durations_PRH", sep = "_"), value = t(rep(0.5, times = length(noquote(paste("P",i , "Onsets_PRH", sep = "_"))))))
}
How do I read the length of matrix 'P_i_Onsets_PRH'?
I'm a newbie to R. Any help is appreciated.
You may use get to do this -
library(dplyr)
for (i in 1:nrow(LabviewFiles)){
assign(x = paste("P",i , "Onsets_PRH", sep = "_"), value = t(subset.data.frame(All_Phase, All_Phase$Phase==i) %>%
filter(CONDITIONS == "NULL_TRIAL",
MISC_REWARD == 1,
MISC_PASSIVE_FAILED == 1) %>%
select(Feedback_onset)))
assign(x = paste("P",i , "Durations_PRH", sep = "_"), value = t(rep(0.5, times = length(get(paste("P",i , "Onsets_PRH", sep = "_"))))))
}
Note that using assign and creating variables in global environment is discouraged. You may also read Why is using assign bad?
The function strip() below tries to produce a brief report on the result of its operation via the tee pipe (%T>%). Because this function is in turn being handed to a wrapper function and then to purrr::pwalk, which will supply it with a bunch of dataframes one by one, I want to get a report of its operation on each dataframe along with the dataframe name; which is to say, the name of the actual dataframe that is supplied to correspond to the formal argument tib in the function below. In the example supplied, this would be "tst_df". I don't know the names in advance of running the function, as they are constructed from the filenames read from disk and various other inputs.
Somewhat to my surprise, I actually have almost all of this working, except for getting the name of the supplied dataframe. In the example below, the code that is supposed to do this is enexpr(XX), but I have also tried expr(XX), and both of these expressions applied to tib or the dot (.), with or without a preceding !!. Also deparse(substitute()) on XX, tib, and ., but without the bang bangs.
I see that the names is stripped initially by pass-by-value, and then again, maybe, by each stage of the pipe, including the T, and again, maybe, by (XX = .) in the anonymous function after the T. But I know R + tidyverse will have a way. I just hope it does not involve providing an integer to count backwards up the call stack
tst_df <- tibble(A = 1:10, B = 11:20, C=21:30, D = 31:40)
tst_df
################################################################################
# The strip function expects a non-anonymous dataframe, from which it removes
# the rows specified in remove_rows and the columns specified in remove_cols. It
# also prints a brief report; just the df name, length and width.
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL){
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
out <- tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>% (function(XX = .){
function(XX = .)print(
paste0("length of ", enxpr(XX), " = ", nrow(XX), " Width = ", ncol(XX)))
cat("\n")
})
out
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
out_tb
Just save the name of tib at the beginning of your function,
it will be found by your reporter function:
strip <- function(tib, remove_rows = FALSE, remove_cols = NULL) {
remove_rows <- enquo(remove_rows)
remove_cols <- enquo(remove_cols)
tib_name <- as.character(substitute(tib))
report <- function(out) {
cat("output length of", tib_name, "=", nrow(out), ", width =", ncol(out), "\n")
}
tib %>%
filter(! (!! remove_rows)) %>%
select(- !! remove_cols) %T>%
report
}
out_tb <- strip(tib = tst_df, remove_rows = (A < 3 | D > 38), remove_cols = c(C, D))
output length of tst_df = 6 , width = 2
I have done an analysis of my dataset but in long-form and would like to write a function to automate a part which I had to write multiple times over.
#Null_proband outputs for disease tables
null.proband <- dbdata %>%
filter(Proband == TRUE & Fam.state.mainmut.MutGroup == "Null") %>%
group_by(ICD_Grouping) %>%
summarise(n_distinct(PID),
n_distinct(Fam.state.mainmut.FID),
n(),
median(AgeOfOnset))
#Column names for null_proband
colnames(null.proband) = c("ICD_Grouping", "proband_uniq(PID)", "proband_uniq(FID)", "proband_freq(PID)","proband_median(ageofonset)")
I repeat this multiple times for different filtering sets and I was wondering if I can have a base function with a selection of if clauses
f1 <- function(x, pop = c("proband", "nonproband", "carrier", "unknown", "all), mut = ("null", "mis", "unk", "all"))
Is how I would want the set up for the function then a base dplyr process
{ pop.mut <- dbdata %>%
filter(defined by arguments) %>%
group_by(ICD_Grouping) %>%
summarise(pop_uniqPID = n_distinct(PID),
pop_uniqFID = n_distinct(FID),
pop_freqPID = n(),
pop_medianAgeOfOnset = median(AgeOfOnset))
}
And then if clauses describing what the filter should depending on the argument e.g.
if (pop = "proband" & mut = "null") { filter(Proband == TRUE & Fam.state.mainmut.Mutgroup == "Null") }
and write for loops for all the combinations or can I write for loops for half the filter with arguments or just filter once for one argument then filter a second time for the second argument?
i am using Expss pakage .
df<-read_spss("test.SAV")
I shows the following:
Warning message: In foreign::read.spss(enc2native(file),
use.value.labels = FALSE, : Tally.SAV: Very long string record(s)
found (record type 7, subtype 14), each will be imported in
consecutive separate variables
It shows 4174 Variables in environment Panel.Actual Number of Variables in the Data file around 400.
Can anyone among you please help me on this.
As mentioned in the comment foreign::read.spss split SPSS long (>255 chars) characters variables into the several columns. If the such columns are empty you can drop them without any issues.
Convenience function for this:
remove_empty_characters_after_foreign = function(data){
empty_chars = vapply(data, FUN = function(column) is.character(column) & all(is.na(column)), FUN.VALUE = logical(1))
additional_chars = grepl("00\\d$", colnames(data), perl = TRUE)
to_remove = empty_chars & additional_chars
if(any(to_remove)){
message(paste0("Removing ", paste(colnames(data)[to_remove], collapse = ", "),"..."))
}
data[,!to_remove, drop = FALSE]
}
df = remove_empty_characters_after_foreign(df)
I've got a set of functions that I'm trying to work with and I'm struggling to figure out why the assignment isn't working. Here are the functions I'm using:
new_timeline <- function() {
timeline = structure(list(), class="timeline")
timeline$title <- list("text" = list("headline" = NULL, "text" = NULL),
"start_date" = list("year" = NULL, "month" = NULL, "day" = NULL),
"end_date" = list("year" = NULL, "month" = NULL, "day" = NULL))
return(timeline)
}
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>% parse(text = .) %>% eval()
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>% parse(text = .) %>% eval()
if (!is.na(date[3])) stringr::str_interp("self$title$${time_type}_date$day <- date[3]") %>% parse(text = .) %>% eval()
return(self)
}
edit_title <- function(self, headline = NULL, text = NULL, start_date = NULL, end_date = NULL) {
if (class(self) != "timeline") stop("The object passed must be a timeline object.")
if (is.null(headline) && is.null(self$title$text$headline)) stop("Headline cannot be empty when adding a new title.")
if (!is.null(headline)) self$title$text$headline <- headline
if (!is.null(text)) self$title$text$text <- text
if (!is.null(start_date)) self <- .add_date(self, date = start_date, time_type = "start")
if (!is.null(end_date)) self <- .add_date(self, date = end_date, time_type = "end")
return(self)
}
EDIT: The above code has been severely reduced per a request in the comments. The code is still sufficient to reproduce the error.
I know that's a bit long-winded, so I apologize. The first function establishes a new timeline object. The third function allows us to change the title of the timeline object and the second function is a helper function that handles dates. The code would be used like this:
library(magrittr)
#devtools::install_github("hadley/stringr")
library(stringr)
tl <- new_timeline()
tl <- tl %>% edit_title(headline = "My Timeline", text = "Example", start_date = "2015-10-18")
The code runs with no errors, but when I call tl$title$start_date$year, it comes back as NULL. Using an answer I got in this previous question I asked, I tried to set envir = globalenv() within the eval function. When I do that, the function returns an error saying that object self cannot be found.
So I'm under the impression that self is held in the parent.frame(). So I add both of these to a list: envir = list(globalenv(), parent.frame()). This causes the function to run without error, but there's still no assignment.
Where am I going wrong? Thanks in advance!
As mentioned in the comments, I think you could probably do away with all of the code parsing and just pass variables in [[ for your assignments. Anyway, when you use the pipe operator a bunch of function wrapping happens so determining how many frames to go back is painful. Here are a couple solutions modifying the .add_date function.
You already found one, using <<-, since it searches back through the parent environments until it finds the variable (or doesnt and assigns it in the global).
Another would be just storing the function environment() and passing that to eval.
A third would be counting how many frames deep you go, and using sys.frame to tell eval which environment to look in.
.add_date <- function(self, date, time_type) {
valid_date <- stringr::str_detect(date, "^[0-9]{4}(-[0-9]{1,2}){0,2}$")
if (!valid_date) {
stringr::str_interp("Your ${time_type} date does not appear to be formatted correctly. It must be of the form 'yyyy-mm-dd'. Only the year is required.") %>% stop()
}
## Examining environemnts
e <- environment() # current env
efirst <- sys.nframe() # frame number
print(paste("Currently in frame", efirst))
envs <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.frames()} # list of frames
elast <- stringr::str_interp("${date}") %>% parse(text=.) %>% {.; sys.nframe()} # number of last
print(paste("Went", elast, "frames deep."))
## Go back this many frames in eval
goback <- efirst-elast
date_elements <- date %>% as.character() %>% stringr::str_split(" ") %>% unlist()
date <- date_elements[1] %>% stringr::str_split("-") %>% unlist()
## Solution 1: use sys.frame
stringr::str_interp("self$title$${time_type}_date$year <- date[1]") %>%
parse(text = .) %>% eval(envir=sys.frame(goback))
## Solution 2: use environment defined in function
if (!is.na(date[2])) stringr::str_interp("self$title$${time_type}_date$month <- date[2]") %>%
parse(text = .) %>% eval(envir=e)
return(self)
}