I am working with 5 data frames that I want to filter (eliminating some rows if they match a regex). Because all data frames are similar, with the same variable names, I stored them in a list and I'm iterating it. However, when I want to save the filtered data for each of the original data frame, I find that it creates an i_filtered (instead of dfName_filtered) so every time the loop runs, it gets overwritten.
Here's what I have in the loop:
for (i in list_all){
i_filtered1 <- i[i$chr != filter1,]
i_filtered2 <- i[i$chr != filter2,]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(i_filtered2, file="/home/tama/Desktop/i_filtered.csv")
}
As I said, filter1 and filter2 are just regex that I'm using to filter the data in the chr column.
What's the correct way to assign the original name + "_filtered" to the new dataframe?
Thanks in advance
Edited to add info:
Each dataframe has these variables (but values can change)
chr start end length
chr1 10400 10669 270
chr10 237646 237836 191
chrX 713884 714414 531
chrUn 713884 714414 531
chr1 762664 763174 511
chr4 805008 805571 564
And I have stored all them in a list:
list_all <- list(heep, oe, st20_n, st20_t,all)
list_all <- lapply(list_all, na.omit)
The filters:
#Get rid of random chromosomes
filter1=".*random"
#Get rid of undefined chromosomes
filter2 = "ĉhrUn.*
The output I'm looking for is:
heep_filtered1
heep_filtered2
oe_filtered1
oe_filtered2
etc
One possibility is to iterate over a sequence of indices (or names), rather than over the list of data-frames itself, and access the data-frames using the indices.
Another problem is that the != operator doesn't support regular expressions. It only does exact literal matches. You need to use grepl() instead.
names(list_all) <- c("heep", "oe", "st20_n", "st20_t", "all")
filtered <- NULL
for (i in names(list_all)){
df <- list_all[[i]]
df.1 <- df[!grepl(filter1, df$chr), ]
df.2 <- df[!grepl(filter2, df$chr), ]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(df.2, file=paste0("/home/tama/Desktop/", i, "_filtered.csv"))
filtered[[paste0(i, "_filtered", 1)]] <- df.1
filtered[[paste0(i, "_filtered", 2)]] <- df.2
}
The result is a list called filtered that contains the filtered data-frames.
The issue is that i is only interpreted specially when it is alone. You are using it as part of other names, and as a character in the current version.
I would suggest naming the list, then using lapply instead of a for loop (note that I also changed the filter to occur in one step, since right now it is unclear if you are trying to take both things out or not -- this also makes it easier to add more filters).
filters <- c(".*random", "chrUn.*")
list_all <- list(heep = heep
, oe = oe
, st20_n = st20_n
, st20_t = st20_t
, all = all)
toLoop <- names(list_all)
names(toLoop) <- toLoop # renames them in the output list
filtered <- lapply(toLoop, function(thisSet)){
tempFiltered <- list_all[[thisSet]][!(list_all[[thisSet]]$chr %in% filters),]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(tempFiltered, file=paste0("/home/tama/Desktop/",thisSet,"_filtered.csv"))
# Return the part you care about
return(tempFiltered)
}
Related
I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)
I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))
I am trying to filter rows out from a dataframe and write the resulting shorter dataframe to a new file.
I can get as far as getting to work on individual files, but trying to use lappy to run this process over multiple files (and giving different names to the output files) is proving troublesome.
Im trying to filter out rows based on whether the values in "aaSeqCDR3" contain "_" or "*"
so far I have:
productseq <-function(x){
#establish filter criteria
filter <- c("\\*", "_")
#Filter data set to new variable
df2 <- df[!grepl(paste(filter, collapse = "|"), df$aaSeqCDR3),]
write.delim(df2, "df2.txt", sep= " ")}
however trying to apply it to a vector containing multiple data frame names (names)
nameproduct <- lapply(names, productiveseq)
i get the error:
error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "character"
Im very lost at the moment, and would appreciate any insight.
An example dataframe is below:
ID allDHitsWithScore allJHitsWithScore allCHitsWithScore aaSeqCDR3
0 290 0.031402274 TGTGCCAGCGGCAGCCCCAATTCACCCCTCCACTTT CASGSPNSPLHF
1 168 0.018191662 TGTGCTCTGAGTGATCAGAATAAGGGCAGGAGAGCACTTACTTTT CALSDQNKGRRALTF
2 49 0.005305902 TGTGCAGTCTCCAAAGCTGCAGGCAACAAGCTAACTTTT CAVSKAAGNKLTF
3 16 0.001732539 TGCAGTGCTAGAGGGCGCTTAGCCAAAAACATTCAGTACTTC CSARGRLAKNIQYF
4 15 0.001624256 TGTGCCTGAAGGAATGCAGGCAAATCAACCTTT CA*RNAGKSTF
5 14 0.001515972 TGCAGTGCTAGAGTTGGACAGGGAGGGTTCTTC CSARVGQGGFF
6 13 0.001407688 TGTGCCAGCAGTTACTTGGGACAGGGGGGAAACATTCAGTACTTC CASSYLGQGGNIQYF
7 12 0.001299404 TGTGCCAGCAGTTTATGGGACTAGCGGGGGGTTCGAGCTCCTACAATGAGCAGTTCTTC CASSLWD*RG_SSSYNEQFF
Because you are passing a character vector of data frame names and not data frame objects themselves, use get inside your function.
Also, do note you are writing to same file, df2.txt, so this same file will be overwritten with each iteration. To resolve, paste the x character value to text file name. And be sure to return data frame instead of NULL from write.delim call being last line of function.
productseq <- function(x) {
# Retrieve data frame
df <- get(x)
# Establish filter criteria
filter <- c("\\*", "_")
# Filter data set to new variable
df2 <- df[!grepl(paste(filter, collapse = "|"), df$aaSeqCDR3),]
write.delim(df2, paste0(x, ".txt"), sep= " ")
# Return filtered data
return(df2)
}
# LIST OF FILTERED DATA FRAMES EACH EXPORTED TO .txt FILE
nameproduct <- lapply(names, productiveseq)
I am pulling 10-Ks off the SEC website using the EDGAR package in R. Fortunately, the text files come with a consistent file naming convention: CIK number (this is a unique filing ID)_File type_Date.
Ultimately I want to analyze these by SIC/industry group, so I think the best way to do this would be to add the SIC industry code to this filename rule.
I am including an image of what I would like to do below. It is kind of like a database join except my file names would be taking the new field. Not sure how to do that, I am pretty new to R and file scripting.
I am assuming that you have a data.frame with a column filenames. (Or a vector containing all the filenames) See the code below:
# A data.frame with a character column 'filenames'
df$CIK <- sapply(df$filenames, FUN = function(x) {unlist(strsplit(x, split = "_"))[1]})
df$CIK <- as.character(df$CIK)
Now, let us assume that you have another data.frame with two columns: CIK and SIC.
# A data.frame with two character columns: 'CIK' and 'SIC'
# df2.
#
# We add another column to the first data.frame: 'new_filenames'
df$new_filename <- sapply(1:nrow(df), FUN = function(idx, CIK, filenames, df2) {
SIC <- df2$SIC[which(df2$CIK == CIK[idx])]
new_filename <- as.character(paste(SIC, "_", filenames[idx], sep = ""))
new_filenames
}, CIK = df$CIK, filenames = df$filenames, df2 = df2)
# Now the new filenames are available in df$new_filenames
View(df)
I have more than 300 csv files in a directory.
The csv files have a following structure
id Date Nitrate Sulfate
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
id of csv file Some date Some Value Some Value
I want to count number of row in each csv file excluding the NA in that file and stored it in dataframe which has two columns: (1) id & (2) nobs.
Here is my code for that:
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
for(i in filenames){
data <- read.csv(i)
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
}
dataframe
}
The problem arises when I try to populate dataframe inside the loop, it seems like it is not populating the data frame and returning me NULL. I know that I am doing something stupid.
I usually prefer to add the rows into a pre-allocated list then bind them together. Here's a working example :
##### fake read.csv function returning random data.frame
# (just to reproduce your case, remove this from your code...)
read.csv <- function(fileName){
stupidHash <- sum(as.integer(charToRaw(fileName)))
if(stupidHash %% 2 == 0){
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(NA,2,3,NA,5),sulfate=c(10,20,NA,NA,40)))
}else{
return(data.frame(id=stupidHash,date='2016-02-28',
nitrate=c(4,2,3,NA,5,9),sulfate=c(10,20,NA,NA,40,50)))
}
}
#####
complete <-function(directory,id){
filenames <-sprintf("%03d.csv", id)
filenames <-paste(directory,filenames,sep = '/')
# here we pre-allocate a list of lenght=length(filenames)
# where we will put the rows of our future data.frame
rowsList <- vector(mode='list',length=length(filenames))
for(i in 1:length(filenames)){
filename <- filenames[i]
data <- read.csv(filename)
rowsList[[i]] <- data.frame(id=data$id[1],
nobs=sum(!is.na(data$sulfate) & !is.na(data$nitrate)))
}
# here we bind all the previously created rows together into one data.frame
DF <- do.call(rbind.data.frame, rowsList)
return(DF)
}
Usage example :
res <- complete(directory='dir',id=1:3)
> res
id nobs
1 889 4
2 890 2
3 891 4
The problem is in these 2 lines:
dataframe[i,dataframe$id]<-data[data$id]
dataframe[i,dataframe$nobs]<-nrow(data[!is.na(data$sulfate & data$nitrate),])
If you want to extend a dataframe, please use rbind function. But please be aware of that it is not effective way, because it allocate new memory and copy all data and add one new row. The effective way is to allocate dataframe big enough in this line:
dataframe <-data.frame(id=numeric(0),nobs=numeric(0))
Instead of 0, use number of expected number of rows.
So the easiest way is to
dataframe <- rbind(dataframe, data.frame(id=data$id[1], nobs=nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),]))
More effective way is something like that:
dataframe <-data.frame(id=numeric(numberOfRows),nobs=numeric(numberOfRows))
and after that in loop:
dataframe[i,]$id<-data$id[1]
dataframe[i,]$nobs<-nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])
UPDATE: I changed values you used to populate dataframe to data$id[1] and nrow(data[!is.na(data$sulfate) & !is.na(data$nitrate),])