sapply as.numeric not working inside a for loop - r

I'm using a for loop to read every sheet in an Excel file, process it and save it in a list. First iteration does it without problem, but when it gets to i=2, the following error shows up:
Error:
! Assigned data `sapply(EstContratos[cols.name], as.numeric)` must be compatible with existing data.
✖ Existing data has 1 row.
✖ Assigned data has 3 rows.
ℹ Row updates require a list value. Do you need `list()` or `as.list()`?
This is the code:
mysheetlist <- excel_sheets(path="whatever/EstadoContrato.xlsx")
datalist<-list()
i=1
for (i in 1:length(mysheetlist)){
EstContratos <- read_excel(path="whatever/EstadoContrato.xlsx", sheet = mysheetlist[i])
EstContratos<-EstContratos %>%
filter (!row_number() %in% c (1:15))
EstContratos<-Filter(function(x)!all(is.na(x)), EstContratos)
colnames(EstContratos)<-c("Cap","UM","CantContr","VrUni","VrUniTot","VrTotalContr",
"CantEjec","VrTotalEjec","CantXEjec","VrXEjec")
EstContratos<-EstContratos%>%
select(Cap,VrTotalContr,VrTotalEjec,VrXEjec)
EstContratos$VrTotalContr<- gsub(",", "",EstContratos$VrTotalContr)
EstContratos$VrTotalEjec<- gsub(",", "",EstContratos$VrTotalEjec)
EstContratos$VrXEjec<-gsub(",", "",EstContratos$VrXEjec)
EstContratos<-EstContratos %>% drop_na(Cap)
cols.name <- c("VrTotalContr","VrTotalEjec","VrXEjec")
EstContratos[cols.name] <- sapply(EstContratos[cols.name],as.numeric)
EstContratos[c('CapCod', 'Capítulo')] <- str_split_fixed(EstContratos$Cap, ' ', 2)
EstContratos<-EstContratos%>%
select(CapCod,VrTotalContr,VrTotalEjec)
ResumenContratos<-aggregate(EstContratos[,2:3], by = list(EstContratos$CapCod), sum)
datalist[[i]] <- ResumenContratos
}
ConsolidadoContratos = do.call(rbind, datalist)
I'm not sure why it says assigned data has 3 rows when its the same EstContratos dataframe that in the second iteration has 1 row. Maybe it's taking the EstContratos dataframe from the first iteration which does in fact have 3 rows, but I don't know how that's possible.
(Sorry if it's not clear its my first time posting a question)

If I understand correctly and your intention is to transform those columns to numeric type, you can try:
EstContratos[ , cols.name] <- lapply(EstContratos[ , cols.name], as.numeric)
or
EstContratos[ , cols.name] <- apply(EstContratos[ , cols.name], 2, as.numeric)

Related

Read list of files with inconsistent delimiter/fixed width

I am trying to find a more efficient way to import a list of data files with a kind of awkward structure. The files are generated by a software program that looks like it was intended to be printed and viewed rather than exported and used. The file contains a list of "Compounds" and then some associated data. Following a line reading "Compound X: XXXX", there are a lines of tab delimited data. Within each file the number of rows for each compound remains constant, but the number of rows may change with different files.
Here is some example data:
#Generate two data files to be imported
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t25.2",
"\n2\tA4567\tQC\t26.8\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tA1234\tQC\t51.1",
"\n2\tA4567\tQC\t48.6\n",
file = "test1.txt")
cat("Quantify Compound Summary Report\n",
"\nPrinted Mon March 28 14:54:39 2022\n",
"\nCompound 1: One\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t25.2",
"\n2\tC4567\tQC\t26.8",
"\n3\tC8910\tQC\t25.4\n",
"\nCompound 2: Two\n",
"\tName\tID\tResult",
"\n1\tC1234\tQC\t51.1",
"\n2\tC4567\tQC\t48.6",
"\n3\tC8910\tQC\t45.6\n",
file = "test2.txt")
What I want in the end is a list of data frames, one for each "Compound", containing all rows of data associated with each compound. To get there, I have a fairly convoluted approach of smashed together functions which give me what I want but in a very unruly fashion.
library(tidyverse)
## Step 1: ID list of data files
data.files <- list.files(path = ".",
pattern = ".txt",
full.names = TRUE)
## Step 2: Read in the data files
data.list.raw <- lapply(data.files, read_lines, skip = 4)
## Step 3: Identify the "compounds" in the data file output
Hdr.dat <- lapply(data.list.raw, function(x) grepl("Compound", x)) # Scan the file and find the different compounds within it (this can be applied to any Waters output)
grp.dat <- Map(function(x, y) {x[y][cumsum(y)]}, data.list.raw, Hdr.dat)
## Step 4: Unpack the tab delimited parts of the export file, then generate a list of dataframes within a list of imported files
Read <- function(x) read.table(text = x, sep = "\t", fill = TRUE, stringsAsFactors = FALSE)
raw.dat <- Map(function(x,y) {Map(Read, split(x, y))}, data.list.raw, grp.dat)
## Step 5: Curate the list of compounds - remove "Compound X: "
cmpd.list <- lapply(raw.dat, function(x) trimws(substring(names(x), 13)))
## Step 6: Rename the headers for the dataframes, remove the blank rows and recentre
NameCols <- function(z) lapply(names(z), function(i){
x <- z[[ i ]]
colnames(x) <- x[2,]
x[c(-1,-2),]
})
data.list <- Map(function(x,y){setNames(NameCols(x), y)}, raw.dat, cmpd.list)
## Step 7: rbind the data based on the compound
cmpd_names <- unique(unlist(sapply(data.list, names)))
result <- list()
j <- for (n in cmpd_names) {
result[[n]] <- map(data.list, n)
}
list.merged <- map(result, dplyr::bind_rows)
list.merged <- lapply(list.merged, function(x) x %>% filter(Name != ""))
The challenge here is script efficiency as far as time (I can import hundreds or thousands of data files with hundreds of lines of data, which can take quite a while) as well as general "cleanliness", which is why I included tidyverse as a tag here. I also want this to be highly generalizable, as the "Compounds" may change over time. If someone can come up with a clean and efficient way to do all of this I would be forever in your debt.
See one approach below. The whole pipeline might be intimidating at first glance. You can insert a head (or tail) call after each step (%>%) to display the current stage of data transformation. There's a bit of cleanup with regular expressions going on in the gsubs: modify as desired.
intermediate_result <-
data.frame(file_name = c('test1.txt','test2.txt')) %>%
rowwise %>%
## read file content into a raw string:
mutate(raw = read_file(file_name)) %>%
## separate raw file contents into rows
## using newline and carriage return as row delimiters:
separate_rows(raw, sep = '[\\n\\r]') %>%
## provide a compound column for later grouping
## by extracting the 'Compound' string from column raw
## or setting the compound column to NA otherwise:
mutate(compound = ifelse(grepl('^Compound',raw),
gsub('.*(Compound .*):.*','\\1', raw),
NA)
) %>%
## remove rows with empty raw text:
filter(raw != '') %>%
## filling missing compound values (NAs) with last non-NA compound string:
fill(compound, .direction = 'down') %>%
## keep only rows with tab-separated raw string
## indicating tabular data
filter(grepl('\\t',raw)) %>%
## insert a column header 'Index' because
## original format has four data columns but only three header cols:
mutate(raw = gsub(' *\\tName','Index\tName',raw))
Above steps result in a dataframe with a column 'raw' containing the cleaned-up data as string suited for conversion into tabular data (tab-delimited, linefeeds).
From there on, we can either proceed by keeping and householding the future single tables inside the parent table as a so-called list column (Variant A) or proceed with splitting column 'raw' and mapping it (Variant B, credits to #Dorton).
Variant A produces a column of dataframes inside the dataframe:
intermediate_result %>%
group_by(compound) %>%
## the nifty piece: you can store dataframes inside a dataframe:
mutate(
tables = list(read.table(text = raw, header = TRUE, sep = '\t' ))
)
Variant B produces a list of dataframes named with the corresponding compound:
intermediate_result %>%
split(f = as.factor(.$compound)) %>%
lapply(function(x) x %>%
separate(raw,
into = unlist(
str_split(x$raw[1], pattern = "\t"))
)
)

Converting list of Characters to Named num in R

I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))

Vectorized use of the substring function for a row selection of a dataframe with different length

My dataframe has a column named Code of the type char which goes like b,b1,b110-b139,b110,b1100,b1101,... (1602 entries)
I am trying to select all the entries that match the strings in a vector and all the ones that start with the same string.
So lets say I have the vector
Selection=c("b114","d2")
then i want all codes like b114, b1140, b1141, b1142, ... as well as d2, d200, d2000, d2001, d2002, d2003 etc...
what does work in principle is to create a new dataframe like this:
bTable <- TreeMapTable[substr(TreeMapTable$Code,1,4)=="b114"|substr(TreeMapTable$Code,1,2)=="d2",]
which gives me all the data i want, but requires me to manually type the condition for each entry and i just want to give the script a vector with the strings.
I tried to do it like this:
SelectionL=nchar(Selection)
Beispieltable <- TreeMapTable[substr(TreeMapTable$Code,1,AuswahlL)==Auswahl1,]
but this gives me somehow only half of the required entries and i confess i don't really know what it is doing. I know i could use a for loop but from everything i read so far, loops should be avoided and the problem should be solveable by use of vectors.
sample data
df <- data.frame( Code = c("b114", "b115", "b11456", "d2", "d12", "d200", "db114"),
stringsAsFactors = FALSE)
Selection=c("b114","d2")
answer
library( dplyr )
#create a regex pattern to filter on
pattern <- paste0( "^", Selection, collapse = "|" )
#filter out all rows wher 'Code' dows not start with the entries from 'Selection'
df %>% filter( grepl( pattern, Code, perl = TRUE ) )
# Code
# 1 b114
# 2 b11456
# 3 d2
# 4 d200

Filtering multiple data sets using lapply

I am trying to filter rows out from a dataframe and write the resulting shorter dataframe to a new file.
I can get as far as getting to work on individual files, but trying to use lappy to run this process over multiple files (and giving different names to the output files) is proving troublesome.
Im trying to filter out rows based on whether the values in "aaSeqCDR3" contain "_" or "*"
so far I have:
productseq <-function(x){
#establish filter criteria
filter <- c("\\*", "_")
#Filter data set to new variable
df2 <- df[!grepl(paste(filter, collapse = "|"), df$aaSeqCDR3),]
write.delim(df2, "df2.txt", sep= " ")}
however trying to apply it to a vector containing multiple data frame names (names)
nameproduct <- lapply(names, productiveseq)
i get the error:
error in UseMethod("filter_") :
no applicable method for 'filter_' applied to an object of class "character"
Im very lost at the moment, and would appreciate any insight.
An example dataframe is below:
ID allDHitsWithScore allJHitsWithScore allCHitsWithScore aaSeqCDR3
0 290 0.031402274 TGTGCCAGCGGCAGCCCCAATTCACCCCTCCACTTT CASGSPNSPLHF
1 168 0.018191662 TGTGCTCTGAGTGATCAGAATAAGGGCAGGAGAGCACTTACTTTT CALSDQNKGRRALTF
2 49 0.005305902 TGTGCAGTCTCCAAAGCTGCAGGCAACAAGCTAACTTTT CAVSKAAGNKLTF
3 16 0.001732539 TGCAGTGCTAGAGGGCGCTTAGCCAAAAACATTCAGTACTTC CSARGRLAKNIQYF
4 15 0.001624256 TGTGCCTGAAGGAATGCAGGCAAATCAACCTTT CA*RNAGKSTF
5 14 0.001515972 TGCAGTGCTAGAGTTGGACAGGGAGGGTTCTTC CSARVGQGGFF
6 13 0.001407688 TGTGCCAGCAGTTACTTGGGACAGGGGGGAAACATTCAGTACTTC CASSYLGQGGNIQYF
7 12 0.001299404 TGTGCCAGCAGTTTATGGGACTAGCGGGGGGTTCGAGCTCCTACAATGAGCAGTTCTTC CASSLWD*RG_SSSYNEQFF
Because you are passing a character vector of data frame names and not data frame objects themselves, use get inside your function.
Also, do note you are writing to same file, df2.txt, so this same file will be overwritten with each iteration. To resolve, paste the x character value to text file name. And be sure to return data frame instead of NULL from write.delim call being last line of function.
productseq <- function(x) {
# Retrieve data frame
df <- get(x)
# Establish filter criteria
filter <- c("\\*", "_")
# Filter data set to new variable
df2 <- df[!grepl(paste(filter, collapse = "|"), df$aaSeqCDR3),]
write.delim(df2, paste0(x, ".txt"), sep= " ")
# Return filtered data
return(df2)
}
# LIST OF FILTERED DATA FRAMES EACH EXPORTED TO .txt FILE
nameproduct <- lapply(names, productiveseq)

Creating dataframe in R loop and naming it

I am working with 5 data frames that I want to filter (eliminating some rows if they match a regex). Because all data frames are similar, with the same variable names, I stored them in a list and I'm iterating it. However, when I want to save the filtered data for each of the original data frame, I find that it creates an i_filtered (instead of dfName_filtered) so every time the loop runs, it gets overwritten.
Here's what I have in the loop:
for (i in list_all){
i_filtered1 <- i[i$chr != filter1,]
i_filtered2 <- i[i$chr != filter2,]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(i_filtered2, file="/home/tama/Desktop/i_filtered.csv")
}
As I said, filter1 and filter2 are just regex that I'm using to filter the data in the chr column.
What's the correct way to assign the original name + "_filtered" to the new dataframe?
Thanks in advance
Edited to add info:
Each dataframe has these variables (but values can change)
chr start end length
chr1 10400 10669 270
chr10 237646 237836 191
chrX 713884 714414 531
chrUn 713884 714414 531
chr1 762664 763174 511
chr4 805008 805571 564
And I have stored all them in a list:
list_all <- list(heep, oe, st20_n, st20_t,all)
list_all <- lapply(list_all, na.omit)
The filters:
#Get rid of random chromosomes
filter1=".*random"
#Get rid of undefined chromosomes
filter2 = "ĉhrUn.*
The output I'm looking for is:
heep_filtered1
heep_filtered2
oe_filtered1
oe_filtered2
etc
One possibility is to iterate over a sequence of indices (or names), rather than over the list of data-frames itself, and access the data-frames using the indices.
Another problem is that the != operator doesn't support regular expressions. It only does exact literal matches. You need to use grepl() instead.
names(list_all) <- c("heep", "oe", "st20_n", "st20_t", "all")
filtered <- NULL
for (i in names(list_all)){
df <- list_all[[i]]
df.1 <- df[!grepl(filter1, df$chr), ]
df.2 <- df[!grepl(filter2, df$chr), ]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(df.2, file=paste0("/home/tama/Desktop/", i, "_filtered.csv"))
filtered[[paste0(i, "_filtered", 1)]] <- df.1
filtered[[paste0(i, "_filtered", 2)]] <- df.2
}
The result is a list called filtered that contains the filtered data-frames.
The issue is that i is only interpreted specially when it is alone. You are using it as part of other names, and as a character in the current version.
I would suggest naming the list, then using lapply instead of a for loop (note that I also changed the filter to occur in one step, since right now it is unclear if you are trying to take both things out or not -- this also makes it easier to add more filters).
filters <- c(".*random", "chrUn.*")
list_all <- list(heep = heep
, oe = oe
, st20_n = st20_n
, st20_t = st20_t
, all = all)
toLoop <- names(list_all)
names(toLoop) <- toLoop # renames them in the output list
filtered <- lapply(toLoop, function(thisSet)){
tempFiltered <- list_all[[thisSet]][!(list_all[[thisSet]]$chr %in% filters),]
#Write the result filtered table in a csv file
#Change output directory if needed
write.csv(tempFiltered, file=paste0("/home/tama/Desktop/",thisSet,"_filtered.csv"))
# Return the part you care about
return(tempFiltered)
}

Resources