Issue with combining multiple separate data points in R - r

I have X number of spreadsheets with information spreading out over two tabs.
I am looking to combine these into one data frame.
The files have 3 distinct cells on tab 1 (D6, D9, D12) and tab 2 has a grid (D4:G6) that i want to pull out of each spreadsheet into a row.
So far i have made the data frame, and pulled a list of the files. I have managed to get a for-loop working that pulls out the data from sheet1 D6, i plan to copy this code for the rest of the cells I need.
file.list <-
list.files(
path = "filepath",
pattern = "*.xlsx",
full.names = TRUE,
recursive = FALSE
)
colnames <- c( "A","B","C","etc",)
output <- matrix(NA,nrow = length(file.list), ncol = length(colnames), byrow = FALSE)
colnames(output) <- c(colnames)
rownames(output) <- c(file.list)
for (i in 1:length(file.list)) {
filename=file.list[i]
data = read.xlsx(file = filename, sheetIndex = 1, colIndex = 7, rowIndex = 6)
assign(x = filename, value = data)
}
The issue i have is that R then pulls out X number of single data points, and I am unable to bring this out as one list of multiple data points to insert in to the dataframe.

Related

Loop through files in R and select rows by string

I have a large number of CSV files. I need to extract relevant data from each file, and compile all of the relevant data into a new file.
I have been copying/pasting the code below and changing relevant details (e.g., file name) to repeat the same process for many CSV files. After that, I use cbind()/write.xlsx() to combine all of the relevant data and write it to an excel file. I need a more efficient method to accomplish this task.
How can I:
incorporate a loop that imports a large number of CSV files (to replace #1 below)
select relevant rows based on a string instead of entering specific row numbers
(to replace # 2 below)
combine all of the relevant data into a single data frame with each file's data in one column
library(tidyr)
# 1 - import raw data files
file1 <- read.csv ("1.csv", header = FALSE, sep = "\n")
# 2 - select relevant rows
file1 <- as.data.frame(file1[c(41:155),])
colnames(file1) <- c("file1")
#separate components of each line from raw csv file / isolate data
temp1 <- separate(file1, file1, into = c("Text", "IntNum", "Data", sep = "\\s"))
temp1 <- temp1$Data
temp1 <- as.data.frame(temp1)
If the number of relevant rows in each file is the same, you could do it like this. Option 1 shows a solution using a loop, option 2 shows a solution using sapply.
In a first step I generate three csv-files to make the code reproducible. The start row in each file is defined by "start", the end row by "end". I then get a list with the names of these files with dir().
#make csv-files, target vector always same length (3)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "begin",
paste0("dat", i),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#get list of file names
allFiles <- dir(pattern = glob2rx("*.csv"))
Option 1 - loop
For the loop you could first initialize a result data frame ("outDF") with the number of columns set to the number of csv-files and the number of rows set to the length of the target vector in each file ("start" to "end"). You can then loop over the files and fill the data frame. The start and end rows can be indexed using which().
#initialise result data frame
outDF <- data.frame(matrix(nrow = 3, ncol = length(allFiles),
dimnames = list(NULL, allFiles)))
#loop over csv files
for (iFile in allFiles) {
idat <- read.csv(iFile, stringsAsFactors = FALSE) #read csv
outDF[, iFile] <- idat[which(idat$x == "start"):which(idat$x == "end"),]
}
Option 2 - sapply
Instead of a loop you could use sapply with a custom function to extract the relevant rows in each file. This returns a matrix which you could then transform into a dataframe.
out <- sapply(allFiles, FUN = function(x) {
idat <- read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
outDF <- as.data.frame(out)
If the number of rows between "start" and "end" differs between files, the above options won´t work. In this case you could generate a data frame by first using lapply() (similar to option 2) to generate a result list (with different lengths of the list elements) and then padding shorter lists with NAs before transforming the result into a dataframe again.
#make csv-files with with target vector of different lengths (3:12)
set.seed(1)
for (i in 1:3) {
df <- data.frame(x = c(rep(0, sample(1:10,1)), "start",
rep(paste0("dat", i), sample(1:10,1)),
"end",rep(0, sample(1:10, 1))))
write.csv(df, file = paste0("file", i, ".csv"), quote = FALSE, row.names = FALSE)
}
#lapply
out <- lapply(allFiles, FUN = function(x) {
idat = read.csv(x, stringsAsFactors = FALSE)
return(idat[which(idat$x == "start"):which(idat$x == "end"),])
})
out <- lapply(out, `length<-`, max(lengths(out)))
outDF <- do.call(cbind, out)

Trying to rename multiple.csv files using data contained within the file

A machine I use spits out .csv files named by the time. But I need them named after the plate they were read from, which is contained within the file.
I created list of files:
files <- list.files(path="", pattern="*.csv")
I then tried using a for-loop to first create a data frame from each file containing the 1st row only, then to create a variable from the relevant piece of data, (the desired name), and then renaming the files.
for(x in files)
{
y <- read.csv(x, nrow = 1, header = FALSE, stringsAsFactors = TRUE)
z <- y[2, 2]
file.rename(x, z)
}
It didn't work. After 7 hours of trying (new to R) I am here. Please give simple advice, I have basically zero R experience.
I believe the following for loop does what the question asks for if the new filename is the second column header value.
If it is not, change nmax to the appropriate column number.
fls <- list.files(pattern = '\\.csv')
for(f in fls){
x <- scan(file = f, what = character(), nmax = 2, nlines = 1, sep = ',')
g <- paste0(x[2], '.csv')
file.rename(f, g)
}

How to do a vlookup for a few millions row, without running into memory problems?

Looking to write a code that can perform a vlookup.
I have several excel files (each containing ~1m rows (excel file limit) that act as lookup tables.
I then have an excel sheet with two columns that I need to look up - these two columns can't be merged, as the results must be kept seperated.
First I load the files that contain the tables I want to lookup in:
All <- lapply(filenames_list,function(filename){
print(paste("Merging",filename,sep = " "))
read.xlsx(filename)
})
df <- do.call(rbind.data.frame, All)
Then I load the files I want to look up:
LookUpID1 <- read.xlsx(paste(current_working_dir,"/LookUpIDs.xlsx", sep=""), sheet = 1, startRow = 1, colNames = TRUE, cols = 1, skipEmptyRows = TRUE, skipEmptyCols = TRUE)
LookUpID2 <- read.xlsx(paste(current_working_dir,"/LookUpIDs.xlsx", sep=""), sheet = 1, startRow = 1, colNames = TRUE, cols = 2, skipEmptyRows = TRUE, skipEmptyCols = TRUE)
I need to load the file twice, as I need to perform a lookup on both column 1 and 2.
And then the actual vlookup:
# Matching ID
FoundIDs1 <- merge(df, LookUpID1)
FoundIDs2 <- merge(df, LookUpID2)
FoundIDs <- merge(FoundIDs1, FoundIDs2, by = NULL)
The issue is that my PC runs out of memory when running the last part of the code (the actual Vlookup);
Error: cannot allocate vector of size 1715.0 Gb

New variable from list.file

I am creating an object calling all .csv files in a directory, reading them in according to some specifications, and merging them.
Before merging them I want to take the first two letters of the file names and create a new column in each table reporting that two letter as a variable.
I got this far:
temp = list.files(pattern="*.csv")
myfiles = lapply(temp, function(x) read.csv(x,
header=TRUE,
#sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA",""),
colClasses=c("code"="character")))
myfiles.final = do.call(rbind, myfiles)
When I try to create the new variable though I generate a replacement that has double the rows of the data:
temp.2 <- lapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = lapply(myfiles,
function(x){
a <- temp.2[seq_along(myfiles)]
x$identifier <- rep(a,nrow(x))
return(x)
})
In the folder the files are named, for example AA029893.csv,BB024593.csv..., for the first table I just want a new column called "identifier" that has "AA" for all entries, for the second "BB", and so on.
Thanks a lot
lapply is good for iterating along 1 list (e.g., myfiles data frames). To add a column to each data frame, you want to iterate in parallel over two lists, the list of data frames and the list of names. Map does this (for an arbitrary number of lists):
myfiles.2 = Map(function(dd, nn) {dd$identifier = nn; return(dd)},
dd = myfiles, nn = temp.2)
An easier alternative is to add the column post-hoc:
myfiles.final = do.call(rbind, myfiles)
myfiles.final$identifier = rep(
sapply(temp, function(x) substr(x, start = 1, stop = 2)),
each = lengths(myfiles)
)
The easiest alternative is to use data.table::rbindlist or dplyr::bind_rows, either of which will automatically add an ID column based on the names of your list. Depending on the size of your data, they might be quite a bit faster as well.
names(myfiles) = sapply(temp, function(x) substr(x, start = 1, stop = 2))
myfiles.2 = dplyr::bind_rows(myfiles)
myfiles.2 = data.table::rbindlist(myfiles, idcol = "identifier")

Extract Links from Excel Sheet using R-package 'openxlsx'

I'm working with an Excel sheet in which some columns contain hyperlinks that are represented as text that is completely different from the actual address the hyperlinks point to. I want to use some R code to modify and subset the Excel sheet but keep the hyperlinks. I think I can do this by extracting those hyperlinks as an indexed character vector then re-introducing them into a new Excel document using the makeHyperlinkString() and writeFormula() functions. But I cannot figure out how to get a vector of the links themselves.
In case it matters, my intention is to do all the modifying and subsetting on a data.frame version of the Excel sheet rather than a workbook object.
Oh, now I think I got your problem. I thought there were only normal hyperlinks not Excel-Hyperlinks.
I think this may help you to get a vector of the hyperlinks, although its a bit messy.
library(openxlsx)
pathtofile = "path to .xlsx file"
df1 <- read.xlsx(xlsxFile = pathtofile,
sheet = 1, skipEmptyRows = FALSE,
colNames = F, rowNames = F,
startRow = 1)
## Sheet or Tabelle
Sheet = "Sheet" ## Or "Tabelle"
## Get Names of rows from Hyperlink column
rowIndex <- sub(x = df1[,1], pattern = paste0("(#'",Sheet,"\\d'!)"), replacement = "")
## Get the Sheet, where Hyperlinks are saved
SheetName <- regmatches(df1[,1], regexpr(text = df1[,1], pattern = paste0("(",Sheet,"\\d)")))
## Extract only the Sheet number
SheetIndex <- as.numeric(sub(x = SheetName, pattern = Sheet, replacement = ""))
## Get the row Indexes as numeric
RowIndexNum <- as.numeric(regmatches(rowIndex, regexpr(text = rowIndex, pattern = "\\d")))
## Get the column name as character
RowIndexName <- sub(x = rowIndex, pattern = "\\d", "")
## Create uppercase Letters
myLetters <- toupper(letters[1:26])
## Convert Row Name (character) to numeric (based on alphabetical order)
RowIndexNameNum <- match(RowIndexName, myLetters)
## If Hyperlinks only in 1 Sheet or several sheets
if (length(unique(SheetIndex)) == 1) {
dfLinks <- read.xlsx(xlsxFile = pathtofile,
sheet = unique(SheetIndex),
skipEmptyRows = FALSE,
colNames = F, rowNames = F,
rows = RowIndexNum[1]:tail(RowIndexNum,1),
cols = unique(RowIndexNameNum),
startRow = 1
);
} else {
dfLinks <- data.frame()
for (i in unique(SheetIndex)){
dfTmp <- read.xlsx(xlsxFile = pathtofile,
sheet = i,
skipEmptyRows = FALSE,
colNames = F, rowNames = F,
rows = RowIndexNum[1]:tail(RowIndexNum,1),
cols = unique(RowIndexNameNum),
startRow = 1)
dfLinks <- rbind(dfLinks, dfTmp)
}
}
dfLinks
This is how my Excel File looks like:

Resources