I have some bloated code in R which I'm trying to streamline. I'm trying to read spreadsheets into a dataframe and then transpose each one.
I have a list as follows
var <- c("amp_genes.annotated.BLCA.txt","amp_genes.annotated.BRCA.txt")
for (i in var) {
var[i] <- readWorksheet(wk, sheet="var[i]", header=T)
var[i] <- as.data.frame(var[i])
var[i] <- t(var1[i][3:ncol(var1[i]),])
}
The sheet = line has to have double quotes around the string variable.
This just tells me I have an unexpected }
Maybe try this; not sure it will work as I don't have your spreadsheets, but give it a try and let me know... And maybe if it doesn't work right out, it can hopefully unblock you wherever you're stuck.
library(XLConnect)
wk <- loadWorkbook("workbookname.xls")
sheetnames <- getSheets(object = wk)
content.tr <- list()
# To access sheets by their names
for (sheetname in sheetnames) {
content <- readWorksheet(wk, sheet=sheetname, header=T)
content.tr[[sheetname]] <- t(content[3:ncol(content),])
}
# To access sheets by their position
for (pos in c(1,2) {
content <- readWorksheet(wk, sheet=i, header=T)
content.tr[[sheetname[i]]] <- t(content[3:ncol(content),])
}
To access the dataframes:
names(content.tr)
spreadsheet1 <- content.tr[[1]]
spreadsheet2 <- content.tr[[2]]
Related
I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))
Firstly, apologies as this may seem a bit long winded, but I hope to give as much information on this problem as I can...
I have written a script that loops through a set of files defined in a csv file. Each file within this csv listing is an XML file, each one is for a particular event in an application, and all files within this list are of the same event type. However, each file can contain different data. For instance, one could hold an attribute with no child nodes beneath, while others contain nodes.
My script works perfectly fine, but when it gets to about XML file 5000, it has slowed down considerably.
Problem is that my code creates a blank dataframe initially, and then grows is at new columns are detected.
I understand that this is a big NO NO when it comes to writing R FOR loops, but am unsure how to get around this problem, give my smallest file listing is 69000, which makes going through each one in turn and counting the nodes a task in itself.
Are there any ideas on how to get around this?
pseudo code or actual R code to do this would be great. So would ideas/opinions, as I am unsure on the best approach to this task.
Here is my current code.
library(XML)
library(xml2)
library(plyr)
library(tidyverse)
library(reshape2)
library(foreign)
library(rio)
# Get file data to be used
#
setwd('c:/temp/xml')
headerNames <- c('GUID','EventId','AppId','RequestFile', 'AE_Type', 'AE_Drive')
GetNames <- rowid_to_column(read.csv(file= 'c:/temp/xml/R_EventIdA.csv', fileEncoding="UTF-8-BOM", header = FALSE, col.names = headerNames),'ID')
inputfiles <- as.character(GetNames[,5]) # Gets list of files
# Create empty dataframes
#
df <- data.frame()
transposed.df1 <- data.frame()
allxmldata <- data.frame()
findchildren<-function(nodes, df) {
numchild <- sapply(nodes, function(x){length(xml_children(x))})
xml.value <- xml_text(nodes[numchild==0])
xml.name <- xml_name(nodes[numchild==0])
xml.path <- sapply(nodes[numchild==0], function(x) {gsub(', ','_', toString(rev(xml_name(xml_parents(x)))))})
fieldname <- paste(xml.path,xml.name,sep = '_')
contents <- sapply(xml.value, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
fieldname <- paste(fieldname,xml.value, sep = '_')
dftemp <- data.frame(fieldname, contents)
df <- rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
findchildren2<-function(nodes, df){
numchild<-sapply(nodes, function(embeddedinputfile){length(xml_children(embeddedinputfile))})
xmlvalue<-xml_text(nodes[numchild==0])
xmlname<-xml_name(nodes[numchild==0])
xmlpath<-sapply(nodes[numchild==0], function(embeddedinputfile) {gsub(', ','_', toString(rev(xml_name(xml_parents(embeddedinputfile)))))})
fieldname<-paste(xmlpath,xmlname,sep = '_')
contents<-sapply(xmlvalue, function(f){is.na(f)<-which(f == '');f})
if (length(fieldname) > 0) {
dftemp<-data.frame(fieldname, contents)
df<-rbind(df, dftemp)
print(dim(df))
}
if (sum(numchild)>0){
findchildren2(xml_children(nodes[numchild>0]), df) }
else{ return(df)
}
}
# Loop all files
#
for (x in inputfiles) {
df1 <- findchildren(xml_children(read_xml(x)),df)
## original xml dataframe
if (length(df1) > 0) {
xml.df1 <- data.frame(spread(df1, key = fieldname, value = contents), fix.empty.names = TRUE)
}
##
xml.df1 %>%
pluck('Response_RawData') -> rawxml
if (length(rawxml)>0) {
df.rawxml <- data.frame(rawxml)
export(df.rawxml,'embedded.xml')
embeddedinputfile <-as.character('embedded.xml')
rm(df1)
df1 <- findchildren2(xml_children(read_xml(embeddedinputfile)),df)
if (length(df1) > 0) {
xml.df2 <- spread(df1, key = fieldname, value = contents)
}
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1,xml.df2))
} else {
allxmldata <- rbind.fill(allxmldata,cbind(xml.df1))
}
}
if(nrow(allxmldata)==nrow(GetNames)) {
alleventdata<-cbind(GetNames,allxmldata)
}
dbConn2 <- odbcDriverConnect('driver={SQL Server};server=PC-XYZ;database=Events;trusted_connection=true')
sqlSave(dbConn2, alleventdata, tablename = 'AE_EventA', append = TRUE )
I am trying to scrape baseball data from baseball-reference (e.g., https://www.baseball-reference.com/teams/NYY/2017.shtml). I have a huge vector of URLS that I created using a for loop, since the links follow a specific pattern. However, I am having trouble running my code, probably because I have to make too many connections within R. There are over 17000 elements in my vector, and my code stops working once it gets to around 16000. Is there an easier and perhaps a more efficient way to replicate my code?
require(Lahman)
teams <- unique(Teams$franchID)
years <- 1871:2017
urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
for(j in 1:length(years)) {
urls[i, j] <- paste0("https://www.baseball-reference.com/teams/",
teams[i], "/", years[j], ".shtml")
}
}
url_vector <- as.vector(urls)
list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:length(url_vector)) {
url <- url_vector[i]
res <- try(readLines(url), silent = TRUE)
## check if website exists
if(inherits(res, "try-error")) {
list_of_batting[[i]] <- NA
list_of_pitching[[i]] <- NA
}
else {
urltxt <- readLines(url)
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
doc <- htmlParse(urltxt)
tables_full <- readHTMLTable(doc)
tmp1 <- tables_full$players_value_batting
tmp2 <- tables_full$players_value_pitching
list_of_batting[[i]] <- tmp1
list_of_pitching[[i]] <- tmp2
}
print(i)
closeAllConnections()
}
I've got a problem in R.
I have loaded files from folder (as filelist) using this method:
ff <- list.files(path=" ", full.names=TRUE)
myfilelist <- lapply(ff, read.table)
names(myfilelist) <- list.files(path=" ", full.names=FALSE)
In myfilelist I have dataframe name as: A1.txt, A2.txt, A3.txt.. etc
Now I would like to use the 'i'th element of list to change my data, for example
with each data frame delete rows the sum of which = 0.
I tried:
A1 <- A1[which(rowSums(A1) > 0),]
and it works.
How can I do it for all A[i] at once?
Try this code:
lapply(myfilelist, function(x) {
x <- x[which(rowSums(x) > 0),]
return(x)
})
I want to write some R tables into an excel file. So I have the follow?
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.table(outline)
print(outline) # this prints all n tables
write.csv(outline, 'Test.csv') # this only writes the last table
}
But I only get the last file written into the csv file. Not all of them. How would I fix this?
your writing to test.csv every time. So you keep over writing files. You need to change the filename for each step to keep the different files.
try:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.table(outline)
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
Looking at your code perhaps you want this instead:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}