I have multiple files containing the data in tabulated form. I want to generate a 3-dimensional array in which the data from each file stored in the third dimension. For example, if I have 10 files then data from the first file will store in the first layer of 3D array, data from the second file in the second layer and so on.
Here is the dummy code I am using but it does not work correctly.
# reading data from the file ( I have a list of files names as fname)
dataDum <- read.table(fname[i],header = F, sep =';', skip=121, stringsAsFactors = FALSE)
# Assigning data to the array. I have already generated an empty array with the desired dimension
finaldata[, , i]=dataDum
It is not clear "why" your code is not working properly, as there is not reproducible example. As it is it should work correctly given the inputs are as expected. For example:
arr <- array(data = 0, dim = c(10,10,3));
for(i in 1:3){
mat <- matrix(rnorm(10^2), ncol = 10);
arr[,,i] <- mat
}
arr
If an error occurs it is likely due to dataDum being a data.frame. Explicitly using as.matrix(dataDum) would fix such an issue.
Related
My goal is to read many files into R, and ultimately, run a Root Mean Square Error (rmse) function on each pair of columns within each file.
I have this code:
#This calls all the files into a dataframe
filnames <- dir("~/Desktop/LGsampleHUCsWgraphs/testRSMEs", pattern = "*_45Fall_*")
#This reads each file
read_data <- function(z){
dat <- read_excel(z, skip = 0, )
return(dat)
}
#This combines them into one list and splits them by the names in the first column
datalist <- lapply(filnames, read_data)
bigdata <- rbindlist(datalist, use.names = T)
splitByHUCs <- split(bigdata, f = bigdata$HUC...1 , sep = "\n", lex.order = TRUE)
So far, all is working well. Now I want to apply an rmse [library(Metrics)] analysis on each of the "splits" created above. I don't know what to call the "splits". Here I have used names but that is an R reserved word and won't work. I tried the bigdata object but that didn't work either. I also tried to use splitByHUCs, and rMSEs.
rMSEs <- sapply(splitByHUCs, function(x) rmse(names$Predicted, names$Actual))
write.csv(rMSEs, file = "~/Desktop/testRMSEs.csv")
The rmse code works fine when I run it on a single file and create a name for the dataframe:
read_excel("bcc1_45Fall_1010002.xlsm")
bcc1F1010002 <- read_excel("bcc1_45Fall_1010002.xlsm")
rmse(bcc1F1010002$Predicted, bcc1F1010002$Actual)
The "splits" are named by the "splitByHUCs" script, like this:
They are named for the file they came from, appropriately. I need some kind of reference name for the rmse formula and I don't know what it would be. Any ideas? Thanks. I made some small versions of the files, but I don't know how to add them here.
As it is a list, we can loop over the list with sapply/lapply as in the OP's code, but the names$ is incorrect as the lambda function object is x which signifies each of the elements of the list (i.e. a data.frame). Therefore, instead of names$, use x$
sapply(splitByHUCs, function(x) rmse(x$Predicted, x$Actual))
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!
I have a list of URLs that direct to different xml files and I want to extract some info from them using R and the xml package.
I am trying to do this with a for loop.
I have this code but it gives me only the last xml (numtotal), how can I read all of them?
for (i in seq(from = 1, to = numtotal, by = 1)){
urli <- xmlParse(urls[[i]], useInternalNodes = TRUE)
top_numberi <- xmlRoot(urli)
GS = data.frame(GS = xpathSApply(top_numberi,"//a//b",xmlValue))
}
where:
urls is a list of 7 or more URLs
numtotal is the length of another list (numeric value)
Every iteration of your for loop is overwriting your GS data frame. Instead of using a data frame, create a list outside the loop
l = list()
Then fill in the elements inside the loop
l[i] = xpathSApply(top_numberi,"//a//b",xmlValue)
As an aside, this is a very basic question. You should read some standard R textbooks before proceeding much further.
As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)
I have a very large XML file (>70GB) from which I only need to read some segments. However, I also don't know the structure of the file, and failed to extract it due to the file's size.
I don't need to read the full file or convert it to a data frame - only to extract specific parts, but I don't know the specific format for those sequences since I don't have the structure.
I tried using xmlParse, and also using xmlEventParse based on what is suggested here:
How to read large (~20 GB) xml file in R?
The code suggested there returns an empty data frame:
xmlDoc <- "Final.xml"
result <- NULL
#function to use with xmlEventParse
row.sax = function() {
ROW = function(node){
children <- xmlChildren(node)
children[which(names(children) == "text")] <- NULL
result <<- rbind(result, sapply(children,xmlValue))
}
branches <- list(ROW = ROW)
return(branches)
}
#call the xmlEventParse
xmlEventParse(xmlDoc, handlers = list(), branches = row.sax(),
saxVersion = 2, trim = FALSE)
#and here is your data.frame
result <- as.data.frame(result, stringsAsFactors = F)
I have little experience working with XML, and so I don't fully understand the solution I tried to use.
Thanks for your help!