Saving a group of histograms in R as a data.frame - r

I am trying to save a histogram for every file in a list. I cannot load more than 1 file at a time due to their large size. Normally I would use a symbolic object name for each file's histogram and iterate the name for each item in the list. I am having trouble figuring out how to do this in R so instead I attempt to save each hist as a column of a data.frame. The code is as follows:
filelist <- list.files("dir/")
file.hist <- data.frame(check.rows = FALSE)
for(i in 1:length(filelist) {
file <- read.csv(capture.output(cat("dir/", filelist[i], sep = "")))
file.hist[[i]] <- hist(file$Value, breaks = 200)
}
The error message that results is:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(breaks = c(0, 200, :
replacement has 6 rows, data has 0
I have googled the error message and it seems like it might be related to how you go about initializing the data from although I have to admit that my brain is fried this close to Thanksgiving. Has anyone out there dealt with an solved a similar problem? I am not married to this approach.

Related

PupilPre (R package for analyzing pupil data) problem - Prep_data function converts TIMESTEP values to NAs

We are interested in analyzing our pupil data (only interested in size, not position) recorded with an SR eyelink 1000Hz system.
We exported the files using the SR data viewer as sample reports.
After running ppl_prep_data the TIMESTAMP variable class is converted from character to numeric however it returns all NA and the real timestamp values are lost. The rest of the pipeline isthereforer not working.
Does anyone of you have an idea why this is the case that it gives us a NA message and if so how can we maybe work around this?
Below you can find the code the code that we are using:
#step 1 Load library
library(PupilPre)
#step 2:load data
# change folder were the data is in the line below
Pupildat <- read.table("DATAXX.txt", header = T, sep = "\t", na.strings = c(".", "NA"))
# after reading in the first column is called weird something with ?.. so we rename it for the next line of code
names(Pupildat)[1] <- 'RECORDING_SESSION_LABEL'
## Step 3:PupilPre Pipeline ###
# Check classes of columns and reassigns => creates event variable
data_pre <- ppl_prep_data(data = Pupildat, Subject = "RECORDING_SESSION_LABEL", EventColumns = c("Subject", "TRIAL_INDEX"))
align_msg(data_pre, Msg = "Hashtag_1")
#Using the function check_msg_time you can see that the TIMESTAMP values associated with the message are not the same for each event.
#This indicates that alignment is required. Note that a regular expression (regex) can be used here as the message string.
#example below, though think we want different timings for the events
check_msg_time(data = data_pre, Msg = "Hashtag_1")
### returns NA

R: use single file while running a for loop on list of files

I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!

Error merging multiple files within directory while skipping rows and columns

I have multiple text files that I'm trying to merge together into one dataframe.
Within each file I'm attempting to skip the first 10 rows, as well as the first column (there are 15 columns total, including the first one I'm trying to skip)
Here's code I'm currently using based on different pieces found online and on stack overflow:
for (x in list.files(pattern="*.txt", recursive=TRUE))
{
all_content <- readLines(x)
skip = all_content[-c(1:10)]
input <- read.table(textConnection(skip),
header = FALSE,
colClasses = c(rep("NULL", 1),
rep(NA, 14)),
sep="\t", stringsAsFactors = FALSE)
df <- rbind(df, input)
}
However I'm getting the "Error in rep(xi, length.out = nvar) :
attempt to replicate an object of type 'closure'" error and I can't seem to figure out what's causing it. The code was working the last time I tried it...not sure if I accidentally changed something.
Thanks all.
It is because you are trying to replicate null value, no matter how much you replicate null value it will be a single vector of Null value:
That's why it is showing error for closure object.
Let me know what happens when you add this before your for loop.
df <- NULL

Searching for target in Excel spreadsheet using R

As an R noob, I'm currently rather stumped by what is probably a rather trivial problem. I have data that looks like in the second image below, essentially a long sheet of rows with values in three columns. What I need is for a way to scan the sheet looking for particular combinations of values in the first and second column - combinations that are specified in a second spreadsheet of targets (see picture 1). When that particular combination is found, I need the script to extract the whole row in question from the data file.
So far, I've managed to read the files without problem:
library(xlsx)
folder <- 'C:\\Users\\...\\Desktop\\R EXCEL test\\'
target_file <- paste(folder,(readline(prompt = "Enter filename for target list:")),sep = "")
data_file <- paste(folder,(readline(prompt = "Enter data file:")),sep = "")
targetsDb <- read.xlsx(target_file, sheetName = "Sheet1")
data <- read.xlsx(data_file, sheetName = "Sheet1")
targets <- vector(mode = "list", length = 3)
for(i in 1:nrow(targetsDb)){
targets[[i]] <- c(targetsDb[i,1],targetsDb[i,2])
}
And with the last command I've managed to save the target combinations as items in a list. However, I run into trouble when it comes to iterating through the file looking for any of those combinations of cell values in the first two columns. My approach was to create a list with one item,
SID_IA <- vector(mode = "list", length = 1)
and to fill it with the values of column 1 and 2 iteratively for each row of the data file:
for(n in 1:nrow(data)){
SID_IA[[n]] <- c(data[n,1],data[n,2])
I would then nest another for loop here, which basically goes through every row in the targets sheet to check if the combination of values currently in the SID_IA list matches any of the target ones. Then at the end of the loop, the list is emptied so it can be filled with the following combination of data values.
for(i in targets){
if(SID_IA[[n]] %in% targets){
print(SID_IA[[n]], "in sentence" , data[n,1], "is ", data[n,3])
}else{
print(FALSE)
}
SID_IA[[n]] <- NULL
}
}
However, if I try to run that last loop, it returns the following output and error:
[1] FALSE
Error in SID_IA[[n]] : subscript out of bounds
In addition: Warning message:
In if (SID_IA[[n]] %in% targets) { :
the condition has length > 1 and only the first element will be used
So, it seems to be doing something for at least one iteration, but then crashes. I'm sure I'm missing something very elementary, but I just can't see it. Any ideas?
EDIT: As requested, I've removed the images and made the test Excel sheets available here and here.
OK.. I'm attempting an answer that should require minimum use of fancy tricks.
data<- xlsx::read.xlsx(file = "Data.xlsx",sheetIndex = 1)
target<- xlsx::read.xlsx(file = "Targets.xlsx",sheetIndex = 1)
head(data)
target
These values are already in data.frame format. If all you want to know is which rows appear exactly same in data and target, then it will be as simple as finding a merge
merge(target,data,all = F)
If, on the other hand , you want to keep the data table with a marking of target rows, then the easiest way will be to make an index column
data$indx<- 1:nrow(data)
data
mrg<- merge(target,data,all = F)
data$test<- rep("test", nrow(data))
data$test[mrg$indx]<- "target"
data
This is like the original image you'd posted.
BTW , if yo are on a graphical interface you can also use File dialogue to open data files.. check out file.choose()
(Posted on behalf of the OP).
Following from #R.S.'s suggestion that didn't involve vectors and loops, and after some playing around, I have figured out how to extract the target lines, and then how to remove them from the original data, outputting both results. I'm leaving it here for future reference and considering this solved.
extracted <- merge(targets,data,all = F)
write.xlsx(extracted,output_file1)
combined <-rbind(data,extracted)
minus.target <- combined[!duplicated(combined,fromLast = FALSE)&!duplicated(combined,fromLast = TRUE),]
write.xls(minus.target,output_file2)

Convert R read.csv to a readLines batch?

I have a fitted model that I'd like to apply to score a new dataset stored as a CSV. Unfortunately, the new data set is kind of large, and the predict procedure runs out of memory on it if I do it all at once. So, I'd like to convert the procedure that worked fine for small sets below, into a batch mode that processes 500 lines at a time, then outputs a file for each scored 500.
I understand from this answer (What is a good way to read line-by-line in R?) that I can use readLines for this. So, I'd be converting from:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
newdata <- as.data.frame(read.csv('newstuff.csv'), stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=filename)
to something like:
trainingdata <- as.data.frame(read.csv('in.csv'), stringsAsFactors=F)
fit <- mymodel(Y~., data=trainingdata)
con <- file("newstuff.csv", open = "r")
i = 0
while (length(mylines <- readLines(con, n = 500, warn = FALSE)) > 0) {
i = i+1
newdata <- as.data.frame(mylines, stringsAsFactors=F)
preds <- predict(fit,newdata)
write.csv(preds, file=paste(filename,i,'.csv',sep=''))
}
close(con)
However, when I print the mylines object inside the loop, it doesn't get auto-columned correctly the same way read.csv produces something that is---headers are still a mess, and whatever modulo column-width happens under the hood that wraps the vector into an ncol object isn't happening.
Whenever I find myself writing barbaric things like cutting the first row, wrapping the columns, I generally suspect R has a better way to do things. Any suggestions for how I can get a read.csv-like output form a readLines csv connection?
If you want to read your data into memory in chunks using read.csv by using the skip and nrows arguments. In pseudo-code:
read_chunk = function(start, n) {
read.csv(file, skip = start, nrows = n)
}
start_indices = (0:no_chunks) * chunk_size + 1
lapply(start_indices, function(x) {
dat = read_chunk(x, chunk_size)
pred = predict(fit, dat)
write.csv(pred)
}
Alternatively, you could put the data into an sqlite database, and use the sqlite package to query the data in chunks. See also this answer, or do some digging with [r] large csv on SO.

Resources