Read over 13.000 txt files into RStudio - r

I am trying to load txt files into R. The files are EEG recordings for emotional words. Every text file is the recording of a word for a study participant. For every participant there are 360 words, which were recorded.
Every text file includes the complete time frame of EEG recording as a row (from 0 to 2000ms) and the electrodes from 1 to 58 in the column.
I have a Script for R worked out with a statistics crack, who is currently unavailable for me to ask again. The script used to work and read the data and produce and output.
Now RStudio does run the script to a certain point, but the RStudio environment does remain with every table showing NA for each data point. (Which it was able to read before.)
Much later R gives an error as output, but I guess this is mainly dependent on the data not being properly read.
I seem to be blind to mistakes in the script, in which I did not change anything after it worked for the first time.
Also installing R on another computer for the first time and running the script results in the same output: The environment shows NA for all data.
I have looked into several tips on how to read txt files into R, but they are not covering complex data.
The employed code, to sum up the data into means per electrode and time point is as follows:
stats<-function(x,tf1=NA,tf2=NA){
natf1<-sum(is.na(tf1))>0
natf2<-sum(is.na(tf2))>0
if(natf1&natf2){tf1<-1:dim(x)[2]}
if(!natf1){
mx<-as.numeric(lapply(as.data.frame(t(x[,tf1])),max))
mn<-as.numeric(lapply(as.data.frame(t(x[,tf1])),mean))
min<-as.numeric(lapply(as.data.frame(t(x[,tf1])),min))
}
if(!natf2){
mx2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),max))
mn2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),mean))
min2<-as.numeric(lapply(as.data.frame(t(x[,tf2])),min))
}
if(natf2){
return(cbind(electrode=1:58,max1=mx,mean1=mn,min1=min
,max2=NA,mean2=NA,min2=NA))
}else{
return(cbind(electrode=1:58
,max1=mx,mean1=mn,min1=min
,max2=mx2,mean2=mn2,min2=min2))
}}
I hope somebody can help me and improve my limited knowledge.
Cheers, Emily
P.S.: code to read the txt.files should be the following:
filenames<-dir() # dateinamen auslesen
nfiles<-length(filenames) #anzahl files
store<-matrix(rep(NA,nfiles*9*58),nfiles*58,9)
ti<-2010/503
t250<-floor(250/ti)
t350<-ceiling(350/ti)
t350p1<-t350+1
t450<-ceiling(450/ti)
for(i in 1:nfiles){
data<-read.table(filenames[i], sep=" ",dec=".")
vp<-as.numeric(substr(filenames[i],3,4))
word<-as.numeric(substr(filenames[i],15,17))
print(i)
store[(1+(i-1)*58):(58+(i-1)*58),1]<-rep(vp,58)
store[(1+(i-1)*58):(58+(i-1)*58),2]<-word
store[(1+(i-1)*58):(58+(i-1)*58),3:9]<-stats(data,t250:t350,t350p1:t450)
}
data looks as follows (being the output of an EEG recording, that is listed by time(x) and electrode (y):
-24.0726 -25.4886 -19.3321 -12.9210 -5.1501 3.1598 7.3684 4.7018 -2.2902 -7.5973 -8.6344 -7.8640 -7.4511 -6.1870 -2.6582 0.8325 1.3330 -0.3912 -1.8508 -3.5361 -5.7567 -6.1500 -5.9328 -6.0740 -5.1535 -3.7834 -0.3229 3.5887 2.1871 -3.7773 -7.5377 -7.9027 -10.2698 -11.9537 -8.7184 -6.0458 -9.0905 -14.2111 -17.0484 -18.7480 -17.6947 -12.7817 -9.0529 -7.9332 -8.9464 -11.4776 -13.9951 -11.9900 -3.6849 -1.1153 -5.2907 -4.8818 -2.8731 -5.9760 -7.7751 -5.4999 -7.4731 -9.3200 ...
error message received:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 5 did not have 2 elements
but I am not sure how to fix this or where/ in which file to check for said error

Related

RStudio read_delim(): intermittently receive error std::bad_alloc upon opening files with unusual delimeter

I received a series of 100+ files from a client. This client received the files as part of litigation, so they didn't have to be transmitted in a convenient fashion, they just all had to be present. In a single .zip file, all the files are all tracked with names like Folder1.001, Folder1.002, Folder3.001, etc. When unpackaged these files using the 7-Zip program, they don't show up with a .txt, .csv, or any other file extension. Windows incorrectly interprets the unzipped files as a ".001 File" or ".002 File." This is not the issue, because I know that the files are delimited by a ~ and are 118 columns wide. Each file has between 2.5M and 4.9M rows, and each is about 1 GB in size when unzipped.
This is my first ever post here, so please excuse any breach of etiquette.
I am working in a .Rmd file on a virtual machine running Windows. I have R4.2.2 (64-bit), and RStudio 2022.12.0+353. All work is being done within a drive on the virtual machine that has 9+ GB free out of 300 GB total. The size of this virtual drive could be increased, if necessary.
My goal here is examine one variable in each file, to see if cases fall within a given range for that variable, and save those rows that do. I have been saving them as .rds files using write_rds().
I have been bringing in the files using a read_delim() statement specifying 'delim = "~"'. I created a vector of 120 column names which I use because the columns are not labeled. These commands on their own are not an issue. A successful import looks like the below.
work1 <- read_delim("Data\\Folder1\\File1.001"), delim = "~", col_names = vNames1)
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
It mishandles the columns named Skip1 and Skip9 as logical values, but those aren't a necessary part of my analysis.
I then filter and write the file using
work1 <- work1 %>% filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)
write_rds(work1, "Data\\Working\\Folder1_001.rds")
I have also done this with the read_delim() and filter() piped into a single command. This is not the issue. NOTE: Before I read in the next file (File1.002), I now have a work1 file that is at most, 4000 cases, down from millions when it was imported.
Since I have over 100 of these files, I have written multiple code chunks to do a few of these at a time. After one to three read_delim() statements in a row, I get the below error.
work2 <- read_delim("Data\\Folder1\\File1.002"), delim = "~", col_names = vNames1)
Error std::bad_alloc
Which I understand has to memory allocation. I can close out RStudio and restart and that will allow me to do one or two more imports, filterings, then writings. Doing that for over 100 files is far too inefficient.
I condensed my code a step further by writing the read_delim() step within the write_rds() step, which looks like the below.
write_rds((read_delim("Data\\Folder1\\File003",
delim = "~", col_names = vNames1) %>%
filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)),
"Data\\Working\\Folder1_003.rds")
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
Yet after 1 or 2 successful runs, I get the same
Error std::bad_alloc message.
Using traceback(), it seems like it is related to vroom::vroom(), but I'm not sure how to check any further.

Error while using "EpiEstim" and "ggplot2" libraries

First of all, I must say I'm completely noob in R. So I apologize in advance for asking for help with such a simple task. My task is to form a graph of COVID-19 cases for a certain period using data from the CSV file. Unfortunately, at the moment I cannot contact the person from the World Health Organization who provided the data and the script for launching. But I was left with an error that I cannot fix either myself, not with the help of Google.
script.R
library(EpiEstim)
library(ggplot2)
COVID<-read.csv("dataset.csv")
res_parametric_si<-estimate_R(COVID$I,method="parametric_si",config=make_config(list(mean_si=4,std_si=3)))
plot(res_parametric_si)
dataset.csv
Date,Suspected per day,Total suspected,Discarded/pending,Confirmed per day,Total confirmed,Deaths per day,Deaths Total,Case fatality rate,Daily confirmed,Recovered per day,Recovered total,Active cases,Tested with PCR,# of PCR tests total,average tests/ 7 days,Inf HCW,Inf HCW/d,Vent HCW,Susp per day
01-Jul-20,1239,91172,45285,889,45887,12,1185,2.58%,889,505,20053,24649,11109,676684,10073,6828,63,,1239
02-Jul-20,1249,92421,45658,876,46763,27,1212,2.59%,876,505,20558,24993,13167,689851,9966,6874,46,,1249
03-Jul-20,1288,93709,46032,914,47677,15,1227,2.57%,914,597,21155,25295,11825,701676,9915.7,6937,63,,1288
04-Jul-20,926,94635,46135,823,48500,22,1249,2.58%,823,221,21376,25875,9934,711610,9957,6990,53,,926
05-Jul-20,680,95315,46272,543,49043,13,1262,2.57%,543,327,21703,26078,6696,718306,9963.7,7030,40,,680
06-Jul-20,871,96186,46579,564,49607,21,1283,2.59%,564,490,22193,26131,9343,727649,10303.9,7046,16,,871
07-Jul-20,1170,97356,46942,807,50414,23,1306,2.59%,807,926,23119,25989,13568,741217,10806,7092,46,,1170
Error
Error in process_I(incid) (script.R#4): incid must be a vector or a dataframe with either i) a column called 'I', or ii) 2 columns called 'local' and 'imported'.
For the example data the issue seems to be that it does only cover 7 data points, and the configurator assumes that there it can window over more than 7 days. What worked for me was the following code (working in the sense that it does not throw an error).
config <- make_config(incid = COVID$Daily.confirmed,
method="parametric_si",
list(mean_si=4,std_si=3, t_start = c(2,3),t_end = c(6,7)))
res_parametric_si<-estimate_R(COVID$Daily.confirmed,method="parametric_si",config=config)
plot(res_parametric_si)

Uneven numbers of tokens and subscript out of bound errors

I am trying to analyze data from flow cytometry, where there is a package that was developed like 10 years ago. It requires a few dependencies packages that I was able to install all.
Now when I tried to run it with the first function to create a gate frame for a winlist processed fcs file.
create_gate_frame(frame = archframe1x36x16, inputfile =c("facsdata/TrungTran/Gelfree-8-lane7-5_1_1_A5.fcs"), popdesc = "frames/popdescriptions/array1xpopdesc.txt").
I just got the following errors that I don't know how to solve. So, any help would be very much appreciated.
uneven number of tokens: 1013
The last keyword is dropped.
uneven number of tokens: 1013
The last keyword is dropped.
Error in mat[, c(scatters, dims1, dims2, PE)] : subscript out of bounds

R - read html files within a folder, count frequency, and export output

I'm planning to use R to do some simple text mining tasks. Specifically, I would like to do the following:
Automatically read each html file within a folder, then
For each file, do frequency count of some particular words (e.g., "financial constraint" "oil export" etc.), then
Automatically write output to a csv. file using the following data structure (e.g., file 1 has "financial constraint" showing 3 times and "oil export" 4 times, etc.):
file_name count_financial_constraint count_oil_export
1 3 4
2 0 3
3 4 0
4 1 2
Can anyone please let me know where I should start, so far I think I've figured out how to clean html files and then do the count but I'm still not sure how to automate the process (I really need this as I have around 5 folders containing about 1000 html files within each)? Thanks!
Try this:
gethtml<-function(path=".") {
files<-list.files(path)
setwd(path)
html<-grepl("*.html",files)
files<-files[html]
htmlcount<-vector()
for (i in files) {
htmlcount[i]<- ##### add function that reads html file and counts it
}
return(sum(htmlcount))
}
R is not intended for doing rigorous text parsing. Subsequently, the tools for such tasks are limited. If you insist on doing it with R then you better get familiar with regular expressions and have a look at this.
However, I highly recommend using Python with the beautifulsoup library, which is specifically designed for this task.

How do I read multiple binary files in R?

Suppose we have files in one folder file1.bin, file2.bin, ... , and file1460.bin in directory C:\R\Data and we want to read them and make a loop to go from 1 to 4 and take the average then from 4 to 8 average and so on till 1460.in the end will get 360 files
I tried to have them in a list,but did not know how to make the loop.
How do I read multiple files and manupulat them? in R language
I have been wasting countless hourse to figuer it out.any help
results <- array(dim=360)
for (i in 1:360){
results <- mean(yourlist[[(i*4):(i*4+3)]])
}
YMMV with the mean(yourList) call, but that structure would be how you could loop through the data once it's loaded.

Resources