Merging a large number of csv datasets - r

Here are 2 sample datasets.
PRISM-APPT_1895.csv
https://copy.com/SOO2KbCHBX4MRQbn
PRISM-APPT_1896.csv
https://copy.com/JDytBqLgDvk6JzUe
I have 100 of these types of data sets that I'm trying to merge into one data frame, export that to csv, and then merge that into another very large dataset.
I need to merge everything by "gridNumber" and "Year", creating a time series dataset.
Originally, I imported all of the annual datasets and then tried to merge them with this :
df <- join_all(list(Year_1895, Year_1896, Year_1897, Year_1898, Year_1899, Year_1900, Year_1901, Year_1902,
Year_1903, Year_1904, Year_1905, Year_1906, Year_1907, Year_1908, Year_1909, Year_1910,
Year_1911, Year_1912, Year_1913, Year_1914, Year_1915, Year_1916, Year_1917, Year_1918,
Year_1919, Year_1920, Year_1921, Year_1922, Year_1923, Year_1924, Year_1925, Year_1926,
Year_1927, Year_1928, Year_1929, Year_1930, Year_1931, Year_1932, Year_1933, Year_1934,
Year_1935, Year_1936, Year_1937, Year_1938, Year_1939, Year_1940, Year_1941, Year_1942,
Year_1943, Year_1944, Year_1945, Year_1946, Year_1947, Year_1948, Year_1949, Year_1950,
Year_1951, Year_1952, Year_1953, Year_1954, Year_1955, Year_1956, Year_1957, Year_1958,
Year_1959, Year_1960, Year_1961, Year_1962, Year_1963, Year_1964, Year_1965, Year_1966,
Year_1967, Year_1968, Year_1969, Year_1970, Year_1971, Year_1972, Year_1973, Year_1974,
Year_1975, Year_1976, Year_1977, Year_1978, Year_1979, Year_1980, Year_1981, Year_1982,
Year_1983, Year_1984, Year_1985, Year_1986, Year_1987, Year_1988, Year_1989, Year_1990,
Year_1991, Year_1992, Year_1993, Year_1994, Year_1995, Year_1996, Year_1997, Year_1998,
Year_1999, Year_2000),
by = c("gridNumber","Year"),type="full")
But R keeps crashing because I think the merge is a bit to large for it to handle, so I'm looking for something that would work better. Maybe data.table? Or another option.
Thanks for any help you can provide.

Almost nine months later and your question has no answer. I could not find your datasets, however, I will show one way to do the job. It is trivial in awk.
Here is a minimal awk script:
BEGIN {
for(i=0;i<10;i++) {
filename = "out" i ".csv";
while(getline < filename) print $0;
close(filename);
}
}
The script is run as
awk -f s.awk
where s.awk is the above script in a text file.
This script creates ten filenames: out0.csv, out1.csv ... out9.csv. These are the already-existing files with the data. The first file is opened and all records sent to the standard output. The file is then closed and the next filename created and opened. The above script has little to offer over a command line read/redirect. You would typically use awk to process a long list of filenames read from another file; with statements to selectively ignore lines or columns depending on various criteria.

Related

speeding up nested for loop in R

I have a nested for loop that I am using to parse a complex and large json file. It takes forever! I am wondering if there is a way to speed this up that I am missing? I can link to a smaller json file if that would be useful? Not sure if something from the apply family would be of use?
for(row_n1 in 1:length(json_data$in_network)){
in_network1=json_data$in_network[[row_n1]]
in_network1[["negotiated_rates"]]=NULL
in_network_df=as.data.frame(in_network1)
for(row_n2 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates)){
reporting=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups)){
npi_df=as.data.frame(toString(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$npi))%>% set_names(nm = "npi")
tin_df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$provider_groups[[row_n3]]$tin)))
df=merge(npi_df,tin_df)
df=merge(in_network1,df)
reporting=rbind(reporting,df)
}
negotiated_prices_df=NULL
for(row_n3 in 1:length(json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices)){
df=as.data.frame(t((json_data$in_network[[row_n1]]$negotiated_rates[[row_n2]]$negotiated_prices[[row_n3]])))
negotiated_prices_df=bind_rows(negotiated_prices_df,df)
}
r=merge(reporting,negotiated_prices_df)
result=bind_rows(result,r)
}
if(row_n1%% 100 ==0)
print(paste (row_n1,Sys.time() ,sep="====="))
}
l=json_data
l$in_network=NULL
result=merge(as.data.frame(l),result)
This isn't a big deal on small files (e.g., < 1,000 rows in the result df, but when I am dealing with 1m+ it just takes forever!)

Is there a way to retrieve physiological data from acqknowledge and preprocess them in R?

I'm trying to retrieve raw physiological data (hand dynanometer) from Acqknowledge to be able to preprocess and analyze them in R (or Matlab) in an automatized way. Is there a way to do that ? I would like to avoid having to copy/paste the data manually from Acknowledge to Excel to read them in R.
Then I would like to apply a filter on the data and retrieve the squeezes of interest in R. Is there a way to do that ?
Any advice is very welcome, thank you in advance!
I had a similar problem. The key is the load_acq.m function, you can find it here -> https://github.com/munoztd0/fusy-octave-memories/blob/main/load_acq.m then you can just loop through them and save them as .csv or anything that you can load and work with in R.
As for how to do that I have put together a little routine to do that.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Create physiological files from AcqKnowledge
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% SETTING THE PATH
path = 'your path here'; %your path here
physiodir = fullfile(path, '/SOURCE/physio');
outdir = fullfile(path, '/DERIVATIVES/physio');
session = "first" %your session
base = [path 'sub-**_ses-' session '*'];
subj = dir(base); %check all matching subject for this session
addpath([path '/functions']) % load aknowledge functions
%check here https://github.com/munoztd0/fusy-octave-memories/blob/main/load_acq.m
for i=1:length(subj)
cd (physiodir) %go to the folder
subjO = char(subj(i).name);
subjX = subjO(end-3:end); %remove .mat extension
filename = [subjX '.acq']; %add .acq extension
disp (['****** PARTICIPANT: ' subjX ' **** session ' session ' ****' ]);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% OPEN FILE
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
physio = load_acq(filename); %load and transform acknowledge file
data = physio.data; %or whatever the name is
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% CREATE AND SAVE THE CHANNEL
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
cd (outdir) %go to the folder
% save the data as a file text in the participant directory
fid = fopen([subjX ',txt'],'wt');
for ii = 1:length(data)
fprintf(fid,'%g\t',data(ii));
fprintf(fid,'\n');
end
fclose(fid);
end
Hope it helps !

R function to write results in to a csv file

Hi I have created this R function two do some operation and store the results into separate files.
x1=rnorm(100,0,1)
x2=rnorm(100,0,1)
dataaa=data.frame(x1,x2)
func1=function(dataaa,name1,name2)
{
xqr=dataaa[,1]^2
xcube=dataaa[,1]^3
write.csv(xqr,"name1.csv")
write.csv(xcube,"name2.csv")
}
func1(dataaa,xr,xc)
The function works well. But the file name didnt change. i.e the names of the two csv files should be xr.csv and xc.csv. But it was saved as name1.csv and name2.csv.
How to modify this function so that I can get the correct file names?
Thank you.
this should work:
func1=function(dataaa,name1,name2)
{
xqr=dataaa[,1]^2
xcube=dataaa[,1]^3
write.csv(xqr, paste0(name1,".csv"))
write.csv(xcube, paste0(name2,".csv"))
}

Converting the argument name of a function into string

I have developed a function which will take a list of files and will do some statistical tests and will generate a excel file. In the last line of function (return object) I want the function will return a excel file with same names as input file names. In my example it will give list_file.xlsx. IF I enter another file let's say tslist_file it should automatically return tslist_file.xlsx. The function is properly working. Suggest me how I code last line of the function so that I can generalise it.
newey<-function(list_files){
tsmom<-do.call(cbind,lapply(list_files,function(x) read_excel(x)[,2]))
tsmom<-xts(tsmom[,1:5],order.by = seq(as.Date("2005-02-01"),length=183,by="months")-1)
names(tsmom)<-c("tsmom121","tsmom123","tsmom126","tsmom129","tsmom1212")
## newey west
newey_west<-function(x){
model<-lm(x~1)
newey_west<-coeftest(model,vcov=NeweyWest(model,verbose=T))
newey_west[c(1,3,4)]
}
## running newey west
cs_nw_full<-do.call(cbind,lapply(tsmom,newey_west))
library(gtools)
p_values<-cs_nw_full[3,]
cs_nw_full[2,]<-paste0(cs_nw_full[2,],stars.pval(p_values))
write.xlsx(cs_nw_full,"list_file.xlsx")
}
Try:
write.xlsx(cs_nw_full, paste0(eval(substitute(list_files)), ".xlsx"))
Edit:
#jeetkamal is absolutely right - you need to use
write.xlsx(cs_nw_full, paste0(deparse(substitute(list_files)), ".xlsx"))
here.
I apologize for the mistake. eval wold only work if list_files was e.g. the name of a file, not a list object.

Multiple inputs into a R function

So I currently have this code below in a file, pullsec.R
pullsec <- function(session=NULL){
if(is.null(session)) session<-1
stopifnot(is.numeric(session))
paste("Session",1:10)[session]
}
In an .Rnw file, I call on this pullsec.R and choose session number 3 by:
source("pullsec.R")
setsec <- pullsec(3)
which would pull all of the rows where the column Session has data values of "Session 3"
I would like to add another block to pullsec.R that would allow me to pull data for a second column, Sessions where the data in that column is Sessions 1-2, Sessions 3-4, Session 5-6, etc. But I'm not sure how to modify the pullsec block to accept multiple inputs.
I had tried many solutions but no bite. My most naive solution is:
pullsec2 <- function(sessions1=NULL,sessions2=NULL){
if(is.null(sessions1)) sessions1<-1
stopifnot(is.numeric(session1))
paste("Sessions",1:10,"-",1:10)[sessions]
}
either of these would do:
pullsec2 <- function(session1=1,session2=2){
stopifnot(is.numeric(session1))
stopifnot(is.numeric(session2))
paste0("Sessions ",session1,'-',session2)
}
pullsec2(3,4)
pullsec2 <- function(sessions=1){
stopifnot(is.numeric(sessions))
paste("Sessions",paste0(sessions,collapse="-"))
}
pullsec2(3:4)

Resources