I have a question that I can't seem to find the answer anywhere online. I apologize if it's already been answered, but here goes. I've written a script in R that will go through the process of forecasting for me, and returning the best point forecast based on cross validation and other criteria. I'm wanting to save this script as a function, that way I don't have to use the full script every time I go to forecast. The basic set up of my script is the following:
output <- read.csv("C:/Users/data.csv", header = T)
colnames(output)
month_count = length(output[,1]) ##used in calculations throughout code
current_year = output[1,1]
current_month = output[1,2]
months = 5 #months to forecast out
m = 0
data <- ts(output[,3][c(1:(month_count-m))],
frequency = 12, start = c(current_year,current_month))
#runs all the other steps from here on
The function that I'm writing will looking like this where it takes various inputs and then runs the script and prints back my forecasts
forecastMe = function(sourcefile,months,m)
{
#runs the data prints out the result
}
The problem I'm having is I want to be able to enter a directory and file name such as C:/Users/documents/data1.csv into the function (for the sourcefile part) and for it pick that up at this step of my R script.
output <- read.csv("C:/Users/sourcefile.csv", header = T)
I can't seem to find a way to get it to do it right. Any ideas or suggestions?
So...
function(sourcefile, etc) {
output <- read.csv(sourcefile, header = T)
etc
}
...that? I don't really see what you're asking exactly.
You were almost there. All you have to do is replace your constants with the variable names you want to pass to the function and delete your declarations you don't need anymore.
forecastMe = function(sourcefile,months,m) {
output <- read.csv(sourcefile, header = T)
colnames(output)
month_count = length(output[,1]) ##used in calculations throughout code
current_year = output[1,1]
current_month = output[1,2]
data <- ts(output[,3][c(1:(month_count-m))],
frequency = 12, start = c(current_year,current_month))
#runs all the other steps from here on
}
Related
Still very new to R, so please excuse me.
I am trying to download csv data from the Sloane Digital Website Survey. Within R I do the following -
astro1 <- read.csv("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=596&plateid=4055")
This downloads 1 csv spectra per fibre ID per plate [here, plateid=4055]. However, if there are several hundred fibre IDs it will be a very long couple of days.
Is there a way to batch download all csv data for all fibre IDs? I tried fibreid=* (and "", " ", #, but got the following error -
"no lines available in input", or unexpected string constant.
If for example there are 100 .csv files per plate. All will have a common x-axis (wavelength), but a different 3rd column (best fit, for y-axis). Is there a way to get the downloaded csv tables to form 1 very large dataset, with the same common axis (wavelength), and subsequent columns to show only the Best Fit columns?
Many thx
Best case would be that you have a list of all the links to your wanted csv-Files. Since this is seemingly not the case, you know that you want to loop over all the fiberids. You know the structure of the link, hence we could use it to define
buildFibreIdLink <- function(fibreId) {
paste0("https://dr14.sdss.org/optical/spectrum/view/data/format=csv/spec=full?mjd=55359&fiberid=",fibreId,"&plateid=4055")
}
Now I would just loop over all ids, whatever all means in this case. Just start at 1 and count up. Therefore I would use the function
getCsvDataList <- function(startId = 1, endId = 10, maxConsecutiveNulls = 5) {
dataList <- list()
consecutiveNullCount <- 0
for(id in startId:endId) {
csvLink <- buildFibreIdLink(fibreId = id)
newData <- tryCatch(expr = {
read.csv(csvLink)
}, error = function(e) {return(NULL)})
if(is.null(newData)) {
consecutiveNullCount <- consecutiveNullCount +1
} else {
dataList <- c(dataList,list(newData))
consecutiveNullCount <- 0
}
if(consecutiveNullCount == maxConsecutiveNulls) {
print(paste0("reached maxConsecutiveNulls at id ",id))
break;
}
}
return(dataList)
}
Specify the id-range you want to read, such that you can really partially read the csvs. Now the question is: When have you reached the end? My answer would basically be: You reached the end, when there are maxConsecutiveNulls consecutive "read-csv-fails". I assume that a link doesn't exist if you can't read it, hence the tryCatch-block triggers and I basically count these triggers until a given maximum.
If you know that the structure of the csvs is always the same, you can merge the list of data.frames together via
dataListFrom1to10 <- getCsvDataList(startId = 1, endId = 10)
merged1to10 <- do.call("rbind",dataListFrom1to10)
Update: If you have your vector of needed fibre-ids, you can modify the function as follows. Since we didn't know the exact Ids, we looped from 1 to anywhere. Now, knowing the Ids, you can replace the startId and endId arguments by say fibreIdVector, to get the signature
getCsvDataList <- function(fibreIdVector, maxConsecutiveNulls) .... In the for-loop, replace for(id in startId:endId) by for(id in fibreIdVector). If you know that all your Ids are valid, you can remove the error-handling to get a much cleaner function. Since you don't need to know the results of previous iterations, e.g. counting the consecutiveNullCount, you can just put everything into an lapply like
allCsvData <- lapply(fibreIdVector, function(id) {
read.csv(buildFibreIdLink(fibreId = id))
})
replacing the whole function.
I am trying to automate the calculation of some animal energy requirements where I have inputs as days on feed, daily feed intake, etc. My code first reads in the initial data from a CSV, uses it to calculate some starting values outside the loop, runs a loop of that day's energy calculations for the time on feed, stores those results in a data frame, and then write the final data frame to a CSV.
I have data from >300 sheep on individual records basis like this and want to automate reading in the files, and writing the results to separate CSV files within a specific folder. I know this means a loop within a loop, but am trying to figure out how exactly to go about it.
I know I need to read in the files using files.list, like this:
files = list.files("C:/Users/Me/Desktop/Sheepfiles/", pattern = "Sheep+.*csv")
but I want each file as its own data frame run through the model and I need to keep everything separate going in and out.
setwd("C:Users/....../Sheepfiles")
input = read.csv(file = "Sheep131.csv", header = TRUE, sep =",")
#set up initialized values outside loop here
LWt0 = input$LWT[1]
EBW = LWT0*.96*.891
#constants go here
Results = NULL;
timefeed = input$DOF
#now the loop
for (i in timefeed)
{
#differential equations and calculations here
results1 = (c(t, NEG, MEI, OldMEI, HPmaint, EBW, ID, TRT))
names(results1) = c("DOF", "NEG", "MEI", "OldMEI","HPmaint", "EBW", "ID", "TRT")
print((results1))
Results = rbind(Results,results1)
#update variables to new values here
}
write.csv(Results, file = "Results131.csv")
What I want is for them to be able to have files with SheepX in the name, one per sheep, where X is the eartag #, have those read in, calculated, and then automatically output with the results in ResultsX.csv. If it helps, the eartag number is in the original input file under the column "ID". So for Sheep 1:150 I'd have Results1:150 etc
Later on, I'll need to be able to read those result files back in, extract outputs at specific days, and then pull those into a data frame for comparison with observations, but that's the next step after I get all these files run through the model.
You need to loop through your filenames and execute your existing code for each file, so a solution could look like this:
setwd("C:Users/....../Sheepfiles")
files = list.files("C:/Users/Me/Desktop/Sheepfiles/", pattern = "Sheep+.*csv")
for (i in files) {
input = read.csv(file = i,
header = TRUE,
sep = ",")
#set up initialized values outside loop here
LWt0 = input$LWT[1]
EBW = LWT0 * .96 * .891
#constants go here
Results = NULL
timefeed = input$DOF
#now the loop
for (i in timefeed)
{
#differential equations and calculations here
results1 = (c(t, NEG, MEI, OldMEI, HPmaint, EBW, ID, TRT))
names(results1) = c("DOF", "NEG", "MEI", "OldMEI", "HPmaint", "EBW", "ID", "TRT")
print((results1))
Results = rbind(Results, results1)
#update variables to new values here
}
# automatically generate filename for results
result.filename <- gsub("Sheep", "Results", i)
write.csv(Results, file = result.filename)
}
So you basically wrap a for-loop around your code, with your file-names as the counter-variable.
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!
I am doing PCA. Here is the code for the same-
### Read .csv file #####
data<-read.csv(file.choose(),header=T,sep=",")
names(data)
data$qcountry
#### for the country-ARGENTINA#######
ar_data<-data[which(data$qcountry=="ar"),]
ar_data$qcountry<-NULL
names(ar_data)
names(ar_data)<-c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
names(ar_data)
ar_data_wdt_zero_columns<-ar_data[, colSums(ar_data != 0) > 0]
####Testing multicollinearity####
vif(ar_data_wdt_zero_columns)
#### Testing appropriatness of PCA ####
KMO(ar_data_wdt_zero_columns)
cortest.bartlett(ar_data_wdt_zero_columns)
#### Run PCA ####
pca<-prcomp(ar_data_wdt_zero_columns,center=F,scale=F)
summary(pca)
#### Compute the loadings for deciding the top4 most correlated variables###
load<-pca$rotation
write.csv(load,"loadings_argentina_2015_Q4.csv")
I have shown here for the one country, I have done this for 9countries. For each country I have to run this code. I am sure there must be easier way to automate this code. Please suggest !!
Thanks!!
Yes, this is doable for every country. You can make your custom function which takes appropriate parameters, e.g. country name and data. You do the magic inside and return an appropriate object (or not). Pass this magic to a processed data which you import and make pretty only once. The below code is not tested but should get you started.
A few comments.
Don't use file.choose() as it breaks your code 3 days down the line. How do you know what file to choose? Why click every time you run the script when you can make the script work for you? Be lazy in that sense.
You have a lot of clutter in your script. Adhere to some style and don't leave in random lines you try out for "shits and giggles". Use spaces in your code, at least.
Be more imaginative in choose object names. Try the name out first if perhaps the object already exists in a form of a base function, e.g. load.
myPCA <- function(my.country, my.data) {
ar_data <- data[data$qcountry %in% "ar", ]
ar_data$qcountry <- NULL
ar_data_wdt_zero_columns <- ar_data[, colSums(ar_data != 0) > 0]
#### Run PCA ####
pca <- prcomp(ar_data_wdt_zero_columns, center = FALSE, scale = FALSE)
#### Compute the loadings for deciding the top4 most correlated variables###
write.csv(pca$rotation, paste("loadings_", my.country, ".csv", sep = "")) # may need tweaking
return(list(pca = pca, vif = vif(ar_data_wdt_zero_columns),
kmo = KMO(ar_data_wdt_zero_columns), correlation = cortest.bartlett(ar_data_wdt_zero_columns))
}
data <- read.csv("relative_link_to_file", header = TRUE, sep = ",")
names(data) <- c("01_insufficient_efficacy","02_safety_issues","03_inconvenient_dosage_regimen","04_price_issues"
,"05_not_reimbursed","06_not_inculed_govt","07_insuficient_clinicaldata","08_previously_used","09_prescription_opted_for_some_patients","10_scientific_info_NA","12_involved_in_diff_clinical_trial"
,"13_patient_inappropriate_for_TT","14_patient_inappropriate_Erb","16_patient_over_65","17_Erbitux_alternative","95_Others")
sapply(data$qcountry, FUN = myPCA)
I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.