Saving code by avoid writing consecutive numbers in R - r

Suppose we have these objects:
distM_ref1_matrix <- dist(cbind(distM_ref1$x, distM_ref1$y))
distM_ref2_matrix <- dist(cbind(distM_ref2$x, distM_ref2$y))
distM_ref3_matrix <- dist(cbind(distM_ref3$x, distM_ref3$y))
distM_ref4_matrix <- dist(cbind(distM_ref4$x, distM_ref4$y))
distM_ref5_matrix <- dist(cbind(distM_ref5$x, distM_ref5$y))
distM_ref6_matrix <- dist(cbind(distM_ref6$x, distM_ref6$y))
distM_ref7_matrix <- dist(cbind(distM_ref7$x, distM_ref7$y))
distM_ref8_matrix <- dist(cbind(distM_ref8$x, distM_ref8$y))
distM_ref9_matrix <- dist(cbind(distM_ref9$x, distM_ref9$y))
I have two questions:
How can I save code by creating all these objects automatically with a single line? Consider I do have tons of equivalent objects, not only 9 as in the example.
How can I calculate the mean value of all these objects at the same time?
I think that the answer for the second question would be somehow like this code:
mean(c(distM_ref[1:9]_matrix))

Related

Is there a way in R to pass subsets of dataset into a variable

I'm learning R by using it on one project where I need to extract unique paths from logs.
Now, My workaround (lower) part of the code work, but I had to split the log into two files and perform grouping on them separately, while I tried the same on variables, I was getting all the data in all three path counts.
Can someone point me to what is wrong in the first approach, as I doubt that writing physically files to a disk is intended way?
a = read.csv('download-report-06-10-2017.csv')
yesterdays_data <- a[grepl("2017-10-05", a$Download.Time), ]
todays_data <- a[grepl("2017-10-06", a$Download.Time), ]
write.csv(yesterdays_data, "yesterdays.csv")
write.csv(todays_data, "todays.csv")
path_count <- as.data.frame(table(a$Path))
path_count_today <- as.data.frame(table(todays_data$Path))
path_count_yday <- as.data.frame(table(yesterdays_data$Path))
#### path_count, path_count_today & path_count_yday contain the same values and I expect them to be different ???
yd = read.csv('yesterdays.csv')
td = read.csv('todays.csv')
path_count_td <- as.data.frame(table(td$Path))
path_count_yd <- as.data.frame(table(yd$Path))
#### path_count_td and path_count_yd are different, as I'd expect in upper three variables

Including additional data at each stage of a loop

I am trying to create minimum convex polygons for a set of GPS coordinates, each day has 32 coordinates and I want to create a MCP with 1 day,2 days,3 days... and so on worth of data. For instance in the first step I want to include rows 1-32 which I have managed:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:32, ]
Currently to select data for me to do 2 days at a time I have written:
mydata <- read.csv("file.csv", stringsAsFactors = FALSE)
mydata <- mydata[1:64, ]
Is there a way to automate adding 32 rows at each step (in a loop) rather than me running the code manually each time and changing the amount of data used manually each time?
I am very new to R so I do not know whether it is possible to do this, the way I thought would work was:
n <- 32
for (i in 1:100) {
mydata <- mydata[1:n, ]
## CREATE MCP AND STORE HOME RANGE OUTPUT
n <- n+32
}
However it is not possible to have n representing a row number but is there a way to do this?
Apologies if this is unclear but as I said I am quite new to using R and really would appreciate any help that can be given.

read, manipulate and export multiple .dta Files using a for Loop in R

I have multiple time series (each in a seperate file), which I need to adjust seasonally using the season package in R and store the adjusted series each in a seperate file again in a different directory.
The Code works for a single county.
So I tried to use a for Loop but R is unable to use the read.dta with a wildcard.
I'm new to R and using usually Stata so the question is maybe quite stupid and my code quite messy.
Sorry and Thanks in advance
Nathan
for(i in 1:402)
{
alo[i] <- read.dta("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County[i]")
alo_ts[i] <-ts(alo[i], freq = 12, start = 2007)
m[i] <- seas(alo_ts[i])
original[i]<-as.data.frame(original(m[i]))
adjusted[i]<-as.data.frame(final(m[i]))
trend[i]<-as.data.frame(trend(m[i]))
irregular[i]<-as.data.frame(irregular(m[i]))
County[i] <- data.frame(cbind(adjusted[i],original[i],trend[i],irregular[i], deparse.level =1))
write.dta(County[i], "/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County[i].dta")
}
This is a good place to use a function and the *apply family. As noted in a comment, your main problem is likely to be that you're using Stata-like character string construction that will not work in R. You need to use paste (or paste0, as here) rather than just passing the indexing variable directly in the string like in Stata. Here's some code:
f <- function(i) {
d <- read.dta(paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/SINGLE_SERIES/County",i,".dta"))
alo_ts <- ts(d, freq = 12, start = 2007)
m <- seas(alo_ts)
original <- as.data.frame(original(m))
adjusted <- as.data.frame(final(m))
trend <- as.data.frame(trend(m))
irregular <- as.data.frame(irregular(m))
County <- cbind(adjusted,original,trend,irregular, deparse.level = 1)
write.dta(County, paste0("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/ADJUSTED_SERIES/County",i,".dta"))
invisible(County)
}
# return a list of all of the resulting datasets
lapply(1:402, f)
It would probably also be a good idea to take advantage of relative directories by first setting your working directory:
setwd("/Users/nathanrhauke/Desktop/MA_NH/Data/ALO/SEASONAL_ADJUSTMENT/")
Then you can simply the above paths to:
d <- read.dta(paste0("./SINGLE_SERIES/County",i,".dta"))
and
write.dta(County, paste0("./ADJUSTED_SERIES/County",i,".dta"))
which will make your code more readable and reproducible should, for example, someone ever run it on another computer.

R for loop to summarize matrix of data

New user to R (like 2 days of use new) and coming from MATLAB, syntax nuances are driving me a little crazy. If anyone can point me in a direction on this topic I would really appreciate it. I have this dataset (fl1.back), that has 32 variables (columns) and 513 measurements (rows), and I want to create a table with basic stat tables of 9 of the 32 columns of data. There's a separate datset(fl2.back) that I would also like to pull 1 column of data from for the final table.
Here's the code I used to do the above tasks for 1 of the columns of data (sodium measurements) from fl1.back and fl2.back:
fl1.back <- read.delim("web.flat",comment.char="#",colClasses="character")
fl1.back <- fl1.back[-1,]
fl2.back <- read.delim("web.flat2",comment.char="#",colClasses="character")
fl2.back <- fl2.back[-1,]
head(fl1.back)
head(fl2.back)
#for rep criteria for sodium
back.sod.rep <- fl2.back[fl2.back$P00930!="",]
back.sod.rep$P00930 <- as.numeric(back.sod.rep$P00930)
back.sod.rep$P00930
#for samples...sodium
back.sod <- fl1.back[fl1.back$P00930!="",]
back.sod$P00930 <- as.numeric(back.sod$P00930)
back.sod$P00930
head(back.sod)
back.sod.summ <- data.frame("Sodium")
back.sod.summ
colnames(back.sod.summ) <- "Compound"
back.sod.summ$WQ_crit <- "20 mg/L"
back.sod.summ$n <- nrow(back.sod)
back.sod.summ$n_det <- nrow(back.sod[back.sod$R00930!="<",])
back.sod.summ$min <- min(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$max <- max(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$mean <- mean(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$median <- median(back.sod[back.sod$R00930!="<","P00930"])
back.sod.summ$percent_samp_det <- 100*(back.sod.summ$n_det/back.sod.summ$n)
back.sod.summ$percent_samp_above_crit <- 100*(length(back.sod[back.sod$P00930>20,"P00930"])/back.sod.summ$n)
back.sod.summ$percent_rep_above_crit <- (sum(back.sod.rep$P00930>=20)/(nrow(back.sod.rep)))
back.sod$P00930
length(back.sod[back.sod$P00930>back.sod.summ$WQ_crit,"P00930"])
back.sod.summ
final <- data.frame(back.sod.summ)
Instead of rewriting/copying and pasting the above code to create the data frame final, I would like to loop over the two datasets since I'm looking to repeat the same task, just on different columns of data. I really don't know where to start, and there doesn't seem to be much literature on for loops in R.
Any insight is appreciated!
Here is an example of what I think you want with the iris dataset:
library(plyr)
dlply(iris, .(Species), summary)
This can be extended if you need additional stats. Anyway, you probably should use (as I show above) the "split-apply-combine" approach as implemented in various functions and packages.

Delete row after row in for loop

I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).
I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.
Here are my tries:
Approach 1
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
a <- data.frame()
for (i in 1:length(rand_no)){
a <- rbind(a, which.names(rand_no[i], file))
file <- file[-a[1,1]]
}
Problem:
Warning message:
In which.names(rand_no[i], file) : 297 not matched
Approach 2
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
deleter <- function(i) {
a <- which.names(rand_no[i], file)
file <- file[-a]
}
lapply(1:length(rand_no), deleter)
Problem:
This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.
Thanks for any suggestions.
Edit
I hoped that it will work with numbers, but of course file looks like this:
file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")
Of course rand_no can't be over 100 files with such a small sample. Therefore:
rand_no <- sample(file, 2)
Use list instead of c. Then you can set the values to NULL and they will be removed.
file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)
library(car) #From poster's code.
file[file %in% rand_no] <- NULL
If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)
library(car) #From poster's code.
file[rand_no] <- NULL
Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.
So if you want three sets of 100 samples that don't share any elements, you can use
file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)
s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]
Or if you wanted to decease the total size by 100 each time you could do
s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]
A simple approach would be to select random indices and then remove those indices:
file <- 1:10000 # Build sample data
ind <- sample(seq(length(file)), 100) # Select random indices
rand_no <- file[ind] # Compute the actual values selected
file <- file[-ind] # Remove selected indices
I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.
files<-paste("file",1:100,sep="_")
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))

Resources