I have a large character-vector file and I need to draw a random sample from it. This works fine. But I need to draw sample after sample. For that I want to shorten file by every element that is already drawn out of it (that I can draw a new sample without drawing the same element more than once).
I've got some solution, but I'm interested in anything else that might work faster and even more important, maybe correctly.
Here are my tries:
Approach 1
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
a <- data.frame()
for (i in 1:length(rand_no)){
a <- rbind(a, which.names(rand_no[i], file))
file <- file[-a[1,1]]
}
Problem:
Warning message:
In which.names(rand_no[i], file) : 297 not matched
Approach 2
file <- rep(1:10000)
rand_no <- sample(file, 100)
library(car)
deleter <- function(i) {
a <- which.names(rand_no[i], file)
file <- file[-a]
}
lapply(1:length(rand_no), deleter)
Problem:
This doesn't work at all. Maybe I should split the quesion, because the second problem clearly lies with me not fully understanding lapply.
Thanks for any suggestions.
Edit
I hoped that it will work with numbers, but of course file looks like this:
file <- c("Post-19960101T000000Z-1.tsv", "Post-19960101T000000Z-2.tsv", "Post-19960101T000000Z-3.tsv","Post-19960101T000000Z-4.tsv", "Post-19960101T000000Z-5.tsv", "Post-19960101T000000Z-6.tsv", "Post-19960101T000000Z-7.tsv","Post-19960101T000000Z-9.tsv")
Of course rand_no can't be over 100 files with such a small sample. Therefore:
rand_no <- sample(file, 2)
Use list instead of c. Then you can set the values to NULL and they will be removed.
file[file %in% rand_no] <- NULL This find all instances from rand_no in file and removes them.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(file, 2)
library(car) #From poster's code.
file[file %in% rand_no] <- NULL
If you are working with a large list of files, using %in% to compare strings may bog you down. In that case I would use indexes.
file <- list("Post-19960101T000000Z-1.tsv",
"Post-19960101T000000Z-2.tsv",
"Post-19960101T000000Z-3.tsv",
"Post-19960101T000000Z-4.tsv",
"Post-19960101T000000Z-5.tsv",
"Post-19960101T000000Z-6.tsv",
"Post-19960101T000000Z-7.tsv",
"Post-19960101T000000Z-9.tsv")
rand_no <- sample(1:length(file), 2)
library(car) #From poster's code.
file[rand_no] <- NULL
Sample() already returns values in a permuted order with no replacements (unless you set replace=T). So it will never pick a value twice.
So if you want three sets of 100 samples that don't share any elements, you can use
file <- rep(1:10000)
rand_no <- sample(seq_along(file), 300)
s1<-file[rand_no[1:100]]
s2<-file[rand_no[101:200]]
s3<-file[rand_no[201:300]]
Or if you wanted to decease the total size by 100 each time you could do
s1<-file[-rand_no[1:100]]
s2<-file[-rand_no[1:200]]
s3<-file[-rand_no[1:300]]
A simple approach would be to select random indices and then remove those indices:
file <- 1:10000 # Build sample data
ind <- sample(seq(length(file)), 100) # Select random indices
rand_no <- file[ind] # Compute the actual values selected
file <- file[-ind] # Remove selected indices
I think using sample and split could be a nice way of doing this, without having to alter your files variable. I'm not a big fan of mutation, unless you really need to, and this would let you know exactly which files you used for each chunk of the analysis going forward.
files<-paste("file",1:100,sep="_")
randfiles<-sample(files, 50)
randfiles_chunks<-split(randfiles,seq(1,length(randfiles), by=10))
Related
I'm learning R by using it on one project where I need to extract unique paths from logs.
Now, My workaround (lower) part of the code work, but I had to split the log into two files and perform grouping on them separately, while I tried the same on variables, I was getting all the data in all three path counts.
Can someone point me to what is wrong in the first approach, as I doubt that writing physically files to a disk is intended way?
a = read.csv('download-report-06-10-2017.csv')
yesterdays_data <- a[grepl("2017-10-05", a$Download.Time), ]
todays_data <- a[grepl("2017-10-06", a$Download.Time), ]
write.csv(yesterdays_data, "yesterdays.csv")
write.csv(todays_data, "todays.csv")
path_count <- as.data.frame(table(a$Path))
path_count_today <- as.data.frame(table(todays_data$Path))
path_count_yday <- as.data.frame(table(yesterdays_data$Path))
#### path_count, path_count_today & path_count_yday contain the same values and I expect them to be different ???
yd = read.csv('yesterdays.csv')
td = read.csv('todays.csv')
path_count_td <- as.data.frame(table(td$Path))
path_count_yd <- as.data.frame(table(yd$Path))
#### path_count_td and path_count_yd are different, as I'd expect in upper three variables
I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?
I have a dataframe data with information on tiffs, including one column txt describing the content of the tiff. Unfortunately, txt is not always correct and we need to correct them by hand. Therefore I want to loop over each row in data, show the tiff and ask for feedback, which is than put into data$txt.cor.
setwd(file.choose())
Some test tiffs (with nonsene inside, but to show the idea...):
txt <- sample(100:199, 5)
for (i in 1:length(txt)){
tiff(paste0(i, ".tif"))
plot(txt[i], ylim = c(100, 200))
dev.off()
}
and the dataframe:
pix.files <- list.files(getwd(), pattern = "*.tif", full.names = TRUE)
pix.file.info <- file.info(pix.files)
data <- cbind(txt, pix.file.info)
data$file <- row.names(pix.file.info)
data$txt.cor <- ""
data$txt[5] <- 200 # wrong one
My feedback function (error handling stripped):
read.number <- function(){
n <- readline(prompt = "Enter the value: ")
n <- as.character(n) #Yes, character. Sometimes we have alphanumerical data or leading zeros
}
Now the loop, for which help would be very much appreciated:
for (i in nrow(data)){
file.show(data[i, "file"]) # show the image file
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}
In my very first attempts I was thinking of the plot.lm idea, where you go through the diagnostic plots after pressing return. I suspect that plot and tiffs are not big friends. file.show turned out to be easier. But now I am having a hard time with that loop...
Your problem is that you don't loop over the data, you only evaluate the last row. Simply write 1:nrow(data)to iterate over all rows.
To display your tiff images in R you can use the package rtiff:
library(rtiff)
for (i in 1:nrow(data)){
tif <- readTiff(data[i,"file"]) # read in the tiff data
plot(tif) # plot the image
data[i, "txt.cor"] <- read.number() # aks for the feedback and put it back into the dataframe
}
I want to apply a for-loop to every element of a list (station code of air quality stations) and create a single data.frame for each station with specific data.
My current code looks like this:
for (i in Stations))
{i_PM <- data.frame(PM2.5$DateTime,PM2.5$i)
colnames(i_PM)[1] <- "DateTime"
i_AOT <- subset(MOD2011, MOD2011$Station_ID==i)
i <- merge(i_PM, i_AOT, by="DateTime")}
Stations consists of 28 elements. The result should be a data.frame for every station with the colums DateTime, PM2.5 and several elements from MOD2011.
I just dont get it running as its supposed to be. Im sure its my fault, I couldnt find the specific answer via the internet.
Can you show me my mistake?
Try assign:
for (i in Stations)) {
dat <- data.frame(PM2.5$DateTime,PM2.5$i)
dat2 <- subset(MOD2011, MOD2011$Station_ID==i)
colnames(i_PM)[1] <- "DateTime"
assign(paste(i, "_PM", sep=""), dat)
assign(paste(i, "_AOT", sep=""), dat2)
assign(i, merge(dat, dat2, by="DateTime"))
}
Note, however, that this is bad coding practice. You should reconsider your algorithm. For instance, use a list instead.
I am trying to write an input file that requires a single line in the first row telling if the file is sparse and if so how many variable levels there are. I know how to append a single line to the end of a file, but can't find a way to append to the first line of a file. Any suggestions?
library(e1071)
library(caret)
library(Matrix)
library(SparseM)
iris2 <- iris
iris2$sepalOver5 <- ifelse(iris2$Sepal.Length >= 5, 1, -1)
head(iris2)
summary(iris2)
trainRows <- sample(1:nrow(iris2), nrow(iris2) * .66, replace = F)
testRows <- which(!(1:nrow(iris2) %in% trainRows))
sum(testRows %in% trainRows)
sum(trainRows %in% testRows)
vtu1 <- c('Sepal.Width','Petal.Length','Petal.Width','Species')
dv1 <- dummyVars( ~., data = iris2[,vtu1], sparse = T)
train <- iris2[trainRows,]
test <- iris2[testRows,]
trainX <- as.matrix.csr(predict(dv1, train))
testX <- as.matrix.csr(predict(dv1, test))
trainY <- train[,'sepalOver5']
testY <- test[,'sepalOver5']
write.matrix.csr( as(trainX , "matrix.csr"), file= "amz.train" , fac = TRUE)
headString <- paste('sparse ',max(trainX#ja),sep = '')
I'd basically like to insert/append headString into amz.train in the first row. Any suggestions?
It is generally not possible to prepend to the start of a file (and if there are ways, they would be really inefficient, since the information of the start of the file in memory is generally unknown. This holds for any programming language).
Three options come to mind:
Read in the file, write the other information first, followed by the rest of the content of the file (might also be inefficient)
Write the information you want to prepend first
In the case you have a writer that cannot append (write.matrix for instance has no append option), you could try to merge this meta information with the data frame, and then writing it as a whole.
Since you are using a specialized format, I wouldn't recommend storing this meta-information this way.
Your file would look like:
sparse 6
1:3 2:5.2 3:2 6:1
1:3.7 2:1.5 3:0.2 4:1
1:3.2 2:6 3:1.8 6:1
And then there is option 4:
Rather, consider having a meta file which contains information such as file name, whether it is sparse or not and the number of levels. Here you could append, and if you would repeat this process it would be preferable. It will avoid problems of reading in weirdly formatted files.