I want to write large amounts of data in Julia language. The data is generated and then stored in lists. A pseudocode is:
f = open("test.csv", "w")
for i = 1:3
position = [j for j = i:(i + 10) ]
string_position = string(position)
n = length(string_position)
write(f, string_position[2:(n - 1)]*"\n")
end
close(f)
However it seems inefficient to get the length of the strings in each iteration and then remove the first and last element of the string.
Is there a faster way?
One simple optimization is to use
write(f, string_position[2:(n - 1)], "\n")
instead of *. This writes the two objects in succession, instead of first concatenating them and then writing the result.
It might also be faster to use a SubString, which references part of another string in place without copying.
In general, it is also likely to be faster to avoid creating intermediate strings. Instead of first making a string and then writing it, write the individual items. For example
for item in position
print(f, item, ",")
end
print(f, "\n")
I should add that there is a writecsv function in the standard library that does this for you.
Related
Trying to learn the ropes in R and already struggling trying to find a replacement for SAS macro.
I'm trying to run a piece of code several times, but I'm having a hard time and came here for help.
First, I'm working with this example file, with a variable that gives me the quantity of rows that I have previously analised in another file (qtde_registros), followed by three variables that give me the quantity of rows that had different type of errors.
file <- readRDS(file="file.Rda")
file
qtde_registros error1 error2 error3
1 1175 0 0 0
After that, I created a list with the errors and another one with the description of each one of them.
Then, using those lists and the file mentioned initially, I wish to create several files (one for each error) that will later be binded in one last file to form a final report.
As I said, I'm struggling with it, so I made an example code of how it would be forming the first file:
error_list <- list("Error1","Error2","Error3",)
description_list <- list("Code not found",
"Invalid date.",
"Negative value.")
error1 <- file
error1$file_name <- "Clients"
error1$error <- error_list[1]
error1$qtde <- error1$error1
error1$desc <- description_list[1]
error1 <- select(error1, file_name, error, qtde, desc)
error1
file_name error qtde desc
1 Clients Error1 0 Code not found
And that leads to my question: how can I make the code above run several times, one for each erros on my list?
I'm aware that the whole mentality may not be the best, as the approach to do certain things are different depending on the language used, but I have to work with the knowledge I have at the moment.
I'm thinking of using the apply family of functions, but I didn't managed to work it out.
Thanks in advance for the help and sorry for any errors in typing or grammar (english is not my first language).
EDIT: forgot to say that I'm not intend to do via For or While loop.
In R (and many other languages) you'll be using a form of for-loop. In R there are several wrappers for loops with specific outcome in the *apply family. Here's a short (incomplete) list of the *apply family and their input/output:
lapply -> list output
sapply -> List or atomic (integer vector, numeric vector etc.)
mapply -> Similar to sapply but can take more than 1 input to go over (so if you have 2 simultanious things to loop over for example)
tapply -> loop over groups defined by INDEX
apply -> Loop over an array (either rows or columns) return matrix/vector
And so on.
I am guessing that your example is incomplete, but I'll show 3 examples to get you started. One using a for-loop, one using lapply and one using mapply.
for-loop
A for-loop is the classic method (found in most programming languages). It works by having a for(---) where --- is replaced by something to iterate over. This could be error_list or it could be a numeric vector seq(1, n) or 1:n. Here you have more than 1 thing to iterate over, so a numeric vector makes sense (and we use this to subset the data)
errors <- list() # <== Somewhere to put our results
for(i in 1:length(error_list)){
error_i <- list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case
desc = description_list[[i]]
)
# Put into our errors list. Create "error1" using paste and our index
errors[[paste0('error', i)]] <- error_i
}
And by the end all of your results will be in the errors list to be extracted using errors[1] or errors["errors1"] (change the number to your error). This can then be combined using do.call(rbind, errors) and then saved using write.table (or write.csv or similar).
lapply
For the *apply family, the *apply takes care of the looping. But instead we have to provide a function to execute (a macro in SAS terms) in each iteration. So we wrap the contents of the loop in the function above.
macro <- function(i){
list(file = file,
file_name = "Clients",
error = error_list[[i]], # Use i to subset error_list
qtde = error_list[[i]], # Maybe this should be something else in your case?
desc = description_list[[i]]
)
}
errors <- lapply(1:length(error_list), macro)
#set names afterwards
names(errors) <- paste0("error", 1:length(error_list))
And once again we have the data ready to be extracted saved etc. This is equivalent to:
errors <- list()
for(i in 1:length(error_list))
errors[[i]] <- macro(i)
names(errors) <- paste0("error", 1:length(error_list))
mapply
Now in your case you have more than 1 thing to iterate over. An alternative is to use mapply and add these as parameters to your function instead. This way we remove error_list[[i]] and description_list[[i]] from the function and instead add these as parameters
macro_mapply <- function(error, description){
list(file = file,
file_name = "Clients",
error = error, # No need to use I here anymore
qtde = error, # Maybe this should be something else in your case?
desc = description
)
}
errors <- mapply(macro_mapply,
# parameters to iterate over comes after function
error = error_list,
description = description_list,
# Avoid simplification (if we want a list returned)
SIMPLIFY = FALSE)
names(errors) <- paste0("error", 1:length(error_list))
Note that "mapply" will try to return a vector if possible, so I set SIMPLIFY = FALSE to avoid this.
Things to note:
In the above 3 examples I have not taken into account if you read multiple files, or any other parameters changing. So if you have to read a file in each iteration it will make sense to go with the first 2 examples and add readRDS to the loop or function with appropriate file naming. Also I have used your data, but I am guessing qtde and error should be different in your specific case but this is not clear from your example.
I hope this will help getting you started.
Once you've gotten the hang of your first loops I and somewhat understand how *applys work, I would then suggest checking out tidyverse which provides what many find to be a more "user-friendly" and intuitive interface to data transformation.
I hope that this will help you getting started on solving your problem.
I have a list of identifiers as follows:
url_num <- c('85054655', '85023543', '85001177', '84988480', '84978776', '84952756', '84940316', '84916976', '84901819', '84884081', '84862066', '84848942', '84820189', '84814935', '84808144')
And from each of these I'm creating a unique variable:
for (id in url_num){
assign(paste('test_', id, sep = ""), FUNCTION GOES HERE)
}
This leaves me with my variables which are:
test_8505465, test_85023543, etc, etc
Each of them hold the correct output from the function (I've checked), however my next step is to combine them into one big vector which holds all of these created variables as a seperate element in the vector. This is easy enough via:
c(test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144)
However, as I update the original 'url_num' vector with new identifiers, I'd also have to come down to the above chunk and update this too!
Surely there's a more automated way I can setup the above chunk?
Maybe some sort of concat() function in the original for-loop which just adds each created variable straight into an empty vector right then and there?
So far I've just been trying to list all the variable names and somehow get the output to be in an acceptable format to get thrown straight into the c() function.
for (id in url_num){
cat(as.name(paste('test_', id, ",", sep = "")))
}
...which results in:
test_85054655,test_85023543,test_85001177,test_84988480,test_84978776,test_84952756,test_84940316,test_84916976,test_84901819,test_84884081,test_84862066,test_84848942,test_84820189,test_84814935,test_84808144,
This is close to the output I'm looking for but because it's using the cat() function it's essentially a print statement and its output can't really get put anywhere. Not to mention I feel like this method I've attempted is wrong to begin with and there must be something simpler I'm missing.
Thanks in advance for any help you guys can give me!
Troy
I'm using base::paste in a for loop:
for (k in 1:length(summary$pro))
{
if (k == 1)
mp <- summary$pro[k]
else
mp <- paste(mp, summary$pro[k], sep = ",")
}
mp comes out as one big string, where the elements are separated by commas.
For example mp is "1,2,3,4,5,6"
Then, I want to put mp in a file, where each of its elements is added to a separate column in the same row. My code for this is:
write.table(mp, file = recompdatafile, sep = ",")
However, mp just appears in the CSV as one big string as opposed to being divided up. How can I achieve my desired format?
FYI
I've also tried converting mp to a list, and strsplit()-ing it, neither of which have worked.
Once I've added summary$pro to the file, how can I also add summary$me (which has the same format), in one row with multiple columns?
Thanks,
n.i.
If you want to write something to a file, write.table() isn't the only way. If you want to avoid headers and quotes and such, you can use the more direct cat. For example
cat(summary$pro, sep=",", file="filename.txt")
will write out the vector of values from summary$pro separated by commas more directly. You don't need to build a string first. (And building a string one element at a time as you did above is a bad practice anyway. Most functions in R can operate on an entire vector at a time, including paste).
I am trying to create a function that will calculate the frequency count of keywords using TM package. The function works fine if the text pasted from readline is on free form text without a new line. The problem is, when I paste a bunch of text copied from a spreadsheet, readline considers it as a new line.
keyword <- function() {
x <- readline(as.character('Input text here: '))
x <- Corpus(VectorSource(x))
...
tdm <- TermDocumentMatrix(x)
...
tdm
}
Here's the full code: https://github.com/CSCDataAnalytics/PM-Analysis/blob/master/Keyword.R
How can I prevent this from happening or at least consider a bunch of text of every row from the spreadsheet as one vector only?
If I'm understanding you correctly, the problem is when the user pastes the text from another application: the newline is causing R to stop accepting the subsequent lines.
One technique (fragile as it may be) is to look for a specific line, such as an empty line "" or a period ".". It's a little fragile because now you need (1) assurance that the data will "never" include that as a whole line, and (2) it is easily appended by the user.
Try:
endofinput <- ""
totalstr <- ""
while(! endofinput == (x <- readline('prompt (empty string when done): ')))
totalstr <- paste(totalstr, x)
In this case, the empty string is the catch, and when the while loop is done, totalstr contains all input separated by a space (this can be changed in the paste function).
NB: one problem with this technique is that it is "growing" the vector totalstr, which will eventually cause performance penalties (depending on the size of the input data): every loop iteration, more memory is allocated and the entire string is copied plus the new line of text. There are more verbose ways to side-step this problem (e.g., pre-allocate a vector larger than your anticipated input data), but if you aren't anticipated 1000s of lines then you may be able to accept this naive programming for simplicity.
Another option would be to have the user save the data to a text file and use file.choose() and readLines() to get your data.
Try collapsing the data into a single string after using readline
x <- paste(readline(as.character('Input text here: ')), collapse=' ')
Is there a way to have the object name become the file name character string when using write.table or write.csv?
In the following, a and b are vectors. I will be doing similar comparisons for many other pairs of vectors, and would like to not write out the object name as many times as I have been doing.
unique_downa<-a[!(a%in%b)]
write.csv(unique_downa,file="unique_downa.csv")
Or if anyone has a suggestion for a better way to do this whole process, I'd be happy to hear it.
The idiomatic approach is to use deparse(substitute(blah))
eg
write.csv.named <- function(x, ...){
fname <- sprintf('%s.csv',deparse(substitute(x)))
write.csv(x=x, file = fname, ...)
}
It might be easiest to use the names of elements of a list instead of trying to use object names:
mycomparisons <-list (unique_downa = a[!(a%in%b)], unique_downb = b[!(b%in%a)])
mapply (write.csv, mycomparisons, paste (names (mycomparisons), ".csv", sep =""))
The best thing to do is probably put your vectors in a list, and then do the comparisons, the naming, and the writing out all inside the same loop, but that depends on how similar these similar comparisons are...