How to add rows to a .csv file - julia

I need to append rows to a .csv file. Each row contains a vector of integers. Vectors to be added are generated on the fly and I cannot tell in advance how many rows will be added.
I tried to do the following, but it doesn't seem to create a .csv file, nor does it contain each element on a separate row.
s = open( "testfile.txt", "a")
for i in 1:5
x = rand(Int8, 10)
writecsv( s, x)
end
close(s)
Any help is appreciated!

So, you know how to create or over-write a csv file using this:
x = [1 2 ; 3 4 ; 5 6]
filePath = "/someAbsolutePath/temp1.csv"
writecsv(filePath, x)
When you want to append a new row (or multiple rows in one hit), you can also use writecsv but first you need to explicitly open the file in append mode, and then set the resulting file-identifier as the first argument to writecsv. Continuing the example above:
fid1 = open(filePath, "a")
writecsv(fid1, [7 8])
close(fid1)
I don't think writecsv will automatically close the file upon completion when the first input is a file-identifier, so the final close statement is very important. Alternatively, you can use the version proposed by #ShaoWeiTeo in the comments to get an automatic close.

Related

Iterate through and conditionally append string values in a Pandas dataframe

I've got a dataframe of research participants whose IDs are stored in the following format "0000.000".
Where the first four digits are their family ID number, and the final three digits are their individual index within the family. The majority of individuals have a suffix of ".000", but some have ".001", ".002", etc.
As a result of some inefficiencies, these numbers are stored as floats. I'm trying to import them as strings so that I can use them in a join to another data frame that is formatted correctly.
Those IDs that end in .000 are imported as "0000", rather than "0000.000". All others are imported correctly.
I'm trying to iterate through the IDs and append ".000" to those that are missing the suffix.
If I were using R, I could do it like this.
df %>% mutate(StudyID = ifelse(length(StudyID)<5,
paste(StudyID,".000",sep=""),
StudyID)
I've found a Python solution (below), but it's pretty janky.
row = 0
for i in df["StudyID"]:
if len(i)<5:
df.iloc[row,3] = i + ".000"
else: df.iloc[row,3] = i
index += 1
I think it'd be ideal to do it as a list comprehension, but I haven't been able to find a solution that lets me iterate through the column, changing a single value at a time.
For example, this solution iterates and checks the logic properly, but it replaces every single value that evaluates True during each iteration. I only want the value currently being evaluated to change.
[i + ".000" if len(i)<5 else i for i in df["StudyID"]]
Is this possible?
As you said, your code is doing the trick. One other way of doing what you want that i could think of is the following :
# Start by creating a mask that gives you the index you want to change
mask = [len(i)<5 for i in df.StudyID]
# Change the value of the dataframe on the mask
df.StudyID.iloc[mask] += ".000"
I think by length(StudyID), you meant nchar(StudyID), as #akrun pointed out.
You can do it in the dplyr way in python using datar:
>>> from datar.all import f, tibble, mutate, nchar, if_else, paste
>>>
>>> df = tibble(
... StudyID = ["0000", "0001", "0000.000", "0001.001"]
... )
>>> df
StudyID
<object>
0 0000
1 0001
2 0000.000
3 0001.001
>>>
>>> df >> mutate(StudyID=if_else(
... nchar(f.StudyID) < 5,
... paste(f.StudyID, ".000", sep=""),
... f.StudyID
... ))
StudyID
<object>
0 0000.000
1 0001.000
2 0000.000
3 0001.001
Disclaimer: I am the author of the datar package.
Ultimately, I needed to do this for a few different dataframes so I ended up defining a function to solve the problem so that I could apply it to each one.
I think the list comprehension idea was going to become too complex and potentially too difficult to understand when reviewing so I stuck with a plain old for-loop.
def create_multi_index(data, col_to_split, sep = "."):
"""
This function loops through the original ID column and splits it into
multiple parts (multi-IDs) on the defined separator.
By default, the function assumes the unique ID is formatted like a decimal number
The new multi-IDs are appended into a new list.
If the original ID was formatted like an integer, rather than a decimal
the function assumes the latter half of the ID to be ".000"
"""
# Take a copy of the dataframe to modify
new_df = data
# generate two new lists to store the new multi-index
Family_ID = []
Family_Index = []
# iterate through the IDs, split and allocate the pieces to the appropriate list
for i in new_df[col_to_split]:
i = i.split(sep)
Family_ID.append(i[0])
if len(i)==1:
Family_Index.append("000")
else:
Family_Index.append(i[1])
# Modify and return the dataframe including the new multi-index
return new_df.assign(Family_ID = Family_ID,
Family_Index = Family_Index)
This returns a duplicate dataframe with a new column for each part of the multi-id.
When joining dataframes with this form of ID, as long as both dataframes have the multi index in the same format, these columns can be used with pd.merge as follows:
pd.merge(df1, df2, how= "inner", on = ["Family_ID","Family_Index"])

Skipping last column in r with read.csv

I was on that post read.csv and skip last column in R but did not find my answer, and try to check directly in Answer ... but that's not the right way (thanks mjuarez for taking the time to get me back on track.
The original question was:
I have read several other posts about how to import csv files with
read.csv but skipping specific columns. However, all the examples I
have found had very few columns, and so it was easy to do something
like:
columnHeaders <- c("column1", "column2", "column_to_skip")
columnClasses <- c("numeric", "numeric", "NULL")
data <- read.csv(fileCSV, header = FALSE, sep = ",", col.names =
columnHeaders, colClasses = columnClasses)
All answer were good, but does not work for what I entended to do. So I asked my self and other:
And in one function, does data <- read_csv(fileCSV)[,(ncol(data)-1)]
could work?
I've tried in one line of R to get on data, all 5 of first 6 columns, so not the last one. To do so, I would like to use "-" in the number of column, do you think it's possible? How can I do that?
Thanks!
In base r it has to be 2 steps operation. Example:
> data <- read.csv("test12.csv")
> data
# 3 columns are returned
a b c
1 1/02/2015 1 3
2 2/03/2015 2 4
# last column is excluded
> data[,-ncol(data)]
a b
1 1/02/2015 1
2 2/03/2015 2
one cannot write data <- read.csv("test12.csv")[,-ncol(data)] in base r.
But if you know max number of columns in your csv (say 3 in my case) then one can write:
df <- read.csv("test12.csv")[,-3]
df
a b
1 1/02/2015 1
2 2/03/2015 2
The right hand side of an assignment is processed first so this line from the question:
data <- read.csv(fileCSV)[,(ncol(data)-1)]
is trying to use data before it is defined. Also note what the above is saying is to take only the 2nd last field. To get all but the last field:
data <- read.csv(fileCSV)
data <- data[-ncol(data)]
If you know the name of the last field, say it is lastField, then this works and unlike the code above does not read the whole file and then remove the last field but rather only reads in fields other than the last. Also it is only one line of code.
read.csv(fileCSV, colClasses = c(lastField = "NULL"))
If you don't know the name of the last field but you do know how many fields there are, say n, then either of these would work:
read.csv(fileCSV)[-n]
read.csv(fileCSV, colClasses = replace(rep(NA, n), n, "NULL"))
Another way to do it without first reading in the last field is to first read in the header and first line to calculate the number of fields (assuming that all records have the same number) and then re-read the file using that.
n <- ncol(read.csv(fileCSV, nrows = 1))
making use of one of the prior two statements involving n.
It's not possible in one line as the data variable is not yet initialized when you call it. So the command ncol(data) will trigger an error.
You would need to use two lines of code to first load your data into the data variable and then remove the last column by either using data[,-ncol(data)] or data[,1:(ncol(data)-1)].
Not a single function, but at least a single line, using dplyr (disclaimer: I never use dplyr or magrittr, so a more optimized solution must exist using these libraries)
library(dplyr)
dat = read.table(fileCSV) %>% select(., which(names(.) != names(.)[ncol(.)]))

Using Loop variable to access and write specific data.frames

I wrote a script, that reads CSV-Data with help of user input. For example when the user enters "20 40 160" the CSV files 1, 2 and 3 are read and saved as the data.frames d20, d40 and d160 in my global enviroment/workspace. The variable vel has the values for the user input.
Now for the actual question:
Im trying to manipulate the read data in a loop with the vel variable. For example:
for (i in vel)
{
newVariable"i" <- d"i"[6]
}
I know thats not the correct syntax for the programming, but what im trying to do ist to write a newVariable with a specific row from a specific data frame d.
The result should be:
newVariable20 = d20[20]
newVariable40 = d40[20]
newVariable160 = d160[20]
So I think the actual question is, how do I use the Loop Variable for calling out the names of the created data frames and for writing new variables.
There are a couple of ways to do this. One is to store all of your dataframes in a list originally. There are a couple ways to do this. Start with an empty list and then put each df into the next position in the list. Note that you have to use list(df) because a dataframe is actually already a list and gets messed up if you don't do this.
list_of_df <- list();
list_of_df[1] <- list(df1);
list_of_df["df20"] <- list(df2)
This makes it easy to loop through the dataframes. If you want column 4 of dataframe 2 you just put in
list_of_df[[2]][,4]
# Same thing different code
list_of_df[["df20"]][,4]
The double brackets [[2]] give you the value that is stored in the list at position 2 (instead of [2] which gives you a list containing the value and metadata). The next [,4] says that from the dataframe we just got the value of, we now want to get every row of the 4th column. Note that this will output a vector and not a dataframe.
Or in a loop:
for(df in list_of_df) {
print(df)
}

printing warnings by R

I have a code on a file that works on a matrix and I read it by using
source("filecode.r")
As the matrix that the code works with must have some specific characteristics, I would like to print a message to remember the user that the input matrix must be formatted with those characteristics.
The code is this:
n<- nrow(aa)
d_ply(aa, 1, function(row){
cu<- dist(as.numeric(row[-1]))
cucu<- as.matrix(cu)
saveRDS(cucu, file = paste0(row$ID, ".rds"))
}, .progress='text', .print = TRUE)
Ideally I would like to add a warning message appearing before the code starts running...like this:
Warning(“1) did you write ‘ID’ in position [1,1] of the input matrix?;
2) is your matrix saved as a .txt?
3) ensure that the matrix file does not have empty rows at the end”)
and receiving also a question like "do you want to go on then?".
Thank you in advance for all suggestions!
Gab
Put that at the beginning of your file:
check <- readline(prompt="Warning!\n(1) did you write 'ID' in position [1,1] of the input matrix? \n(2) is your matrix saved as a .txt?\n(3) ensure that the matrix file does not have empty rows at the end\n\n Do you wish to continue? (y/n)")
if(check == "n") stop("Aborted.")
print(check) #Here would follow your code instead
If you type "y" the following code will be evaluated. If you type "n", the script stops and prints the message inside stop().
You could also make sure that only 'y' and 'n' are accepted by putting the prompt statement inside of a while loop:
check <- NA
while(!(check %in% c('y','n'))) {
check <- readline(prompt="Warning!\n(1) did you write 'ID' in position [1,1] of the input matrix? \n(2) is your matrix saved as a .txt?\n(3) ensure that the matrix file does not have empty rows at the end\n\n Do you wish to continue? (y/n)")
}
if(check == "n") stop("Aborted.")

Reading variable name of dataframe into another dataframe using loops

So using a for loop I was able to break my 1.1 million row dataset in r into 110 tables of approximately 10,000 rows each in hopes of getting r to handle the data better. I now want to run another for loop that assigns the values in each of these tables to a different dataframe name.
My table names are:
Pom_1
Pom_2
Pom_3
...
Pom_110
What I want to do is create a for loop like the following:
for (i in 1:110)
{
Pom <- read.table(paste("Pom",i,sep = "_"))
for (j in 1:nrows(Pom))
{do something}
}
So I want to loop through the array and assign the values of each Pom table to "Pom" so that I can then run a for loop on each subsection of Pom. This problem is the read.table function does not seem to be the right one. Any ideas?
Can you give a more specific example of what you want to do withing each dataframe? You should avoid using the inner loop when possible and if you really need to have a look at ?apply
nrow instead of nrows
This is a generic solution using a example data.frame. The function you're looking for is assign, check it's help page:
Pom = data.frame(x = rnorm(30)) #original data.frame
n.tables = 3 # number of new data.frames you want to creat
Pom.names = paste("Pom",1:3,sep="") # name of all new data.frames
breaks = nrow(Pom)/n.tables * 0:n.tables # breaks of the original data.frame
for (i in 1:n.tables) {
rows = (breaks[i]+1):breaks[i+1] # which rows from Pom are going to be assign to the new data.frame?
assign(Pom.names[i],Pom[rows,]) # create new data.frame
}
ls()
[1] "breaks" "i" "n.tables" "Pom" "Pom.names" "Pom1"
[7] "Pom2" "Pom3" "rows"
I'm willing to bet the problem with your table call is that you aren't specifying the file extension (assuming Pom_1 - Pom_110 are files in your working directory, which I think they are since you're using read.table).
You can fix it by the following
fileExtension<-".xls" #specify your extension, I assume xls
for (i in 1:110)
{
tablename<-paste("Pom",i,sep = "_")
Pom <- read.table(paste(tablename, fileExtension, sep=""))
for (j in 1:nrows(Pom))
{do something}
}
Of course that's assuming a couple things about how everything in your problem is set up, but it's my best guess based on your description and code

Resources