I chunked several novels into a data frame called documents. I want to export each chunk as a separate .txt file.
The data frame that consists of two columns. The first column has the file names for each chunk, and the second column has the actual text that would go into the file.
documents[1,1]
[1] "Beloved.txt_1"
documents[1,2]
[1] "124 was spiteful full of a baby's venom the women......"
class(documents)
[1] "data.frame"
I'm trying to write a for loop that would take each row, make the second column into a .txt file, and make the first column the name of the file. And then to iterate for each row. I've been working with something like this:
for (i in 1:ncol(documents)) {
write(tagged_text, paste("data/taggedCorpus/",
documents[i], ".txt", sep=""))
I've also been reading that maybe the cat function would work well here?
I'm not positive this will work for you (a little more of an example of your input and desired output would help), but one issue you've got is that your for loop is by column rather than by row. If you want to do this once for every row, then it needs to be for (i in 1:nrow(documents) rather than ncol.
Assuming that "documents" is the name of your data.frame and that the column containing the text you want to save is called "tagged_text" and the column with the file name is called "file", try this:
for (i in 1:nrow(documents)) {
write(documents$tagged_text[i], paste0("data/taggedCorpus/",
documents$file[i], ".txt"))
}
Note that you don't need to specify the path every time if you already set it before you start the loop.
Related
If I have a file like this
1:10177[b38]A,AC
1:10235[b38]T,TA
1:10352[b38]T,TA
1:10505[b38]A,T
1:10506[b38]C,G
and another file like this
6:26762428[b38]C,
6:27333733[b38]T,
14:105766248[b38]A,
6:32465058[b38]C,
6:32465058[b38]C,
and I would like to filter the first file such that its beginning matches the second file. So for example if my first file looks like
1:10177[b38]A,AC
1:10235[b38]T,TA
1:10352[b38]T,TA
and my second file look like
1:10177[b38]A,
1:10234[b38]G,
then it should return 1:10177[b38]A,AC because the first row from the first file matches the first row from the second row before comma.
I also tried doing this in R to obtain the boolean index after trimming the first so that only the text before the first comma is kept but it didn't work
second_file <- paste(second_file,collapse="|")
subset_of_first_file <- pbmclapply(first_file, function(x) grepl(x, second_file))
because it returns all false in a small example where there is a match.
I have got this .txt file outputed by a microscope to process.
#read the .txt file generated by microscope, skipping the first 9 lines of garbage information
df <- read.csv("Objects_Population - AllCells.txt", sep="\t", skip = 9,header=TRUE, fill = T)
Then I started looking at the structure of the dataframe, everything seems fine except I now found an extra column in the end of the data frame named "x.1" and all rows of it are NA values. I don't see this column when I open the .txt file in excel. I suspect the problem has something to do with the column names generated by microscope, they contain quite some special characters
Below is the dataframe read by Excel(only showing the last 2 columns since I have 132 columns, and their names are disgustingly long):
AllCells - Cell Contact Area with Neighbors [%] AllCells - Nucleus Nearest Neighbor Distance [µm]
0 4.82083
21.9512 0
15.7895 0
29.4118 0.584611
0 4.21569
0 1.99599
0 3.50767
...
This has happened to me before but I never took it too serious as I was always interested in a subset of my data frame. Now I'm looking at all columns then this starts to bothering me.
Is there any way I can read them correctly without R attaching that additional "X.1" column in the end? Preferably not manually delete or subset out the last column...
Cheers,
ML
If all other column names are correct, you have probably a trailing \t in the text file. R tries to include it and gives it the generic column name X.1.
You could try and read the file first as 'plain text' and remove the trailing \t and only then use read.csv:
file_connection <- file("Objects_Population - AllCells.txt")
content <- readLines(file_connection )
close(file_connection)
Now we try to get rid of these trailing \t (this might need some testing to fit your needs)
sanitized <- gsub("\\t$", "", content)
And then we read this sanitized string as if it was a file (using the argument text)
df <- read.csv(text=paste0(sanitized, collapse="\n"), sep="\t", skip = 9,header=TRUE, fill = T)
Had that problem too. Fixed it by saving the file as "CSV (MS-DOS (*csv)" instead of what I originally had as "CSV (Comma delimited)(*csv)".
This is almost certainly because you've got an extra empty column in your spreadsheet.
In Excel, open your sheet and press Ctrl-End. If you end up in an empty cell outside the range of your data, there's the problem. Select the column (Ctrl-Space), right-click, and choose Delete.
I also encountered similar problem. I found that three extra columns were created (X, X.1, X.2), after I loaded dataset from excel sheet to R studio.
Steps Followed by me:
a) I went to the excel sheet and selected those three extra columns after last column with actual values in excel sheet. Selected extra 3 columns by keeping cursor on top of columns and then right click the mouse and select delete.
b) Again loaded that excel sheet in R. I did not find those 3 columns.
I have a folder with tons of txt files from where I have to extract especific data. The problem is that the format of the file has changed once and the position of the data I need to extract has also changed. So I need to deal with files in different format.
To try to make it more clear, in column 4 I have the name of the variable and in 5 I have the value, but sometimes this is in a different row. Is there a way to find the name of the variable (in which row) and then extract its value?
Thanks in advance
EDITING
In some files I will have the data like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Current--------28
But in some point in life, there was a change in the software to add another variable and the new file iis like this:
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
So I need to deal with these 2 types of data, extracting the same variables which are in different rows.
If these files can't be read with read.table use readLines and then find those lines that start with the keyword you need.
For example:
Sample file 1 (with the dashes included and extra line breaks):
Column 1-------Column 2.
Device ID------A.
Voltage------- 500.
Error------------5.
Current--------28
Sample file2 (with a comma as separator):
Column 1,Column 2.
Device ID,A.
Current,555
Voltage, 500.
Error,5.
For both cases do:
text = readLines(con = file("your filename here"))
curr = text[grepl("^Current", text, ignore.case = T)]
Which returns:
for file 1:
[1] "Current--------28"
for file 2:
[1] "Current,555"
Then use gsub to remove anything that is not a number.
I have 10 data files in my current directory such as data-01, data-02, data-03, data-04 till data-10.
Each of these data files has a few hundred rows with 4 fields. I would like to add a new column name "ID" and keep its ID like 01 (for data file data-01) for all the rows in that file.
A base R solution using a loop would go like this:
df<- c()
for (x in list.files(pattern="*.csv")) {
u<-read.table(x)
u$Label = factor(x)
df <- rbind(df, u)
cat(x, "\n ")
}
This depends on your data files having the same number of columns (though you get get around that inside the loop by selecting which columns you need before rbind) and then you can set whichever filetype you are looking at. The cat is useful because you can better trace read problems (because there are always problems). I bet there is a better way to do this with apply as well.
I have a csv file and am unsure how to get R to interpret it as a table because all the title info is in one cell and all the data relating to the titles is in a separate cell. So all the info I need is in 2 cells but it actually needs to be split up.
The cell A3 has a value called 'Team' , this corresponds to the part in the cell A4 that says 'Visitor'. Then each part after than corresponds to the bit below it. ..sorry I don't know how to describe it, but ultimately it would look like this …
Looks like the field separator in your data is a ;
read.csv has a parameter sep to change the field separator and another parameter header to tell it there is an initial line containing the column names. Use read.csv like this:
data = read.csv(file="/mydir/myfile.csv", sep=";", header=T)
To test you can print out the first 5 lines of the data table with:
head(data,5)