I have a .txt file that contains multiple newspaper articles. Each article has a headline, the author name etc. I want to read the whole .txt file in R and remove every line + the next 5 lines that starts with certain words. I think gsub + reg expression might be the solution, but I do not know how to define it like the way so that not only the line containing these words is deleted, but also the next 5 lines.
Edit:
The txt. file consists of 200 Washington Post articles. Each article ends with:
lydia.depillis#washpost.com
LOAD-DATE: July 14, 2013
LANGUAGE: ENGLISH
PUBLICATION-TYPE: Web Publication
Copyright 2013 Washingtonpost.Newsweek Interactive Company, LLC d/b/a Washington
Post Digital
All Rights Reserved
4 of 200 DOCUMENTS
Washington Post Blogs
In the Loop
June 28, 2013 Friday 3:08 PM EST
Whenever an e-mail address appears, I want to delete everything until the line where a date appears so that we have a smooth transition to the next article. I want to use a sentiment analysis and thus don't need these lines.
I want to read .txt file in R which is exported by this lines:
cat(sprintf("%s\n",paste0(matematykData)),file=nazwapliku2,append = TRUE)
The line is in loop so it saves it line by line, and the variable matematykData is a 1dim tab that contains only one record which is replaced by another record on the next run of "for loop" and it looks like this:
[1] "1884"
The reading method i use in another R script is:
dane2=read.table(file=nazwapliku2,sep="\n",skipNul= FALSE)
From this i get a string without any rows and columns that looks like this:
2962 1847
2963 1866
2964 1906
2965 429
2966 450
2967 450
2968 1910
2969 1900
2970 1889
Where the first "column" is the number of line. I want to convert that string to tab so i can reffrence every row of it, simply by using dane2[i] where "i" is the number of row i'am looking for. I'm not sure if i should change the way it's beening saved or readed or should i just read it and then convert it.
I also have other variable that needs to be converted and its more complicated because it contains 3 records per row: full_name,date and place of birth,date and place of death. The method i use for saving it is the same:
cat(sprintf("%s\n%s\n%s\n",paste0(matematyk[1]),paste0(matematyk[2]),paste0(matematyk[3])),file=nazwapliku1,append = TRUE)
For the first case it should work with dane2[i] for the row number.
You have to distinguish between records and variables.
In the second case I understand it that you have several records, each with the 5 variables
full_name, date of birth, place of birth, date of death, place of death
In that case you need to change your way of saving your data. In between the variables you need to use for example \t to separate the variables, and only use \n as the last in your format string for sprintf() call to separate the individual records.
I have a data.frame that looks like this:
a=data.frame(c("MARCH3","SEPT9","XYZ","ABC","NNN"),c(1,2,3,4,5))
> a
c..MARCH3....SEPT9....XYZ....ABC....NNN.. c.1..2..3..4..5.
1 MARCH3 1
2 SEPT9 2
3 XYZ 3
4 ABC 4
5 NNN 5
Write into csv: write.csv(a,"test.csv")
I want everything to stay the way it is but MARCH3 and SEPT9 become 3-Mar and 9-Sep. I have tried everything in Excel: formatting by date, text, custom...none works. 3-Mar would be converted to 42066 and 9-Sep to 42256. In reality, a is a fairly large table so this can't even be done manually. Is there a way to coerce a[,1] so that Excel would ignore its format?
The best way to prevent Excel from autoformatting would probably be to store the data as excel file:
library(xlsx)
write.xlsx(a, "test.xlsx")
Your best bet is probably to change the file extension (e.g. make it ".txt" or ".dat" or something like that). When you open such a file in Excel the text import wizard will open. Specify that the file is delimited with commas, then make sure to change the appropriate column from "General" to "Text".
As an example: looking at the data in the question it appears that your CSV file might look like
,,,,MARCH3,,,,1
,,,,SEPT9,,,,2
,,,,XYZ,,,,3
,,,,ABC,,,,4
,,,,NNN,,,,5
If I save this file with a ".csv" extension and open it in Excel I get:
3-Mar 1
9-Sep 2
XYZ 3
ABC 4
NNN 5
with the date values changed as you noted. When I change the file extension to ".dat", making no other changes to the file, and open it in Excel I'm presented with the Text Import Wizard. I tell Excel that the file is "Delimited", choose "Comma" as the delimiter, and in the column with the "MARCH3" and "SEPT9" values I change the Column Data Type to "Text" (instead of "General"). After I clicked the Finish button on the wizard I got the following data in the spreadsheet:
MARCH3 1
SEPT9 2
XYZ 3
ABC 4
NNN 5
I tried putting the MARCH3 and SEPT9 values in double-quotes to see if that would convince Excel to treat these values as text but Excel still converted these cells to dates.
Share and enjoy.
My solution was to append a semicolon to all the gene names. The added character convinces excel that this column is text not a date. You can find and replace the semicolon later is you want, but most programs - like perseus will allow you to ignore everything after the semicolon so its not always a problem...
df$Gene.name <- paste(df$Gene.name, ";", sep="")
I would be interested in anyone has a trick for doing this to just the Sept, March gene names though...
I require to make 58 changes to be done on a html file.
The for loop runs 29 times
It contains the below sed command. Every run of the for loop replaces 2 place holders out of 58
sed "s?$Plc_hldr1?$DateTime?;s?$Plc_hldr2?$Total?" html_format.htm >> html_final.htm
I am using above command which makes changes in the original file on every loop and appends it into the html_final.htm file.
Thus there are 29 copies of html_format.htm in html_final.htm.
I require only 1 copy of the html_format.htm with all the 58 place holder values replaced.
Below is the small example of the whole table:
01/02/2014 15
%%DDMS2RT%% %%DDMS2C%%
%%DDMS3RT%% %%DDMS3C%%
%%DDMS4RT%% %%DDMS4C%%
%%DDMS5RT%% %%DDMS5C%%
%%DDMS6RT%% %%DDMS6C%%
%%DDMS7RT%% %%DDMS7C%%
after the 2nd for loop below is the content of the html_final.htm
01/02/2014 15
%%DDMS2RT%% %%DDMS2C%%
%%DDMS3RT%% %%DDMS3C%%
%%DDMS4RT%% %%DDMS4C%%
%%DDMS5RT%% %%DDMS5C%%
%%DDMS6RT%% %%DDMS6C%%
%%DDMS7RT%% %%DDMS7C%%
%%DDMS1RT%% %%DDMS1C%%
01/02/2014 817
%%DDMS3RT%% %%DDMS3C%%
%%DDMS4RT%% %%DDMS4C%%
%%DDMS5RT%% %%DDMS5C%%
%%DDMS6RT%% %%DDMS6C%%
%%DDMS7RT%% %%DDMS7C%%
Note that the same table is appended once again after the 2nd for loop and the place holder in the 2nd row contains the values, value in 1st row is again replaced by the placeholders.
What I would like is the below output. i.e. 1 single table instead of multiple copies and all the place holders replaced within that table itself
01/02/2014 15
01/02/2014 817
01/02/2014 512
01/02/2014 765
%%DDMS5RT%% %%DDMS5C%%
%%DDMS6RT%% %%DDMS6C%%
%%DDMS7RT%% %%DDMS7C%%
I tried to play with sed -i but it is not available in AIX unix.
I really hope I have explained expressed it very clearly and my question is not an XY problem anymore!!
The quickest solution would surely be to use > rather than >>.
>> is used to concatenate the output of standard out to the specified file.
> is used to write the output from standard out to the specified file, replacing the old file if it already exists.
i.e.
sed "s?$Plc_hldr1?$DateTime?;s?$Plc_hldr2?$Total?" html_format.htm > html_final.htm
cp html_final.htm html_format.htm
If we imagine that your real problem statement is "how do I piece together a HTML table from fragments I retrieve e.g. from a database", the answer might look something like this.
#!/bin/sh
# Output HTML header
cat <<'____HERE'
<html><head><title>Table</title></head>
<body><table>
<tr><th>Date</th><th>Result</th></tr>
____HERE
# Obtain results from database
sql 'select date, result from table;' |
# Read each record, format an HTML table fragment
while read date result; do
cat <<________HERE
<tr><td>$date</td><td>$result</td></tr>
________HERE
done
# Output HTML footer
cat <<____HERE
</table></body></html>
____HERE
The script prints an HTML page on its standard output. Redirect to a file if you want it in a file.
(Sorry if my HTML skills are rusty. It's been a while...)
I am generating word report using asp.net.
I am exporting data from the database and replacing it with the placeholders in the word document.
When i am retrieving the data from the database the string contains the "block character" (□) carriage returns which i want to eliminate? I have tried replacing it with chr(11), chr(13) but have not got any success.
for example i have the following text
abc
xyz
then the current output i am getting is as follows:
abc
□ xyz
instead of
abc
xyz
on word report.
Can you please provide me with a solution for the above issue?
The enter can also be 0x0D (13) and following with 0x0A (10)
So replace also the chr(10) (together with your current replace of the chr(11) and chr(13)).