Separating one file in two using timestamp - r

Having a txt file in the following format:
2011-01-01 00:00:00 text text text text
2011-01-01 00:01:00 text text text text
text
2011-01-01 00:02:00 text text text text
....
....
....
....
2011-01-02 00:00:00 text text text text
2011-01-02 00:01:00 text text text text
All file contains data of two calendar days.
Is it possible to separate the file into two different files, one for everyday?

read data in with read.table()
we should have a data.frame similar to :
df <- data.frame(d = c("2011-01-01 00:00:00", "2011-01-01 00:01:00"), x = 0:1)
apply split()
dfl <- split(df, df$d)
Map write.table to split
Map(write.table, dfl, file = paste(names(dfl), "txt", sep = "."), row.names = FALSE, sep = ";")

dat <- readLines(textConnection(" 2011-01-01 00:00:00 text text text text
2011-01-01 00:01:00 text text text text text
2011-01-01 00:02:00 text text text text
2011-01-02 00:00:00 text text text text
2011-01-02 00:01:00 text text text text"))
grouped.lines <- split(dat, substr(dat, 1,11) )
grouped.lines
$` 2011-01-01`
[1] " 2011-01-01 00:00:00 text text text text"
[2] " 2011-01-01 00:01:00 text text text text text"
[3] " 2011-01-01 00:02:00 text text text text"
$` 2011-01-02`
[1] " 2011-01-02 00:00:00 text text text text"
[2] " 2011-01-02 00:01:00 text text text text"
It's more efficient to process these as separate items in a single list. It will create problems if you split them into separate objects. They can be accessed by text names or by numeric reference. (But do note that the leading space would need to be in the name if a leading space was in your text file.)

You will have to read all the lines of the file.
you can try to do so using
library(package=reshape)
then the function read.table might help
then you will have to compare all lines and write them back in two new files

Related

How to substitute section of character values (i.e., string) within a large data frame in R?

I am cleaning and processing text data about school districts. The data frame is organized as follows:
County
District
ID
Text
1
1
01
"Text for District 1 "
1
2
02
"Text for District 2"
1
3
03
"so on"
1
4
04
"and "
2
5
05
"so"
2
6
06
"forth"
...
...
...
...
58
850
0850
"Text for District 58"
The values of text are of concern here and are long sections of text in character form. Contained in some values of the text variable are string sections such as the following: "Funds are being spent.We know". The absence of a space following the period is my purpose for posting. I would like to substitute all instances of a period with a space such that the previous string is converted to "Funds are being spent We know". I have tried the following to substitute the period for an empty space (" "):
Dataframe <- Dataframe %>% str_replace_all(Text, ".", " ")
if (str_detect(Dataframe$Text, ".") {Dataframe$Text <- str_replace_all(Text, ".", " ")}
Dataframe <- Dataframe %>% gsub(".", " ", Text)
All my attempts have left me unsuccessful and frustrated.
Some other issues I have encountered which I will address in other posts (why not mention it here in case someone knows a resolution) are instances in which the Text variable has values which contain a section of text which has been erroneously concatenated such as the following: "Wearefocusedonsolving". I am looking for a way to insert a space in between each word - "We are focused on solving". Another issue is spelling mistakes (not from processing but mistakes in the text data itself) and I am looking for a way to resolve them - e.g., "focuseb" instead of "focused".
You have to escape the period as period is a special character.
In R we use \\ to escape a special character:
text <- "Funds are being spent.We know"
library(stringr)
str_replace_all(text, "\\.", " ")
"Funds are being spent We know"

Add quatation marks to every row of specific column

Having a dataframe in this format:
data.frame(id = c(4,2), text = c("my text here", "another text here"))
How is it possible to add triple quatation marks at the start and end of every value/row in text column.
Expected printed output:
id text
4 """my text here"""
2 """another text here"""
With no paste nor cat/paste you can simply run:
data.frame(id = c(4,2), text = c('"""my text here"""', '"""another text here"""'))
id text
1 4 """my text here"""
2 2 """another text here"""

Delimiting Text from Democratic Debate

I am trying to delimit the following data by first name, time stamp, and then the text. Currently, the entire data is listed in 1 column as a data frame this column is called Text 1. Here is how it looks
text
First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text
This is what I did so far:
text$specificname = str_split_fixed(text$text, ":", 2)
and it created the following
text specific name
First Name: 00:03 Welcome Back text text text First Name
First Name 2: 00:54 Text Text Text First Name2
First Name 3: 01:24 Text Text Text First Name 3
How do I do the same for the timestamp and text? Is this the best way of doing it?
EDIT 1: This is how I brought in my data
#Specifying the url for desired website to be scraped
url = 'https://www.rev.com/blog/transcript-of-july-democratic-debate-night-1-full-transcript-july-30-2019'
#Reading the HTML code from the website
wp = read_html(url)
#assignging the class to an object
alltext = html_nodes(wp, 'p')
#turn data into text, then dataframe
alltext = html_text(alltext)
text = data.frame(alltext)
Assuming that text is in the form shown in the Note at the end, i.e. a character vector with one component per line, we can use read.table
read.table(text = gsub(" +", ",", text), sep = ",", as.is = TRUE)
giving this data.frame:
V1 V2 V3
1 First Name: 00:03 Welcome Back text text text
2 First Name 2: 00:54 Text Text Text
3 First Name 3: 01:24 Text Text Text
Note
Lines <- "First Name: 00:03 Welcome Back text text text
First Name 2: 00:54 Text Text Text
First Name 3: 01:24 Text Text Text"
text <- readLines(textConnection(Lines))
Update
Regarding the EDIT that was added to the question define a regular expression pat which matches possible whitespace, 2 digits, colon, 2 digits and possibly more whitespace. Then grep out all lines that match it giving tt and in each line left replace the match with #, the pattern (except for the whitespace) and # giving g. Finally read it in using # as the field separator giving DF.
pat <- "\\s*(\\d\\d:\\d\\d)\\s*"
tt <- grep(pat, text$alltext, value = TRUE)
g <- sub(pat, "#\\1#", tt)
DF <- read.table(text = g, sep = "#", quote = "", as.is = TRUE)

How to read a file with more than one tab as separator and where the space is part of column value

I have to read a CSV file with tab ("\t") separator and it can occur multiple times. The read.table function has the special white space separator (sep="") that considers multiple occurrences of any whitespace (tab or space). The problem is that I have the space character as part of the column value, so I cannot use the white space separator. When I use "\t" it only consider one occurrence.
Here is a toy example of my problem:
text1 <- "
a b c
11 12 13
21 22 23
"
ds <- read.csv(sep = "", text = text1)
before the element [1,3], i.e. "13" there are two tabs as separator. Then I get:
a b c
1 11 12 13
2 21 22 23
This is the expected result.
Let's say we add an space in the third column values between the first and second number, so now it would be: "1 3" and "2 3". Now we cannot use a white space delimiter because the space is not a delimiter in this case, it is part of the column value. Now when I use "\t" I get this unexpected result:
text3 <- "
a b c
11 12 1 3
21 22 2 3
"
ds <- read.csv(sep = "\t", text = text3)
The string representation of the input text is:
"a\tb\tc\n11\t12\t\t1 3\n21\t22\t2 3\n"
And now the result is:
a b c
11 12 1 3
21 22 23
It seems to be simple, but I cannot find a way to do it using the read.table interface, because the input argument sep does not accept a regular expression as delimiter.
I think I found a workaround for this, 1) replacing all extra tabs with one first, 2) read the file/text. For example:
read.csv(text = gsub("[\t]+", "\t", readLines(text3), perl = TRUE), sep = "\t")
and also using a file instead:
temp <- tempfile()
writeLines(text3, temp)
read.csv(text = gsub("[\t]+", "\t", readLines(temp), perl = TRUE), sep = "\t")
The text input argument will result:
> text
[1] "a\tb\tc" "11\t12\t1 3" "21\t22\t2 3" ""
and the result of read.csv will be:
a b c
1 11 12 1 3
2 21 22 2 3
This is similar to #Badger suggestion, just in one step.
Okay I think I've got something for you:
write.table( gsub("\\r","", gsub("\t","", readChar( "C:/_Localdata/tab_sep.txt", file.info( "C:/_Localdata/tab_sep.txt" )$size) ) ), "C:/_Localdata/test.txt", sep=" ", quote = F, col.names = T, row.names=F)
## In the event there is a possibility that it is 1 or 2 tabs in series, you can use gsub("\t|\t\t", in place of gsub("\t", just add a | and more \t's if needed!
read.table("C:/_Localdata/test.txt",sep=" ",skip=1,header=T)
Okay what just happened? First we read in the file as a massive character string using readChar(), we need to tell R how big the file is, using file.info(), from this we need to get rid of any tabs using gsub and the \t call, then we have a character string with \r's and \n's, the \r and \n are both carriage returns however R sees both within the file, so it reports both. As such we get rid of one of the carriage returns. Then we write the table out (ideally back to where it came from). Now you can read it in with an easy separating value of a single space, and skip the first line. The first line will be an X, an artifact of writing out a gsub. Also declare a header and you should be good to go!
Let's say you have 500 files.
Place all your files in a folder, and set the pattern to the file type they are, or just allow R to view them all by removing the pattern call.
for( filename in 1:list.files("C:_/Localdata/",patten=".txt") ) {
write.table( gsub("\\r","", gsub("\t","", readChar( filename , file.info( filename )$size) ) ), filename , sep=" ", quote = F, col.names = T, row.names=F)
}
Now your files are ready to be read in however you would like.

Exceptions to sep = " " when reading table into R? Dealing with whitespace within fields

I need to import a table into R that is separated by spaces. Unfortunately, within some of the fields, there are spaces which cause R to separate into a new row. Is there any way of making those fields 'stick together'?
For example, the table looks like this:
V1 V2 V3 V4
Text More 0.11 (a)kdfs hdfa ag$
Text More 1.12 a
Text More 0.21 v
Text More 1222 (a)sdfs sdfa->g
Text More 1232 (a)sdfs sdfa->g
But gets turned into this when R reads it (using read.delim)
V1 V2 V3 V4
Text More 0.11 (a)kdfs
hdfa ag$
Text More 1.12 a
Text More 0.21 v
Text More 1222 (a)sdfs
sdfa->g
Text More 1232 (a)sdfs
sdfa->g
Those fields all have weird characters that aren't all shared with the other columns/rows. However, as seen, the spaces aren't flanked by the same characters.
In the original file, the rows are separated properly. Is there a way to do any of the following?
Stop separating by spaces after the fourth column is created
Have fields starting/ending with certain characters be stuck together as a string/add a non-space character where the spaces are
Generically, allow exceptions to sep
Quite new to R so sorry if this is very naive. Here is what my script looks like up to then:
strs <- readLines("file")
dat <- read.delim(text = strs,
skip = 17,
col.names = c("V1", "V2", "V3", "V4"),
sep = " ", header = F)
Is there anything I can add to either read.delim or readLines or in between those to fix this problem? As there is fluff that needs to be cut out (hence the skip) I can't use read.table (correct me if I'm wrong).
Some of the characters around the spaces are shared, so I would be willing to use a more tedious method to put other characters in place of the spaces in between e.g. 's' and 's'. Would that be possible with gsub if there isn't an easier method?
Thanks so much!
EDIT: Flash of insight, would it be possible to make the fourth column a new table (that's of course not separated by spaces), then replace all spaces in that table with something else? How would I go about 'breaking off' the fourth column/columns after the third column?
1) Try this:
for(i in 1:3) strs <- sub(" +", ",", strs)
read.csv(text = strs)
The result of the last line is:
V1 V2 V3 V4
1 Text More 0.11 (a)kdfs hdfa ag$
2 Text More 1.12 a
3 Text More 0.21 v
4 Text More 1222.00 (a)sdfs sdfa->g
5 Text More 1232.00 (a)sdfs sdfa->g
2) Here is a second solution:
strs.comma <- sub("^(\\S+) +(\\S+) +(\\S+) +", "\\1,\\2,\\3,", strs)
read.csv(text = strs.comma)

Resources