Extracting portions of a string of characters

Extracting portions of a string of characters - r

I have a string of characters (311,522 length long). It is in .txt format and all on one line. The data file can be found here. I tried to import it into R like this:
eya4_lagan_HM_cp <- read.table("C:/Documents and Settings/SS/Desktop/Sequence Segmentation/eya4_lagan_HM_cp.txt", quote="\"")
But I get warning messages and it does not import it.
I need to extract portions of this string of characters. That is, I need to extract from 44184 to 44216, meaning the sequence of characters from the 44184th character (inclusive) to the 44216th character (inclusive), then from 151795 to 151844, and so on.
How can I do this?

See https://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string for an information on how to read the file into a string, for example you would use:
fileName <- "C:/Documents and Settings/SS/Desktop/Sequence Segmentation/eya4_lagan_HM_cp.txt"
theData <- readChar(fileName, file.info(fileName)$size)
Also see the readChar docs.
See substr for information on how to extract substrings.
In your case, you could use for example:
mySubstr <- substr(theData, 44184, 44216)

Related

How recode unicode char like "\xe9" and "<e9>" to "é" in R?

I read "csv" file where one field has values like "J\xe9rome" or "Jrome" at the same time. How to read this file to have values like "Jérome" or make characters transformation then?
I tried to use
df <- fread(file_name, encoding = "UTF-8")
but it does not work.
Thanks!

How can I extract a pattern (start and end) in a big string, using R?

I have a big string and I want to match/extract a pattern with start and end search pattern. How can this be done in R?
An example of the string:
big_string <- "read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)"
And I want to extract the url components in this instance. Therefore, the pattern starts with http and ends with .csv or any extension (if possible).
http://company.com/students.csv
http://company.com/students_grades.csv
I have no luck with many attempts using gregexpr to extract the pattern. Can someone help with coming out a way to do this in R?

The stringr package works very well for this type of application:
library(stringr)
big_string <- 'read.csv(\"http://company.com/students.csv\", header = TRUE)","solution":"# Preview students with str()\nstr(students)\n\n# Coerce Grades to character\nstudents$Grades <- read.csv(\"http://company.com/students_grades.csv\", header = TRUE)'
results<-unlist(str_extract_all(big_string, "http:.+csv"))
The search pattern is a string starting with "http:" with at least 1 character and ending with "csv"

Regarding reading files which contain UTF-8 character

I have a csv file including chinese character saved with UTF-8.
项目 价格
电视 5000
The first row is header, the second row is data. In other words, it is one by two vector.
I read this the file as follows:
amatrix<-read.table("test.csv",encoding="UTF-8",sep=",",header=T,row.names=NULL,stringsAsFactors=FALSE)
However, the output including the unknown marks for the header, i.e.,X.U.FEFF

That is the byte order mark sometimes found in Unicode text files. I'm guessing you're on Windows, since that's the only popular OS where files can end up with them.
What you can do is read the file using readLines and remove the first two characters of the first line.
txt <- readLines("test.csv", encoding="UTF-8")
txt[1] <- substr(txt[1], 3, nchar(txt[1]))
amatrix <- read.csv(text=txt, ...)

How to handle blank items when converting dates in R

I have a csv download of data from a Management Information system. There are some variables which are dates and are written in the csv as strings of the format "2012/11/16 00:00:00".
After reading in the csv file, I convert the date variables into a date using the function as.Date(). This works fine for all variables that do not contain any blank items.
For those which do contain blank items I get the following error message:
"character string is not in a standard unambiguous format"
How can I get R to replace blank items with something like "0000/00/00 00:00:00" so that the as.Date() function does not break? Are there other approaches you might recommend?

If they're strings, does something as simple as
mystr <- c("2012/11/16 00:00:00"," ","")
mystr[grepl("^ *$",mystr)] <- NA
as.Date(mystr)
work? (The regular expression "^ *$" looks for strings consisting of the start of the string (^), zero or more spaces (*), followed by the end of the string ($). More generally I think you could use "^[[:space:]]*$" to capture other kinds of whitespace (tabs etc.)

Even better, have the NAs correctly inserted when you read in the CSV:
read.csv(..., na.strings='')
or to specify a vector of all the values which should be read as NA...
read.csv(..., na.strings=c('',' ',' '))

How to make R stop reading rows in a text file at a line containing a specific character?

For example, I want to read lines from the beginning of a text file up to a string with ";" symbol excluding this string.
Thanks a lot.

A very simple approach might be to read the contents of the using readLines:
content = readLines("data.txt")
And then split the character data on the ;:
split_content = strsplit(content, split = ";")
And then extract the first elememt, i.e. the text up to the semicolon:
first_element = lapply(split_content, "[[", 1]
The result is a list of all the text in the rows of the data file up to the semicolon.
Ps I'm not entirely sure about the last line...I can't check it as I've got no access to R right now.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting portions of a string of characters - r

Related

How recode unicode char like "\xe9" and "<e9>" to "é" in R?

How can I extract a pattern (start and end) in a big string, using R?

Regarding reading files which contain UTF-8 character

How to handle blank items when converting dates in R

How to make R stop reading rows in a text file at a line containing a specific character?

Categories

Resources