Ambiguity while using readLines() in R - r

The first line of my dataset contains the name of the columns.
It looks like this --
#"State Code","County Code","Site Num","Parameter Code","POC","Latitude","Longitude","Datum","Parameter Name","Sample Duration","Pollutant Standard","Metric Used","Method Name","Year","Units of Measure","Event Type","Observation Count","Observation Percent","Completeness Indicator","Valid Day Count","Required Day Count","Exceptional Data Count","Null Data Count","Primary Exceedance Count","Secondary Exceedance Count","Certification Indicator","Num Obs Below MDL","Arithmetic Mean","Arithmetic Standard Dev","1st Max Value","1st Max DateTime","2nd Max Value","2nd Max DateTime","3rd Max Value","3rd Max DateTime","4th Max Value","4th Max DateTime","1st Max Non Overlapping Value","1st NO Max DateTime","2nd Max Non Overlapping Value","2nd NO Max DateTime","99th Percentile","98th Percentile","95th Percentile","90th Percentile","75th Percentile","50th Percentile","10th Percentile","Local Site Name","Address","State Name","County Name","City Name","CBSA Name","Date of Last Change"
It is a csv file.
Since I am using windows I wrote
pm0 <-read.csv("C:/Users/Ad/Desktop/EDA/2010.csv",
comment.char="#", header=FALSE, sep=",", na.strings="")
to read this csv file except the first line. Now I want to read the first line so that I can use the first line to set the column names of my generated data frame.For this I wrote--
cnames<-readLines("C:/Users/Ad/Desktop/EDA/2010.csv",1)
But when I print cnames I get this --
[1] "\"State Code\",\"County Code\",\"Site Num\",\"Parameter Code\",\"POC\",\"Latitude\",\"Longitude\",\"Datum\",\"Parameter Name\",\"Sample Duration\",\"Pollutant Standard\",\"Metric Used\",\"Method Name\",\"Year\",\"Units of Measure\",\"Event Type\",\"Observation Count\",\"Observation Percent\".
I dont understand why \ is coming at start and end of every element of cnames.
Can someone help me to remove this.

This is from the Exploratory Data Analysis (EDA) assignment on Coursera, right? I trust you are compliant with the honor code.
What you have in 'cnames' is ONE string enclosed in double-quotes within which the backslash operator has escaped other quotation marks.
To get around this, try:
cnames1 <- strsplit(cnames, ",")
gsub("[\"]", "", cnames1[[1]], perl=TRUE)
This gives an array of names.
[1] "State Code" "County Code" "Site Num"
[4] "Parameter Code" "POC" "Latitude"
[7] "Longitude" "Datum" "Parameter Name"
[10] "Sample Duration" "Pollutant Standard" "Metric Used"
[13] "Method Name" "Year" "Units of Measure"
[16] "Event Type" "Observation Count" "Observation Percent"

What i did is this --
pm0<-read.csv("C:/Users/Ad/Desktop/EDA/2010.csv",comment.char="#",header=TRUE,sep=",",na.strings="")
Now the object pm0 contains the first row of csv file as the column names.

Related

How to remove hidden characters in R from string imported from Excel?

Using the openxlsx package in R, I am importing data from an Excel file that originated in Brazil. In the character strings, there seem to be hidden characters. As you can see from my code, I remove the white space but strings 2 and 3 are still shown as unique strings. Strings 6 and 7 are also appearing as unique strings when they should be identical.
moths %<>%
mutate(Details = str_trim(Details))
sort(unique(moths$Details))
[1] "Check Outside" "Check Plot"
[3] "Check Plot ​​" "Check Plot ​​ (between treatments)"
[5] "PRX-01GA1-21022.00" "PRX-01GA2-22001​"
[7] "PRX-01GA2-22001" "PRX-01GA2-22002​"
[9] "PRX-01GA2-22002" "PRX-01GA2-22003"
[11] "PRX-01GA2-22004" "SF2.5VP"
[13] "YM001-22" "YM001-PRX-01GA2-22001​ PRX-01GA2-22001​"
Unfortunately, since I can't attach the Excel file that the data are coming from, I can't make a completely reproducible example here, but hopefully someone can still provide some insight.
There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):
vec <- c("Check Outside", "Check Plot", "Check Plot ​​",
"Check Plot ​​ (between treatments)")
unique(vec)
# [1] "Check Outside" "Check Plot"
# [3] "Check Plot ​​" "Check Plot ​​ (between treatments)"
library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside" "Check Plot" "Check Plot (between treatments)"
So tl;dr this should do what you’re after
library(textclean)
moths <- moths %>%
mutate(Details = replace_non_ascii(str_trim(Details)))
you haven't saved your cleaned data
moths %<>%
mutate(Details = str_trim(Details)) -> moths
sort(unique(moths$Details))

Remove certain lines (with ---- and empty lines) from txt file using readLines() or read_lines()

I have this text file called textdata.txt:
TREATMENT DATA
------------------------------------
A: Text1
B: Text2
C: Text3
D: Text4
E: Text5
F: Text6
G: Text7
I would like to remove the whole line with --------- and the empty lines using readLines or read_lines:
When I use readLines("textdata.txt") I get:
[1] "TREATMENT DATA"
[2] ""
[3] "------------------------------------"
[4] "A: Text1"
[5] "B: Text2"
[6] ""
[7] "C: Text3"
[8] "D: Text4"
[9] ""
[10] "E: Text5"
[11] "F: Text6"
[12] "G: Text7"
I would like to have, expected output:
[1] "TREATMENT DATA"
[2] "A: Text1"
[3] "B: Text2"
[4] "C: Text3"
[5] "D: Text4"
[6] "E: Text5"
[7] "F: Text6"
[8] "G: Text7"
Background:
I have de facto no experience handling files with R. The basic idea is to get a .txt format from which I can load multiple text files stored in a folder to one dataframe.
1) read.table If we can assume that the only occurrence of - is where shown in the question and if ? does not occur anywhere in the file then this will read in the data regarding every line as a single field and throwing away the header. Since - is the comment character lines with only - are regarded as blank and those will be thrown away. This reads the file into a one columnn data frame and the [[1]] returns that column as a character vector. If you want to keep the header omit header=TRUE.
read.table("myfile", sep = "?", comment.char = "-", header = TRUE)[[1]]
2) grep Another possibility is to read in the file and then remove lines that are empty or contain only - characters.
grep("^-*$", readLines("myfile"), invert = TRUE, value = TRUE)
3) pipe We could process the input using a filter and then pipe that into R. On Windows grep is found in C:\Rtools40\usr\bin if you have Rtools40 installed but if it is not on your path either use the complete path or if you don't have it at all replace grep with findstr. If on UNIX/Linux the escaping may vary according to which shell you are using.
readLines(pipe('grep -v "^-*$" myfile'))

Regular Expression in R (removing spacing and punctuation characters)

Suppose I have the following text:
text = c("Initial [kHz]","Initial Value [dB]",
"Min Accept X [kHz]","Min Accept [dB]",
"Cut-Off Frequency [kHz]",
"Min Bandwidth Limit [kHz]","y min [dB]",
"Max Bandwidth Limit [kHz]","y max [dB]",
"Iter: 1 [kHz]","Iter: 1","Value: 55 [dB]",
"Iter: 2 [kHz]","Iter: 2","Value: 59 [dB]")
But what I want is (which removed the spacing and the punctuation characters:
text = c("InitialkHz","InitialValuedB",
"MinAcceptXkHz","MinAcceptdB",
"CutOffFrequencykHz",
"MinBandwidthLimitkHz","ymindB",
"MaxBandwidthLimitkHz","ymaxdB]",
"Iter1kHz","Iter1","Value55dB",
"Iter2kHz","Iter2","Value59dB")
Can anyone help me? Please...
You can choose to keep only alpha numeric values like this:
gsub('[^[:alnum:]]', '', text)
We can use gsub to remove all the punctuations and spaces from text.
gsub("[[:punct:]]| ", "", text)
# [1] "InitialkHz" "InitialValuedB" "MinAcceptXkHz"
# [4] "MinAcceptdB" "CutOffFrequencykHz" "MinBandwidthLimitkHz"
# [7] "ymindB" "MaxBandwidthLimitkHz" "ymaxdB"
#[10] "Iter1kHz" "Iter1" "Value55dB"
#[13] "Iter2kHz" "Iter2" "Value59dB"

Extract title from multiple lines

I have multiple files each one has a different title, I want to extract the title name from each file. Here is an example of one file
[1] "<START" "ID=\"CMP-001\"" "NO=\"1\">"
[4] "<NAME>Plasma-derived" "vaccine" "(PDV)"
[7] "versus" "placebo" "by"
[10] "intramuscular" "route</NAME>" "<DIC"
[13] "CHI2=\"3.6385\"" "CI_END=\"0.6042\"" "CI_START=\"0.3425\""
[16] "CI_STUDY=\"95\"" "CI_TOTAL=\"95\"" "DF=\"3.0\""
[19] "TOTAL_1=\"0.6648\"" "TOTAL_2=\"0.50487622\"" "BLE=\"YES\""
.
.
.
[789] "TOTAL_2=\"39\"" "WEIGHT=\"300.0\"" "Z=\"1.5443\">"
[792] "<NAME>Local" "adverse" "events"
[795] "after" "each" "injection"
[798] "of" "vaccine</NAME>" "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>"
[801] "</GROUP_LABEL_2>" "<GRAPH_LABEL_1>" "PDV</GRAPH_LABEL_1>"
the extracted expected title is
Plasma-derived vaccine (PDV) versus placebo by intramuscular route
Note, each file has a different title's length.
Here is a solution using stringr. This first collapses the vector into one long string, and then captures all words / characters that are not a newline \n between every pair of "<NAME>" and "</NAME>". In the future, people will be able to help you easier if you make a reproducible example (e.g., using dput()). Hope this helps!
Note: if you just one the first title you can use str_match() instead of str_match_all().
library(stringr)
str_match_all(paste0(string, collapse = " "), "<NAME>(.*?)</NAME>")[[1]][,2]
[1] "Plasma-derived vaccine (PDV) versus placebo by intramuscular route"
[2] "Local adverse events after each injection of vaccine"
Data:
string <- c("<START", "ID=\"CMP-001\"", "NO=\"1\">", "<NAME>Plasma-derived", "vaccine", "(PDV)", "versus", "placebo", "by", "intramuscular", "route</NAME>", "<DIC", "CHI2=\"3.6385\"", "CI_END=\"0.6042\"", "CI_START=\"0.3425\"", "CI_STUDY=\"95\"", "CI_TOTAL=\"95\"", "DF=\"3.0\"", "TOTAL_1=\"0.6648\"", "TOTAL_2=\"0.50487622\"", "BLE=\"YES\"",
"TOTAL_2=\"39\"", "WEIGHT=\"300.0\"", "Z=\"1.5443\">", "<NAME>Local", "adverse", "events", "after", "each", "injection", "of", "vaccine</NAME>", "<GROUP_LABEL_1>PDV</GROUP_LABEL_1>", "</GROUP_LABEL_2>", "<GRAPH_LABEL_1>", "PDV</GRAPH_LABEL_1>")

Splitting string in r and putting them in new columns in new data frame

I want to do this in R.
I run a code to extract paths of all folders and subfolders. I get the list mentioned below. I want to apply a set of rules to this:
If 1 "/" is encountered in the whole line then replace that "/" with "/Folder/"
If 2 "/" is encountered in the whole line then do nothing.
If 3 or MORE "/" are encountered then ignore the first and last "/" and replace all remaining "/" with "-"
The code I run is to extract file path is:
b<-list.files(path="/Users/Mohit/Desktop/Company/Database",recursive=TRUE)
[1] "Accounts/Academic History.pdf" "Accounts/Contract.pdf"
[3] "Accounts/Credit/Analyst/Banking/TFileOutput.txt" "Accounts/Credit/Analyst/untitled.jpg"
[5] "Accounts/Credit/background.jpg" "Accounts/Credit/background.xcf"
[7] "Accounts/Debit/index.html" "Human Resources/RStudio-0.98.1073.dmg"
[9] "Information Technology/Iti.pdf" "Logistics/1610085_10152585224658626_398303669_n.jpg"
[11] "Sales/947309_10152376144413626_1056138683_n.jpg"
Im not able to understand which function to use. stringr package with sapply maybe?
I want to put this in a column with a heading and export it as text file.
Any help will be greatly appreciated.
Thanks very much
May be this helps:
library(stringr)
Ct <- str_count(b, "/")
b1 <- ifelse(Ct==1, gsub("[/]", "/Folder/", b),
ifelse(Ct>=3, gsub("(^([^/]+[/])|([/][^/]+)$)(*SKIP)(*F)|[/]", "-",b,
perl=TRUE), b))
b1
#[1] "Accounts/Folder/Academic History.pdf"
#[2] "Accounts/Folder/Contract.pdf"
#[3] "Accounts/Credit-Analyst-Banking/TFileOutput.txt"
#[4] "Accounts/Credit-Analyst/untitled.jpg"
#[5] "Accounts/Credit/background.jpg"
#[6] "Accounts/Credit/background.xcf"
#[7] "Accounts/Debit/index.html"
#[8] "Human Resources/Folder/RStudio-0.98.1073.dmg"
#[9] "Information Technology/Folder/Iti.pdf"
#[10] "Logistics/Folder/1610085_10152585224658626_398303669_n.jpg"
#[11] "Sales/Folder/947309_10152376144413626_1056138683_n.jpg"
If you want to create a data.frame and then export as .txt file
dat <- data.frame(b, b1, stringsAsFactors=FALSE)
write.table(dat, file="Mohit.txt", quote=FALSE, row.names=FALSE, sep=",")
Update
If you need to create 3 columns based on b1
datN <- setNames(read.table(text=b1, sep="/"), c("Class", "Title", "File"))
head(datN,2)
# Class Title File
#1 Accounts Folder Academic History.pdf
#2 Accounts Folder Contract.pdf
Now, you can save the file using write.table
data
b <- c("Accounts/Academic History.pdf", "Accounts/Contract.pdf", "Accounts/Credit/Analyst/Banking/TFileOutput.txt",
"Accounts/Credit/Analyst/untitled.jpg", "Accounts/Credit/background.jpg",
"Accounts/Credit/background.xcf", "Accounts/Debit/index.html",
"Human Resources/RStudio-0.98.1073.dmg", "Information Technology/Iti.pdf",
"Logistics/1610085_10152585224658626_398303669_n.jpg","Sales/947309_10152376144413626_1056138683_n.jpg"

Resources