Remove Hex Code from String in R [duplicate] - r

This question already has answers here:
Remove special characters from data frame
(2 answers)
Closed 4 years ago.
I have converted a .doc document to .txt, and I have some weird formatting that I cannot remove (from looking at other posts, I think it is in Hex code, but I'm not sure).
My data set is a data frame with two columns, one identifying a speaker and the second column identifying the comments. Some strings now have weird characters. For instance, one string originally said (minus the quotes):
"Why don't we start with a basic overview?"
But when I read it in R after converting it to a .txt, it now reads:
"Why don<92>t we start with a basic overview?"
I've tried:
df$comments <- gsub("<92>", "", df$comments)
However, this doesn't change anything. Furthermore, whenever I do any other substitutions within a cell (for instance, changing "start" to "begin", it changes that special character into a series of weird ? that're surrounded in boxes.
Any help would be very appreciated!
EDIT:
I read my text in like this:
df <- read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T)
It has 2 columns; the first is speaker and the second is comments.

I found the answer here: R remove special characters from data frame
This code worked: gsub("[^0-9A-Za-z///' ]", "", a)

Related

Importing csv file in R and remove character added in numeric header [duplicate]

This question already has answers here:
Why am I getting X. in my column names when reading a data frame?
(5 answers)
Closed last month.
These extra x are added in the csv file. I want these header without the x. I'm importing a csv file into R using read.csv() the file is being read but extra character is included in the header as is it numeric. How to remove this extra character?
The extra 'X' are added to the header because the original column names are not syntactically valid variable names. A legitimate column name has to start with a letter or the dot not followed by a number. By default read.csv() will check the names of the variables in the data frame to ensure validity of names. You can switch off this feature through
read.csv(..., check.names = FALSE)
You can rename row.names of the element you imported. If the onlly thing you want to do is delete the first string of each name, you can do this (lets call your dataset df):
colnames(df) <- substring(colnames(df),2)

Remove multiple instances with a regex expression, but not the text in between instances [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 1 year ago.
In long passages using bookdown, I have inserted numerous images. Having combined the passages into a single character string (in a data frame) I want to remove the markdown text associated with inserting images, but not any text in between those inserted images. Here is a toy example.
text.string <- "writing ![Stairway scene](/media/ClothesFairLady.jpg) writing to keep ![Second scene](/media/attire.jpg) more writing"
str_remove_all(string = text.string, pattern = "!\\[.+\\)")
[1] "writing more writing"
The regex expression doesn't stop at the first closed parenthesis, it continues until the last one and deletes the "writing to keep" in between.
I tried to apply String manipulation in R: remove specific pattern in multiple places without removing text in between instances of the pattern, which uses gsubfn and gsub but was unable to get the solutions to work.
Please point me in the right direction to solve this problem of a regex removal of designated strings, but not the characters in between the strings. I would prefer a stringr solution, but whatever works. Thank you
You have to use the following regex
"!\\[[^\\)]+\\)"
alternatively you can also use this:
"!\\[.*?\\)"
both solution offer a lazy match rather than a greedy one, which is the key to your question
I think you could use the following solution too:
gsub("!\\[[^][]*\\]\\([^()]*\\)", "", text.string)
[1] "writing writing to keep more writing"

how to get the last part of strings with different lengths ended by ".nc" [duplicate]

This question already has answers here:
Get filename without extension in R
(9 answers)
Find file name from full file path
(4 answers)
Closed 3 years ago.
I have several download links (i.e., strings), and each string has different length.
For example let's say these fake links are my strings:
My_Link1 <- "http://esgf-data2.diasjp.net/pr/gn/v20190711/pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
My_Link2 <- "http://esgf-data2.diasjp.net/gn/v20190711/pr_-present_r1i1p1f1_gn_19500101-19591231.nc"
My goals:
A) I want to have only the last part of each string ended by .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc
pr_-present_r1i1p1f1_gn_19500101-19591231.nc
B) I want to have only the last part of each string before .nc , and get these results:
pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231
pr_-present_r1i1p1f1_gn_19500101-19591231
I tried to find a way on the net, but I failed. It seems this can be done in Python as documented here:
How to get everything after last slash in a URL?
Does anyone know the same method in R?
Thanks so much for your time.
A shortcut to get last part of the string would be to use basename
basename(My_Link1)
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231.nc"
and for the second question if you want to remove the last ".nc" we could use sub like
sub("\\.nc", "", basename(My_Link1))
#[1] "pr_day_MRI-AGCM3-2-H_highresSST_gn_20100101-20141231"
With some regex here is another way to get first part :
sub(".*/", "", My_Link1)

write_csv - Exporting trailing spaces (no elimination)

I am trying to export a table to CSV format, but one of my columns is special - it's like a number string except that the length of the string needs to be the same every time, so I add trailing spaces to shorter numbers to get it to a certain length (in this case I make it length 5).
library(dplyr)
library(readr)
df <- read.table(text="ID Something
22 Red
55555 Red
123 Blue
",header=T)
df <- mutate(df,ID=str_pad(ID,5,"right"," "))
df
ID Something
1 22 Red
2 55555 Red
3 123 Blue
Unfortunately, when I try to do write_csv somewhere, the trailing spaces disappear which is not good for what I want to use this for. I think it's because I am downloading the csv from the R server and then opening it in Excel, which messes around with the data. Any tips?
str_pad() appears to be a function from stringr package, which is not currently available for R 3.5.0 which I am using - this may be the cause of your issues as well. If it the function actually works for you, please ignore the next step and skip straight to my Excel comments below
Adding spaces. Here is how I have accomplished this task with base R
# a custom function to add arbitrary number of trailing spaces
SpaceAdd <- function(x, desiredLength = 5) {
additionalSpaces <- ifelse(nchar(x) < desiredLength,
paste(rep(" ", desiredLength - nchar(x)), collapse = ""), "")
paste(x, additionalSpaces, sep="")
}
# use the function on your df
df$ID <- mapply(df$ID, FUN = SpaceAdd)
# write csv normally
write.csv(df, "df.csv")
NOTE When you import to Excel, you should be using the 'import from text' wizard rather than just opening the .csv. This is because you need marking your 'ID' column as text in order to keep the spaces
NOTE 2 I have learned today, that having your first column named 'ID' might actually cause further problems with excel, since it may misinterpret the nature of the file, and treat it as SYLK file instead. So it may be best avoiding this column name if possible.
Here is a wiki tl;dr:
A commonly encountered (and spurious) 'occurrence' of the SYLK file happens when a comma-separated value (CSV) format is saved with an unquoted first field name of 'ID', that is the first two characters match the first two characters of the SYLK file format. Microsoft Excel (at least to Office 2016) will then emit misleading error messages relating to the format of the file, such as "The file you are trying to open, 'x.csv', is in a different format than specified by the file extension..."
details: https://en.wikipedia.org/wiki/SYmbolic_LinK_(SYLK)

Remove <U+00A0> from values in columns in R [duplicate]

This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 4 years ago.
When I read my csv file using read.csv and using the encoding parameter, I get some values with in them.
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "UTF-8")
The dataset looks like
X Y
Met<U+00A0>Expectations Met<U+00A0>Expectations
Met<U+00A0>Expectations Met<U+00A0>Expectations
NA Met<U+00A0>Expectations
Met<U+00A0>Expectations Exceeded Expectations
Did<U+00A0>Not Meet Expectations Met<U+00A0>Expectations
Unacceptable Exceeded Expectations
How can I remove the from these values? If I do not use the "encoding" parameter, when I show these values in the shiny application, it is seen as:
Met<a0>Expectations and Did<a0>Not Meet Expectations
I have no clue on how to handle this.
PS: I have modified the original question with examples of the problem faced.
The problem bores me a long time, and I search all around the R communities, no answer in "r" tag can work in my situation. Until I expanded search area, I got the worked answer in "java" tag.
Okay,for the data frame, the solution is:
application <- as.data.frame(lapply(application, function(x) {
gsub("\u00A0", "", x)
}))
Two options:
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "ASCII")
or with {readr}
application <- read_csv("application.csv", na.strings = c("N/A","","NA"), locale = locale(encoding = "ASCII"))
Converting UTF-8 to ASCII will remove the printed UTF-8 syntax, but the spaces will remain. Beware that if there are extra spaces at the beginning or end of a character string, you may get unwanted unique values. For example "Met Expectations<U+00A0>" converted to ASCII will read "Met Expectations ", which does not equal "Met Expectations".
This isn't a great answer but to get your csv back into UTF-8 you can open it in google sheets and then download as a .csv. Then import with trim_ws = T. This will solve the importing problems and won't create any weirdness.

Resources