Remove <U+00A0> from values in columns in R [duplicate] - r

This question already has answers here:
How to remove unicode <U+00A6> from string?
(4 answers)
Closed 4 years ago.
When I read my csv file using read.csv and using the encoding parameter, I get some values with in them.
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "UTF-8")
The dataset looks like
X Y
Met<U+00A0>Expectations Met<U+00A0>Expectations
Met<U+00A0>Expectations Met<U+00A0>Expectations
NA Met<U+00A0>Expectations
Met<U+00A0>Expectations Exceeded Expectations
Did<U+00A0>Not Meet Expectations Met<U+00A0>Expectations
Unacceptable Exceeded Expectations
How can I remove the from these values? If I do not use the "encoding" parameter, when I show these values in the shiny application, it is seen as:
Met<a0>Expectations and Did<a0>Not Meet Expectations
I have no clue on how to handle this.
PS: I have modified the original question with examples of the problem faced.

The problem bores me a long time, and I search all around the R communities, no answer in "r" tag can work in my situation. Until I expanded search area, I got the worked answer in "java" tag.
Okay,for the data frame, the solution is:
application <- as.data.frame(lapply(application, function(x) {
gsub("\u00A0", "", x)
}))

Two options:
application <- read.csv("application.csv", na.strings = c("N/A","","NA"), encoding = "ASCII")
or with {readr}
application <- read_csv("application.csv", na.strings = c("N/A","","NA"), locale = locale(encoding = "ASCII"))
Converting UTF-8 to ASCII will remove the printed UTF-8 syntax, but the spaces will remain. Beware that if there are extra spaces at the beginning or end of a character string, you may get unwanted unique values. For example "Met Expectations<U+00A0>" converted to ASCII will read "Met Expectations ", which does not equal "Met Expectations".

This isn't a great answer but to get your csv back into UTF-8 you can open it in google sheets and then download as a .csv. Then import with trim_ws = T. This will solve the importing problems and won't create any weirdness.

Related

Remove Hex Code from String in R [duplicate]

This question already has answers here:
Remove special characters from data frame
(2 answers)
Closed 4 years ago.
I have converted a .doc document to .txt, and I have some weird formatting that I cannot remove (from looking at other posts, I think it is in Hex code, but I'm not sure).
My data set is a data frame with two columns, one identifying a speaker and the second column identifying the comments. Some strings now have weird characters. For instance, one string originally said (minus the quotes):
"Why don't we start with a basic overview?"
But when I read it in R after converting it to a .txt, it now reads:
"Why don<92>t we start with a basic overview?"
I've tried:
df$comments <- gsub("<92>", "", df$comments)
However, this doesn't change anything. Furthermore, whenever I do any other substitutions within a cell (for instance, changing "start" to "begin", it changes that special character into a series of weird ? that're surrounded in boxes.
Any help would be very appreciated!
EDIT:
I read my text in like this:
df <- read_delim("file.txt", "\n", escape_double = F, col_names = F, trim_ws = T)
It has 2 columns; the first is speaker and the second is comments.
I found the answer here: R remove special characters from data frame
This code worked: gsub("[^0-9A-Za-z///' ]", "", a)

Simplifying characters with ornaments in R [duplicate]

This question already has answers here:
Replace multiple letters with accents with gsub
(11 answers)
Closed 5 years ago.
I have the names of some music artists which I am working with within the Spotify API. I'm having some issues dealing with some strings because of the characters' accents. I don't have much understanding of character encoding.
I'll provide more context a bit further below, but essentially I am wondering if there is a way in R to "simplify" characters with ornaments.
Essentially, I am interested if there is a function which will take c("ë", "ö") as an input, and return c("e", "o"), removing the ornaments from the characters.
I don't think I can create a reproducible example because of the issues with API authentication, but for some context, when I try to run:
artistName <- "Tiësto"
GET(paste0("https://api.spotify.com/v1/search?q=",
artistName,
"&type=artist"),
config(token = token))
The following gets sent to the API:
https://api.spotify.com/v1/search?q=Tiësto&type=artist
Returning me a 400 bad request error. I am trying to alter the strings I pass to the GET function so I can get some useful output.
Edit: I am not looking for a gsub type solution, as that relies on me anticipating the sorts of accented characters which might appear in my data. I'm interested whether there is a function already out there which does this sort of translation between different character encodings.
Here is what I found, and may work for you. Simpler and convenient to apply on any form of data.
> artistName <- "Tiësto"
> iconv(artistName, "latin1", "ASCII//TRANSLIT")
[1] "Tiesto"
Based on the answers to this question , you could do this:
artistName <- "Tiësto"
removeOrnaments <- function(string) {
chartr(
"ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ",
"SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy",
string
)
}
removeOrnaments(artistName)
# [1] "Tiesto"

Error in tolower() invalid multibyte string

This is the error that I receive when I try to run tolower() on a character vector from a file that cannot be changed (at least, not manually - too large).
Error in tolower(m) : invalid multibyte string X
It seems to be French company names that are the problem with the É character. Although I have not investigated all of them (also not possible to do so manually).
It's strange, because my thought was that encoding issues would have been identified during read.csv(), rather than during operations after the fact.
Is there a quick way to remove these multibyte strings? Or, perhaps a way to identify and convert? Or even just ignore them entirely?
Here's how I solved my problem:
First, I opened the raw data in a texteditor (Geany, in this case), clicked properties and identified the Encoding type.
After which I used the iconv() function.
x <- iconv(x,"WINDOWS-1252","UTF-8")
To be more specific, I did this for every column of the data.frame from the imported CSV. Important to note that I set stringsAsFactors=FALSE in my read.csv() call.
dat[,sapply(dat,is.character)] <- sapply(
dat[,sapply(dat,is.character)],
iconv,"WINDOWS-1252","UTF-8")
I was getting the same error. However, in my case it wasn't when I was reading the file, but a bit later when processing it. I realised that I was getting the error, because the file wasn't read with the correct encoding in the first place.
I found a much simpler solution (at least for my case) and wanted to share. I simply added encoding as below and it worked.
read.csv(<path>, encoding = "UTF-8")
library(tidyverse)
data_clean = data %>%
mutate(new_lowercase_col = tolower(enc2utf8(as.character(my_old_column))))
Where new_lowercase_col is the name of the new column I'm making out of the old uppercase one, which was called my_old_column.
I know this has been answered already but thought I'd share my solution to this as I experienced the same thing.
In my case, I used the function str_trim() from package stringr to trim whitespace from start and end of string.
com$uppervar<-toupper(str_trim(com$var))
# to avoid datatables warning: error in tolower(x) invalid multibyte string
# assuming all columns are char
new_data <- as.data.frame(
lapply(old_data, enc2utf8),
stringsAsFactors = FALSE
)
My solution to this issue
library(dplyr) # pipes
library(stringi) # for stri_enc_isutf8
#Read in csv data
old_data<- read.csv("non_utf_data.csv", encoding = "UTF-8")
#despite specifying utf -8, the below columns are not utf8:
all(stri_enc_isutf8(old_data$problem_column))
#The below code uses regular expressions to cleanse. May need to tinker with the last
#portion that selects the grammar to retain
utf_eight_data<- old_data %>%
mutate(problem_column = gsub("[^[:alnum:][:blank:]?&/\\-]", "", old_data$problem_column)) %>%
rename(solved_problem = problem_column)
#this column is now utf 8.
all(stri_enc_isutf8(utf_eight_data$solved_problem))

R - read.table imports half of the dataset - no errors nor warnings

I have a csv file with ~200 columns and ~170K rows. The data has been extensively groomed and I know that it is well-formed. When read.table completes, I see that approximately half of the rows have been imported. There are no warnings nor errors. I set options( warn = 2 ). I'm using 64-bit latest version and I increased the memory limit to 10gig. Scratching my head here...no idea how to proceed debugging this.
Edit
When I said half the file, I don't mean the first half. The last observation read is towards the end of the file....so its seemingly random.
You may have a comment character (#) in the file (try setting the option comment.char = "" in read.table). Also, check that the quote option is set correctly.
I've had this problem before how I approached it was to read in a set number of lines at a time and then combine after the fact.
df1 <- read.csv(..., nrows=85000)
df2 <- read.csv(..., skip=84999, nrows=85000)
colnames(df1) <- colnames(df2)
df <- rbind(df1,df2)
rm(df1,df2)
I had a similar problem when reading in a large txt file which had a "|" separator. Scattered about the txt file were some text blocks that contained a quote (") which caused the read.xxx function to stop at the prior record without throwing an error. Note that the text blocks mentioned were not encased in double quotes; rather, they just contained one double quote character here and there (") which tripped it up.
I did a global search and replace on the txt file, replacing the double quote (") with a single quote ('), solving the problem (all rows were then read in without aborting).

R: invalid multibyte string [duplicate]

This question already has answers here:
Invalid multibyte string in read.csv
(9 answers)
Closed 8 years ago.
I use read.delim(filename) without any parameters to read a tab delimited text file in R.
df = read.delim(file)
This worked as intended. Now I have a weird error message and I can't make any sense of it:
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) :
invalid multibyte string at '<fd>'
Calls: read.delim -> read.table -> type.convert
Execution halted
Can anybody explain what a multibyte string is? What does fd mean? Are there other ways to read a tab file in R? I have column headers and lines which do not have data for all columns.
I realize this is pretty late, but I had a similar problem and I figured I'd post what worked for me. I used the iconv utility (e.g., "iconv file.pcl -f UTF-8 -t ISO-8859-1 -c"). The "-c" option skips characters that can't be translated.
If you want an R solution, here's a small convenience function I sometimes use to find where the offending (multiByte) character is lurking. Note that it is the next character to what gets printed. This works because print will work fine, but substr throws an error when multibyte characters are present.
find_offending_character <- function(x, maxStringLength=256){
print(x)
for (c in 1:maxStringLength){
offendingChar <- substr(x,c,c)
#print(offendingChar) #uncomment if you want the indiv characters printed
#the next character is the offending multibyte Character
}
}
string_vector <- c("test", "Se\x96ora", "works fine")
lapply(string_vector, find_offending_character)
I fix that character and run this again. Hope that helps someone who encounters the invalid multibyte string error.
I had a similarly strange problem with a file from the program e-prime (edat -> SPSS conversion), but then I discovered that there are many additional encodings you can use. this did the trick for me:
tbl <- read.delim("dir/file.txt", fileEncoding="UCS-2LE")
This happened to me because I had the 'copyright' symbol in one of my strings! Once it was removed, problem solved.
A good rule of thumb, make sure that characters not appearing on your keyboard are removed if you are seeing this error.
I figured out Leafpad to be an adequate and simple text-editor to view and save/convert in certain character sets - at least in the linux-world.
I used this to save the Latin-15 to UTF-8 and it worked.

Resources