I have few XML files which are having Japanese characters, when I change it to csv it's japanese characters changed to code point e.g. <U+FA32> some characters like this.
I want to keep them as it is and change it to csv or excel format.
I tried changing Locale, I tried changing setting of R studio. Nothing is working. The xml contains a lot of data and some fields are having email body which consists raw data with some special characters.
Let me show you how my code for changing xml to csv looks like:-
for(f in file)
{
doc <- xmlParse(f,useInternalNodes = TRUE , fileEncoding='UTF-8');
xL <- xmlToList(doc, fileEncoding='UTF-8');
data <- ldply(xL, data.frame, fileEncoding='UTF-8');
write.csv(data, concat(f,".csv"), row.names = FALSE, fileEncoding='UTF-8')
}
Please help with the solution. If we can change it to csv even with some other ways than R please help.
Related
I have a dataset with values that are separated with an apostrophe at the thousands, f.e. 3'203.12. I can read those values with read.table, but when I want to plot them, the values above 1000 are converted to NAs, because of the apostrophe. How can I prevent this, or alternatively how can I remove all apostrophes in a text file?
Open the file in a text editor (e.g. to open with Notepad on Windows: right-click on the file and then choose Open With and select Notepad) and replace all apostrophes by nothing (Ctrl-H in Notepad, then put ' under Find What and leave Replace With empty; then click on Replace All). Save this file under a different name (e.g. if the file was called dummy.csv save as dummy_mod.csv) and then use read.table to upload dummy_mod.csv.
If this does not help you then please edit your answer and provide a sample of the file you try to upload and the R code that you wrote to upload the file.
if you want to remove the apostrophes from within R:
infile <- file('name-of-original-file.csv')
outfile <- file('apostrophes-gone.csv')
readLines(infile) |>
(\(line_in) gsub("'", "", line_in))() |>
(\(line_out) writeLines(line_out, outfile))()
close(infile)
close(outfile)
Then, read in the cleaned data file with the tool of your choice. I find import of package {rio} very convenient: df <- import('apostrophes-gone.csv')
Good day! I need some help for this. So I am trying to do some text mining where I am counting the word frequencies for every word in a text. I was able to do it fine in R with all the different characters, but the problem lies when I export it to .csv file. Basically, I am working on Hungarian text and when I try to save my data frame to .csv, three accented letters (ő, ű, ú) get converted to non-accented ones (o, u and u). It doesn't happen when the file is in .rds but I need to convert it to a .csv file so one of my consultants (zero knowledge of programming) can look at it in a normal Excel file. I tried some tricks e.g. making sure Notepad++ is in UTF-8 format, adding a line like this (fileEncoding = "UTF-8" or encoding ="UTF-8") when writing the .csv file using the write.csv
command, but it doesn't work.
Hope you can help me.
Thank you.
write.csv() works with the three characters you mentioned in the question.
Example
First create a data.frame containing special characters
library(tidyverse)
# Create an RDS file
test_df <- "ő, ű, ú" %>%
as.data.frame
test_df
# .
# 1 ő, ű, ú
Save it as an RDS
saveRDS(test_df, "test.RDS")
Now read in the RDS, save as csv, and read it back in:
# Read in the RDS
df_with_special_characters <- readRDS("test.RDS")
write.csv(df_with_special_characters, "first.csv", row.names=FALSE)
first <- read.csv("first.csv")
first
# .
# 1 ő, ű, ú
We can see above that the special characters are still there!
Extra note
If you have even rarer special characters, you could try setting the file encoding, like so:
write.csv(df_with_special_characters, "second.csv", fileEncoding = "UTF-8", row.names=FALSE)
second <- read.csv("second.csv")
# second
# # .
# # 1 ő, ű, ú
With the writexl package you could use write_xlsx(...) to write an xlsx file instead. It should handle unicode just fine.
I am using R on Windows 10 x64. I am trying to read a set of txt file into R to do text analysis. I am using the following code:
setwd(inputdir)
files <- DirSource(directory = inputdir, encoding ="UTF-8" )
docs<- VCorpus(x=files)
writeLines(as.character(docs[[2]]))
The last line is intended to show the content of the document #2, which this code shows as empty (as well as all other documents in the set). I am not sure why. I checked encoding of the txt document (open, then choose "save as") and my txt files encoding is "Unicode." When I save any of the files as "ANSI" manually, the writeLines(as.character(docs[[2]])) gives me proper content. I thought I should convert all files to ANSI. In that regard, I wanted to ask how can I do that in R for all txt files in my "inputdir"?
get all txt file
files <- list.files(path=getwd(), pattern="*.txt", full.names=T, recursive=FALSE)
loop for converting the encoding and overwrite it
for(i in 1:length(files)){
input <- readLines(files[i])
converted_input <- iconv(input, from = file_encoding, to = file_encoding)
writeLines(converted_input,files[i])
}
possible encodings can be viewed by the iconvlist() command
I'm trying to read a data file into R but every time I do R changes the headers. I can't see any way to control this in the documentation from the read function.
I have the same data saved as both a csv and a tsv but get the same problem with both.
The headers in the data file look like this when I open it in excel or in the console:
cod name_mun age_class 1985 1985M 1985F 1986 1986M 1986F
But when I read it into R using either read.csv('my_data.csv') or read.delim('my_data.tsv') R changes the headers to this:
> colnames(my_data)
[1] "ï..cod" "name_mun" "age_class" "X1985" "X1985M" "X1985F" "X1986"
[8] "X1986M" "X1986F"
Why does R do this and how can I prevent it from happening?
You are seeing two different things here.
The "ï.." on the first column comes from having a byte order mark at the beginning of your file. Depending on how you created the file, you may be able to save as just ASCII or even just UTF-8 without a BOM to get rid of that.
R does not like to have variable names that begin with a digit. If you look at the help page ?make.names you will see
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
You can get around that when you read in your data by using the check.names argument to read.csv possibly like this.
my_data = read.csv(file.choose(), check.names = FALSE)
That will keep the column names as numbers. It will also change the BOM to be the full BOM "".
I have a large data frame(df). I want to export it as an excel file. I am using "WriteXLS" function from "WriteXLS" library. Everything is fine except for some non English characters. For "İ", "ı" and for some other characters a strange object(which is "�") is printed in the excel sheet. I guess it is an ecoding issue. The used code is:
WriteXLS("df", "C:/Users/ozgur/Desktop/df.xlsx",Encoding = "UTF-8", col.names = TRUE,perl = "perl")
The base exporting function
write.table(df, "C:/Users/ozgur/Desktop/df.txt", sep="\t",col.names=TURE)
does not work well. It does not convert data types as in R.
How can I overcome this problem. I will be very glad for any help. Thanks a lot.