Modify encodings of accented characters in value labels - r

I am having a very hard time with accented characters in a stata file I have to import into R. I solved one problem over here, but there's another problem.
After import, anytime I use the lookfor command in the labelled package I get this error.
remotes::install_github("sjkiss/cesdata")
library(cesdata)
data("ces19web")
library(labelled)
look_for(ces19web, "vote")
invalid multibyte string at '<e9>bec Solidaire'
Now I can find one value label that has that label, but it actually appears properly, so I don't know what is going on.
val_labels(ces19web$pes19_provvote)
But, there are other problematic value labels that cause other problems. For example, the value labels for the 13th variable cause this problem.
# This works fine
ces19web %>%
select(1:12) %>%
look_for(., "[a-z]")
# This chokes
ces19web %>%
select(1:13) %>%
look_for(., "[a-z]")
# See the accented character
val_labels(ces19web[,13])
I have come up with this way of replacing the accented characters of the second type.
names(val_labels(ces19web$cps19_imp_iss_party))<-iconv(names(val_labels(ces19web$cps19_imp_iss_party)), from="latin1", to="UTF-8")
And this even solves the problem for look_for()
#This now works!
ces19web %>%
select(1:13) %>%
look_for(., "[a-z]")
But what I need is a way to loop through all of the names of all of the the value labels and make this conversion for all the bungled accented characters.
This is so close, but I don't a know how to save the results of this as the new names for the value labels
ces19web %>%
#map onto all the variables and get the value labels
map(., val_labels) %>%
#map onto each set of value labels
map(., ~{
#Skip if there are no value labels
if (!is.null(.x)){
#If not convert the names as above
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
}) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]
But how do I make that conversion actually stick in the original dataset
Thank you!

There is a problem with character variables: all encodings are marked as either "unknown" (i.e. no non-ascii characters) or UTF-8, however there are strings which are really latin1 strings: for instance 0xe9 is the latin-1 encoding of "é".
Assuming all character variables are actually latin1, you can do this:
ces19web_corr <- as.data.frame(lapply(ces19web, function(v) {
if (is.character(v)) {
Encoding(v) <- "latin1"
v <- iconv(v, from = "latin1", to = "UTF-8")
} else if (is.factor(v)) {
lev <- levels(v)
Encoding(lev) <- "latin1"
lev <- iconv(lev, from = "latin1", to = "UTF-8")
levels(v) <- lev
}
v
}))
Alternately, if only some of them have the problem, you will have to select which one to fix.
Side comment: it might be that you applied my fix from the other post to a data file (or some of its columns) which hasn't the problem described in your other question. Then you accidentally forced the wrong encoding, and the code above is just forcing back the right one.

I don't know if I understand your problem correctly (since the explanations are very verbose), but is it just a matter of reassigning the dataframe ?
library(magrittr)
ces19web %<>% #### REASSIGN THE DATAFRAME WITH THE %<>% OPERATOR
#map onto all the variables and get the value labels
map(., val_labels) %>%
#map onto each set of value labels
map(., ~{
#Skip if there are no value labels
if (!is.null(.x)){
#If not convert the names as above
names(.x)<-iconv(names(.x), from="latin1", to="UTF-8")
}
}) ->out
#Compare the 16th variable's value labels in the original
ces19web[,16]
#With the 16th set of value labels after the conversion function above
out[[16]]

Related

How to preserve numbers in a character format for Excel with data.table's fwrite

In the example below, when I save the number in a character format (i.e., .10) as a CSV file using data.table's fwrite, Excel displays it as a number rather than a character string.
Is there a way to save the number in a character format that Excel can recognize as a string?
I'm not simply trying to remove the 0 before the decimal point or keep the trailing 0.
I'd like to keep the character strings intact, exactly as they would be displayed in R (e.g., .10 but without the quotation marks).
dt <- data.table(a = ".10")
fwrite(dt, "example.csv")
# The saved CSV file opened in Excel shows the number 0.1, rather than the string .10
Excel is mostly brain-dead with regards to reading things in as you want. Here's one workaround:
dat <- data.frame(aschr = c("1", "2.0", "3.00"), asnum = 1:3, hascomma = "a,b", hasnewline = "a\nb", hasquote = 'a"b"')
notnumber <- sapply(dat, function(z) is.character(z) & all(is.na(z) | !grepl("[^-0-9.]", z)))
needsquotes <- sapply(dat, function(z) any(grepl('[,\n"]', z))) & !notnumber
dat[needsquotes] <- lapply(dat[needsquotes], function(z) dQuote(gsub('"', '""', z), FALSE))
dat[notnumber] <- lapply(dat[notnumber], function(z) paste0("=", dQuote(z, FALSE)))
fwrite(dat, "iris.csv", quote = FALSE)
Resulting in the following perspective in Excel.
Most of this is precautionary: if you know your data well and you know that none of the data contains commas, quotes, or newlines, then you can do away with the needsquotes portions. notnumber is the column(s) of interest.
Bottom line, we "trick" excel into keeping it a string by forcing it as an excel formula. This may not work with other spreadsheets (e.g., Calc), I haven't tested.

How to adjust read_excel so that thousand seperator is not changed to decimal point?

Currently, I use the following code to store Excel files (which are stored in a folder on my PC) in a list.
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
decrease_list <- sapply(decrease_names,read_excel,simplify = F)
After that, I combine the dataframes into one object by using the following code.
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows()
The problem I have is that the Excel files that are stored in the folder contain decimal points (points ".") as well as thousand separators (commas ","). I think R (and read_excel() in particular) convert the thousand separators into decimal points, which results in incorrect data.
Although I know that I can remove the thousand separators in Excel first, this would result in a lot of manual work and hence I am interested in a solution that recognises the thousand separator and keeps it intact (or removes it, the goal is to keep the nature of the data correct).
EDIT: as #dario suggested I add a snippet of a tibble that is stored in decrease_list after I run the code. The snippet looks like this:
Raised Avg. change
526.000 2.04
186.000 3.24
...
In the column raised the "." used to be a "," but has become a ".". The "." in Avg. change was a "." already.
Assuming that each excel file contains data in the same format, then we can apply the following code:
library(tidyverse)
library(readxl)
decrease_names <- list.files("4_large_decreases",pattern = ".xlsx",full.names = T)
# 10 columns as written in your comment
decrease <- sapply(decrease_names, readxl::read_excel, col_types = rep("text", 10L), simplify = F)
# Not tested
decrease <- decrease_list %>%
keep(function(x) nrow(x) > 0) %>%
bind_rows() %>%
mutate(across(where(is.character), ~ as.numeric(gsub("\\,", "", .x))))

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

blank elements in Excel: how to fill them as 0 in R

I have a CSV file in Excel to be read into R using the read.csv function. However, in the Excel file, some elements are blank, which indicate 0's. When I read the file into R, those elements are still blank. How can I fill these elements as 0's in R? It seems that is.na-like functions won't apply to this situation. Thanks!
Depends on how they're being read in to the R. Blank cells in a numeric column should usually be interpreted as NA, in which case
your_df$your_column[is.na(your_df$your_column)] <- 0
should work. Your question suggests that doesn't work, in which case they might be read in as empty characters. In that case,
your_df$your_column[your_df$your_column==""] <- 0
ought to do it. If you post a reproducible example (e.g. with a link to the file on Dropbox) it will be possible to be more specific.
Like Drew says the way to get NAs from blank is when you read in. Provide code and example output of the read in data for better responses. Also play around with str as you can see what classes are in the data, which is valuable info.
You may run into some hanky panky (i.e., the data is a factor or character and not a numeric vector) if you have blank cells and the column is numeric. This approach would address that:
## Make up some data
dat <- data.frame(matrix(c(1:3, "", "", 1:3, "", 1:3, rep("", 3), 5), 4))
data.frame(apply(dat, 2, function(x) {
x[x == ""] <- 0
as.numeric(x)
}))

R: Import CSV with column names that contain spaces

CSV file looks like this (modified for brevity). Several columns have spaces in their title and R can't seem to distinguish them.
Alias;Type;SerialNo;DateTime;Main status; [...]
E1;E-70;781733;01/04/2010 11:28;8; [...]
Here is the code I am trying to execute:
s_data <- read.csv2( file=f_name )
attach(s_data)
s_df = data.frame(
scada_id=ID,
plant=PlantNo,
date=DateTime,
main_code=Main status,
seco_code=Additional Status,
main_text=MainStatustext,
seco_test=AddStatustext,
duration=Duration)
detach(s_data)
I have also tried substituting
main_code=Main\ status
and
main_code="Main status"
Unless you specify check.names=FALSE, R will convert column names that are not valid variable names (e.g. contain spaces or special characters or start with numbers) into valid variable names, e.g. by replacing spaces with dots. Try names(s_data). If you do use check.names=TRUE, then use single back-quotes (`) to surround the names.
I would also recommend using rename from the reshape package (or, these days, dplyr::rename).
s_data <- read.csv2( file=f_name )
library(reshape)
s_df <- rename(s_data,ID="scada_id",
PlantNo="plant",DateTime="date",Main.status="main_code",
Additional.status="seco_code",MainStatustext="main_text",
AddStatustext="seco_test",Duration="duration")
For what it's worth, the tidyverse tools (i.e. readr::read_csv) have the opposite default; they don't transform the column names to make them legal R symbols unless you explicitly request it.
s_data <- read.csv( file=f_name , check.names=FALSE)
I believe spaces get replaced by dots "." when importing CSV files. So you'd write e.g. Main.status. You can check by entering names(s_data) to see what the names are.

Resources