R can't fix Umlaut encoding after trimming white spaces - r

I'm working with data from many different sources, so I'm creating a name bridge and a function to make it easier to join tables. One of the sources uses an umlaut for a value and (I think) the excel csv isn't UTF-8 encoded, so I'm getting strange results.
Since I can't control how the other source compiles their data, I'd like to make a universal function that fixes all the weird encoding rules. I'll use Dennis Schröder as an example name.
One particular source uses the Umlaut, and when I read it in with read.csv and view the table in RStudio, it shows up as Dennis Schr<f6>der. However, if I index the particular table to his value (table[i,j]), the console reads Dennis Schr\xf6der
So in my name-bridge csv, I made a row to map all Dennis Schr\xf6der to Dennis Schroder. I read this name bridge in (with the condition allowEscapes = TRUE), and he shows up exactly the same in my name-bridge table. Great! I should be able to left_join this to the other source to change the name to just Dennis Schroder.
But unfortunately the names still don't map unless I Don't trim strings (I have to trim strings in general because other sources introduce white spaces). Here's the general function I use to fix names. The dataframe is the other source's table, VarUse is the name-column that I want to fix from dataframe, and correctionTable is my name-bridge.
nameUpdate <- dataframe %>%
mutate(name = str_trim(VarUse, 'both')) %>%
left_join(correctionTable, by = c('name' = 'WrongName'))
When I dig into the results of this mapping, I get the following:
correctionTable[14,1] is my name-bridge input of "Dennis Schr\xf6der".
nameUpdate[29,3] is the original name variable from the other source which reads "Dennis Schr\xf6der".
nameUpdate[29,19] is the mutated name variable from the other source after using str_trim, which also reads "Dennis Schr\xf6der".
However, for some reason the str_trim version is not equal to the name-bridge, so it won't map:
In writing this (non-reproducible, sorry) example, I've figured out a work-around by using a combo of str_trim and by not using it, but at this point I'm just confused why the name doesn't get fixed after I use str_trim. The values look exactly the same.

Related

Rename a column with R

I'm trying to rename a specific column in my R script using the colnames function but with no sucess so far.
I'm kinda new around programming so it may be something simple to solve.
Basically, I'm trying to rename a column called Reviewer Overall Notes and name it Nota Final in a data frame called notas with the codes:
colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
and it returns to me:
> colnames(notas$`Reviewer Overall Notes`) <- `Nota Final`
Error: object 'Nota Final' not found
I also found in [this post][1] a code that goes:
colnames(notas) [13] <- `Nota Final`
But it also return the same message.
What I'm doing wrong?
Ps:. Sorry for any misspeling, English is not my primary language.
You probably want
colnames(notas)[colnames(notas) == "Reviewer Overall Notes"] <- "Nota Final"
(#Whatif's answer shows how you can do this with the numeric index, but probably better practice to do it this way; working with strings rather than column indices makes your code both easier to read [you can see what you're renaming] and more robust [in case the order of columns changes in the future])
Alternatively,
notas <- notas %>% dplyr::rename(`Nota Final` = `Reviewer Overall Notes`)
Here you do use back-ticks, because tidyverse (of which dplyr is a part) prefers its arguments to be passed as symbols rather than strings.
Why using backtick? Use the normal quotation mark.
colnames(notas)[13] <- 'Nota Final'
This seems to matter:
df <- data.frame(a = 1:4)
colnames(df)[1] <- `b`
Error: object 'b' not found
You should not use single or double quotes in naming:
I have learned that we should not use space in names. If there are spaces in names (it works and is called a non-syntactic name: And according to Wickham Hadley's description in Advanced R book this is due to historical reasons:
"You can also create non-syntactic bindings using single or double quotes (e.g. "_abc" <- 1) instead of backticks, but you shouldn’t, because you’ll have to use a different syntax to retrieve the values. The ability to use strings on the left hand side of the assignment arrow is an historical artefact, used before R supported backticks."
To get an overview what syntactic names are use ?make.names:
make.names("Nota Final")
[1] "Nota.Final"

R read csv with comma in column

Update 2020-5-14
Working with a different but similar dataset from here, I found read_csv seems to work fine. I haven't tried it with the original data yet though.
Although the replies didn't help solve the problem because my question was not correct, Shan's reply fits the original question I posted the most, so I accepted his answer.
Update 2020-5-12
I think my original question is not correct. Like mentioned in the comment, the data was quoted. Although changing the separator made the 11582 row in R look the same as the 11583 row in excel, it doesn't mean it's "right". Maybe there is some incorrect line switch due to inappropriate encoding or something, and thus causing some of the columns to be displaced. If I open the data with notepad++, the instance at row 11583 in excel is at the 11596 row.
Original question
I am trying to read the listings.csv from this dataset in kaggle into R. I downloaded the file and wrote the coderead.csv('listing.csv'). The first column, the column id, is supposed to be numeric. However, it shows:
listing$id[1:10]
[1] 2015 2695 3176 3309 7071 9991 14325 16401 16644 17409
13129 Levels: Ole Berl穩n!,16736423,Nerea,Mitte,Parkviertel,52.55554132116211,13.340658248460871,Entire home/apt,36,6,3,2018-01-26,0.16,1,279\n17312576,Great 2 floor apartment near Friederich Str MITTE,116829651,Selin,Mitte,Alexanderplatz,52.52349354926847,13.391003496971203,Entire home/apt,170,3,31,2018-10-13,1.63,1,92\n17316675,80簡 m of charm in 3 rooms with office space,116862833,Jon,Neuk繹lln,Schillerpromenade,52.47499080234379,13.427509313575928...
I think it is because there are values with commas in the second column. For example, opening the file with MiCrosoft excel, I can see one of the value in the second column is Ole,Ole...:
How can I read a csv file into R correctly when some values contain commas?
Since you have access to the data in Excel, you can 'Save As' in Excel with a seperator other than comma (,). First go in to Control Panel –> Region and Language -> Additional settings, you can change the "List Seperator". Most common one other than comma is pipe symbol (|). In R, when you read_csv, specify the seperator as '|'.
You could try this?
lsitings <- read.csv("listings.csv", stringsAsFactors = FALSE)
listings$name <- gsub(",","", listings$name) - This will remove the comma in Col name
If you don't need the information in the second column, then you can always delete it (in Excel) before importing into R. The read.csv function, which calls scan, can also omit unwanted columns using the colClasses argument. However, the fread function from the data.table package does this much more simply with the drop argument:
library(data.table)
listings <- fread("listings.csv", drop=2)
If you do need the information in that column, then other methods are needed (see other solutions).

Cleaning a column with break spaces that obtain last, first name so I can filter it from my data frame

I'm stumped. My issue is that I want to grab specific names from a given column. However, when I try and filter them I get most of the names except for a few, even though I can clearly see their names in the original excel file. I think it has to do what some sort of special characters or spacing in the name column. I am confused on how I can fix this.
I have tried using excels clean() function to apply that to the given column. I have tried working an Alteryx flow to clean the data. All of these steps haven't helped any. I am starting to wonder if this is an r issue.
surveyData %>% filter(`Completed By` == "Spencer,(redbox with whitedot in middle)Amy")
surveyData %>% filter(`Completed By` == "Spencer, Amy")
in r the first line had this redbox with white dot in between the comma and the first name. I got this red box with white dot by copy the name from the data frame and copying it into notepad and then pasting it in r. This actually works and returns what I want. Now the second case is a standard space which doesn't return what I want. So how can I fix this issue by not having to copy a name from the data frame and copy to notepad then copying the results from notepad to r, which has the redbox with a white dot in between the comma(,) and first name.
Expected results is that I get the rows that are attached to what ever name I filter by.
I was able to find the answer, it turns out the space is actually a break space with unicode of (U+00A0) compared to the normal space unicode (U+0020). The break space is not apart of the American Standard Code for Information Interchange(ACSII). Thus r filter() couldn't grab some names because they had break spaces. I fixed this by subbing the Unicode of the break space with the Unicode for a normal space and applying that to my given column. Example below:
space_fix = gsub("\u00A0", " ", surveyData$`Completed By`, fixed = TRUE) #subbing break space unicode with space unicode for the given column I am interested in
surveyData$`Completed By Clean` = space_fix
Once, I applied this I could easily filter any name!
Thanks everyone!

R: Extract value and lines after key word (text file mining)

Setting:
I have (simple) .csv and .dat files created from laboratory devices and other programs storing information on measurements or calculations. I have found this for other languages but nor for R
Problem:
Using R, I am trying to extract values to quickly display results w/o opening the created files. Hereby I have two typical settings:
a) I need to read a priori unknown values after known key words
b) I need to read lines after known key words or lines
I can't make functions such as scan() and grep() work.
c) Finally I would like to loop over dozens of files in a folder and give me a summary (to make the picture complete: I will manage this part)
I woul appreciate any form of help.
ok, it works for the key value (although perhaps not very nice)
variable<-scan("file.csv", what=character(),sep="")
returns a charactor vector of everything
variable[grep("keyword", ks)+2] # + 2 as the actual value is stored two places ahead
returns characters of seaked values.
as.numeric(lapply(variable, gsub, patt=",", replace="."))
for completion: data had to be altered to number and "," and "." problem needed to be solved.
in a line:
data=as.numeric(lapply(ks[grep("Ks_Boden", ks)+2], gsub, patt=",", replace="."))
Perseverence is not to bad of an asset ;-)
The rest isn't finished, yet, I will post once finished.

Is there a way to check the spelling of words in a character vector?

The text to be checked is in Greek, but I would like to know if it can be done for English words too. My initial idea is described here, and I have already found a way to do it using VBA. But I wonder if there's a way to do it using R. If there isn't a way in R, do you think of something better than Excel-vba?
Alternatively, OpenOffice ships with a dictionary that entries stored in a text file. You can read that and remove the word definitions to create your word list.
This was tested on v3.0; the file location may have shifted, and the filename will change depending on which dictionary you want.
library(stringr)
dict <- readLines("C:/Program Files/OpenOffice.org 3/share/uno_packages/cache/uno_packages/174.tmp_/dict-en.oxt/th_en_US_v2.dat")
is_word <- str_detect(dict, "^[^(]")
words <- str_split_fixed(dict[is_word], "\\|", 2)
words <- words[,1]
This list contains some multi-word phrases. You may prefer to split on the first space, and take unique values. You probably also want to write words to file, to save repeating yourself.
Once this is done, checking a word is as easy as
c("persnickety", "sqwrzib") %in% words # TRUE FALSE
There exists an open source GNU spell checker called Aspell with suppot for various languages. This is a command line program which I basically use for scanning bunches of text files at once (then the output is just given to the console).
But there also exists a C API and perhaps more interesting for you a Pipe mode which accepts streams of texts and outputs to the standard output.
Hope this helps.

Resources