Convert character to numeric without NA in r - r

I know this question has been asked many times (Converting Character to Numeric without NA Coercion in R, Converting Character\Factor to Numeric without NA Coercion in R, etc.) but I cannot seem to figure out what is going on in this one particular case (Warning message:
NAs introduced by coercion). Here is some reproducible data I'm working with.
#dependencies
library(rvest)
library(dplyr)
library(pipeR)
library(stringr)
library(translateR)
#scrape data from website
url <- "http://irandataportal.syr.edu/election-data"
ir.pres2014 <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="content"]/div[16]/table') %>%
html_table(fill = TRUE)
ir.pres2014<-ir.pres2014[[1]]
colnames(ir.pres2014)<-c("province","Rouhani","Velayati","Jalili","Ghalibaf","Rezai","Gharazi")
ir.pres2014<-ir.pres2014[-1,]
#Get rid of unnecessary rows
ir.pres2014<-ir.pres2014 %>%
subset(province!="Votes Per Candidate") %>%
subset(province!="Total Votes")
#Get rid of commas
clean_numbers = function (x) str_replace_all(x, '[, ]', '')
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)
#remove any possible whitespace in string
no_space = function (x) gsub(" ","", x)
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(no_space), -province)
This is where things start going wrong for me. I tried each of the following lines of code but I got all NA's each time. For example, I begin by trying to convert the second column (Rouhani) to numeric:
#First check class of vector
class(ir.pres2014$Rouhani)
#convert character to numeric
ir.pres2014$Rouhani.num<-as.numeric(ir.pres2014$Rouhani)
Above returns a vector of all NA's. I also tried:
as.numeric.factor <- function(x) {seq_along(levels(x))[x]}
ir.pres2014$Rouhani2<-as.numeric.factor(ir.pres2014$Rouhani)
And:
ir.pres2014$Rouhani2<-as.numeric(levels(ir.pres2014$Rouhani))[ir.pres2014$Rouhani]
And:
ir.pres2014$Rouhani2<-as.numeric(paste(ir.pres2014$Rouhani))
All those return NA's. I also tried the following:
ir.pres2014$Rouhani2<-as.numeric(as.factor(ir.pres2014$Rouhani))
That created a list of single digit numbers so it was clearly not converting the string in the way I have in mind. Any help is much appreciated.

The reason is what looks like a leading space before the numbers:
> ir.pres2014$Rouhani
[1] " 1052345" " 885693" " 384751" " 1017516" " 519412" " 175608" …
Just remove that as well before the conversion. The situation is complicated by the fact that this character isn’t actually a space, it’s something else:
mystery_char = substr(ir.pres2014$Rouhani[1], 1, 1)
charToRaw(mystery_char)
# [1] c2 a0
I have no idea where it comes from but it needs to be replaced:
str_replace_all(x, rawToChar(as.raw(c(0xc2, 0xa0))), '')
Furthermore, you can simplify your code by applying the same transformation to all your columns at once:
mystery_char = rawToChar(as.raw(c(0xc2, 0xa0)))
to_replace = sprintf('[,%s]', mystery_char)
clean_numbers = function (x) as.numeric(str_replace_all(x, to_replace, ''))
ir.pres2014 = ir.pres2014 %>% mutate_each(funs(clean_numbers), -province)

Related

How do I create a loop to change the text encoding of the labels in labelled variables in R

I have imported a stata file that is giving me some encoding problems in the value labels. On import, using labelled::lookfor for any keyword returns this error:
Error in structure(as.character(x), names = names(x)) :
invalid multibyte string at '<e9>bec Solidaire'
Knowing the data-set, that is almost certainly a value label in there.
How to I loop through the data-set fixing the encoding problem in the names of the value labels and then reset them. I have found a solution, I think, to fix the problematic characters, but I don't know how to replace the original names.
v <- labelled(c(1,2,2,2,3,9,1,3,2,NA), c(yes = 1, "Bloc Qu\xe9b\xe9cois" = 3, "don't know" = 9))
x<- labelled(c(1,2,2,2,3,9,1,3,2,NA), c("Bloc Qu\xe9b\xe9cois" = 1, no = 3, "don't know" = 9))
mydat<-data.frame(v=v, x=x)
glimpse(mydat)
mydat %>%
map(., val_labels)
#This works individually
iconv(names(val_labels(x)), from="latin1", to="UTF-8")
#And this seems to work looping over each variable, but how to I store it?
mydat %>%
map(., function(x) iconv(names(val_labels(x)), from="latin1", to="UTF-8"))
This seems to be a bit tough to do in one simple step, so here I used some helper functions
conv_names <- function(x) {
setNames(x, iconv(names(x), from="latin1", to="UTF-8"))
}
conv_val_labels <- function(x) {
val_labels(x) <- conv_names(val_labels(x))
x
}
mydat <- map_dfc(mydat, conv_val_labels)
But we map the function to each column and then reassign those columns back to the data frame. Note we use map_dfc to combine the columns back into a data frame

Extract emails in brackets

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. name#mail.com) I need to keep those elements.
This is an example
x2 <- c("John Smith <jsmith#company.ch> <abrown#company.ch>","no-reply#cdon.com" ,
"<rikke.hc#hotmail.com>")
I need output like:
[1] "jsmith#company.ch" "abrown#company.ch"
[2] "no-reply#cdon.com"
[3] "rikke.hc#hotmail.com"
I tried this in purpose to merge that 2 results
library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )
My data code sample:
from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)
gmail_DK <- gmail_DK %>%
mutate(from = unlist(y1)) %>%
mutate(from = unlist(y2))
but when I use this function to my data (only one day emails) and unlist I get
Error in mutate():
! Problem while computing cc = unlist(cc2).
x cc must be size 103 or 1, not 104.
Run rlang::last_error() to see where the error occurred.
I suppose that in data from more days difference should be bigger, so I prefer to not go this way.
Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.
We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>]+ - implies one or more characters that are not a >
sub(".*<([^>]+)>", "\\1", unlist(strsplit(x2,
"(?<=>)\\s+", perl = TRUE)))
[1] "jsmith#company.ch" "abrown#company.ch"
[3] "no-reply#cdon.com" "rikke.hc#hotmail.com"
Clunky but OK?
(x2
## split into single words/tokens
%>% strsplit(" ")
%>% unlist()
## find e-mail-like strings, with or without brackets
%>% stringr::str_extract("<?[\\w-.]+#[\\w-.]+>?")
## drop elements with no e-mail component
%>% na.omit()
## strip brackets
%>% stringr::str_remove_all("[<>]")
)

R mutate & replace string with empty pattern or empty string

I am trying to remove some pattern (to_remove) from another string column (entry) inside mutate().
The problem is both my string and pattern columns contain some empty strings. So using some vectorized functions such as stringr::str_remove() would result some warnings and slow the process down by a lot.
I notice that without the empty strings & patterns (i.e. you replace them with some values) it would only take less than 1 sec to complete about 1e5 rows of records. However, with the warnings it would take over 10 secs.
I am wondering if there is any way I can use stringr::str_remove() inside mutate() but skipping those empty rows so that I can still have the speed benefit from vectorization.
Note that I can also use dplyr::rowwise() + gsub() but rowwise() slows things down a lot as well:(
Example code:
library(tidyverse)
library(stringr)
set.seed(123)
temp <- data.frame(
entry = c('A12','JW13','C','')
,to_remove = c('A','W','','D')
) %>%
sample_n(1e5,replace = T)
temp <- temp %>%
mutate(
removed = str_remove(entry,to_remove)
)
Try replacing the blank values with NA :
library(dplyr)
library(stringr)
temp %>%
mutate(to_remove = na_if(to_remove, ''),
removed = str_remove(entry,to_remove))
We can do
library(dplyr)
library(stringr)
temp %>%
mutate(removed = str_remove(entry, replace(to_remove, to_remove == "", NA)))

How properly convert vector of numbers written as character into numeric type?

I have a problem with my R code.
I want to convert numbers written as a characters in vector powpow into real numbers. As usually I used as.numeric() function, but I have no idea why it doesn't work.
Here is my code, if anyone knows how to solve my problem, please write.
Thanks in advance.
The problematic part is start with comment "# średnia i kwantyle powierzchni powiatów woj. wlkp."
############################################################
### Zadanie 1 ###
library(rvest)
library(tidyverse)
library(magrittr)
url <- "https://pl.wikipedia.org/wiki/Wojew%C3%B3dztwo_wielkopolskie"
website_html <- url %>% read_html()
tbls <- website_html %>% html_nodes("table")
tabele <- tbls[11] %>% html_table() %>% as.data.frame()
head(tabele)
tabele <- tabele[, -1]
head(tabele)
length(colnames(tabele))
nazwy <- colnames(tabele)
nazwy[1] <- 'powiat'
nazwy[2] <- 'siedziba'
nazwy[3] <- 'ludnosc'
nazwy[4] <- 'powierzchnia'
nazwy[5] <- 'gestosc'
nazwy[6] <- 'urbanizacja'
nazwy[7] <- 'wyd_budzet'
nazwy[8] <- 'doch_budzet'
nazwy[9] <- 'zadluzenie'
nazwy[10] <- 'stopa'
nazwy -> colnames(tabele)
head(tabele)
powiaty <- tabele # rm(tabele)
# średnia i kwantyle powierzchni powiatów woj. wlkp.
str(powiaty$powierzchnia)
powpow <- powiaty$powierzchnia
str(powpow)
for(i in 1:length(powpow))
{
powpow[i] <- powpow[i] %>% gsub("\\,", "\\.", ., perl=TRUE) %>% as.numeric()
print(str(powpow[i]))
}
What I want is a powpow vector of numbers, not characters.
Depending on your global settings, you may need to replace , to . as decimal separators. An easy solution is as.numeric():
# if your global settings accept "," as a decimal separator
powpow_numeric <- as.numeric(powpow)
# if your global settings do NOT accept "," as a decimal separator
powpow_numeric <- as.numeric(sub(",", ".", powpow, fixed = T))
There is also a way to change your global settings if the first option doesn't work, but I don't know it off the top of my head. Maybe someone else can help with this.
You already loaded the tidyverse package. you can use de parse_number() function from readr and get a numeric vector out of powpow.
parse_number(powpow)
as.numeric(powpow) can do the same, but parse numbers will work in cases where the vector contains non numeric characters, like letters.
Anyway, base in what you have done I did as follow with all others variables that you will have to change:
powiaty <- powiaty %>%
mutate(powierzchnia = parse_number(powierzchnia),
urbanizacja = parse_number(urbanizacja),
wyd_budzet = parse_number(wyd_budzet),
doch_budzet = parse_number(doch_budzet),
# in the case of "zadluzenie" and "stopa" we have to change ',' by dots before parsing
zadluzenie = str_replace(zadluzenie, ",", "\\."),
stopa = str_replace(stopa, ",", "\\."),
zadluzenie = parse_number(zadluzenie),
stopa = parse_number(stopa))
glimpse(powiaty)

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Resources