Whitespace string can't be replaced with NA in R - r

I want to substitute whitespaces with NA. A simple way could be df[df == ""] <- NA, and that works for most of the cells of my data frame....but not for everyone!
I have the following code:
library(rvest)
library(dplyr)
library(tidyr)
#Read website
htmlpage <- read_html("http://www.soccervista.com/results-Liga_MX_Apertura-2016_2017-844815.html")
#Extract table
df <- htmlpage %>% html_nodes("table") %>% html_table()
df <- as.data.frame(df)
#Set whitespaces into NA's
df[df == ""] <- NA
I figured out that some whitespaces have a little whitespace between the quotation marks
df[11,1]
[1] " "
So my solution was to do the next: df[df == " "] <- NA
However the problem is still there and it has the little whitespace! I thought the trim function would work but it didn't...
#Trim
df[,c(1:10)] <- sapply(df[,c(1:10)], trimws)
However, the problem can't go off.
Any ideas?

We need to use lapply instead of sapply as sapply returns a matrix instead of a list and this can create problems in the quotes.
df[1:10] <- lapply(df[1:10], trimws)
and another option if we have spaces like " " is to use gsub to replace those spaces to ""
df[1:10] <- lapply(df[,c(1:10)], function(x) gsub("^\\s+|\\s+$", "", x))
and then change the "" to NA
df[df == ""] <- NA
Or instead of doing the two replacements, we can do this one go and change the class with type.convert
df[] <- lapply(df, function(x)
type.convert(replace(x, grepl("^\\s*$", trimws(x)), NA), as.is = TRUE))
NOTE: We don't have to specify the column index when all the columns are looped

I just spent some time trying to determine a method usable in a pipe.
Here is my method:
df <- df %>%
dplyr::mutate_all(funs(sub("^\\s*$", NA, .)))
Hope this helps the next searcher.

Related

How to replace commas in a non-numerical list in R?

I have a data.frame in R, that is also a list. I want to replace the "," with "." in the numbers. The data.frame is not numerical, but I think it has to be to be able to change the decimal separator.
I tried a lot, but nothing works. I do not want to rearrange or manipulate my data.frame. All I want is to get rid off "," in the deciaml numbers.
df <- data.frame(colnames(c("a","b","c")),"row1"=c("2,3","6"),"row2"=c("56,0","56,8"),"row3"=c("1",0"))
#trials to make df numeric and change from , to .
as.numeric(str_replace_all(df,",","."))
as.numeric(unlist(df[ ,2:3]))
lapply(df, as.numeric)
as.numeric(gsub(pattern = ",",replacement = ".",df[ ,2:3]))
as.numeric(df$a)
What else can I do about this nasty problem?
I guess you read the data incorrectly (you can specify dec = ",") while reading the data).
You can use gsub to replace commas (,) with dot (.) and turn them to numeric.
df[] <- lapply(df, function(x) as.numeric(gsub(',', '.', x)))
We can also use mutate_all
library(dplyr)
library(stringr)
df %>%
mutate_all(~ as.numeric(str_replace(., ",", ".")))

gsub not working on colnames?

I have a dataframe called df with column names in the following format:
"A Agarwal" "A Agrawal" "A Balachandran"
"A.Brush" "A.Casavant" "A.Chakrabarti"
They are first initial and last name. However, some of them are separated with a space, while other are with a period. I need to replace the period with a period.(The first column is called author.ID, and I excluded it from the following code)
I have tried the following codes but the resulting colnames still do not change.
colnames(df[, -1]) = gsub("\\s", "\\.", colnames(df[, -1]))
colnames(df[, -1]) = gsub(" ", ".", colnames(df[, -1]))
What am I doing wrong?
Thanks.
Note that df[, -1] gets you all rows and columns except the first column (see this reference). In order to modify the column names you should use colnames(df).
To replace the first literal space with a dot, use
colnames(df) <- sub(" ", ".", colnames(df), fixed=TRUE)
If there can be more than one whitespace, use a regex:
colnames(df) <- sub("\\s+", ".", colnames(df))
If you need to remove all whitespaces sequences with a single dot in the column names, use gsub:
colnames(df) <- gsub("\\s+", ".", colnames(df))

Inserting NA in blank values from web scraping

I am working on scraping some data into a data frame, and am getting some empty fields, where I would instead prefer to have NA. I have tried na.strings, but am either placing it in the wrong place or it just isn't working, and I tried to gsub anything that was whitespace from beginning of line to end, but that didn't work.
htmlpage <- read_html("http://www.gourmetsleuth.com/features/wine-cheese-pairing-guide")
sugPairings <- html_nodes(htmlpage, ".meta-wrapper")
suggestions <- html_text(sugPairings)
suggestions <- gsub("\\r\\n", '', suggestions)
How can I sub out the blank fields with NA, either once it is added to the data frame, or before adding it.
rvest::html_text has an build in trimming option setting trim=TRUE.
After you have done this you can use e.g. ifelse to test for an empty string (=="") or use nzchar.
I full you could do this:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% ifelse(. == "", NA, .)
or this:
res <- html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE)
res[!nzchar(res)] <- NA_character_
#Richard Scriven improvement:
html_nodes(htmlpage, ".meta-wrapper") %>% html_text(trim=TRUE) %>% replace(!nzchar(.), NA)

remove rows that a particular column has NA [duplicate]

I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]

Delete rows with blank values in one particular column

I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]

Resources