Using dplyr, Remove all strings from a data frame - r

I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.

dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.

If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))

mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson

Related

How to omit rows with a value contained in a separate vector [duplicate]

This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))

Remove special characters from entire dataframe in R

Question:
How can you use R to remove all special characters from a dataframe, quickly and efficiently?
Progress:
This SO post details how to remove special characters. I can apply the gsub function to single columns (images 1 and 2), but not the entire dataframe.
Problem:
My dataframe consists of 100+ columns of integers, string, etc. When I try to run the gsub on the dataframe, it doesn't return the output I desire. Instead, I get what's shown in image 3.
df <- read.csv("C:/test.csv")
dfa <- gsub("[[:punct:]]", "", df$a) #this works on a single column
dfb <- gsub("[[:punct:]]", "", df$b) #this works on a single column
df_all <- gsub("[[:punct:]]", "", df) #this does not work on the entire df
View(df_all)
df - This is the original dataframe:
dfa - This is gsub applied to column b. Good!
df_all - This is gsub applied to the entire dataframe. Bad!
Summary:
Is there a way to gsub an entire dataframe? Else, should an apply function be used instead?
Here is a possible solution using dplyr:
# Example data
bla <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
bla %>%
mutate_all(funs(gsub("[[:punct:]]", "", .)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd
Update:
mutate_all has been superseded, and funs is deprecated as of dplyr 0.8.0. Here is an updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use mutate_all from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
Another solution is to convert the data frame to a matrix first then run the gsub and then convert back to a data frame as follows:
as.data.frame(gsub("[[:punct:]]", "", as.matrix(df)))
I like Ryan's answer using dplyr. As mutate_all and funs are now deprecated, here is my suggested updated solution using mutate and across:
# Example data
df <- data.frame(a = c(1,2,3),
b = c("fefa%^%", "fes^%#$%", "gD%^E%Ewfseges"),
c = c("%#%$#^#", "%#$#%#", ",.,gdgd$%,."))
# Use across() from dplyr
df %>%
mutate(across(everything(), ~gsub("[[:punct:]]", "", .x)))
a b c
1 1 fefa
2 2 fes
3 3 gDEEwfseges gdgd

R dplyr: mutate a column for specific row range

I am trying to modify the values of a column for rows in a specific range. This is my data:
df = data.frame(names = c("george","michael","lena","tony"))
and I want to do the following using dplyr:
df[2:3,] = "elsa"
My attempt at it is the following, but it doesn't seem to work:
df = cbind(df, rows = as.integer(rownames(df)))
dplyr::mutate(df, ifelse(rows %in% c(2,3), names = "elsa" , names = names))
which gives the result:
Error: unused arguments (names = "elsa", names = c(1, 3, 2, 4))
Thanks for any advice.
This question is a little vague, but I think OP is trying to just replace certain values in a data frame using indexing. As the comment above noted the example dataframe's column is comprised of a factor variable, which makes replacing the value behave differently than you might expect. There are two ways to get around this.
The first (more verbose) way is to force df$names to be a character variable instead of a factor. Then using indexing to select the value you'd like to change and replace it:
df$names = as.character(df$names)
df$names[c(2,3)] = "elsa"
Alternatively, you can set stringsAsFactors = TRUE and proceed as above.
df = data.frame(names = c("george","michael","lena","tony"), stringsAsFactors = FALSE)
df$names[c(2:3)] = "elsa"
names
1 george
2 elsa
3 elsa
4 tony
Definitely check out ?data.frame to get a fuller explanation.
The factor answers are faster, but you can do it with dplyr like this (notice that the column must be of type character and not factor):
df <- data.frame(names = c("george","michael","lena","tony"), stringsAsFactors=F)
oldnames <- c("michael", "lena")
df <- mutate(df, names=ifelse(names %in% oldnames, "elsa", names))
Another way is to do something like
oldnames <- c("michael", "lena")
df$names[df$names %in% oldnames] <- "elsa"
Convert names to a character vector explicitly and use replace:
df %>% mutate(names = replace(as.character(names), 2:3, "elsa"))
Note: If names were already a character vector we could have done just:
df %>% mutate(names = replace(names, 2:3, "elsa"))
We can do this using data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), specify the row index as i and assign (:=) 'elisa' to the 'names'. As the OP mentioned about large dataset, using the := from data.table will be extremely fast.
library(data.table)
setDT(df)[2:3, names := 'elisa']
df
# names
#1: george
#2: elisa
#3: elisa
#4: tony

how can I apply a function to all dataframe variables?

I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA

Determine the number of NA values in a column

I want to count the number of NA values in a data frame column. Say my data frame is called df, and the name of the column I am considering is col. The way I have come up with is following:
sapply(df$col, function(x) sum(length(which(is.na(x)))))
Is this a good/most efficient way to do this?
You're over-thinking the problem:
sum(is.na(df$col))
If you are looking for NA counts for each column in a dataframe then:
na_count <-sapply(x, function(y) sum(length(which(is.na(y)))))
should give you a list with the counts for each column.
na_count <- data.frame(na_count)
Should output the data nicely in a dataframe like:
----------------------
| row.names | na_count
------------------------
| column_1 | count
Try the colSums function
df <- data.frame(x = c(1,2,NA), y = rep(NA, 3))
colSums(is.na(df))
#x y
#1 3
A quick and easy Tidyverse solution to get a NA count for all columns is to use summarise_all() which I think makes a much easier to read solution than using purrr or sapply
library(tidyverse)
# Example data
df <- tibble(col1 = c(1, 2, 3, NA),
col2 = c(NA, NA, "a", "b"))
df %>% summarise_all(~ sum(is.na(.)))
#> # A tibble: 1 x 2
#> col1 col2
#> <int> <int>
#> 1 1 2
Or using the more modern across() function:
df %>% summarise(across(everything(), ~ sum(is.na(.))))
If you are looking to count the number of NAs in the entire dataframe you could also use
sum(is.na(df))
In the summary() output, the function also counts the NAs so one can use this function if one wants the sum of NAs in several variables.
A tidyverse way to count the number of nulls in every column of a dataframe:
library(tidyverse)
library(purrr)
df %>%
map_df(function(x) sum(is.na(x))) %>%
gather(feature, num_nulls) %>%
print(n = 100)
This form, slightly changed from Kevin Ogoros's one:
na_count <-function (x) sapply(x, function(y) sum(is.na(y)))
returns NA counts as named int array
sapply(name of the data, function(x) sum(is.na(x)))
Try this:
length(df$col[is.na(df$col)])
User rrs answer is right but that only tells you the number of NA values in the particular column of the data frame that you are passing to get the number of NA values for the whole data frame try this:
apply(<name of dataFrame>, 2<for getting column stats>, function(x) {sum(is.na(x))})
This does the trick
I read a csv file from local directory. Following code works for me.
# to get number of which contains na
sum(is.na(df[, c(columnName)]) # to get number of na row
# to get number of which not contains na
sum(!is.na(df[, c(columnName)])
#here columnName is your desire column name
Similar to hute37's answer but using the purrr package. I think this tidyverse approach is simpler than the answer proposed by AbiK.
library(purrr)
map_dbl(df, ~sum(is.na(.)))
Note: the tilde (~) creates an anonymous function. And the '.' refers to the input for the anonymous function, in this case the data.frame df.
If you're looking for null values in each column to be printed one after the other then you can use this. Simple solution.
lapply(df, function(x) { length(which(is.na(x)))})
Another option using complete.cases like this:
df <- data.frame(col = c(1,2,NA))
df
#> col
#> 1 1
#> 2 2
#> 3 NA
sum(!complete.cases(df$col))
#> [1] 1
Created on 2022-08-27 with reprex v2.0.2
You can use this to count number of NA or blanks in every column
colSums(is.na(data_set_name)|data_set_name == '')
In the interests of completeness you can also use the useNA argument in table. For example table(df$col, useNA="always") will count all of non NA cases and the NA ones.

Resources