I want have a dataframe with something like 90 variables, and over 1 million observations. I want to calculate the percentage of NA rows on each variable. I have the following code:
sum(is.na(dataframe$variable) / nrow(dataframe) * 100)
My question is, how can I apply this function to all 90 variables, without having to type all variable names in the code?
Use lapply() with your method:
lapply(df, function(x) sum(is.na(x))/nrow(df)*100)
If you want to return a data.frame rather than a list (via lapply()) or a vector (via sapply()), you can use summarise_each from the dplyr package:
library(dplyr)
df %>%
summarise_each(funs(sum(is.na(.)) / length(.)))
or, even more concisely:
df %>% summarise_each(funs(mean(is.na(.))))
data
df <- data.frame(
x = 1:10,
y = 1:10,
z = 1:10
)
df$x[c(2, 5, 7)] <- NA
df$y[c(4, 5)] <- NA
Related
I am trying to work out how to create a user defined function to perform a calculation on a series of columns in a dataframe, and add the answer as an additional column to the same dataframe. To keep things simple, the test example I have been using is to calculate percentage growth from one year to the next, but the goal is to be able to create more elaborate calculations that are too cumbersome and repetitive to manually calculate.
The practice data I have been using is...
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c(2001, 2002))
Which will look like...
2001 2002
a 10 12
b 11 9
The manual calculation I have been using is...
df$PercGrowth <- (df$`2002` - df$`2001`) / df$`2001` * 100
Which returns:
2001 2002 PercGrowth
a 10 12 20.00000
b 11 9 -18.18182
How do I turn this into a user defined function where I can specify the columns to perform the calculation, and then have the answer added to the dataframe as a derived value?
What I initially thought might work was...
pg <- function(data, c1, c2)
df <- mutate(data, PercGrowth = ((df[c2] -df[c1]) / df[c1] * 100))
pg(df, 1, 2)
However I keep getting the error message:
Error: Column PercGrowth is of unsupported class data.frame
How do I get this to work?
This is actually more complicated than it looks - you need to use dplyr pronouns and quasiquotation in order to pass the column names as arguments in the function. The following code works:
library(dplyr)
a <- c(10, 12)
b <- c(11, 9)
df <- t(data.frame(a, b))
df <- data.frame(df)
colnames(df) <- c("year1", "year2")
pg <- function(df, col1, col2) {
quo_col1 <- enquo(col1)
quo_col2 <- enquo(col2)
df %>%
mutate(pct_growth = (!! quo_col2 - !! quo_col1) / !! quo_col1 * 100)
}
pg(df, year1, year2)
I renamed the columns to strings so they are easier to work with. You can read more at this link: https://dplyr.tidyverse.org/articles/programming.html
Another option could be to use some kind of string matching on the column names you're interested in, perform operations using those columns, and then join the result back to the main data frame.
This question already has answers here:
How to delete multiple values from a vector?
(9 answers)
Closed 3 years ago.
I have a vector of values and a data frame.
I would like to filter out the rows of the data frame which contain (in specific column) any of the values in my vector.
I'm trying to figure out if a person in the survey has a child who was also questioned in the survey - if so I would like to remove them from my data frame.
I have a list of respondent IDs, and vectors of mother/father personal IDs. If the ID appears in the mother/father column I would like to remove it.
df <- data.frame(ID= c(101,102,103,104,105), Name = (Martin, Sammie, Reg, Seamus, Aine)
vec <- c(103,105,108,120,150)
Output should be a dataframe with three rows - Martin, Sammie, Seamus.
ID Name
1 101 Martin
2 102 Sammie
3 104 Seamus
df[!(df$ID %in% vec), ] # Or subset(df, !(ID %in% vec))
# ID Name
# 1 101 Martin
# 2 102 Sammie
# 4 104 Seamus
Data
df <- data.frame(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
You can do this with filter from dplyr
library(tidyverse)
df2 <- df%>%
filter(!ID %in% vec)
If you create this as a data.table (and load data.table package, and fix the errors in the example data):
library(data.table)
df <- data.table(ID= c(101,102,103,104,105), Name = c("Martin", "Sammie", "Reg", "Seamus", "Aine"))
vec <- c(103,105,108,120,150)
# solution, slightly different from base R
df[!(ID %in% vec)]
Data.table is likely going to run a bit quicker than base R so very useful with large datasets. Microbenchmarking with a large dataset using base R, tidyverse and data.table shows data.table to be a bit quicker than tidyverse and a lot faster than base.
library(tidyverse)
library(data.table)
library(microbenchmark)
n <- 10000000
df <- data.frame("ID" = c(1:n), "Name" = sample(LETTERS, size = n, replace = TRUE))
dt <- data.table(df)
vec <- sample(1:n, size = n/10, replace = FALSE)
microbenchmark(dt[!(ID %in% vec)], df[!(df$ID %in% vec),], df%>% filter(!ID %in% vec))
I would like to convert a dataframe filled with frequencies into a dataframe filled with percentage by row using dplyr.
My data set has the particularity to get filled with others variables and I just want to calculate the percentage for a set of columns defined by a vector of names. Plus, I want to use the dplyr library.
sim_dat <- function() abs(floor(rnorm(26)*10))
df <- data.frame(a = letters, b = sim_dat(), c = sim_dat(), d = sim_dat()
, z = LETTERS)
names_to_transform <- names(df)[2:4]
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_each(function(x) x / sum_freq_codpos, names_to_transform)
# does not work
Any idea on how to do it? I have tried with mutate_at and mutate_each but I can't get it to work.
you're almost there!:
df2 <- df %>%
mutate(sum_freq_codpos = rowSums(.[names_to_transform])) %>%
mutate_at(names_to_transform, funs(./sum_freq_codpos))
the dot . roughly translates to "the object i am manipulating here", which in this call is "the focal variable in names_to_transform".
I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}
I have a data frame with 300 columns which has a string variable somewhere which I am trying to remove. I have found this solution in stack overflow using lapply (see below), which is what I want to do, but using the dplyr package. I have tried using the mutate_each function but cant seem to make it work
"If your data frame (df) is really all integers except for NAs and garbage then then the following converts it.
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
You'll have a warning about NAs introduced by coercion but that's just all those non numeric character strings turning into NAs.
dplyr 0.5 now includes a select_if() function.
For example:
person <- c("jim", "john", "harry")
df <- data.frame(matrix(c(1:9,NA,11,12), nrow=3), person)
library(dplyr)
df %>% select_if(is.numeric)
# X1 X2 X3 X4
#1 1 4 7 NA
#2 2 5 8 11
#3 3 6 9 12
Of course you could add further conditions if necessary.
If you want to use this line of code:
df2 <- data.frame(lapply(df, function(x) as.numeric(as.character(x))))
with dplyr (by which I assume you mean "using pipes") the easiest would be
df2 = df %>% lapply(function(x) as.numeric(as.character(x))) %>%
as.data.frame
To "translate" this into the mutate_each idiom:
mutate_each(df, funs(as.numeric(as.character(.)))
This function will, of course, convert all columns to character, then to numeric. To improve efficiency, don't bother doing two conversions on columns that are already numeric:
mutate_each(df, funs({
if (is.numeric(.)) return(.)
as.numeric(as.character(.))
}))
Data for testing:
df = data.frame(v1 = 1:10, v2 = factor(11:20))
mutate_all works here, and simply wrap the gsub in a function. (I also assume you aren't necessarily string hunting, so much as trawling for non-integers.
StrScrub <- function(x) {
as.integer(gsub("^\\D+$",NA, x))
}
ScrubbedDF <- mutate_all(data, funs(StrScrub))
Example dataframe:
library(dplyr)
options(stringsAsFactors = F)
data = data.frame("A" = c(2:5),"B" = c(5,"gr",3:2), "C" = c("h", 9, "j", "1"))
with reference/help from Tony Ladson