R: column profiling

R: column profiling - r

In R I'm trying to profile the columns of a data frame. This is the data frame:
> library(MASS)
> data<-iris[1:5,1:4]
> data
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5.0 3.6 1.4 0.2
I want the result of the profiling to look something like this:
min max mean
Sepal.Length 4.6 5.1 5
Sepal.Width 3.0 3.6 5
Petal.Length 1.3 1.5 3
Petal.Width 0.2 0.2 1
There could be many more functions I want to apply to the columns.
I'm able to get the data I want with this command:
library(dplyr)
data %>% summarise_all(funs(min, max, mean))
However, neither the shape nor the row/column names are as desired. Is there an elegant way of achieving what I want?

Oneliner with base R:
t(sapply(data, summary))[, c('Min.', 'Max.', 'Mean')]

library(plyr)
t(sapply(data, each(min,max,mean)))

Using dplyr to allow use of any functions
library(dplyr)
library(tidyr)
data %>%
gather() %>%
group_by(key) %>%
summarise_all(funs(min, max, mean))

Related

filter rows based on their Sum in dplyR R

I am trying to get things right using the dplyr package in R.
Imagine that I have the iris dataset, which looks like this
library(tidyverse)
iris=iris[,1:4]
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 5.1 3.5 1.4 0.2
2 4.9 3.0 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
I want to keep only the rows whose sum is bigger or equal (>=) 10. With baseR i can do it
like this
iris[rowSums(iris) >= 10, , drop = FALSE]
How could do I do this using dplyR and the rowSums function

You can use -
library(dplyr)
iris1=iris[,1:4]
iris1 %>% filter(rowSums(.) >= 10)

Operating on list of strings representing column names?

I'm currently trying to automate a data task that requires taking in a list of column names in string format, then summing those columns (rowwise). i.e., suppose there is some list as follows:
> list
[1] "colname1" "colname2" "colname3"
How would I go about passing in this list to some function like sum() in tidyverse? That is, I would like to run something like the following:
df <- df %>%
rowwise %>%
mutate(new_var = sum(list))
Any suggestions would be greatly, greatly appreciated. Thanks.

You could use rowSums here. For example:
library(dplyr)
mycols <- colnames(iris)[3:4]
mycols
[1] "Petal.Length" "Petal.Width"
Then:
iris %>%
mutate(new_var = rowSums(.[, mycols])) %>%
head()
Result:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
1 5.1 3.5 1.4 0.2 setosa 1.6
2 4.9 3.0 1.4 0.2 setosa 1.6
3 4.7 3.2 1.3 0.2 setosa 1.5
4 4.6 3.1 1.5 0.2 setosa 1.7
5 5.0 3.6 1.4 0.2 setosa 1.6
6 5.4 3.9 1.7 0.4 setosa 2.1

You can pass the vector of column names in c_across.
library(dplyr)
df <- df %>% rowwise() %>% mutate(new_var = sum(c_across(list)))
df

Reference column by column index

have a dataset like an iris, any help will be appreciated,
iris %>% head %>% mutate(sum = .[[1]] + .[[2]]) #works
iris %>% head %>% mutate(max = max(.[1], .[2])) #doesnt work
Expected answer, find the max(1st column, 2nd column)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species max
1 5.1 3.5 1.4 0.2 setosa 5.1
2 4.9 3.0 1.4 0.2 setosa 4.9
3 4.7 3.2 1.3 0.2 setosa 4.7
4 4.6 3.1 1.5 0.2 setosa 4.6
5 5.0 3.6 1.4 0.2 setosa 5.0
6 5.4 3.9 1.7 0.4 setosa 5.4
many thanks in advance

We need elementwise max and this can be achieved with pmax
iris %>%
head %>%
mutate(max= pmax(.[[1]] , .[[2]]) )
The issue with max is that its usage is
max(..., na.rm = FALSE)
Here, the ... signifies
numeric or character arguments
So, it is taking the max value of all the columns passed into the function, rather than the elementwise max of the columns
The + is a different function and it is always elementwise, but if we do sum (which would be a corresponding candidate to check with max), it also does the same behavior as max

Succinct subsetting across multiple columns in R

Say I have a massive dataframe and in multiple columns I have an extremely large list of unique codes and I want to use these codes to select certain rows to subset the original dataframe. There are around 1000 codes and the codes I want all follow after each other. For example I have about 30 columns that contain codes and I only want to take rows that have codes 100 to 120 in ANY of these columns .
There's a long way to do this which is something like
new_dat <- df[which(df$codes==100 | df$codes==101 | df$codes1==100
and I repeat this for every single possible code for everyone of the columns that can contain these codes. Is there a way to do this in a more convenient fashion?
I want to try solving this with dplyr's select function, but I'm having trouble seeing if it works for my case out of the box
Take the iris dataset
Say I wanted all rows that contain the value 4.0-5.0 in any columns that contains the word Sepal in the column name.
#this only goes for 4.0
brand_new_df <- select(filter(iris, Sepal.Length ==4.0 | Sepal.Width == 4.0))
but what I want is something like
brand_new_df <- select(filter(iris, contains(Sepal) == 4.0:5.0))
Is there a dplyr way to do this?

A corresponding across() version from #RonakShah's answer:
library(dplyr)
iris %>% filter(rowSums(across(contains('Sepal'), ~ between(., 4, 5))) > 0)
or
iris %>% filter(rowSums(across(contains('Sepal'), between, 4, 5)) > 0)
From vignette("colwise"):
Previously, filter() was paired with the all_vars() and any_vars() helpers. Now, across() is equivalent to all_vars(), and there’s no direct replacement for any_vars().
So you need something like rowSums(...) > 0 to achieve the effect of any_vars().

You can use filter_at :
library(dplyr)
iris %>% filter_at(vars(contains('Sepal')), any_vars(between(., 4, 5)))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 4.9 3.0 1.4 0.2 setosa
#2 4.7 3.2 1.3 0.2 setosa
#3 4.6 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
#5 4.6 3.4 1.4 0.3 setosa
#6 5.0 3.4 1.5 0.2 setosa
#7 4.4 2.9 1.4 0.2 setosa
#....

Base R:
# Subset:
cols <- grep("codes", names(df2), value = TRUE)
df2[rowSums(sapply(cols,
function(x) {
df2[, x] >= 100 & df2[, x] <= 120
})) == length(cols), ]
# Data:
tmp <- data.frame(x1 <- rnorm(999, mean = 100, sd = 2))
df <-
setNames(data.frame(tmp[rep(1, each = 80)]), paste0("codes", 1:80))
df2 <- cbind(id = 1:nrow(df), df)

One option could be:
iris %>%
filter(Reduce(`|`, across(contains("Sepal"), ~ between(.x, 4, 5))))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 4.6 3.4 1.4 0.3 1
6 5.0 3.4 1.5 0.2 1
7 4.4 2.9 1.4 0.2 1
8 4.9 3.1 1.5 0.1 1
9 4.8 3.4 1.6 0.2 1
10 4.8 3.0 1.4 0.1 1

library(dplyr)
df <- iris
# value to look for
val <- 4
# find columns
cols <- which(colSums(df == val , na.rm = TRUE) > 0L)
# filter rows
iris %>% filter_at(cols, any_vars(.==val))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.8 4.0 1.2 0.2 setosa
2 5.5 2.3 4.0 1.3 versicolor
3 6.0 2.2 4.0 1.0 versicolor
4 6.1 2.8 4.0 1.3 versicolor
5 5.5 2.5 4.0 1.3 versicolor
6 5.8 2.6 4.0 1.2 versicolor

Use column names from vector in for loop in dplyr

this should probably be quite straightforward, but I am struggling to get it to work. I currently have a vector of column names:
columns <- c('product1', 'product2', 'product3', 'support4')
I now want to use dplyr in a for loop to mutate some columns, but I am struggling to make it recognize that it is a column name, not a variable.
for (col in columns) {
cross.sell.val <- cross.sell.val %>%
dplyr::mutate(col = ifelse(col == 6, 6, col)) %>%
dplyr::mutate(col = ifelse(col == 5, 6, col))
}
Can I use %>% in these situations? Thanks..

You should be able to do this without using a for loop at all.
Because you didn't provide any data, I am going to use the builtin iris dataset. The top of it looks like:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
First, I am saving the columns to analyze:
columns <- names(iris)[1:4]
Then, use mutate_at for each column, along with that particular rule. In each, the . represents the vector for each column. Your example implies that the rules are the same for each column, though if that is not the case, you may need more flexibility here.
mod_iris <-
iris %>%
mutate_at(columns, funs(ifelse(. > 5, 6, .))) %>%
mutate_at(columns, funs(ifelse(. < 1, 1, .)))
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 6.0 3.5 1.4 1 setosa
2 4.9 3.0 1.4 1 setosa
3 4.7 3.2 1.3 1 setosa
4 4.6 3.1 1.5 1 setosa
5 5.0 3.6 1.4 1 setosa
6 6.0 3.9 1.7 1 setosa
If you wanted to, you could instead write a function to make all of your changes for the column. This could also allow you to set the cutoffs differently for each column. For example, you may want to set the bottom and top portions of the data to be equal to that threshold (to reign in outliers for some reason), or you may know that each variable uses a dummy value as a placeholder (and that value is different by column, but is always the most common value). You could easily add in any arbitrary rule of interest this way, and it gives you a bit more flexibility than chaining together separate rules (e.g., if you use the mean, the mean changes when you change some of the values).
An example function:
modColumns <- function(x){
botThresh <- quantile(x, 0.25)
topThresh <- quantile(x, 0.75)
dummyVal <- as.numeric(names(sort(table(x)))[1])
dummyReplace <- NA
x <- ifelse(x < botThresh, botThresh, x)
x <- ifelse(x > topThresh, topThresh, x)
x <- ifelse(x == dummyVal, dummyReplace, x)
return(x)
}
And in use:
iris %>%
mutate_at(columns, modColumns) %>%
head
returns:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.3 1.6 0.3 setosa
2 5.1 3.0 1.6 0.3 setosa
3 5.1 3.2 1.6 0.3 setosa
4 5.1 3.1 1.6 0.3 setosa
5 5.1 3.3 1.6 0.3 setosa
6 5.4 3.3 1.7 0.4 setosa

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: column profiling - r

Oneliner with base R: t(sapply(data, summary))[, c('Min.', 'Max.', 'Mean')]

library(plyr) t(sapply(data, each(min,max,mean)))

Using dplyr to allow use of any functions library(dplyr) library(tidyr) data %>% gather() %>% group_by(key) %>% summarise_all(funs(min, max, mean))

Related

filter rows based on their Sum in dplyR R

Operating on list of strings representing column names?

Reference column by column index

Succinct subsetting across multiple columns in R

Use column names from vector in for loop in dplyr

Categories

Resources