count the number of occurrences for each variable using dplyr - r

Here is my data frame (tibble) df:
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
<dbl> <dbl> <dbl> <dbl> <dbl>
1 61 0 70 0 0
2 0 0 127 0 0
3 318 0 2 0 0
4 1 0 0 0 0
5 1 0 67 0 0
6 0 0 0 139 0
7 0 0 0 0 0
8 113 0 0 0 0
9 0 0 1 0 0
10 0 0 0 1 0
For each column/variable, I would like to count the number of rows with value greater than 10. In this case, column 1 would be 3, column 2 would be zero, etc. This is a test data frame, and I would like to do this for many columns.

We can use colSums on a logical matrix
colSums(df > 10, na.rm = TRUE)
Or using dplyr
library(dplyr)
df %>%
summarise_all(~ sum(. > 10, na.rm = TRUE))

I think
library(dplyr)
df %>% summarise_all(~sum(.>10))
will do what you want.

Related

Permute labels in a dataframe but for pairs of observations

Not sure title is clear or not, but I want to shuffle a column in a dataframe, but not for every individual row, which is very simple to do using sample(), but for pairs of observations from the same sample.
For instance, I have the following dataframe df1:
>df1
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 1 1 1 1 0 0 0
386 1 0 0 0 1 0 0
438 2 1 0 0 0 1 1
438 2 0 1 1 0 0 0
582 2 0 0 0 0 0 0
582 2 1 0 0 0 1 0
597 1 0 1 0 0 0 1
597 1 0 0 0 0 0 0
I want to randomly shuffle the labels here for groupID for each sample, not observation, so that the result looks like:
>df2
sampleID groupID A B C D E F
438 1 1 0 0 0 0 0
438 1 0 0 0 0 1 1
386 2 1 1 1 0 0 0
386 2 0 0 0 1 0 0
438 1 1 0 0 0 1 1
438 1 0 1 1 0 0 0
582 1 0 0 0 0 0 0
582 1 1 0 0 0 1 0
597 2 0 1 0 0 0 1
597 2 0 0 0 0 0 0
Notice that in column 2 (groupID), sample 386 is now 2 (for both observations).
I have searched around but haven't found anything that works the way I want. What I have now is just shuffling the second column. I tried to use dplyr as follows:
df2 <- df1 %>%
group_by(sampleID) %>%
mutate(groupID = sample(df1$groupID, size=2))
But of course that only takes all the group IDs and randomly selects 2.
Any tips or suggestions would be appreciated!
One technique would be to extract the unique combinations so you have one row per sampleID, then you can shuffle and merge the shuffled items back to the main table. Here's what that would look like
library(dplyr)
df1 %>%
distinct(sampleID, groupID) %>%
mutate(shuffle_groupID = sample(groupID)) %>%
inner_join(df1)
Using dplyr nest_by and unnest:
library(dplyr)
df1 |>
nest_by(sampleID, groupID) |>
mutate(groupID = sample(groupID, n())) |>
unnest(cols = c(data))
+ # A tibble: 10 x 3
# Groups: sampleID, groupID [4]
sampleID groupID A
<dbl> <int> <dbl>
1 386 1 1
2 386 1 0
3 438 1 0
4 438 1 0
5 438 1 0
6 438 1 1
7 582 2 0
8 582 2 0
9 597 1 1
10 597 1 0

How to creat multiple columns of dummies for intervals in R

data$Distance_100<-0
data$Distance_100[data$Distance<100]<-1
data$Distance_200<-0
data$Distance_200[data$Distance>=101&data$Distance<200]<-1
data$Distance_300<-0
data$Distance_300[data$Distance>=201&data$Distance<300]<-1
data$Distance_400<-0
data$Distance_400[data$Distance>=301&data$Distance<400]<-1
data$Distance_500<-0
data$Distance_500[data$Distance>=401&data$Distance<500]<-1
The outcome must be multiple columns. This code creat just one column data$DistanceCut5 = cut(data$Distance, breaks=c(0,100,200,300,400,500))
cut will create a single column, but if you want 1 column for each cut level you could do something like this:
Example
Libraries
library(tidyverse)
Code
# Vector with a sequence from 0 to 500 by 100
seq_0_500 <- seq(0,500,100)
# Example data.frame
tibble(
# Variable distance = sequence from 1 to 500 by 1
distance = 1:500
) %>%
mutate(
#Create a categoric variable by 100: `(0,100]` `(100,200]` `(200,300]` `(300,400]` `(400,500]`
distance_cut = cut(distance,seq_0_500, labels = paste0("Distance_",seq_0_500[-1])),
#Auxiliar variable
aux = 1
) %>%
# Pivot data to make one column for each cut level
pivot_wider(names_from = distance_cut,values_from = aux) %>%
# Replace every NA for 0
replace(is.na(.),0)
Output
# A tibble: 500 x 6
distance Distance_100 Distance_200 Distance_300 Distance_400 Distance_500
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 0 0 0
2 2 1 0 0 0 0
3 3 1 0 0 0 0
4 4 1 0 0 0 0
5 5 1 0 0 0 0
6 6 1 0 0 0 0
7 7 1 0 0 0 0
8 8 1 0 0 0 0
9 9 1 0 0 0 0
10 10 1 0 0 0 0
# ... with 490 more rows
Here is another approach. First provide reproducible data:
set.seed(42)
var <- round(runif(50, 0, 500))
dummy <- cut(var, breaks=c(0, 100, 200, 300, 400, 500))
table(dummy)
# dummy
# (0,100] (100,200] (200,300] (300,400] (400,500]
# 7 6 9 10 18
Now create columns for each value:
dumvar <-table(row(as.matrix(dummy)), dummy)
head(dumvar); tail(dumvar)
# dummy
# (0,100] (100,200] (200,300] (300,400] (400,500]
# 1 0 0 0 0 1
# 2 0 0 0 0 1
# 3 0 1 0 0 0
# 4 0 0 0 0 1
# 5 0 0 0 1 0
# 6 0 0 1 0 0
# dummy
# (0,100] (100,200] (200,300] (300,400] (400,500]
# 45 0 0 1 0 0
# 46 0 0 0 0 1
# 47 0 0 0 0 1
# 48 0 0 0 1 0
# 49 0 0 0 0 1
# 50 0 0 0 1 0
If you want to rename the columns:
dimnames(dumvar)$dummy <- paste0("Distance_", seq(100, 500, by=100))
Here's a nice approach: first cut your data, then use model.matrix() to create dummy variables.
data <- data.frame(Distance = runif(20, 0, 500))
DistanceCut5 = cut(data$Distance, breaks=c(0,100,200,300,400,500))
dummies <- model.matrix(~ DistanceCut5 + 0) # + 0 so we don't have a column of 1s
data <- cbind(data, dummies)
Make sure you don't have any NAs in DistanceCut5. Otherwise you'll get too few rows in your matrix of dummies.

Stacking multiple columns in R

I am trying to convert data frame into long-form in R.
This is an example data for surveys conducted in 'id' grids over 9 days and if the variable of interest was detected '1' or not detected '0'.
I want to convert this data frame so that the number of surveys is reduced from 9 to 3
but each survey period now contains 3 visits.
I am trying to do so by stacking three columns at a time, so that survey visits 'v1' to 'v9' (in the image below) gets converted to v1, v2, v3 by adding a column called 'visit_no' which describe the visit number within the survey period.
The following link is the image of dataframe in current form and below if the code to generate the data
Code to generate data:
id<- c(240,220,160)
v1<- c(rep(0,9))
v2<-c(rep(0,3),1,rep(0,5))
v3<- c(1,rep(0,8))
v<-as.data.frame(rbind(v1,v2,v3))
survey<- cbind(id,v)
survey
This is the link to the image of data frame I need
Reference data-frame
One way is using reshape in base R"
reshape(survey, direction="long", idvar="id",
varying=list(c("V1","V4","V7"), c("V2","V5","V8"), c("V3","V6","V9")),
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no")
id visit_no Visit1 Visit2 Visit3
240.1 240 1 0 0 0
220.1 220 1 0 0 0
160.1 160 1 1 0 0
240.2 240 2 0 0 0
220.2 220 2 1 0 0
160.2 160 2 0 0 0
240.3 240 3 0 0 0
220.3 220 3 0 0 0
160.3 160 3 0 0 0
If you want it sorted by id, then add arrange from dplyr
%>% dplyr::arrange(id)
id visit_no Visit1 Visit2 Visit3
1 160 1 1 0 0
2 160 2 0 0 0
3 160 3 0 0 0
4 220 1 0 0 0
5 220 2 1 0 0
6 220 3 0 0 0
7 240 1 0 0 0
8 240 2 0 0 0
9 240 3 0 0 0
If your original variable names were in a consistent format, then the reshape command is even simpler because it will correctly guess the times from the names. For example,
names(survey)[2:10] <- paste0(names(survey)[2:10], ".", rep(1:3, 3))
head(survey)
id V1.1 V2.2 V3.3 V4.1 V5.2 V6.3 V7.1 V8.2 V9.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", idvar="id",
varying=2:10, # Can just give the indices now.
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no") %>%
arrange(id)
Although the times are in a consistent format, the original variable names are not, so R cannot guess the names for the long format (Visit1, Visit2, Visit3), and these need to be supplied in the v.names argument.
If they were in a consistent format, then the reshape is even simpler.
names(survey)[2:10] <- paste0("Visit", rep(1:3, each=3), ".", rep(1:3, 3))
head(survey)
id Visit1.1 Visit1.2 Visit1.3 Visit2.1 Visit2.2 Visit2.3 Visit3.1 Visit3.2 Visit3.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", varying=2:10, timevar="visit_no") %>%
arrange(id)
The tidyr version would probably involve two reshapes; one to get everything in very long form, and again to get it back to a wider form (what I call the 1 step back, 2 steps forward method).
You can change the name of the columns based on the sequence that you want.
names(survey)[-1] <- paste(rep(paste0("visit", 1:3), each =3), 1:3, sep = "_")
names(survey)
#[1] "id" "visit1_1" "visit1_2" "visit1_3" "visit2_1" "visit2_2" "visit2_3"
# "visit3_1" "visit3_2" "visit3_3"
And then use pivot_longer from tidyr to get data in different columns.
tidyr::pivot_longer(survey, cols = -id, names_to = c(".value", "visit_no"),
names_sep = "_") %>%
type.convert(as.is = TRUE)
# A tibble: 9 x 5
# id visit_no visit1 visit2 visit3
# <int> <int> <int> <int> <int>
#1 240 1 0 0 0
#2 240 2 0 0 0
#3 240 3 0 0 0
#4 220 1 0 1 0
#5 220 2 0 0 0
#6 220 3 0 0 0
#7 160 1 1 0 0
#8 160 2 0 0 0
#9 160 3 0 0 0

How to use column numbers in the dplyr filter function

How do I use the dplyr::filter() function with column numbers, rather than column names?
As an example, I'd like to pick externally selected columns and return the rows that are all zeros. For example, for a data frame like this
> test
# A tibble: 10 x 4
C001 C007 C008 C020
<dbl> <dbl> <dbl> <dbl>
1 -1 -1 0 0
2 0 0 0 0
3 1 1 0 0
4 -1 -1 0 0
5 0 0 0 -1
6 0 0 0 1
7 0 1 1 0
8 0 0 -1 -1
9 1 1 0 0
10 0 0 0 0
and a vector S = c(1,3,4) How would I pick all the rows in test where all(x==0)? I can do this with an test[apply(test[,S] 1, function(x){all(x==0)},] but I'd like to use this as part of a %>% pipeline.
I have not been able to figure out the filter() syntax to use column numbers rather than names. The real data has many more columns (>100) and rows and the column numbers are supplied by an external algorithm.
Use filter_at with all_vars
library(dplyr)
df %>% filter_at(c(1,3,4), all_vars(.==0))
C001 C007 C008 C020
1 0 0 0 0
2 0 0 0 0

Function using conditions

I am trying to create a function that considers a determined number of samples (rivers) each one with a determined number of observations. Given 10 samples each one with 12 observations in a lognormal distribution with mean=4 and sd=1.4, I would like to obtain the number of times a particular number (6 - it refers to a standard number for water quality measurement) is counted.
The following is the code for one experiment, considering "limit" as the maximum number of observations allowed to ovverpass 6.
set.seed(1001)
nobs<-12
limit<-round(0.10 * nobs, digits = 0)
h2o <- as.data.frame(matrix(rnorm(10*12, mean = 4, sd = 1.4), ncol = 12))
paste(rep("Riv", nrow(h2o)), c(1:nrow(h2o)), sep = "")
rownames(h2o) <- paste(rep("Riv", nrow(h2o)), c(1:nrow(h2o)), sep = "")
colnames(h2o) <- paste(rep("Obs", ncol(h2o)), c(1:ncol(h2o)), sep = "")
#Number of rivers declared impared based in the assumptiom that the number of observations per river are 2 or more?
ifelse(h2o >=6,1,0)
h2o$Test<-rowSums(ifelse(h2o >=6,1,0))
length(h2o$Test[h2o$Test>1])
The function should resume the previous data and works for different observations with different samples.
Thanks
Here's a function using dplyr and tidyr from the tidyverse.
library(tidyverse)
test_h2o <- function(data, threshold_quality = 6, limit = 1) {
table <- data %>%
rownames_to_column("river") %>%
gather(observation, value, -river) %>%
mutate(over_lim = value > threshold_quality)
table_wide <- table %>%
select(river, observation, over_lim) %>%
mutate(over_lim = over_lim %>% as.integer()) %>%
spread(observation, over_lim)
summary <- table %>%
group_by(river) %>%
summarize(over_lim_count = sum(over_lim))
result <- summary %>%
summarize(num_impaired = sum(over_lim_count > limit))
list(table_wide, summary, result)
}
Here's the output, meant to show the steps in your example:
> test_h2o(h2o)
[[1]]
river Obs1 Obs10 Obs11 Obs12 Obs2 Obs3 Obs4 Obs5 Obs6 Obs7 Obs8 Obs9
1 Riv1 0 0 0 0 0 0 0 0 0 0 0 0
2 Riv10 0 0 0 0 0 0 0 0 0 0 0 0
3 Riv2 0 0 1 0 0 0 0 0 0 0 0 0
4 Riv3 1 0 0 0 0 0 0 0 0 0 0 0
5 Riv4 0 0 0 0 0 0 0 0 0 0 0 1
6 Riv5 0 1 0 0 0 0 0 0 0 0 0 0
7 Riv6 0 1 1 0 0 1 0 0 0 0 0 0
8 Riv7 0 0 0 0 0 0 0 0 0 0 0 0
9 Riv8 0 0 0 0 0 0 0 1 0 1 0 0
10 Riv9 1 0 0 0 0 0 0 0 0 0 0 0
[[2]]
# A tibble: 10 x 2
river over_lim_count
<chr> <int>
1 Riv1 0
2 Riv10 0
3 Riv2 1
4 Riv3 1
5 Riv4 1
6 Riv5 1
7 Riv6 3
8 Riv7 0
9 Riv8 2
10 Riv9 1
[[3]]
# A tibble: 1 x 1
num_impaired
<int>
1 2

Resources