R Standard deviation across columns and rows by id - r

I have several data frames that look similar to the following data frame (with much more columns):
id col1 col2 col3 col4 col5
1 4 3 5 4 A
1 3 5 4 9 Z
1 5 8 3 4 H
2 6 9 2 1 B
2 4 9 5 4 K
3 2 1 7 5 J
3 5 8 4 3 B
3 6 4 3 9 C
I want to calculate the standard deviation across specific columns (let's say col2 to col4) grouped by the id. I do not know the column index in every data frame. I only know the names for the columns I want to calculate the standard deviation for.
Is there a way I could do that easily? My original data frames contain around 20 columns and I only want the standard deviation for 10 columns with specific column names grouped by the id.
On top, it would be nice if I can directly add the calculated standard deviations to my data frame as a new column according to the id, looking like this:
id col1 col2 col3 col4 col5 SD
1 4 3 5 4 A SD1
1 3 5 4 9 Z SD1
1 5 8 3 4 H SD1
2 6 9 2 1 B SD2
2 4 9 5 4 K SD2
3 2 1 7 5 J SD3
3 5 8 4 3 B SD3
3 6 4 3 9 C SD3

You can try :
library(dplyr)
df %>%
group_by(id) %>%
mutate(SD = sd(unlist(select(cur_data(), col2:col4))))
# id col1 col2 col3 col4 col5 SD
# <int> <int> <int> <int> <int> <chr> <dbl>
#1 1 4 3 5 4 A 2.12
#2 1 3 5 4 9 Z 2.12
#3 1 5 8 3 4 H 2.12
#4 2 6 9 2 1 B 3.41
#5 2 4 9 5 4 K 3.41
#6 3 2 1 7 5 J 2.62
#7 3 5 8 4 3 B 2.62
#8 3 6 4 3 9 C 2.62

Using data.table
library(data.table)
setDT(df)[, SD := sd(unlist(.SD)), id, .SDcols = col2:col4]

Related

Loop through data frame and get counts. Output to other data frame

I have an 30 * 9 data frame filled with integers 1-9. Each integer can feature multiple times in a column, or none at all.
I basically wanted to calculate the number of times a number appears, in order to generate a column of 9 rows (of counts) for each element of the original data frame, to end up with a 9 * 9 data frame of counts. I also wanted to have a 0 placed where a number does not appear in a particular column.
So far I tried multiple approaches with for loops, tapply, functions etc. But I cannot seem to end up with a result which can be stored directly into a new data frame in a loop.
for (i in seq_along(columnHeaderQuosureList)) {
original_data_frame %>%
group_by(!! columnHeaderQuosureList[[i]]) %>% # Unquote with !!
count(!! columnHeaderQuosureList[[i]]) %>%
print()
}
This works and prints each count for each column. I tried replacing print() with return() and then trying to cbind the returned output with the result_data_frame.
Unfortunately I am getting nowhere, and I do not think my approach is feasible.
Does anyone have any better ideas please?
The function to count instances of unique values in R is table.
# simulated data
df <- as.data.frame(matrix(sample(1:9, 30*9, TRUE), ncol=9))
# use stack to turn column name into a factor column (long format)
table(stack(df))
ind
values V1 V2 V3 V4 V5 V6 V7 V8 V9
1 4 0 1 2 6 3 6 3 2
2 2 2 5 4 4 4 3 3 2
3 4 4 3 7 2 7 1 4 2
4 1 5 1 3 3 5 6 3 5
5 3 4 4 4 4 2 2 3 6
6 7 5 3 2 3 0 1 3 3
7 4 5 3 3 2 1 3 3 1
8 3 2 3 3 5 2 6 4 3
9 2 3 7 2 1 6 2 4 6
Edit: forgot the tricky bit. The output of table is a table object, which looks like a matrix but gets turned into a long format if you try to do as.data.frame. To turn your result into a 9x9 df, use
as.data.frame.matrix(table(stack(df)))
Caveat: if for some reason one of the 9 digits doesn't appear anywhere in the original df, then that row will be skipped (instead of being filled with 0s).
This is very similar to the behaviour of tabulate(), so you can do:
#Create example data
df <- as.data.frame(matrix(sample(1:9, 30*9, TRUE), ncol=9))
#Counts of digits 1-9
as.data.frame(sapply(df, tabulate, nbins=9))
Here's a tidyverse solution that is robust with respect to the numbert of columns and the number of distinct values they contain. I avoid recourse to loops andnon-standard evaluation by making the data tidy by pivoting.
First, create some test data
library(tidyverse)
# For reproducibility
set.seed(123)
d <- tibble(
c1=floor(runif(30, 1, 10)),
c2=floor(runif(30, 1, 10)),
c3=floor(runif(30, 1, 10)),
c4=floor(runif(30, 1, 10)),
c5=floor(runif(30, 1, 10)),
c6=floor(runif(30, 1, 10)),
c7=floor(runif(30, 1, 10)),
c8=floor(runif(30, 1, 10)),
c9=floor(runif(30, 1, 10))
)
d
# A tibble: 30 × 9
c1 c2 c3 c4 c5 c6 c7 c8 c9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 9 6 2 6 8 8 5 5
2 8 9 1 6 3 5 3 3 6
3 4 7 4 4 3 4 7 2 2
4 8 8 3 6 2 3 3 7 6
5 9 1 8 3 4 1 6 1 3
6 1 5 5 2 9 4 5 7 7
7 5 7 8 8 2 6 3 4 4
8 9 2 8 1 1 2 6 4 9
9 5 3 8 5 2 5 9 8 9
10 5 3 4 5 7 2 9 9 7
# … with 20 more rows
Now solve the problem
d %>%
pivot_longer(
everything(),
names_to="Column",
values_to="Value"
) %>%
group_by(Column, Value) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Column,
values_from=N,
id_cols=Value,
values_fill=0
)
# A tibble: 9 × 10
Value c1 c2 c3 c4 c5 c6 c7 c8 c9
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 3 2 3 2 3 1 0 3 1
2 2 1 7 3 4 4 4 3 3 3
3 3 4 4 2 3 6 2 7 3 3
4 4 1 5 6 3 3 7 3 5 4
5 5 5 2 2 5 1 5 3 3 5
6 6 4 1 3 5 3 5 6 2 4
7 7 3 3 3 1 4 3 1 5 3
8 8 2 3 6 1 3 3 2 3 3
9 9 7 3 2 6 3 0 5 3 4

Create a dataframe with all observations unique for one specific column of a dataframe in R

I have a dataframe that I would like to reduce in size by extracting the unique observations. However, I would like to only select the unique observations of one column, and preserve the rest of the dataframe. Because there are certain other columns that have repeat values, I cannot simply put the entire dataframe in the unique function. How can I do this and produce the entire dataframe?
For example, with the following dataframe, I would like to only reduce the dataframe by unique observations of variable a (column 1):
a b c d e
1 2 3 4 5
1 2 3 4 6
3 4 5 6 8
4 5 2 3 6
Therefore, I only remove row 2, because "1" is repeated. The other rows/columns repeat values, but these observations are maintained, because I only assess the uniqueness of column 1 (a).
Desired outcome:
a b c d e
1 2 3 4 5
3 4 5 6 8
4 5 2 3 6
How can I process this and then retrieve the entire dataframe? Is there a configuration for the unique function to do this, or do I need an alternative?
base R
dat[!duplicated(dat$a),]
# a b c d e
# 1 1 2 3 4 5
# 3 3 4 5 6 8
# 4 4 5 2 3 6
dplyr
dplyr::distinct(dat, a, .keep_all = TRUE)
# a b c d e
# 1 1 2 3 4 5
# 2 3 4 5 6 8
# 3 4 5 2 3 6
Another option: per-group, pick a particular value from the duplicated rows.
library(dplyr)
dat %>%
group_by(a) %>%
slice(which.max(e)) %>%
ungroup()
# # A tibble: 3 x 5
# a b c d e
# <int> <int> <int> <int> <int>
# 1 1 2 3 4 6
# 2 3 4 5 6 8
# 3 4 5 2 3 6
library(data.table)
as.data.table(dat)[, .SD[which.max(e),], by = .(a) ]
# a b c d e
# <int> <int> <int> <int> <int>
# 1: 1 2 3 4 6
# 2: 3 4 5 6 8
# 3: 4 5 2 3 6
As for unique, it does not have incomparables argument, but it is not yet implemented:
unique(dat, incomparables = c("b", "c", "d", "e"))
# Error: argument 'incomparables != FALSE' is not used (yet)

R way to select unique columns from column name?

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!
Something like this should work:
df[, !duplicated(colnames(df))]
Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

r - dedupe the rows with value in dataframe

How to subset only the rows with values in a particular column among the duplicates based on another column.
Example:
df
A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5
So in the above dataset, first 4 rows are duplicate based on column A and C, so among them, I want to choose only the rows which has value in column B.
Desired output,
A B C D
1 5 8 9
2 6 5 8
3 NA 8 5
Thanks.
Using dplyr:
df <- read.table(text="A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5", header=T)
df %>%
group_by(A,C) %>%
filter(n()==1|!is.na(B))
A B C D
<int> <int> <int> <int>
1 1 5 8 9
2 2 6 5 8
3 3 NA 8 5
Duplicates back or forwards and not missing on B; or not a duplicate:
anydup <- duplicated(df[c("A","C")]) | duplicated(df[c("A","C")], fromLast=TRUE)
df[(anydup & (!is.na(df$B))) | (!anydup),]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Or use ave to check the length per group as per #HubertL's dplyr answer:
df[!is.na(df$B) | ave(df$B, df[c("A","C")], FUN=length)==1,]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Here is one option with data.table
library(data.table)
setDT(df)[df[, .I[.N==1 | complete.cases(B)] , .(A, C)]$V1]
# A B C D
#1: 1 5 8 9
#2: 2 6 5 8
#3: 3 NA 8 5

I need Column sum as sum12 is sum of 1st columns and sum34 is of column 3 and 4

col1 col2 col3 col4
1 4 1 4
2 4 2 5
4 5 3 6
5 6 5 7
I need column sum like
col1 col2 col3 col4 sum12 sum34
1 4 1 4 5 5
2 4 2 5 6 7
4 5 3 6 9 9
5 6 5 7 11 12
We can use transform
transform(df, sum12 = col1 + col2, sum34 = col3 + col4)
Or another option is
df[c("sum12", "sum34")] <- df[c(1,3)] + df[c(2,4)]
df
# col1 col2 col3 col4 sum12 sum34
#1 1 4 1 4 5 5
#2 2 4 2 5 6 7
#3 4 5 3 6 9 9
#4 5 6 5 7 11 12

Resources