R way to select unique columns from column name? - r

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!

Something like this should work:
df[, !duplicated(colnames(df))]

Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

Related

Subsetting list by values in a column in r

I want to subset a list of dataframes so that it returns the list in the same structure, but excluding rows from each dataframe that meet a condition in one column.
Say I have the following list:
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
a <- data.frame(col1, col2, col3)
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
b <- data.frame(col1, col2, col3)
col1<- round(rnorm(5, mean = 5), digits = 0)
col2<- round(rnorm(5, mean = 5), digits = 0)
col3<- round(rnorm(5, mean = 5), digits = 0)
c <- data.frame(col1, col2, col3)
my_list <- list(a,b,c)
names(my_list)<-c("df1", "df2", "df3")
This provides a list:
> my_list
$df1
col1 col2 col3
1 3 6 5
2 5 4 4
3 6 5 6
4 5 3 6
5 4 4 4
$df2
col1 col2 col3
1 6 5 5
2 6 5 5
3 5 6 6
4 5 4 5
5 6 5 5
$df3
col1 col2 col3
1 6 7 5
2 6 5 5
3 5 6 4
4 4 6 5
5 5 6 4
Say I want to remove all rows that have values in col3 that are less than 5 producing:
> my_list
$df1
col1 col2 col3
1 3 6 5
3 6 5 6
4 5 3 6
$df2
col1 col2 col3
1 6 5 5
2 6 5 5
3 5 6 6
4 5 4 5
5 6 5 5
$df3
col1 col2 col3
1 6 7 5
2 6 5 5
4 4 6 5
I have tried using lapply to no avail:
result <- lapply(my_list, function(x) {
return(x[x$'col3' < 5])
}
)
> result
$df1
[1] FALSE TRUE FALSE FALSE TRUE
$df2
[1] FALSE FALSE FALSE FALSE FALSE
$df3
[1] FALSE FALSE TRUE FALSE TRUE
Any help would be greatly appreciated!
base
set.seed(1)
l <- lapply(my_list, function(x) subset(x, col3 >= 5))
l
#> $df1
#> col1 col2 col3
#> 1 5 5 5
#> 2 5 5 5
#> 3 4 4 5
#>
#> $df2
#> col1 col2 col3
#> 1 6 5 7
#> 2 3 6 5
#> 4 5 5 5
#>
#> $df3
#> col1 col2 col3
#> 4 4 5 7
#> 5 7 4 7
do.call(rbind, l)
#> col1 col2 col3
#> df1.1 5 5 5
#> df1.2 5 5 5
#> df1.3 4 4 5
#> df2.1 6 5 7
#> df2.2 3 6 5
#> df2.4 5 5 5
#> df3.4 4 5 7
#> df3.5 7 4 7
Created on 2021-02-05 by the reprex package (v1.0.0)
Here's a tidyverse solution:
library(tidyverse)
result <- function(x) {
x %>%
filter(col3 < 6)
}
map(my_list, result)
This returns a list of data.frames where col3 is less than 6.
$df1
col1 col2 col3
1 5 4 4
2 4 4 4
$df2
col1 col2 col3
1 6 7 5
$df3
col1 col2 col3
1 6 5 5
2 5 5 5
3 5 5 3
You can combine into a single data.frame by using map_df:
map_df(my_list, result)
This gives us:
> map_df(my_list, result)
col1 col2 col3
1 5 4 4
2 4 4 4
3 6 7 5
4 6 5 5
5 5 5 5
6 5 5 3

R Standard deviation across columns and rows by id

I have several data frames that look similar to the following data frame (with much more columns):
id col1 col2 col3 col4 col5
1 4 3 5 4 A
1 3 5 4 9 Z
1 5 8 3 4 H
2 6 9 2 1 B
2 4 9 5 4 K
3 2 1 7 5 J
3 5 8 4 3 B
3 6 4 3 9 C
I want to calculate the standard deviation across specific columns (let's say col2 to col4) grouped by the id. I do not know the column index in every data frame. I only know the names for the columns I want to calculate the standard deviation for.
Is there a way I could do that easily? My original data frames contain around 20 columns and I only want the standard deviation for 10 columns with specific column names grouped by the id.
On top, it would be nice if I can directly add the calculated standard deviations to my data frame as a new column according to the id, looking like this:
id col1 col2 col3 col4 col5 SD
1 4 3 5 4 A SD1
1 3 5 4 9 Z SD1
1 5 8 3 4 H SD1
2 6 9 2 1 B SD2
2 4 9 5 4 K SD2
3 2 1 7 5 J SD3
3 5 8 4 3 B SD3
3 6 4 3 9 C SD3
You can try :
library(dplyr)
df %>%
group_by(id) %>%
mutate(SD = sd(unlist(select(cur_data(), col2:col4))))
# id col1 col2 col3 col4 col5 SD
# <int> <int> <int> <int> <int> <chr> <dbl>
#1 1 4 3 5 4 A 2.12
#2 1 3 5 4 9 Z 2.12
#3 1 5 8 3 4 H 2.12
#4 2 6 9 2 1 B 3.41
#5 2 4 9 5 4 K 3.41
#6 3 2 1 7 5 J 2.62
#7 3 5 8 4 3 B 2.62
#8 3 6 4 3 9 C 2.62
Using data.table
library(data.table)
setDT(df)[, SD := sd(unlist(.SD)), id, .SDcols = col2:col4]

How to combine rows in two dataframes into one dataframe in R?

This is what I want to do:
Combine
df1
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
and
df2
Col1 Col2 Col3
4 6 3
5 7 8
9 1 2
into this:
df3
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
4 6 3
5 7 8
9 1 2
And I'm sorry if someone has asked this already.
Thanks!
You can just do :
df3 <- rbind(df1, df2)

I need Column sum as sum12 is sum of 1st columns and sum34 is of column 3 and 4

col1 col2 col3 col4
1 4 1 4
2 4 2 5
4 5 3 6
5 6 5 7
I need column sum like
col1 col2 col3 col4 sum12 sum34
1 4 1 4 5 5
2 4 2 5 6 7
4 5 3 6 9 9
5 6 5 7 11 12
We can use transform
transform(df, sum12 = col1 + col2, sum34 = col3 + col4)
Or another option is
df[c("sum12", "sum34")] <- df[c(1,3)] + df[c(2,4)]
df
# col1 col2 col3 col4 sum12 sum34
#1 1 4 1 4 5 5
#2 2 4 2 5 6 7
#3 4 5 3 6 9 9
#4 5 6 5 7 11 12

Arithmetic operation on selective rows in R

I am very new to R, so this question may seem stupid, but please bear with me. Here's what my data looks like:
col1 col2
1 2 9
2 2 2
3 1 8
4 1 1
5 2 4
6 2 5
7 2 3
8 1 10
9 1 6
10 2 7
reproducible from
data <- data.frame(col1 = sample(c(1,2), 10, replace = TRUE),
col2 = as.factor(sample(10)))
I want to have all rows in col2 multiplied by 2, if the corresponding value in col1 is "1". So the end result should be like:
col1 col2
1 2 9
2 2 2
3 1 16
4 1 2
5 2 4
6 2 5
7 2 3
8 1 20
9 1 12
10 2 7
And ideas? Appreciation in advance for your help.
If the data were numeric, you could assign to a slice with a simple computation:
> d[d$col1==1,2] <- 2*d[d$col1==1,2]
> d
col1 col2
1 2 9
2 2 2
3 1 16
4 1 2
5 2 4
6 2 5
7 2 3
8 1 20
9 1 12
10 2 7
With a factor, this becomes problematic as you cannot do the substitution in-place (the existing factor doesn't have the appropriate levels). Instead, you must create a new factor with the desired levels:
d$col2 <- as.factor(ifelse(d$col1==1, 2*as.numeric(d$col2), d$col2))
Assuming that the columns are numeric
transform(df1, col2= (2+(col1==1)-1)*col2)
Here's another possibility:
data$col2 <- as.numeric(data$col2) * (1 + (data$col1==1))

Resources