Get all the rows with rownames starting with ABC111 - r

We have the following data frame :
col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1
How to get all the rows with rownames started with "ABC111" as follows:
ABC111001 12 12 13
ABC111002 3 4 5

Given the sample data:
data <- read.table(header=TRUE, row.names=1, sep=" ", text="x col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1")
... you can select matching rows using grep:
> data[grep('^ABC111', rownames(data)),]
col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5

You could use e.g. substr or grepl:
df <- read.table(header=TRUE, row.names=1, sep=" ", text="col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1")
needle <- "ABC111"
i <- substr(row.names(df), 0, nchar(needle))==needle
i <- grepl(paste0("^", needle), row.names(df))
df[i,]
# col1 col2 col3
# ABC111001 12 12 13
# ABC111002 3 4 5

Related

Remove rows with all NA values after groupby r

I want to remove rows which has all NAs after using group_by. here is a sample dataset:
df=data.frame(Col1=c("B","B","C","D",
"P1","P2","P3")
,Col2=c(NA,8,NA,9,10,8,9)
,Col3=c(NA,7,6,8,NA,7,8)
,Col4=c(NA,NA,7,7,NA,7,7))
i want to groupby Col1 and remove rows if column values are all NA.
So the desired output is:
Col1
Col2
Col3
Col4
B
8
7
NA
C
NA
6
7
D
9
8
7
P1
10
NA
NA
P2
8
7
7
P3
9
8
7
any help would be really appreciated.
You don't need group_by, you can use if_any.
library(dplyr)
filter(df, if_any(-Col1, ~ !is.na(.)))
# Col1 Col2 Col3 Col4
# 1 B 8 7 NA
# 2 C NA 6 7
# 3 D 9 8 7
# 4 P1 10 NA NA
# 5 P2 8 7 7
# 6 P3 9 8 7
There is no need for group-by, keep only rows where there is at least 1 non-na column, excluding Col1:
df[ rowSums(!is.na(df[, -1])) > 0, ]
# Col1 Col2 Col3 Col4
# 2 B 8 7 NA
# 3 C NA 6 7
# 4 D 9 8 7
# 5 P1 10 NA NA
# 6 P2 8 7 7
# 7 P3 9 8 7
With the help of r2evans answer, I was able to remove NAs after grouping by using group_by. Thanks everyone who answered.
library(dplyr)
df %>% group_by(Col1) %>% filter(if_any(everything(), ~ !is.na(.)))
Col1 Col2 Col3 Col4
B 8 7 NA
C NA 6 7
D 9 8 7
P1 10 NA NA
P2 8 7 7
P3 9 8 7

R way to select unique columns from column name?

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!
Something like this should work:
df[, !duplicated(colnames(df))]
Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

Renaming Column Names in R

I would like to rename my columns from my data frame which have duplicated column names where my columns are a,b and c.
df>
a b c a b c
1 6 11 1 4 4
2 7 12 2 8 12
3 8 13 3 7 7
4 9 14 5 7 11
5 10 15 44 2 13
I could change the columns name by taking out column 1:3 as df1, but is there a way to loop it if I have 1000 column names to change?
df1 <- df[,1:3]
colnames(df1) <- paste(colnames(df1), "test1" , sep = '_')
If you know that the same 3 column names repeat in the same order, you could just use rep here with the each option:
namenums <- rep(1:(ncol(df1)/3), each=3)
colnames(df1) <- paste0(colnames(df1), "_test", namenums)
df1
a_test1 b_test1 c_test1 a_test2 b_test2 c_test2
1 1 6 11 1 6 11
2 2 7 12 2 7 12
3 3 8 13 3 8 13
4 4 9 14 4 9 14
5 5 10 15 5 10 15
Data:
df <- data.frame(a=c(1:5), b=c(6:10), c=c(11:15), a=c(1:5), b=c(6:10), c=c(11:15))
names(df) <- c("a", "b", "c", "a", "b", "c")
We can use make.unique
names(df) <- make.unique(names(df))

How to combine rows in two dataframes into one dataframe in R?

This is what I want to do:
Combine
df1
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
and
df2
Col1 Col2 Col3
4 6 3
5 7 8
9 1 2
into this:
df3
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
4 6 3
5 7 8
9 1 2
And I'm sorry if someone has asked this already.
Thanks!
You can just do :
df3 <- rbind(df1, df2)

Use a column in a data.frame to select another

How do I use one variable in a data.frame to refer to another?
say I have:
col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8
and I want:
col col1 col2 answer
"col1" 1 5 1
"col2" 2 6 6
"col1" 3 7 3
"col2" 4 8 8
,
df$answer = df[,df$col]
isn't working, and a for loop is taking forever.
I know it's already answered, but I thought another approach might be useful:
read.table(text='col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8',h=T)->df
df$answer <- as.integer(df[ cbind(c(1:nrow(df)), match(df$col, names(df))) ])
df
# col col1 col2 answer
# 1 col1 1 5 1
# 2 col2 2 6 6
# 3 col1 3 7 3
# 4 col2 4 8 8
This doesn't look very hard, but the solution I found isn't very elegant, there are probably better ways. But you can use match and then subset according to the match:
dat <- read.table(text="col col1 col2
col1 1 5
col2 2 6
col1 3 7
col2 4 8", header = T, stringsAsFactors = FALSE)
cols <- unique(dat$col)
matches <- match(dat$col, cols)
dat$answer <- sapply(seq_along(matches), function (i) {
dat[i,cols[matches[i]]]
})
And the result:
> dat
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Edit
Actually, here's an already much better approach:
dat$answer <- sapply(1:nrow(dat), function(r) {
dat[r,dat$col[r]]
})
This is apparently what you have tried, but using sapply instead of unlist(lapply, so yeah, not sure if this helps.
In this case with only 2 columns ifelsemight be the fastest and most straightforward solution.
df$answer <- ifelse(df[,1] == "col1",df[,"col1"],df[,"col2”])
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Addition as N8TRO asked in his comment for a more general solution.
A simple switch might be all that is needed:
for(i in 1:nrow(df)) df$ans[i] <- switch(df[i,1],df[i,df[i,1]])
or without a "for" loop:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
example:
df <- data.frame(col=sample(paste0('col',1:5),10,replace=T),col1=1:10,col2=11:20,col3=21:30,col4=31:40,col5=41:50,stringsAsFactors = F)
select the elements:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
df
col col1 col2 col3 col4 col5 ans
1 col1 1 11 21 31 41 1
2 col1 2 12 22 32 42 2
3 col5 3 13 23 33 43 43
4 col2 4 14 24 34 44 14
5 col3 5 15 25 35 45 25
6 col4 6 16 26 36 46 36
7 col5 7 17 27 37 47 47
8 col3 8 18 28 38 48 28
9 col1 9 19 29 39 49 9
10 col5 10 20 30 40 50 50

Resources