Use a column in a data.frame to select another - r

How do I use one variable in a data.frame to refer to another?
say I have:
col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8
and I want:
col col1 col2 answer
"col1" 1 5 1
"col2" 2 6 6
"col1" 3 7 3
"col2" 4 8 8
,
df$answer = df[,df$col]
isn't working, and a for loop is taking forever.

I know it's already answered, but I thought another approach might be useful:
read.table(text='col col1 col2
"col1" 1 5
"col2" 2 6
"col1" 3 7
"col2" 4 8',h=T)->df
df$answer <- as.integer(df[ cbind(c(1:nrow(df)), match(df$col, names(df))) ])
df
# col col1 col2 answer
# 1 col1 1 5 1
# 2 col2 2 6 6
# 3 col1 3 7 3
# 4 col2 4 8 8

This doesn't look very hard, but the solution I found isn't very elegant, there are probably better ways. But you can use match and then subset according to the match:
dat <- read.table(text="col col1 col2
col1 1 5
col2 2 6
col1 3 7
col2 4 8", header = T, stringsAsFactors = FALSE)
cols <- unique(dat$col)
matches <- match(dat$col, cols)
dat$answer <- sapply(seq_along(matches), function (i) {
dat[i,cols[matches[i]]]
})
And the result:
> dat
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Edit
Actually, here's an already much better approach:
dat$answer <- sapply(1:nrow(dat), function(r) {
dat[r,dat$col[r]]
})
This is apparently what you have tried, but using sapply instead of unlist(lapply, so yeah, not sure if this helps.

In this case with only 2 columns ifelsemight be the fastest and most straightforward solution.
df$answer <- ifelse(df[,1] == "col1",df[,"col1"],df[,"col2”])
col col1 col2 answer
1 col1 1 5 1
2 col2 2 6 6
3 col1 3 7 3
4 col2 4 8 8
Addition as N8TRO asked in his comment for a more general solution.
A simple switch might be all that is needed:
for(i in 1:nrow(df)) df$ans[i] <- switch(df[i,1],df[i,df[i,1]])
or without a "for" loop:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
example:
df <- data.frame(col=sample(paste0('col',1:5),10,replace=T),col1=1:10,col2=11:20,col3=21:30,col4=31:40,col5=41:50,stringsAsFactors = F)
select the elements:
df$ans <- sapply(1:nrow(df),function(i) switch(df[i,1],df[i,df[i,1]]))
df
col col1 col2 col3 col4 col5 ans
1 col1 1 11 21 31 41 1
2 col1 2 12 22 32 42 2
3 col5 3 13 23 33 43 43
4 col2 4 14 24 34 44 14
5 col3 5 15 25 35 45 25
6 col4 6 16 26 36 46 36
7 col5 7 17 27 37 47 47
8 col3 8 18 28 38 48 28
9 col1 9 19 29 39 49 9
10 col5 10 20 30 40 50 50

Related

Remove rows with all NA values after groupby r

I want to remove rows which has all NAs after using group_by. here is a sample dataset:
df=data.frame(Col1=c("B","B","C","D",
"P1","P2","P3")
,Col2=c(NA,8,NA,9,10,8,9)
,Col3=c(NA,7,6,8,NA,7,8)
,Col4=c(NA,NA,7,7,NA,7,7))
i want to groupby Col1 and remove rows if column values are all NA.
So the desired output is:
Col1
Col2
Col3
Col4
B
8
7
NA
C
NA
6
7
D
9
8
7
P1
10
NA
NA
P2
8
7
7
P3
9
8
7
any help would be really appreciated.
You don't need group_by, you can use if_any.
library(dplyr)
filter(df, if_any(-Col1, ~ !is.na(.)))
# Col1 Col2 Col3 Col4
# 1 B 8 7 NA
# 2 C NA 6 7
# 3 D 9 8 7
# 4 P1 10 NA NA
# 5 P2 8 7 7
# 6 P3 9 8 7
There is no need for group-by, keep only rows where there is at least 1 non-na column, excluding Col1:
df[ rowSums(!is.na(df[, -1])) > 0, ]
# Col1 Col2 Col3 Col4
# 2 B 8 7 NA
# 3 C NA 6 7
# 4 D 9 8 7
# 5 P1 10 NA NA
# 6 P2 8 7 7
# 7 P3 9 8 7
With the help of r2evans answer, I was able to remove NAs after grouping by using group_by. Thanks everyone who answered.
library(dplyr)
df %>% group_by(Col1) %>% filter(if_any(everything(), ~ !is.na(.)))
Col1 Col2 Col3 Col4
B 8 7 NA
C NA 6 7
D 9 8 7
P1 10 NA NA
P2 8 7 7
P3 9 8 7

R way to select unique columns from column name?

my issue is I have a big database of 283 columns, some of which have the same name (for example, "uncultured").
Is there a way to select columns avoiding those with repeated names? Those (bacteria) normally have a very small abundance, so I don't really care for their contribution, I'd just like to take the columns with unique names.
My database is something like
Samples col1 col2 col3 col4 col2 col1....
S1
S2
S3
...
and I'd like to select every column but the second col2 and col1.
Thanks!
Something like this should work:
df[, !duplicated(colnames(df))]
Like this you will automatically select the first column with a unique name:
df[unique(colnames(df))]
#> col1 col2 col3 col4 S1 S2 S3
#> 1 1 2 3 4 7 8 9
#> 2 1 2 3 4 7 8 9
#> 3 1 2 3 4 7 8 9
#> 4 1 2 3 4 7 8 9
#> 5 1 2 3 4 7 8 9
Reproducible example
df is defined as:
df <- as.data.frame(matrix(rep(1:9, 5), ncol = 9, byrow = TRUE))
colnames(df) <- c("col1", "col2", "col3", "col4", "col2", "col1", "S1", "S2", "S3")
df
#> col1 col2 col3 col4 col2 col1 S1 S2 S3
#> 1 1 2 3 4 5 6 7 8 9
#> 2 1 2 3 4 5 6 7 8 9
#> 3 1 2 3 4 5 6 7 8 9
#> 4 1 2 3 4 5 6 7 8 9
#> 5 1 2 3 4 5 6 7 8 9

Spliting dataframe into chunks by the rows that satisfy the given condition

I have a dataframe similar to:
col1 col2
1 10
1 30
2 60
3 20
3 12
3 51
3 11
I want to divide this dataframe into chanks when the value in col2 is bigger than 50:
dataframe #1
col1 col2
1 10
1 30
2 60
dataframe #2
col1 col2
3 20
3 12
3 51
dataframe #3
col1 col2
3 11
I have tried split function but it would not serve for this task. I wonder if there is a generic function to achieve this?
You can use cumsum in split, with a lot of reving to include the rows where col2 > 50 in the previous group
rev(split(df, rev(cumsum(rev(df$col2 > 50)))))
##joran method, (same result, except for names):
split(df, cumsum(df$col2 > 50) - (df$col2 > 50))
Output:
# $`2`
# col1 col2
# 1: 1 10
# 2: 1 30
# 3: 2 60
#
# $`1`
# col1 col2
# 1: 3 20
# 2: 3 12
# 3: 3 51
#
# $`0`
# col1 col2
# 1: 3 11
without all the revs you get this
split(df, cumsum(df$col2 > 50))
# $`0`
# col1 col2
# 1: 1 10
# 2: 1 30
#
# $`1`
# col1 col2
# 1: 2 60
# 2: 3 20
# 3: 3 12
#
# $`2`
# col1 col2
# 1: 3 51
# 2: 3 11

How to combine rows in two dataframes into one dataframe in R?

This is what I want to do:
Combine
df1
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
and
df2
Col1 Col2 Col3
4 6 3
5 7 8
9 1 2
into this:
df3
Col1 Col2 Col3
7 1 8
6 2 9
3 6 3
4 6 3
5 7 8
9 1 2
And I'm sorry if someone has asked this already.
Thanks!
You can just do :
df3 <- rbind(df1, df2)

Get all the rows with rownames starting with ABC111

We have the following data frame :
col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1
How to get all the rows with rownames started with "ABC111" as follows:
ABC111001 12 12 13
ABC111002 3 4 5
Given the sample data:
data <- read.table(header=TRUE, row.names=1, sep=" ", text="x col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1")
... you can select matching rows using grep:
> data[grep('^ABC111', rownames(data)),]
col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
You could use e.g. substr or grepl:
df <- read.table(header=TRUE, row.names=1, sep=" ", text="col1 col2 col3
ABC111001 12 12 13
ABC111002 3 4 5
ABC000111 7 6 1
ABC000112 9 23 1")
needle <- "ABC111"
i <- substr(row.names(df), 0, nchar(needle))==needle
i <- grepl(paste0("^", needle), row.names(df))
df[i,]
# col1 col2 col3
# ABC111001 12 12 13
# ABC111002 3 4 5

Resources