Adding a column based on values of other columns - r

The variable Jaehrlichkeit is basically a factor with 3 levels: HQ30, HQ100, HQ300. I want R to read Jaehrlichkeit. If Jaehrlichkeit = HQ30, the copy the value from the column intHQ30 in the correponding row and paste it in the newly created column Intensitaet. Repeat this for HQ100 and HQ300.
I was trying to combine the mutate function with nested ifelse but keep getting errors. Can please someone help me out? or maybe suggest an easier solution?

We can do this with row/column indexing. Get the names of the columns that start with 'int' followed by 'HQ' and some numbers (\\d+) using grep. Then, get the column index for each row by matching the 'Jaehrlichkeit' with the substring of 'v1', cbind with the row sequence and use that to extract the values from the intHQ columns and assign it to create the 'Intensitaet'
v1 <- grep("^intHQ\\d+", names(sub1), value = TRUE)
sub1$Intensitaet <- sub1[v1][cbind(1:nrow(sub1),
match(sub1$Jaehrlichkeit, sub("int", "", v1)))]

Another option would be to split, and apply, i.e.
do.call(rbind, lapply(split(df, df$Jaehrlichkeit), function(i) {
i$Intensitaet <- i[[grep(i$Jaehrlichkeit[1], names(i))]]; i
}))

Since Jaehrlichkeit is of type factor, you could do this vectorized:
r <- sub1[,match(paste0("int", levels(sub1$Jaehrlichkeit)), names(sub1))]
sub1$Intensitaet <- r[cbind(seq(nrow(r)), as.numeric(sub1$Jaehrlichkeit))]
First you get the value of columns intHQ100, intHQ30 and intHQ300 in your data frame in the order of levels(sub1$Jaehrlichkeit).
Then you generate the indices and create the Intensitaet column.

Related

Find match of exactly the same string in R column that has both numeric and character items

I have a column that has numeric and strings. I'd like to find only those rows that has a particular string and not the others. In this case, I only need rows that has SE and not the others.
df :
names
SE123, FE43, SA67
SE167, SE24, SE56, SE34
SE23
FE36, KE90, LS87
DG20, SE34, LP47
SE57, SE39
Result df
names
SE167, SE24, SE56, SE34
SE23
SE57, SE39
My code
df[grep("^SE", as.character(df$names)),]
But this selects every row that has SE. Would somebody please help in achieving the result df? Thanks.
Looking at your expected output it looks like you want to select those rows where every element starts with "SE" where each element is a word between two commas.
Using base R, one method would be to split the strings on "," and select rows where every element startsWith "SE"
df[sapply(strsplit(df$names, ","), function(x)
all(startsWith(trimws(x), "SE"))), , drop = FALSE]
# names
#2 SE167, SE24, SE56, SE34
#3 SE23
#6 SE57, SE39
If you want to find presence of "SE" irrespective of position maybe grepl is a better choice.
df[sapply(strsplit(df$names, ","), function(x)
all(grepl("SE", trimws(x)))), , drop = FALSE]
Make sure you have names as character column before doing strsplit or run
df$names <- as.character(df$names)
names[!grepl("[A-Z]",gsub("SE","",names))]
[1] "SE167, SE24, SE56, SE34" "SE23" "SE57, SE39"
You can remove the SE from all strings and then look for any character. Strings having only SE will not contain any other character and are thus kept by the filter.
(This also works if you have 25SE)

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

How to apply ifelse function by column names?

I know there are many similar questions around but I'm afraid couldn't get my head around this particular one, though obviously it is very simple!
I am trying to write a simple ifelse function to be applied over a series of columns in a data frame by using column names (rather than numbers). What I try to do is to create a single u_all variable as shown below without typing column names repeatedly.
dat <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
dat
I tried many variants of apply but clearly I'm not on the right track as those grouping functions replicate the ifelse function on each column separately.
dat2 <- data.frame(id=c(1:20),u1 = sample(c(0:1),20,replace=T) , u2 = sample(c(0:1),20,replace=T) , u3 = sample(c(0:1),20,replace=T))
dat2<-cbind(dat2,sapply(dat2[,grepl("^u\\d{1,}",colnames(dat2))],
function(x){ u_all<-ifelse(x==1 & !is.na(x),1,0)}))
dat2
This line from the OP
dat<-within(dat,u_all<-ifelse (u1==1 | u2==1 |u3==1,1,0))
can instead be written as
dat$u_all <- +Reduce("|", dat[, c("u1", "u2", "u3")])
How it works, in terms of intermediate objects:
D = dat[, c("u1", "u2", "u3")] uses the names of the columns to subset the data frame.
r = Reduce("|", D) collapses the data by putting | between each pair of columns. The result is a logical (TRUE/FALSE) vector.
To convert r to a 0/1 integer vector, you could use ifelse(r,1L,0L) or as.integer(r) (since TRUE/FALSE converts to 1/0 by default) or just the unary +, like +r.
If you want to avoid using column names (it's really not clear to me from the post), you can construct D = dat[-1] to exclude the first column instead.
You were almost there, here's a solution using apply over rows and using all to transform a vector of tests to a single digit.
dat2$u_all <- apply(dat2[,-1], MARGIN=1, FUN=function(x){
any(x==1)&all(!is.na(x))*1
}
)

automatic column prefix with cbind and just one column

I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))
A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)
What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.

Filtering a dataframe in row names from a column values

Basically I have dataframe with two columns (target_id and fpkm). I want to keep only those row names in first column that are not duplicated.
For example in the below dataframe you can see there are two row names with the same name (almost) comp267138_c0_seq1 comp267138_c0_seq2 and from both and I want to keep only one comp267138_c0_seq2 based of high value in column 2.
target_id fpkm
comp247393_c0_seq1 3.197885
comp257058_c0_seq4 1.624577
comp242590_c0_seq1 1.750319
comp77911_c0_seq1 1.293059
comp241426_c0_seq1 1.626589
comp288413_c0_seq1 14.828853
comp294436_c0_seq1 11.555596
comp63603_c0_seq1 1.982386
comp267138_c0_seq1 8.594494
comp267138_c0_seq2 11.134958
comp321623_c0_seq1 6.934149
It appears you only want to consider part of the target_id (the first two components, splitting by _)
If your data.frame is called DT
# create column without the _seqx part
DT$new_id <- sapply(lapply(strsplit(as.character(DT[['target_id']]), '_'), head, 2),
paste, collapse = '_')
library(plyr)
ddply(DT, .(new_id), function(x) x[which.max(x$fpkm),])

Resources