automatic column prefix with cbind and just one column - r

I have some trouble with a script which uses cbind to add columns to a data frame. I select these columns by regular expression and I love that cbind automatically provides a prefix if you add more then one column. Bit this is not working if you just append one column... Even if I cast this column as a data frame...
Is there a way to get around this behaviour?
In my example, it works fine for columns starting with a but not for b1 column.
df <- data.frame(a1=c(1,2,3),a2=c(3,4,5),b1=c(6,7,8))
cbind(df, log=log(df[grep('^a', names(df))]))
cbind(df, log=log(df[grep('^b', names(df))]))
cbind(df, log=as.data.frame(log(df[grep('^b', names(df))])))

A solution would be to create an intermediate dataframe with the log values and rename the columns :
logb = log(df[grep('^b', names(df))]))
colnames(logb) = paste0('log.',names(logb))
cbind(df, logb)

What about
cbw <- c("a","b") # columns beginning with
cbw_pattern <- paste0("^",cbw, collapse = "|")
cbind(df, log=log(df[grep(cbw_pattern, names(df))]))
This way you do select both pattern at once. (all three columns).
Only if just one column is selected the colnames wont fit.

Related

subset R data frame using only exact matches of character vector

I would like to subset a data frame (Data) by column names. I have a character vector with column name IDs I want to exclude (IDnames).
What I do normally is something like this:
Data[ ,!colnames(Data) %in% IDnames]
However, I am facing the problem that there is a name "X-360" and another one "X-360.1" in the columns. I only want to exclude the "X-360" (which is also in the character vector), but not "X-360.1" (which is not in the character vector, but extracted anyway). - So I want only exact matches, and it seems like this does not work with %in%.
It seems such a simple problem but I just cannot find a solution...
Update:
Indeed, the problem was that I had duplicated names in my data.frame! It took me a while to figure this out, because when I looked at the subsetted columns with
Data[ ,colnames(Data) %in% IDnames]
it showed "X-360" and "X-360.1" among the names, as stated above.
But it seems this was just happening when subsetting the data, before there were just columns with the same name ("X-360") - and that happened because the data frame was set up from matrices with cbind.
Here is a demonstration of what happened:
D1 <-matrix(rnorm(36),nrow=6)
colnames(D1) <- c("X-360", "X-400", "X-401", "X-300", "X-302", "X-500")
D2 <-matrix(rnorm(36),nrow=6)
colnames(D2) <- c("X-360", "X-406", "X-403", "X-300", "X-305", "X-501")
D <- cbind(D1, D2)
Data <- as.data.frame(D)
IDnames <- c("X-360", "X-302", "X-501")
Data[ ,colnames(Data) %in% IDnames]
X-360 X-302 X-360.1 X-501
1 -0.3658194 -1.7046575 2.1009329 0.8167357
2 -2.1987411 -1.3783129 1.5473554 -1.7639961
3 0.5548391 0.4022660 -1.2204003 -1.9454138
4 0.4010191 -2.1751914 0.8479660 0.2800923
5 -0.2790987 0.1859162 0.8349893 0.5285602
6 0.3189967 1.5910424 0.8438429 0.1142751
Learned another thing to be careful about when working with such data in the future...
One regex based solution here would be to form an alternation of exact keyword matches:
regex <- paste0("^(?:", paste(IDnames, collapse="|"), ")$")
Data[ , !grepl(regex, colnames(Data))]

Lookup Comma Seperating Values in R

I am new to this community, currently working on a R project in which I need to find each of the element separated by comma in a dataframe, on any of the columns in another dataframe, here is an example below:
#DataFrame1
a=c("AA,BB","BB,CC,FF","CC,DD,GG,FF","GG","")
df1=as.data.frame(a)
#DataFrame2
x=c("AA","XX","BB","YY","ZZ","MM","YY","CC")
y=c("DD""VV","NN","XX","CC","AA","WW","FF")
z=c("CC","AA","YY","GG","HH","OO","PP","QQ")
df2=as.data.frame(x,y,z)
what I need to do is find, if any of the elements, lets take for example "AA,BB" (which is the first cell in column x of df1) "AA" is an element and "BB" is another element , is available on any of the columns (x,y,x) in df2, if a match is found I need to identify that row or rows, there is also a possibility of more then one match on df2 rows
. Hope I was able to explain this problem well, expert please help
Here it is a solution in 2 steps:
# load tidyverse
library(tidyverse)
Step 1: Split the elements separated by comma from df1 in a new data frame new_df
1a) To do this, we first identify the number of columns to be generated
(as the maximum number of elements separated by ,; that is: maximum number of , + 1)
number_new_columns <- max(sapply(df1$a, function(x) str_count(x, ","))) + 1
1b) Generate the new data frame new_df
new_df <- df1 %>%
separate(a, c(as.character(seq_len(number_new_columns)))) # missing pieces will be filled with NA
# Above, we used c(as.character(seq_len(number_new_columns))) to generate column names as numbers -- not very creative :)
Step 2: Identify the position of each unique element from new_df in df2
(hope I understood correctly this second part of the question)
2a) Get the unique elements (from new_df)
unique_elements <- unlist(new_df) %>%
unique()
2b) Get a list whose components contain the positions of each unique element within df2
output <- lapply(unique_elements, function(x) {
which(df2 == x, arr.ind=TRUE)
})

How to use laply match to lookup value and append in each row?

I have two data tables as below:
library(data.table)
x <- data.table(id = c(1,1,1,2,2,2,3,3,3,4,4,4), date = as.Date(c("2015-5-26","2015-6-15","2015-4-03","2015-5-26","2015-6-15","2015-4-03","2015-5-26","2015-6-15","2015-4-03","2015-5-26","2015-6-15","2015-4-03")))
y <- data.table(id=c(1,2,3,4),new_id=c(10,20,30,40))
As mentioned now I want to append the new_id column in the data table x and then later drop column id .
I can do this by
merge(x,y,by="id")
But I wanted to try the lapply .
So I tried
x[,new_id:=0]
nm <- c("new_id")
x[nm] <- lapply(nm, function(z) y[[z]][match(y$id, x$id)])
Also which method will be good if I have wide columns and more rows.
It does not matches the column it seems.
Also which method will be efficient if I have wide columns and more rows.
Any help is appreciated.

Adding a column based on values of other columns

The variable Jaehrlichkeit is basically a factor with 3 levels: HQ30, HQ100, HQ300. I want R to read Jaehrlichkeit. If Jaehrlichkeit = HQ30, the copy the value from the column intHQ30 in the correponding row and paste it in the newly created column Intensitaet. Repeat this for HQ100 and HQ300.
I was trying to combine the mutate function with nested ifelse but keep getting errors. Can please someone help me out? or maybe suggest an easier solution?
We can do this with row/column indexing. Get the names of the columns that start with 'int' followed by 'HQ' and some numbers (\\d+) using grep. Then, get the column index for each row by matching the 'Jaehrlichkeit' with the substring of 'v1', cbind with the row sequence and use that to extract the values from the intHQ columns and assign it to create the 'Intensitaet'
v1 <- grep("^intHQ\\d+", names(sub1), value = TRUE)
sub1$Intensitaet <- sub1[v1][cbind(1:nrow(sub1),
match(sub1$Jaehrlichkeit, sub("int", "", v1)))]
Another option would be to split, and apply, i.e.
do.call(rbind, lapply(split(df, df$Jaehrlichkeit), function(i) {
i$Intensitaet <- i[[grep(i$Jaehrlichkeit[1], names(i))]]; i
}))
Since Jaehrlichkeit is of type factor, you could do this vectorized:
r <- sub1[,match(paste0("int", levels(sub1$Jaehrlichkeit)), names(sub1))]
sub1$Intensitaet <- r[cbind(seq(nrow(r)), as.numeric(sub1$Jaehrlichkeit))]
First you get the value of columns intHQ100, intHQ30 and intHQ300 in your data frame in the order of levels(sub1$Jaehrlichkeit).
Then you generate the indices and create the Intensitaet column.

if function for rowSums_Modify the code

I want to get summation over several columns and make a new column based on them. So I use
df$Sum <-rowSums(df[,grep("y", names(df))])
But sometimes df just includes one column and in this case, I will get the error. Since this function is part of my long programming procedure, I was wondering how I can make an if function in a way that If df[,grep("y", names(df))] includes just one column then get sum is equal to df[,grep("y", names(df))] otherwise if df[,grep("y", names(df))] have more at leat two columns get the summation over them?
suppose:
require(stats); require(graphics)
attach(cars)
cars$y1<-seq(20:69)
#cars$y2<-seq(30:79)
df<-cars
df$Sum <-rowSums(df[,grep("y", names(df))])
You can use drop = FALSE when subsetting:
df$Sum <-rowSums(df[,grep("y", names(df)), drop = FALSE])
This keeps df as a data frame even if you are selecting only one column.

Resources