Add specific value to a data.frame column by matching a pattern - r

I have two data.frames:
pattern <- data.frame(pattern = c("A", "B", "C", "D"), val = c(1, 1, 2, 2))
match <- data.frame(match = c("A", "C"))
I want to add to my data.frame pattern another column called new_val and assign "X" to each row where the value for column pattern is in the data.frame match otherwise assign "Y"
is.element(pattern$pattern, match$match)
[1] TRUE FALSE TRUE FALSE
So, the resulting data.frame should look like:
pattern val new_val
1 A 1 X
2 B 1 Y
3 C 2 X
4 D 2 Y
I achieved to do it with an ugly for-loop but I am sure this can be pretty much done in a one line R command using fancy stuff :-)
Is anyone able to help?
Many thanks!

I'm only really posting this since Tyler said "if you wanted a one liner data.table would likely do it" and I knew it was definitely possible with a one liner in base. I am also assuming match had been renamed to mat.
pattern$new_val <- c("Y", "X")[(pattern$pattern %in% mat)+1]
pattern
# pattern val new_val
#1 A 1 X
#2 B 1 Y
#3 C 2 X
#4 D 2 Y
pattern$pattern %in% mat is finding which of the elements of pattern are in mat which returns TRUE if it's in mat, FALSE if it's not. Then I add 1 to make it numeric in the range of 1-2 so that it can be used for indexing. Then we use that as an index to the self defined vector c("Y", "X") and since the index we created is always 1 or 2 we're always able to grab an element of interest. So in this case we'll grab "Y" if pattern wasn't in mat and "X" if it was - which is what you wanted.

Here's one way (I renamed your match to mat since there's a pretty important base function named match that you could actually use to solve this problem; in fact %in% is a form of match:
pattern <- data.frame(pattern = c("A", "B", "C", "D"), val = c(1, 1, 2, 2))
mat <- c("A", "C")
pattern$new_val <- "Y" #pre allot everything to be Y
pattern$new_val[pattern$pattern %in% mat] <- "X" #replace any A or C with an X
pattern
PS if you wanted a one liner data.table would likely do it.
If you wanted something a little more complicated you could use a function from a package I'm working on:
library(qdap)
#original problem
pattern$new_val <- text2color(pattern$pattern, list(c("A", "C")), c("X", "Y"))
#extending it
#makes D a 5
text2color(pattern$pattern, list(c("A", "C"), "D"), c("X", 5, "Y"))
This function really is designed to do something else but if you want to grab the essential parts of it you can look at the source code.

Related

Rename suffix part of column name but keep the rest the same

For now I am redoing a merge because I poorly named the columns, however, I would like to know how to match on a suffix of a column name and rename that part of the column, keeping the rest the same.
For example, if I have a data.frame (could be a data.table too, doesn't matter - I could convert it):
d <- data.frame("ID" = c(1, 2, 3),
"Attribute1.prev" = c("A", "B", "C"),
"Attribute1.cur" = c("D", "E", "F"))
Now imagine that there are hundreds of columns similar to columns 2 & 3 from my sample DT. How would I go through and detect all columns ending in ".prev" change to ".1" and all columns ending in ".cur" change to ".2"?
So, the new column names would be: ID (unchanged), Attribute1.1, Attribute1.2 and so on for as many columns that match.
With base R we may do
names(d) <- sub("\\.prev", ".1", sub("\\.cur", ".2", names(d)))
d
# ID Attribute1.1 Attribute1.2
# 1 1 A D
# 2 2 B E
# 3 3 C F
With the stringr package you could also use
names(d) <- str_replace_all(names(d), c("\\.prev" = ".1", "\\.cur" = ".2"))
If instead of Attribute1 and Attribute2 you may have some names with dots/spaces, you could also replace "\\.prev" and "\\.cur" patterns to "\\.prev$" and "\\.cur$" as to make sure that we match them at the end of the column names.
Here's an idea using dplyr & stringr syntax
library(dplyr); library(stringr)
names(d) <-
d %>% names() %>%
str_replace(".prev", ".1") %>%
str_replace(".cur", ".2")
Cheers!
Here is an option with gsubfn
library(gsubfn)
names(d) <- gsubfn("(\\w+)", list(prev = 1, cur = 2), names(d))
names(d)
#[1] "ID" "Attribute1.1" "Attribute1.2"

One liner wanted: Create data frame and give colnames: R data.frame(..., colnames = c("a", "b", "c"))

Is there an easier (i.e. one line of code instead of two!) way to do the following:
results <- as.data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2))
names(results) <- c("a", "b")
Something like:
results <- data.frame(str_split_fixed(c("SampleID_someusefulinfo.countsA" , "SampleID_someusefulinfo.countsB" , "SampleID_someusefulinfo.counts"), "\\.", n=2), colnames = c("a", "b"))
I do this a lot, and would really love to have a way to have this in one line of code.
/data.table works too, if it's easier to do there than in base data.frame/
Clarifying:
My expected output (which is achieved by running the two lines of code at the top - AND I WANT IT TO BE ONE - THAT's IT!!!) is a result data frame of the structure:
results
a b
1 SampleID_someusefulinfo countsA
2 SampleID_someusefulinfo countsB
3 SampleID_someusefulinfo counts
What I would like to do is:
CREATE the data frame from a matrix or with some content (for example the toy code of matrix(c(1,2,3,4),nrow=2,ncol=2) I provided in the first example I wrote)
SPECIFY IN THAT SAME LINE what I would like the column names of my data frame to be
Use setNames() around a data.frame
setNames(data.frame(matrix(c(1,2,3,4),nrow=2,ncol=2)), c("a","b"))
# a b
#1 1 3
#2 2 4
?setNames:
a convenience function that sets the names on an object and returns the object
> setNames
function (object = nm, nm)
{
names(object) <- nm
object
}
We can use the dimnames option in matrix as the OP was using matrix to create the data.
data.frame(matrix(1:4, 2, 2, dimnames=list(NULL, c("a", "b"))))
Or
`colnames<-`(data.frame(matrix(1:4, 2, 2)), c('a', 'b'))

Systematic replace part of variable name with 1st element of an associated R vector

I have a dataframe in which the 1st element of an associated 'name' vector is related to subsequent named numerical vectors. I am attempting to replace the meaningless number with the 1st element of the associated name vector.
Here is an example dataframe:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2))`
Each number.name vector is associated with a construct (either A or B in this case) and each number.time is associated with a time dimension. So, data.0.one_minute_ago is actually the number of A's you had one_minute_ago.
What I would like to do (because I have a large dataset with lots of the transformations) is to replace the number.dimension with the construct.dimension, and of course do that for each number. from 0:9
I've written some grep code to begin with this task, but to no avail (I am stuck with retaining everything after the number.
grep( "data.[0-9].name" ,names(df), perl=TRUE)
as.character(df[1, 1])
as.character(df[1, 4])
as.character(names(df[2]))
as.character(names(df[3]))
as.character(names(df[5]))
as.character(names(df[6]))
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- (df[1, grep( "data.[0-9].name" ,names(df))])
df.1 <- data.frame(lapply(df.1, as.character), stringsAsFactors=FALSE)
constructs <- as.character(df.1[1,c(1:2)])
Here the 1st and 2nd element of constructs are the constructs associated with 0.name/0.dimension and 1.name/1.dimension respectively.
constructs [1]
constructs [2]
From there, I'm fairly certain the code would involve some names(df)[] <- but am uncertain on where to go from here.
Any and all help appreciated.
EDIT: here is the desired variable name output: simply changing the variable names (and of course retain the values associated with the variable names:
data.A.name data.A.one_minute_ago data.A.one_hour_ago data.B.name data.B.one_minute_ago data.B.one_hour_ago
EDIT 2: In my true dataset, the number of repetitions per dimensions (i.e., one_minute_ago, one_hour_ago, one_day_ago) can vary across construct (i.e, two dimensions for one construct and 3 for another, and 9 for another). I would like the solution to take that into account.
Here is a modified sample dataset to reflect this subtlety:
df <- data.frame(data.0.name = c("A", "A", "A"), data.0.one_minute_ago = c(1,2,1), data.0.one_hour_ago = c(2,2,3),
data.1.name = c("B", "B", "B"), data.1.one_minute_ago = c(3,3,2), data.1.one_hour_ago = c(5,6,2),
data.2.name = c("C", "C", "C"), data.2.one_minute_ago = c(3,3,2), data.2.one_hour_ago = c(5,6,2), data.2.one_day_ago = c(3,2,3))
We create a grouping 'indx' based on the 'number' in the column names. split the column names based on the 'indx' ('lst'). Get one element from the columns having 'name' as suffix ('r1'). Use 'Map' and gsub to replace the 'number' in each element of 'lst' with that of 'r1'.
indx <- gsub('[^0-9]+', '', names(df))
lst <- split(names(df), indx)
r1 <- as.character(unlist(df[1,grep('name', names(df))]))
lst2 <- Map(function(x,y) gsub('[0-9]+', y, x), lst, r1)
names(df) <- unsplit(lst2, indx)
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago"
#[4] "data.B.name" "data.B.one_minute_ago" "data.B.one_hour_ago"
#[7] "data.C.name" "data.C.one_minute_ago" "data.C.one_hour_ago"
#[10] "data.C.one_day_ago"
I think this works:
library(stringr)
splits <- str_split(names(df), "\\.")
trailing_name <- sapply(splits, "[[", 3)
constructs <- rep(constructs, each = 3)
constructs
# [1] "A" "A" "A" "B" "B" "B"
names(df) <- str_c("data", constructs, trailing_name, sep=".")
names(df)
# [1] "data.A.name" "data.A.one_minute_ago" "data.A.one_hour_ago" "data.B.name"
# [5] "data.B.one_minute_ago" "data.B.one_hour_ago"

How to get row index number for particular name(s)?

How can one determine the row index-numbers corresponding to particular row names? I have a vector of row names, and I would like to use these to obtain a vector of the corresponding row indices in a matrix.
I tried row() and as.integer(rownames(matrix.object)), but neither seems to work.
In addition to which, you can look at match:
m <- matrix(1:25, ncol = 5, dimnames = list(letters[1:5], LETTERS[1:5]))
vec <- c("e", "a", "c")
match(vec, rownames(m))
# [1] 5 1 3
Try which:
which(rownames(matrix.object) %in% c("foo", "bar"))

subsetting in r based on a vector of conditions

This is a restatement of my poorly worded previous question. (To those who replied to it, I appreciate your efforts, and I apologize for not being as clear with my question as I should have been.) I have a large dataset, a subset of which might look like this:
a<-c(1,2,3,4,5,1)
b<-c("a","b","a","b","c","a")
c<-c("m","f","f","m","m","f")
d<-1:6
e<-data.frame(a,b,c,d)
If I want the sum of the entries in the fourth column based on a specific condition, I could do something like this:
attach(e)
total<-sum(e[which(a==3 & b=="a"),4])
detach(e)
However, I have a "vector" of conditions (call it condition_vector), the first four elements of which look more like this:
a==3 & b == "a"
a==2
a==1 & b=="a" & c=="m"
c=="f"
I'd like to create a "generalized" version of the "total" formula above that produces a results_vector of totals by reading in the condition_vector of conditions. In this example, the first four entries in the results_vector would be calculated conceptually as follows:
results_vector[1]<-sum(e[which(a==3 & b=="a"),4])
results_vector[2]<-sum(e[which(a==2),4])
results_vector[3]<-sum(e[which(a==1 & b=="a" & c=="m"),4])
results_vector[4]<-sum(e[which(c=="f"),4])
My actual data set has more than 20 variables. So each record in the condition_vector can contain anywhere from 1 to more than 20 conditions (as opposed to between 1 and 3 conditions, used in this example).
Is there a way to accomplish this other than using a parse(eval(text= ... approach (which takes a long time to run on a relatively small dataset)?
Thanks in advance for any help you can provide (and again, I apologize that I wasn't as clear as I should have been last time around).
Spark
Here using a solution using eval(parse(text=..) here, even if obviously you find it slow:
cond <- c('a==3 & b == "a"','a==2','a==1 & b=="a" & c=="x"','c=="f"')
names(cond) <- cond
results_vector <- lapply(cond,function(x)
sum(dat[eval(parse(text=x)),"d"]))
$`a==3 & b == "a"`
[1] 3
$`a==2`
[1] 2
$`a==1 & b=="a" & c=="m"`
[1] 1
$`c=="f"`
[1] 11
The advantage of naming your conditions vector is to access to your results by condition.
results_vector[cond[2]]
$`a==2`
[1] 2
Here is a function that takes as arguments the condition in each column (if no condition in a column, then NA as argument) and sums in a selected column of a selected data.frame:
conds.by.col <- function(..., sumcol, DF) #NA if not condition in a column
{
conds.ls <- list(...)
res.ls <- vector("list", length(conds.ls))
for(i in 1: length(conds.ls))
{
res.ls[[i]] <- which(DF[,i] == conds.ls[[i]])
}
res.ls <- res.ls[which(lapply(res.ls, length) != 0)]
which_rows <- Reduce(intersect, res.ls)
return(sum(DF[which_rows , sumcol]))
}
Test:
a <- c(1,2,3,4,5,1)
b <- c("a", "b", "a", "b", "c", "a")
c <- c("m", "f", "f", "m", "m", "f")
d <- 1:6
e <- data.frame(a, b, c, d)
conds.by.col(3, "a", "f", sumcol = 4, DF = e)
#[1] 3
For multiple conditions, mapply:
#all conditions in a data.frame:
myconds <- data.frame(con1 = c(3, "a", "f"),
con2 = c(NA, "a", NA),
con3 = c(1, NA, "f"),
stringsAsFactors = F)
mapply(conds.by.col, myconds[1,], myconds[2,], myconds[3,], MoreArgs = list(sumcol = 4, DF = e))
#con1 con2 con3
# 3 10 6
I guess "efficiency" isn't the first you say watching this, though...

Resources