Which() for the whole dataset - r

I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)

Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.

The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1

Related

data.table assignment by reference modifies wrong object

I experience some unexpected behavior when using grouped modification of a column in a data.table:
# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1
# copying data to data_temp
data_temp <- data
# assigning some random value to data_temp so that it should no longer be a
# copy of "data"
data_temp[1, "random_value"] <- rnorm(1)
# converting data_temp to data.table
setDT(data_temp)
# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]
data_temp comes out as expected with only the "C" sequence entries remaining. However, I would also expect the "data" object to remain unchanged. This is not the case. The "data" object looks as follows:
sequence trim random_value
1 A 2 NA
2 A 2 NA
3 B 2 NA
4 B 2 NA
5 B 2 NA
6 C 0 NA
7 C 0 NA
8 C 0 NA
9 D 1 NA
10 D 1 NA
So the assignment by reference of the "trim" variable also happened in the original data.frame.
I am using data.table_1.11.4 and R version 3.4.3 for compatibility reasons.
Is the error a result of using old versions or am I doing something wrong / do I need to change the code to avoid that error?
As #Roland kindly pointed out in his comment to the original question, it's necessary to use the "copy()" function to explicitly copy objects in data.table. Otherwise data.table won't regard copied objects as distinct objects and will modify columns with the same name in both objects. As #Imo checked, only columns that are changed in just one of the two data.frames and not by reference (e.g. "random_value" in the example) are actually copied / unlinked.
The issue can be easily fixed by using the copy() function:
# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1
# copying data to data_temp explicitly
data_temp <- copy(data)
# assigning some random value to data_temp so that it should no longer be a
# copy of "data" - if the copy() function isn't used, that just unlinks the
# "random_value" column, but not the others
data_temp[1, "random_value"] <- rnorm(1)
# converting data_temp to data.table
setDT(data_temp)
# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]
So it's necessary to use the copy() function every time you don't want data.table modifications by reference done on the copied tables affect the original table (or vice versa) - even if at the time you copy the tables they are not (yet) data.table class objects.

R data.table intersection of all groups

I want to have the intersection of all groups of a data table. So for the given data:
data.table(a=c(1,2,3, 2, 3,2), myGroup=c("x","x","x", "y", "z","z"))
I want to have the result:
2
I know that
Reduce(intersect, list(c(1,2,3), c(2), c(3,2)))
will give me the desired result but I didn't figure out how to produce a list of groups of a data.table query.
I would try using Reduce in the following way (assuming dt is your data)
Reduce(intersect, dt[, .(list(unique(a))), myGroup]$V1)
## [1] 2
Here's one approach.
nGroups <- length(unique(dt[,myGroup]))
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
And here it is with some explanatory comments.
## Mark down the number of groups in your data set
nGroups <- length(unique(dt[,myGroup]))
## Then, use `by="a"` to examine in turn subsets formed by each value of "a".
## For subsets having the full complement of groups
## (i.e. those for which `length(unique(myGroup))==nGroups)`,
## return the value of "a" (stored in .BY).
## For the other subsets, return NULL.
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
If that code and the comments aren't clear on their own, a quick glance at the following might help. Basically, the approach above is just looking for and reporting the value of a for those groups that return x,y,z in column V1 below.
dt[,list(list(unique(myGroup))), by="a"]
# a V1
# 1: 1 x
# 2: 2 x,y,z
# 3: 3 x,z

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

compare values of data frames with different number of rows

I defined the following function, which takes two DataFrames, DF_TAGS_LIST and DF_epc_list. Both data frames have a column with a different number of rows. I want to search each value DF_TAGS_LIST in DF_epc_list, and if found, store it in another dataframe
One example of DF_TAGS_LIST:
TAGS_LIST
3036029B539869100000000B
3036029B537663000000002A
3036029B5398694000000009
3036029B539869400000000C
3036029B5398690000000006
3036029B5398692000000007
And one example of DF_epc_list:
EPC
3036029B539869100000000B
3036029B537663000000002A
3036029B5398690000000006
3036029B5398692000000007
3036029B5398691000000006
3036029B5376630000000034
3036029B53986940000000WF
3036029B5398694000000454
3036029B5398690000000234
3036029B53986920000000FG
In this case, I would like one dataframe output that had the following values:
FOUND_TAGS
3036029B5398690000000006
3036029B5398692000000007
3036029B539869100000000B
3036029B537663000000002A
My function is:
FOUND_COMPARE_TAGS<-function(DF_TAGS_LIST, DF_epc_list){
DF_epc_list<-toString(DF_epc_list)
DF_TAGS_LIST<-toString(DF_TAGS_LIST)
DF_found_epc_tags <- data.frame(DF_found_epc_tags=intersect(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list)); setdiff(union(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list), DF_found_epc_tags$DF_found_epc_tags)
#DF_found_epc_tags <- data.frame(DF_found_epc_tags = DF_TAGS_LIST[unique(na.omit(match(DF_epc_list$DF_epc_list, DF_TAGS_LIST$DF_TAGS_LIST))),])
return(DF_found_epc_tags)
}
I now returns an empty data frame with two columns. Only recently programmed in R
You can use %in% or (as I mentioned in my comment) intersect:
DF_TAGS_LIST[DF_TAGS_LIST$TAGS_LIST %in% DF_epc_list$EPC, , drop = FALSE]
# TAGS_LIST
# 1 3036029B539869100000000B
# 2 3036029B537663000000002A
# 5 3036029B5398690000000006
# 6 3036029B5398692000000007
intersect(DF_TAGS_LIST$TAGS_LIST, DF_epc_list$EPC)
# [1] "3036029B539869100000000B" "3036029B537663000000002A"
# [3] "3036029B5398690000000006" "3036029B5398692000000007"
FOUND_TAGS <- rbind(TAGS_LIST, EPC)
FOUND_TAGS <- FOUND_TAGS[duplicated(FOUND_TAGS), , drop = FALSE]

Remove an entire column from a data.frame in R

Does anyone know how to remove an entire column from a data.frame in R? For example if I am given this data.frame:
> head(data)
chr genome region
1 chr1 hg19_refGene CDS
2 chr1 hg19_refGene exon
3 chr1 hg19_refGene CDS
4 chr1 hg19_refGene exon
5 chr1 hg19_refGene CDS
6 chr1 hg19_refGene exon
and I want to remove the 2nd column.
You can set it to NULL.
> Data$genome <- NULL
> head(Data)
chr region
1 chr1 CDS
2 chr1 exon
3 chr1 CDS
4 chr1 exon
5 chr1 CDS
6 chr1 exon
As pointed out in the comments, here are some other possibilities:
Data[2] <- NULL # Wojciech Sobala
Data[[2]] <- NULL # same as above
Data <- Data[,-2] # Ian Fellows
Data <- Data[-2] # same as above
You can remove multiple columns via:
Data[1:2] <- list(NULL) # Marek
Data[1:2] <- NULL # does not work!
Be careful with matrix-subsetting though, as you can end up with a vector:
Data <- Data[,-(2:3)] # vector
Data <- Data[,-(2:3),drop=FALSE] # still a data.frame
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame
df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)
to remove just the a column you could do
Data <- subset( Data, select = -a )
and to remove the b and d columns you could do
Data <- subset( Data, select = -c(d, b ) )
You can remove all columns between d and b with:
Data <- subset( Data, select = -c( d : b )
As I said above, this syntax works only when the column names are known. It won't work when say the column names are determined programmatically (i.e. assigned to a variable). I'll reproduce this Warning from the ?subset documentation:
Warning:
This is a convenience function intended for use interactively.
For programming it is better to use the standard subsetting
functions like '[', and in particular the non-standard evaluation
of argument 'subset' can have unanticipated consequences.
(For completeness) If you want to remove columns by name, you can do this:
cols.dont.want <- "genome"
cols.dont.want <- c("genome", "region") # if you want to remove multiple columns
data <- data[, ! names(data) %in% cols.dont.want, drop = F]
Including drop = F ensures that the result will still be a data.frame even if only one column remains.
The posted answers are very good when working with data.frames. However, these tasks can be pretty inefficient from a memory perspective. With large data, removing a column can take an unusually long amount of time and/or fail due to out of memory errors. Package data.table helps address this problem with the := operator:
library(data.table)
> dt <- data.table(a = 1, b = 1, c = 1)
> dt[,a:=NULL]
b c
[1,] 1 1
I should put together a bigger example to show the differences. I'll update this answer at some point with that.
There are several options for removing one or more columns with dplyr::select() and some helper functions. The helper functions can be useful because some do not require naming all the specific columns to be dropped. Note that to drop columns using select() you need to use a leading - to negate the column names.
Using the dplyr::starwars sample data for some variety in column names:
library(dplyr)
starwars %>%
select(-height) %>% # a specific column name
select(-one_of('mass', 'films')) %>% # any columns named in one_of()
select(-(name:hair_color)) %>% # the range of columns from 'name' to 'hair_color'
select(-contains('color')) %>% # any column name that contains 'color'
select(-starts_with('bi')) %>% # any column name that starts with 'bi'
select(-ends_with('er')) %>% # any column name that ends with 'er'
select(-matches('^v.+s$')) %>% # any column name matching the regex pattern
select_if(~!is.list(.)) %>% # not by column name but by data type
head(2)
# A tibble: 2 x 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
You can also drop by column number:
starwars %>%
select(-2, -(4:10)) # column 2 and columns 4 through 10
With this you can remove the column and store variable into another variable.
df = subset(data, select = -c(genome) )
Using dplyR, the following works:
data <- select(data, -genome)
as per documentation found here https://www.marsja.se/how-to-remove-a-column-in-r-using-dplyr-by-name-and-index/#:~:text=select(starwars%2C%20%2Dheight)
I just thought I'd add one in that wasn't mentioned yet. It's simple but also interesting because in all my perusing of the internet I did not see it, even though the highly related %in% appears in many places.
df <- df[ , -which(names(df) == 'removeCol')]
Also, I didn't see anyone post grep alternatives. These can be very handy for removing multiple columns that match a pattern.

Resources