SQL attribute FROM and WHERE in R data.frames - r

Please how to select data using SQL like features in R data.frames ?
Let's say I have the following data.frame :
Names Numbers
A 1
B 2
C 3
How to select number 2 using strings "B" and "Numbers" and not data[2,2] ? I would like to use something like data["B", "Numbers"] but it doesn't work, help please !!!

You can use [, or subset when using data.frames. Note that [ has a drop = TRUE argument which will coerce to an atomic vector if a single value / column is returned.
DF <- data.frame(Names = LETTERS[1:3], Numbers = 1:3)
subset(DF, Names == 'B', select = Numbers)
## Numbers
## 2 2
DF[DF$Names == 'B', 'Numbers']
## [1] 2
DF[DF$Names == 'B', 'Numbers', drop = FALSE]
## Numbers
## 2 2
I like data.tables. FAQ 2.16 describes the similarities between SQL and data.table syntax
library(data.table)
DT <- data.table(DF)
DT[Names == 'B', Numbers]
## [1] 2
# using keys
setkey(DT,Names)
DT['B'][,list(Numbers)]
## Numbers
## 1: 2
or there is sqldf which lets you use SQL in data.frames
library(sqldf)
sqldf('select Numbers from DF where Names == "B"')
## Numbers
## 1 2

Related

How to coerce a character column to a list column

I am trying to bind data frames rows. I generate some data frame with list columns after aggregation but some are character. I can't find a way to bind them. I tried converting the character column using as.list() but that didn't work.
library(dplyr)
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
> df1
a b
1 1 1, 2
2 2 4
3 3 5, 6
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
> df2
a b
1 4 9
2 5 12
dplyr::bind_rows(df2,df1)
Error in bind_rows_(x, .id) :
Column `b` can't be converted from character to list
I don't know the dplyr library well, but using base R's rbind() below seems to be working:
df1 <- data.frame(a = c(1,2,3),stringsAsFactors = F)
df1$b <- list(c("1","2"),"4",c("5","6"))
df2 <- data.frame(a=c(4,5),b=c("9","12"),stringsAsFactors = F)
result <- rbind(df1, df2)
class(result$a)
[1] "numeric"
class(result$b)
[1] "list"
Demo
If you wanted to get this working with bind_rows(), start by looking at the error message. It looks like dplyr doesn't like that one data frame has character data while the other has list data. You could try converting the character column to list and then call bind_rows, e.g.
df2$b <- as.list(df2$b)
dplyr::bind_rows(df2,df1)

R - Selecting columns from data table with for loop issue [duplicate]

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

R: change column order in data.table for only some columns

I want to sort two columns to the front of my data.table (id and time in my case). Say I have:
library(data.table)
Data <- as.data.table(iris)
and say I want the order of the columns to be:
example <- Data
setcolorder(example,c("Species","Petal.Length","Sepal.Length",
"Sepal.Width","Petal.Length","Petal.Width"))
but my actual data table has many more variables so I would like to adress this as:
setcolorder(Data, c("Species","Petal.Length",
...all other variables in their original order...))
I played around with something like:
setcolorder(Data,c("Species","Petal.Length",
names(Data)[!c("Species","Petal.Length")]))
but I have a problem subsetting the character vector names(Data) by name reference. Also I'm sure I can avoid this workaround with some neat data.table function, no?
We can use setdiff to subset all the column names that are not in the subset of names i.e. 'nm1', concatenate that with 'nm1' in the setcolorder
nm1 <- c("Species", "Petal.Length")
setcolorder(Data, c(nm1, setdiff(names(Data), nm1)))
names(Data)
#[1] "Species" "Petal.Length" "Sepal.Length" "Sepal.Width" "Petal.Width"
A convenience function for this is:
setcolfirst = function(DT, ...){
nm = as.character(substitute(c(...)))[-1L]
setcolorder(DT, c(nm, setdiff(names(DT), nm)))
}
setcolfirst(Data, Species, Petal.Length)
The columns are passed without quotes here, but extension to a character vector is easy.
You can just do
setcolorder(Data,c("Species","Petal.Length"))
similarly as using xcols in kdb q. ?setcolorder says:
If ‘length(neworder) < length(x)’, the
specified columns are moved in order to the "front" of ‘x’.
My version of data.table is 1.11.4, but it might have been available for earlier versions too.
This is totally a riff off of Akrun's solution, using a bit more functional decomposition and an anaphoric macro because, well why not.
I'm no expert in writing R macros, so this is probably a naive solution.
> toFront <- function(vect, ...) {
c(..., setdiff(vect, c(...)))
}
> withColnames <- function(tbl, thunk) {
.CN = colnames(tbl)
eval(substitute(thunk))
}
> vect = c('c', 'd', 'e', 'a', 'b')
> tbl = data.table(1,2,3,4,5)
> setnames(tbl, vect)
> tbl
c d e a b
1: 1 2 3 4 5
> withColnames(tbl, setcolorder(tbl, toFront(.CN, 'a', 'b') ))
> tbl
a b c d e
1: 4 5 1 2 3
>

Intersecting multiple columns between two data frames

I have two data frames with 2 columns in each. For example:
df.1 = data.frame(col.1 = c("a","a","a","a","b","b","b","c","c","d"), col.2 = c("b","c","d","e","c","d","e","d","e","e"))
df.2 = data.frame(col.1 = c("b","b","b","a","a","e"), col.2 = c("a","c","e","c","e","c"))
and I'm looking for an efficient way to look up the row index in df.2 of every col.1 col.2 row pair of df.1. Note that a row pair in df.1 may appear in df.2 in reverse order (for example df.1[1,], which is "a","b" appears in df.2[1,] as "b","a"). That doesn't matter to me. In other words, as long as a row pair in df.1 appears in any order in df.2 I want its row index in df.2, otherwise it should return NA. One more note, row pairs in both data frames are unique - meaning each row pair appears only once.
So for these two data frames the return vector would be:
c(1,4,NA,5,2,NA,3,NA,6,NA)
Maybe something using dplyr package:
first make the reference frame
use row_number() to number as per row index efficiently.
use select to "flip" the column vars.
two halves:
df_ref_top <- df.2 %>% mutate(n=row_number())
df_ref_btm <- df.2 %>% select(col.1=col.2, col.2=col.1) %>% mutate(n=row_number())
then bind together:
df_ref <- rbind(df_ref_top,df_ref_btm)
Left join and select vector:
gives to get your answer
left_join(df.1,df_ref)$n
# Per #thelatemail's comment, here's a more elegant approach:
match(apply(df.1,1,function(x) paste(sort(x),collapse="")),
apply(df.2,1,function(x) paste(sort(x),collapse="")))
# My original answer, for reference:
# Check for matches with both orderings of df.2's columns
match.tmp = cbind(match(paste(df.1[,1],df.1[,2]), paste(df.2[,1],df.2[,2])),
match(paste(df.1[,1],df.1[,2]), paste(df.2[,2],df.2[,1])))
# Convert to single vector of match indices
match.index = apply(match.tmp, 1,
function(x) ifelse(all(is.na(x)), NA, max(x, na.rm=TRUE)))
[1] 1 4 NA 5 2 NA 3 NA 6 NA
Here's a little function that tests a few of the looping options in R (which was not really intentional, but it happened).
check.rows <- function(data1, data2)
{
df1 <- as.matrix(data1);df2 <- as.matrix(data2);ll <- vector('list', nrow(df1))
for(i in seq(nrow(df1))){
ll[[i]] <- sapply(seq(nrow(df2)), function(j) df2[j,] %in% df1[i,])
}
h <- sapply(ll, function(x) which(apply(x, 2, all)))
sapply(h, function(x) ifelse(is.double(x), NA, x))
}
check.rows(df.1, df.2)
## [1] 1 4 NA 5 2 NA 3 NA 6 NA
And here's a benchmark when row dimensions are increased for both df.1 and df.2. Not too bad I guess, considering the 24 checks on each of 40 rows.
> dim(df.11); dim(df.22)
[1] 40 2
[1] 24 2
> f <- function() check.rows(df.11, df.22)
> microbenchmark(f())
## Unit: milliseconds
## expr min lq median uq max neval
## f() 75.52258 75.94061 76.96523 78.61594 81.00019 100
1) sort/merge First sort df.2 creating df.2.s and append a row number column. Then merge this new data frame with df.1 (which is already sorted in the question):
df.2.s <- replace(df.2, TRUE, t(apply(df.2, 1, sort)))
df.2.s$row <- 1:nrow(df.2.s)
merge(df.1, df.2.s, all.x = TRUE)$row
The result is:
[1] 1 4 NA 5 2 NA 3 NA 6 NA
2) sqldf Since dot is an SQL operator rename the data frames as df1 and df2. Note that for the same reason the column names will be transformed to col_1 and col_2 when df1 and df2 are automatically uploaded to the backend database. We sort df2 using min and max and left join it to df1 (which is already sorted):
df1 <- df.1
df2 <- df.2
library(sqldf)
sqldf("select b.rowid row
from df1
left join
(select min(col_1, col_2) col_1, max(col_1, col_2) col_2 from df2) b
using (col_1, col_2)")$row
REVISED Some code improvements. Added second solution.

Select multiple columns in data.table by their numeric indices

How can we select multiple columns using a vector of their numeric indices (position) in data.table?
This is how we would do with a data.frame:
df <- data.frame(a = 1, b = 2, c = 3)
df[ , 2:3]
# b c
# 1 2 3
For versions of data.table >= 1.9.8, the following all just work:
library(data.table)
dt <- data.table(a = 1, b = 2, c = 3)
# select single column by index
dt[, 2]
# b
# 1: 2
# select multiple columns by index
dt[, 2:3]
# b c
# 1: 2 3
# select single column by name
dt[, "a"]
# a
# 1: 1
# select multiple columns by name
dt[, c("a", "b")]
# a b
# 1: 1 2
For versions of data.table < 1.9.8 (for which numerical column selection required the use of with = FALSE), see this previous version of this answer. See also NEWS on v1.9.8, POTENTIALLY BREAKING CHANGES, point 3.
It's a bit verbose, but i've gotten used to using the hidden .SD variable.
b<-data.table(a=1,b=2,c=3,d=4)
b[,.SD,.SDcols=c(1:2)]
It's a bit of a hassle, but you don't lose out on other data.table features (I don't think), so you should still be able to use other important functions like join tables etc.
If you want to use column names to select the columns, simply use .(), which is an alias for list():
library(data.table)
dt <- data.table(a = 1:2, b = 2:3, c = 3:4)
dt[ , .(b, c)] # select the columns b and c
# Result:
# b c
# 1: 2 3
# 2: 3 4
From v1.10.2 onwards, you can also use ..
dt <- data.table(a=1:2, b=2:3, c=3:4)
keep_cols = c("a", "c")
dt[, ..keep_cols]
#Tom, thank you very much for pointing out this solution.
It works great for me.
I was looking for a way to just exclude one column from printing and from the example above. To exclude the second column you can do something like this
library(data.table)
dt <- data.table(a=1:2, b=2:3, c=3:4)
dt[,.SD,.SDcols=-2]
dt[,.SD,.SDcols=c(1,3)]

Resources