How to grep two terms at the same time in R - r

I have a dataframe as follows
chr Type
1 Tum,B,B,Tum
2 B,B
3 Tum,Tum
4 B,B,B,Tum
I would like to only select those rows which have BOTH Tum and B to be inserted into a new dataframe with the following result:
chr Type
1 Tum,B,B,Tum
4 B,B,B,Tum
I have tried the following
PusungMix <- as.data.frame(Pusung[grep("Barr"&"Tum", Pusung$Type])
but I get the error
Error in "Barr" & "Tum" :
operations are possible only for numeric, logical or complex types

We can use a double grepl to create the two logical index and check whether for instances where both are TRUE using &. This can be used for subsetting the rows of 'df1'.
indx <- grepl('B', df1$Type) & grepl('Tum', df1$Type)
df1[indx,]
# chr Type
#1 1 Tum,B,B,Tum
#4 4 B,B,B,Tum
Or as #Gaurav suggested in the comments, subset is another option if we don't want to use [. We can remove the df1$ within the subset and also don't have to worry about dropping the dimensions as drop=FALSE is the default in subset, whereas in [, it is drop=TRUE. So, when we have a single column or single row, it will drop the dimensions to a vector if we don't specify explicitly drop=FALSE in [.
subset(df,grepl('B', Type) & grepl('Tum', Type))

Or by pure regex w/o the need of 2 grepl:
indx <- grepl("Tum.*B|B.*Tum", df1$Type)
df1[indx, ]
# chr Type
# 1 1 Tum,B,B,Tum
# 4 4 B,B,B,Tum

Related

How to remove only numbers from string

I have following dataframe in R
ID Village_Name
1 23
2 Name-23
3 34
4 Vasai2
5 23
I only want to remove numbers from Village_Name, my desired dataframe would be
ID Village_Name
1 Name-23
2 Vasai2
How can I do it in R?
We can use grepl to match one or more numbers from the start (^) till the end ($) of the numbers and negate (!) it so that all numbers only elements become FALSE and others TRUE
i1 <- !grepl("^[0-9]+$", df1$Village_Name)
df1[i1, ]
Based on the OP's post, it could be also
data.frame(ID = head(df1$ID, sum(i1)), Village_Name = df1$Village_Name[i1])
# ID Village_Name
#1 1 Name-23
#2 2 Vasai2
Or another option is to convert to numeric resulting in non-numeric elements to be NA and is changed to a logical vector with is.na
df1[is.na(as.numeric(df1$Village_Name)),]
Here is another option using sub:
df1[nchar(sub("\\d+", "", df1$Village_Name)) > 0, ]
Demo
The basic idea is to strip off all digits from the Village_Name column, then assert that there is at least one character remaining, which would imply that the entry is not entirely numerical.
But, I would probably go with the grepl option given by #akrun in practice.

Can I programmatically update the type of a set of columns (to factors) in data.table?

I would like to modify a set of columns inside a data.table to be factors. If I knew the names of the columns in advance, I think this would be straightforward.
library(data.table)
dt1 <- data.table(a = (1:4), b = rep(c('a','b')), c = rep(c(0,1)))
dt1[,class(b)]
dt1[,b:=factor(b)]
dt1[,class(b)]
But I don't, and instead have a list of the variable names
vars.factors <- c('b','c')
I can apply the factor function to them without a problem ...
lapply(vars.factors, function(x) dt1[,class(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
lapply(vars.factors, function(x) dt1[,factor(get(x))])
But I don't know how to re-assign or update the original column in the data table.
This fails ...
lapply(vars.factors, function(x) dt1[,x:=factor(get(x))])
# Error in get(x) : invalid first argument
As does this ...
lapply(vars.factors, function(x) dt1[,get(x):=factor(get(x))])
# Error in get(x) : object 'b' not found
NB. I tried the answer proposed here without any luck.
Yes, this is fairly straightforward:
dt1[, (vars.factors) := lapply(.SD, as.factor), .SDcols=vars.factors]
In the LHS (of := in j), we specify the names of the columns. If a column already exists, it'll be updated, else, a new column will be created. In the RHS, we loop over all the columns in .SD (which stands for Subset of Data), and we specify the columns that should be in .SD with the .SDcols argument.
Following up on comment:
Note that we need to wrap LHS with () for it to be evaluated and fetch the column names within vars.factors variable. This is because we allow the syntax
DT[, col := value]
when there's only one column to assign, by specifying the column name as a symbol (without quotes), purely for convenience. This creates a column named col and assigns value to it.
To distinguish these two cases apart, we need the (). Wrapping it with () is sufficient to identify that we really need to get the values within the variable.
Using data frame:
> df1 = data.frame(dt1)
> df1[,vars.factors] = data.frame(sapply(df1[,vars.factors], factor))
> dt1 = data.table(df1)
> dt1
a b c
1: 1 1 b
2: 2 2 c
3: 3 3 b
4: 4 4 c
> str(dt1)
Classes ‘data.table’ and 'data.frame': 4 obs. of 3 variables:
$ a: int 1 2 3 4
$ b: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ c: Factor w/ 2 levels "b","c": 1 2 1 2
- attr(*, ".internal.selfref")=<externalptr>
Can also do
for (col in vars.factors)
set(dt, j=col, value=as.factor(dt1[[col]]))
vars.factors may be a vector of integers or character names specifying the columns to modify.
See https://stackoverflow.com/a/33000778/4241780 for more info.

What is exact difference between adressing column by $mycol, [[mycol]] and [, mycol, with=FALSE]?

After reading data.table FAQ (section 1.5), I had an impression that all three ways of addressing the column are more or less equivalent. But at least the output of [, mycol, with=FALSE] is quite different from $mycol and [[mycol]]:
dt1 <- fread(
" id,colA,colB
id1,3,xxx
id2,0,zzz
id3,NA,yyy
id4,0,aaa
")
dt1$colA <- factor(dt1$colA)
myvar="colA"
dt1$colA
# [1] 3 0 <NA> 0
# Levels: 0 3
dt1[[myvar]]
# [1] 3 0 <NA> 0
# Levels: 0 3
dt1[, myvar, with=FALSE]
# colA
# 1: 3
# 2: 0
# 3: NA
# 4: 0
So, what is exact difference between those three approaches? Can I assume that $mycol and [[mycol]] are always identical? Why [, mycol, with=FALSE] "loses" factors?
Thanks in advance.
First part of your question, the difference between $ and [[, has been covered before in this question:
Indexing by [ is similar to atomic vectors and selects a list of the
specified element(s).
Both [[ and $ select a single element of the list. The main
difference is that $ does not allow computed indices, whereas [[
does. x$name is equivalent to x[["name", exact = FALSE]]. Also,
the partial matching behavior of [[ can be controlled using the
exact argument.
The notation dt1[, ..myvar] in data.table produces a data table with the columns evaluated in myvar. The result is a one-column data table, and the class of that column is factor.
The data frame equivalent would be: as.data.frame(dt1)[, myvar, drop = FALSE].

Remove quotes from vector element in order to use it as a value

Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).

How to reference column names that start with a number, in data.table

If the column names in data.table are in the form of number + character, for example: 4PCS, 5Y etc, how could this be referenced as j in x[i,j] so that it is interpreted as an unquoted column name.
I assume this would solve mine original problem. I wanted to add several column in 'data.table' which were in the form number + character.
M <- data.table('4PCS'=1:4,'5Y'=4:1,X5Y=2:5)
> M[,4PCS+5Y]
Error: unexpected symbol in "M[,4PCS"
The new column should be a sum of 4PSC and 5Y.
Is there a way how to refer to them in data.table in no quoted form? If these columns are referred in data.table with the quoted "logic" of data.frame :
> M[,'5Y',with=FALSE]
5Y
[1,] 4
[2,] 3
[3,] 2
[4,] 1
then there will be a limitation in functionality of such reference. The addition would not work as it does not work in data.frame:
> M[,'4PCS'+'5Y',with=FALSE]
Error in "4PCS" + "5Y" : non-numeric argument to binary operator
The data.table functionality would allow to operate over the columns. I would like to find a solution in the new data.table logic hence I can use its ability to transform the columns by column name referencing.
The question is:
How to quote the column name which start with number so that the data.table logic would understand that it is a column name.
I think, this is what you're looking for, not sure. data.table is different from data.frame. Please have a look at the quick introduction, and then the FAQ (and also the reference manual if necessary).
require(data.table)
dt <- data.table("4PCS" = 1:3, y=3:1)
#   4PCS y
# 1:    1 3
# 2:    2 2
# 3:    3 1
# access column 4PCS
dt[, "4PCS"]
# returns a data.table
# 4PCS
# 1: 1
# 2: 2
# 3: 3
# to access multiple columns by name
dt[, c("4PCS", "y")]
Alternatively, if you need to access the column and not result in a data.table, rather a vector, then you can access using the $ notation:
dt$`4PCS` # notice the ` because the variable begins with a number
# [1] 1 2 3
# alternatively, as mnel mentioned under comments:
dt[, `4PCS`]
# [1] 1 2 3
Or if you know the column number you can access using [[.]] as follows:
dt[[1]] # 4PCS is the first column here
# [1] 1 2 3
Edit:
Thanks #joran. I think you're looking for this:
dt[, `4PCS` + y]
# [1] 4 4 4
Fundamentally the issue is that 4CPS is not a valid variable name in R (try 4CPS <- 1, you'll get the same "Unexpected symbol" error). So to refer to it, we have to use backticks (compare`4CPS` <- 1)
You can also put an 'X' immediately before the variable name you are calling to get R to recognise it as a name rather than evaluating the number and the string as different (and hence bad syntax)
So e.g. when calling 4PCS use X4PCS
as in
mydata <- X4PCS

Resources