Query COUNT statement with multiple clauses doesn't work - count

I have 3 columns like A,B,C and I'd like to count cells in column A which contains something (if is not empty) only if column B or column C contains a name.
I use this statement
=QUERY(Foglio1!A:V;"SELECT COUNT(R) WHERE R <>'' AND M CONTAINS "Alex" OR V CONTAINS "Mary"")
The problem is that it returns me how many times "Alex" and "Mary" appears into column B and C even if R is empty.
What's the problem?

Add some priority to both of the contains with some brackets.
=QUERY(Foglio1!A:V;"SELECT COUNT(R) WHERE R <>'' AND (M CONTAINS 'Alex' OR V CONTAINS 'Mary')")
Your original left the V CONTAINS 'Mary' hanging so that it could be interpreted as (R <>'' AND M CONTAINS 'Alex') OR V CONTAINS 'Mary'.

Related

assign values to vector in a dataframe based on conditions from other vectors in R

so I have a partially empty dataframe and I need to assign values to a vector that we may call "C" based on different outcomes of an other vector which we may call B. The values to assign to C are taken, in some determined cases, from a third vector, "A". How can I do that? I tried if statements and for loops but don't know how to do them properly. Here's summarized my problem.
Thank you for your answers
A = (1,2,3,4,5,6,7,8,9,10)
B = (1,0,1,2,1,0,0,2,0,1)
C = 0 if B = 0
=A(same row) if B = 1
=-A(same row) if B =-1
An option is to get the sign of 'B' and multiply with 'A'
C <- sign(B) * A

Identifying if one field is partial derived from another in R

So given a pattern, say two letters, and the position of the pattern, is it a prefix, suffix or in the middle, I need to identify if a field is partial derived from another. So for example given the following dataset
data.V1 data.V2
1 GH GH1001
2 FD FD2002
3 TH 2345TH
4 ED ED56763
5 US 4345US
6 FG F6736tG
if LL is the pattern for column one where LL refers to two letters in this case. If the pattern for column 2 is LL#. This indicates that the position of the pattern is the first elements of each element in row 2. So in the dataset above rows 1,2&4 would obey the pattern.
I have tried if then statements but these did not work if the pattern was in the middle , #LL#. I have also tried the function regmathes but that did not work either.
apply(df,1,function(x) grepl(paste0("^",x["data.V1"]),x["data.V2"]))
For every row (that's the 1 in apply), this will check whether the contents of data.V1 appears right at the beginning of data.V2 (^ means the beginning of a string for regular expressions).
Result:
1 2 3 4 5 6
TRUE TRUE FALSE TRUE FALSE FALSE
Replace the grepl first argument with the following for:
End of string: paste0(x["data.V1"],"$")
Middle of string: paste0(".+",x["data.V1"],".+")
After n characters (n defined elsewhere): paste0(".{",n,"}",x["data.V1"])
(for the last one, the form "{n}" means the last character is repeated n times. Since it is preceded by ".", it means any n characters.)

subset() with grepl() using REGEX for filtering a dataframe in R

I am learning R and experimenting with subset() and grepl() with Regex for filtering a dataframe. I have created a very small dataframe to play with:
x y z w
1 10 a k
2 12 b l
3 14 c m
4 16 d n
5 18 e o
My code is the following:
subset(df14, grepl('^c | [l - n]', c(df14$z , df14$w) ), grepl('[yz]', colnames(df14)) )
In my mind, the second argument should return the indices of the rows found by grepl() to match the pattern in the columns with names: 'z' or 'w'. However, this is not what happens (returns an empty dataframe with columns y and z).
I would expect to return the rows 2,3,4 since column 'w' contains the letters l, m, n specified in the [l-n] regex pattern and the columns z and w since these names match the regex [yz] in the third argument of the subset().
(I suspect that it is looking for a match in the names of the columns rather the contents of the columns, which is what interests me.)
Obviously, I am not interested in the result per se. This is an experiment to understand how the functions work. So, what I am looking for is an explanation and a method to correct the specific code -- not an alternative solution.
Your advice will be appreciated.
There are a variety of problems.
One issue is the extra spaces in your patterns. Drop them or use the free-spacing modifier (?x) with perl = TRUE. Either way, you have to get rid of the spaces in the character class. [l-n] matches "m" and [l - n] does not, even with (?x). You can read more about the free-spacing modifier and its impact inside and outside character classes here.
Another issues is that in your first grepl, you're searching within a vector (character vector? we can't tell from the example) of length 10. What would a TRUE in the 6th position mean for a 5 row data.frame? It doesn't make sense to return the 6th row of a 5 row data frame. Instead, you can see if your pattern is found for column "w" or (|) column "z". Look within each column, not a concatenation of columns.
Another issue is in your second grepl, "w" is not a match for [yz]. If you want to select the columns with a name containing a "w" or a "z", one way would be with [wz]:
There is no need for the ^ anchor since all your strings contain a single character, but I'll leave it in anyway:
subset(df14,
subset = grepl('^c|[l-n]', df14$z) |
grepl('^c|[l-n]', df14$w),
select = grepl('[wz]', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n
Or with the free-spacing mode modifier and a different pattern ([wz] vs w|z) for the second grepl:
subset(df14,
subset = grepl('(?x)^c | [l-n]', df14$z, perl = TRUE) |
grepl('(?x)^c | [l-n]', df14$w, perl = TRUE),
select = grepl('w|z', colnames(df14)))
# z w
#2 b l
#3 c m
#4 d n
The '^c | [l - n]' search expression can't find anything in those columns. Also, a more intuitive approach is use [ , ] to do this type of subsetting. See http://adv-r.had.co.nz/Subsetting.html.

Extract indices from array meeting a condition in R

Say I have d<-c(1,2,3,4,5,6,6,7). How can I select the indices from d that meet a certain condition such as x>3 and x<=6 (i.e. d[4], d[5], d[6], d[7])?
Use which
> which(d>3 & d<=6)
[1] 4 5 6 7
Minor: c() creates a vector, which is similar to but not exactly an array.
You can create a logical vector an use it to access d.
d[d>3 & d<=6] # the operators return logical vectors, [] extracts
# only the TRUE values

How can I compare two strings to find the number of characters that match in R, using substitution distance?

In R, I have two character vectors, a and b.
a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")
I want a function that counts the character mismatches between each element of a and the corresponding element of b. Using the example above, such a function should return c(2,3,1). There is no need to align the strings.
I need to compare each pair of strings character-by-character and count matches and/or mismatches in each pair. Does any such function exist in R?
Or, to ask the question in another way, is there a function to give me the edit distance between two strings, where the only allowed operation is substitution (ignore insertions or deletions)?
Using some mapply fun:
mapply(function(x,y) sum(x!=y),strsplit(a,""),strsplit(b,""))
#[1] 2 3 1
Another option is to use adist which Compute the approximate string distance between character vectors:
mapply(adist,a,b)
abcdefg hijklmnop qrstuvwxyz
2 3 1

Resources