R - Combining Columns to String Based on Logical Match - r

In a previous question (Setting Value Based on Matching Column) I was trying to take a string, split it into elements and create a column per element with a logical statement. This was answered brilliantly.
But after a couple of months work on things I now need something on the lines of an inverse.
Given...
df <- data.frame(E1=FALSE,E11=TRUE,E20=FALSE,E30=FALSE,E31=TRUE,E100=FALSE,E300=FALSE,E313=TRUE,ECAT=TRUE)
I need to produce a string containing all the column names that have a TRUE match - which would hopefully yield something like...
> df[1,]
E1 E11 E20 E30 E31 E100 E300 E313 ECAT Topics
1 FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE E11,E31,E313,ECAT
In reality I have 3,270 rows and there are actually 102 topics so really I need something that will for each row provide a concatenation of those TRUE topic codes.
My attempts have yielded nothing working, who will volunteer up an answer OR a link to duplicate question/answer (as they probably exist - it is an R question after all)?

You can try
df$Topics <- apply(df, 1, function(x) toString(names(x)[x]))

You can use apply to do this.
df$Topics = apply(df,1,function(x) paste0(colnames(df)[x],collapse=','))

Related

grepl in R not working properly while subsetting

I have a big dataset with some columns representing amount with Decimal(5,2) format:
DF
Name|Salary|State
Joe|12345.34|AZ
Mac|3423.67|CT
Lilly|12342.345|CA
Clearly only Joe qualifies the criteria so after subsetting I should get records NOT matching the criteria on Salary Column.Thus the result should be
Name|Salary|State
Mac|3423.67|CT
Lilly|12342.345|CA
I want to use subset function:
subset(grepl("^[[:digit:]]{,5}\\.[[:digit:]]{,2}$",DF$Salary)
OR
subset(grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",DF)
subset(grepl("[[:digit:]]{,5}[.][[:digit:]]{,2}",DF)
None of these give me correct result.
On further investigation I found that the grepl itself doesnt work properly.
Example:
x <- "12345.45"
grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",x) # TRUE
grepl("[[:digit:]]{,4}\\.[[:digit:]]{,2}",x) # TRUE
grepl("[[:digit:]]{,4}[.][[:digit:]]{,2}",x) # TRUE
Floating-point comparisons aren't accurate. Read Why are these numbers not equal? .
However, in this case you can use :
subset(df, !grepl('\\d{5}\\.\\d\\d$', Salary))

How to match a substring at the end of a string in R?

I'm using grepl to detect if a string includes a substring. For example:
grepl("-B4","P6-B4")
which obviously returns True. Now I want to avoid cases which have characters after the "-B4" substring. For example I want to see False from the following:
grepl("-B4","P6-B41A")
As you can see the reason I want to avoid it is because 4 is different from 41 and I don't want to detect 41.
Thanks
grepl("-B4$",c("P6-B41A", "P6-B4"))
#[1] FALSE TRUE
This seems like the perfect time to use endsWith(). It determines if a string ends with a specific character or series of characters.
endsWith(c("P6-B41A", "P6-B4"), "-B4")
# [1] FALSE TRUE
And according to help(endsWith), it's also more efficient than grepl().
Another option would be to extract the last 3 characters and do a ==
substr(v1, nchar(v1)-2, nchar(v1)) == "-B4"
#[1] FALSE TRUE
data
v1 <- c("P6-B41A", "P6-B4")

Couldn't reduce the looping variable inside the "for" loop in R

I have a for loop to do a matrix manipulation in R. For some checks are true i need to come to the same row again., means i need to be reduced by 1.
for(i in 1:10)
{
if(some chk)
{
i=i-1
}
}
Actually i is not reduced for me. For an example in 5th row i'm reducing the i to 4, so again it should come as 5, but it is coming as 6.
Please advice.
My intention is:
Checking the first column values of a matrix, if I find any duplicate value, I take the second column value and append with the first row's second column and remove the duplicate row. So, when I'm removing a row I do not need increase the i in while loop. (This is just a map reduce method, append values of same key)
Variables in R for loops are read-only, you cannot modify them. What you have written would be solved completely differently in normal R code – the exact solution depending on the actual problem, there isn’t a generic, direct replacement (except by replacing the whole thing with a while loop but this is both ugly and probably unnecessary).
To illustrate this, consider these two typical examples.
Assume you want to filter all duplicated elements from a list. Instead of looping over the list and copying all duplicated elements, you can use the duplicated function which tells you, for each element, whether it’s a duplicate.
Secondly, you use standard R subsetting syntax to select just those elements which are not a duplicate:
x = x[! duplicated(x)]
(This example works on a one-dimensional vector or list, but it can be generalised to more dimensions.)
For a more complex case, let’s say that you have a vector of numbers and, for every even number in the vector, you want to double the preceding number (this is highly artificial but in signal processing you might face similar problems). In other words:
input = c(1, 3, 2, 5, 6, 7, 1, 8)
output = ???
output
# [1] 1 6 2 10 6 7 2 8
… we want to fill in ???. In the first step, we check which numbers are even:
even = input %% 2 == 0
# [1] FALSE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
Next, we shift the result down – because we want to know whether the next number is even – by removing the first element, and appending a dummy element (FALSE) at the end.
even = c(even[-1], FALSE)
# [1] FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
And now we can multiply just these inputs by two:
output = input
output[even] = output[even] * 2
There, done.

Recursive %in% function in R?

I am sure this is a simple question that has been asked many times, but this is one of those times when I find it difficult to know which terms to search for in order to find the solution. I have a simple list of lists, such as the one below:
sets <- list(S1=NA, S2=1L, S3=2:5)
> sets
$S1
[1] NA
$S2
[1] 1
$S3
[1] 2 3 4 5
And I have a scalar variable val which can take the value of any integer in sets (but will never be NA). Suppose val <- 4 -- then, what is a quick way to return a vector of TRUE/FALSE corresponding to each list in set where TRUE means val is in that list and FALSE means it is not? In this case I would want something like
[1] FALSE FALSE TRUE
I was hoping there would be some recursive form of %in% but I haven't had luck searching for it. Thank you!
Like this:
sapply(sets, `%in%`, x = val)
# S1 S2 S3
# FALSE FALSE TRUE
I had to look at the help page ?"%in%" to find out that the first argument to %in% is named x. And for your curiosity (not needed here), the second one is named table.

remove first ocurrence data frame R

So I've been playing around with a data frame in R, although I'm still thinking too much in Python and cannot seem to find a solution for my problem.
I have a data frame and one of the column is an user id. I would like to remove all the first occurrence of a number, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where the first time the user_id appears I would remove it but keep all the others even if repeated.
With python I would probably use enumerate or loop over it. For R, I've seen some functions that seem cool but I'm not sure how to use it with the data frame, like rle.
Any pointers will be really helpful since right now I'm a bit lost about the best approach for this problem.
Thank you all
The function duplicated() is going to be helpful here:
x <- c(1,2,3,4,3,4,2,1,3,4,6,7,7)
> x[duplicated(x)]
[1] 3 4 2 1 3 4 7
This works because duplicated() returns a logical vector indicating whether that element is, well, duplicated:
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
You then use this logical vector to subset (extract) the values you want from x. But notice that in the extraction I keep all of the duplicated values, not remove them.
To remove all of the duplicated values (not what you want, but I illustrate regardless), try the negation:
x[!duplicated(x)]
[1] 1 2 3 4 6 7

Resources