grepl in R not working properly while subsetting - r

I have a big dataset with some columns representing amount with Decimal(5,2) format:
DF
Name|Salary|State
Joe|12345.34|AZ
Mac|3423.67|CT
Lilly|12342.345|CA
Clearly only Joe qualifies the criteria so after subsetting I should get records NOT matching the criteria on Salary Column.Thus the result should be
Name|Salary|State
Mac|3423.67|CT
Lilly|12342.345|CA
I want to use subset function:
subset(grepl("^[[:digit:]]{,5}\\.[[:digit:]]{,2}$",DF$Salary)
OR
subset(grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",DF)
subset(grepl("[[:digit:]]{,5}[.][[:digit:]]{,2}",DF)
None of these give me correct result.
On further investigation I found that the grepl itself doesnt work properly.
Example:
x <- "12345.45"
grepl("[[:digit:]]{,5}\\.[[:digit:]]{,2}",x) # TRUE
grepl("[[:digit:]]{,4}\\.[[:digit:]]{,2}",x) # TRUE
grepl("[[:digit:]]{,4}[.][[:digit:]]{,2}",x) # TRUE

Floating-point comparisons aren't accurate. Read Why are these numbers not equal? .
However, in this case you can use :
subset(df, !grepl('\\d{5}\\.\\d\\d$', Salary))

Related

R mean of one column based on another [duplicate]

I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))

How to check for the presence of multiple strings for each value of a particular column in R Dataframe?

How do we identify all those row entries in a particular column that contain a specific set of keywords?
For example, I have the following dataframe:
test <- data.frame(nom = 1:5, name = c("ser bla", "onlybla", "inspectiongfa serdafds", "inspection", "serbla blainspection"))
My keywords of interest are "ser" & "inspection"
What I'm looking for is to enlist all the values of the second column (i.e. name) in which both the keywords are present together.
So basically, my output should enlist the name values of rows 3 and 4 viz. "inspectiongfa serdafds" & "serbla blainspection"
What I have tried is the following:
I first generate a truth table to enlist the presence of each of the keywords for each row in the dataframe as follows:
as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
Once I get this, all I have to do is identify all those row entries where the values are a pair of TRUE TRUE. Hence, they'll correspond to the cases where the keywords of interest are present. Here it's the same rows 3 & 4.
But, I'm not able to figure out how to identify such row entries with the TRUE TRUE pair and whether this whole process is a bit of an overkill and it can be done in a much efficient manner.
Any help would be appreciated. Thanks!
You're almost there :)
Here's a solution extending what you have done:
# store your logic test outcomes
conditions_df <- as.data.frame(sapply(c("ser", "inspection"), grepl, test$name))
# False=0 & True=1. Can use rowSums to get the total and find ones that =2 ie True+True
# which gives you the indices of the TRUE outcomes ie the rows we need to filter test
locate_rows <- which(rowSums(conditions_df) == 2)
test$name[locate_rows]
[1] "inspectiongfa serdafds"
[2] "serbla blainspection"

R - Combining Columns to String Based on Logical Match

In a previous question (Setting Value Based on Matching Column) I was trying to take a string, split it into elements and create a column per element with a logical statement. This was answered brilliantly.
But after a couple of months work on things I now need something on the lines of an inverse.
Given...
df <- data.frame(E1=FALSE,E11=TRUE,E20=FALSE,E30=FALSE,E31=TRUE,E100=FALSE,E300=FALSE,E313=TRUE,ECAT=TRUE)
I need to produce a string containing all the column names that have a TRUE match - which would hopefully yield something like...
> df[1,]
E1 E11 E20 E30 E31 E100 E300 E313 ECAT Topics
1 FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE E11,E31,E313,ECAT
In reality I have 3,270 rows and there are actually 102 topics so really I need something that will for each row provide a concatenation of those TRUE topic codes.
My attempts have yielded nothing working, who will volunteer up an answer OR a link to duplicate question/answer (as they probably exist - it is an R question after all)?
You can try
df$Topics <- apply(df, 1, function(x) toString(names(x)[x]))
You can use apply to do this.
df$Topics = apply(df,1,function(x) paste0(colnames(df)[x],collapse=','))

Types and comparisons in R

I've been working with R for a month or so, and my comprehension of some subtleties is still quite superficial.
I have had an issue, which I managed to solve (details below), but I still can't explain precisely why it did not work with the first solution.
Note that the example below makes no practical sense for I have simplified it as much as possible so that the problem is quite clear.
ISSUE :
Given a data frame with 4 columns (email, first, last, company) :
> users <- data.frame(matrix(vector(), 0, 4, dimnames=list(c(), c("email", "first", "last", "company"))), stringsAsFactors=F)
> users[1,] <- c("robert#redford.com", "Robert", "Redford", "Paramount")
> users[2,] <- c("julia#roberts.com", "Erin", "B.", "Hinkley")
> users[3,] <- c("matt#damon.com", "Will", "H.", "Stanford")
> users[4,] <- c("john#malkovitch.com", "John", "M.", "JM")
I take one particular row :
> user <- users[3,]
When I try to subset the dataframe on a criteria which could have lead to return the previously mentioned row, it returns no result.
> users[users$email == user["email"],]
[1] email first last company
<0 lignes> (ou 'row.names' de longueur nulle)
I instantly thought it was a casting issue (sorry for this bad one)
> users[users$email == as.character(user["email"]),]
email first last company
3 matt#damon.com Will H. Stanford
However, when I tried to figure out where exactly the issue was, and tried this :
> users[users$email == "matt#damon.com",]
email first last company
3 matt#damon.com Will H. Stanford
> user["email"] == "matt#damon.com"
email
3 TRUE
> users[3,]$email == user$email
[1] TRUE
I got quite confused :
First, I thought about it as a math problem : if A == B and B == C, then A == C (according to Captain Obvious). So, just replacing a member A by another member B which is supposed to be equal to A (given the "TRUE" statement) in some expression should have no impact on the result of this expression.
3 TRUE != [1] TRUE. I think [1] TRUE is a logical vector of size 1 which first element is TRUE. 3 TRUE is (1x1) matrix row, which column "email" value is TRUE.
My problem is with consistency : either two objects of equal content but different types should be equal, or they should be different. I have a problem with "Sometimes there is type inference, and sometimes not". Is there a rule I can't see beyond this behavior ? (I guess there is one)
Another expression of the behavior I'd like to get is this one :
> unique(users$email) == "matt#damon.com"
[1] FALSE FALSE TRUE FALSE
> unique(users$email) == user["email"]
email
3 FALSE
Obviously R does get what I want (considering the fact that it gives me the matching row). But I can't explain (nor use) the result of the second statement.
Any explanations / thoughts?
in normal list situations
users$email == user[["email"]]
however in data.frames things get inconsistent/ a lot worse!
tdf=data.frame(matrix(1:100,10,10))
tdf[] # returns data.frame everything
tdf[1] # returns data.frame first column
tdf[1,1] # returns object as type of the object...
tdf[,1] # returns a vector of the first column
tdf[1,] # returns a data.frame of the first row # eeeeeugh... that is odd....
tdf[2:4] # returns a data.frame with 3 columns
tdf[1,2:4] # returns a data.frame of the first row of 3 colums
tdf[2:4,2:4] # returns a 3x3 data.frame
tdf[2:4,1] # returns a vector of 2:4 row and 1st column
tdf[,2:4] # returns a data.frame with 3 columns
then there is also the double [[]]
do note that in data.frames things get horribly annoying and fugly
tdf[[1]] # gives the first row as a vector
tdf[[1,1]] # gives first element
and pretty much all other combinations gives errors
and assigning stuff to a data.frame or matrix, is an even bigger mess!

remove first ocurrence data frame R

So I've been playing around with a data frame in R, although I'm still thinking too much in Python and cannot seem to find a solution for my problem.
I have a data frame and one of the column is an user id. I would like to remove all the first occurrence of a number, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where the first time the user_id appears I would remove it but keep all the others even if repeated.
With python I would probably use enumerate or loop over it. For R, I've seen some functions that seem cool but I'm not sure how to use it with the data frame, like rle.
Any pointers will be really helpful since right now I'm a bit lost about the best approach for this problem.
Thank you all
The function duplicated() is going to be helpful here:
x <- c(1,2,3,4,3,4,2,1,3,4,6,7,7)
> x[duplicated(x)]
[1] 3 4 2 1 3 4 7
This works because duplicated() returns a logical vector indicating whether that element is, well, duplicated:
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
You then use this logical vector to subset (extract) the values you want from x. But notice that in the extraction I keep all of the duplicated values, not remove them.
To remove all of the duplicated values (not what you want, but I illustrate regardless), try the negation:
x[!duplicated(x)]
[1] 1 2 3 4 6 7

Resources