R: "Which" function gives unexpected (and unwanted) results when using a comparison (>=) - r

I do not understand why the "which" function is not giving the right result here. I just want to select airports in Pennsylvania with more than 5,000 commercial service operations, and the result always contains a number of commercial service operations inferior to 5,000, as seen below.
I have read several questions indicating issues with "which", but I have seen none of that sort, and I have not had this issue before using this function.
Test4<-Airports[which(Airports$LAN_FA_TY=="AIRPORT" &
Airports$STATE_NAME=="PENNSYLVANIA" &
Airports$COMM_SERV>= "5000")
, ]
Test4$COMM_SERV
# [1] 77680 73 71

In this code snippet Airports$COMM_SERV>= "5000" you're using " " around a numeric value (the 5000) which turns the number into a character and hence you can no longer use mathematical operations like >= with that character. Just remove the "" and it should work as expected. (The use of comparison operators for character objects is allowed and meaningful. See ?'>='. It's just that the results may not be what were expected.)
Looking at your code, you could also profit from using with() to reduce the typing and improve readability:
with(Airports, Airports[which(LAN_FA_TY == "AIRPORT" & STATE_NAME == "PENNSYLVANIA" & COMM_SERV >= 5000),])

Related

R empty "" rows cannot be removed

I am scraping comments from Reddit and trying to remove empty rows/comments.
A number of rows appear empty, though I cannot seem to remove them. When I use is_empty they do not appear empty.
> Reddit[25,]
[1] "​"
> is_empty(Reddit$text[25])
[1] FALSE
> Reddit <- subset(Reddit, text != "")
> Reddit[25,]
[1] "​"
Am I missing something? I've tried a couple of other methods to remove these rows and they haven't worked either.
Edit:
Included dput example in answer to comments:
RedditSample <- data.frame(text=
c("I liked coinbase, used it before. But the fees are simply too much. If they were to take 1% instead 2.5% I would understand. It's much simpler and long term it doesn't matter as much.",
"But Binance only charges 0.1% so making the switch is worth it fairly quickly. They also have many more coins. Approval process took me less than 10 minutes, but always depends on how many register at the same time.",
"​", "Here's a 10%/10% referal code if you chose to register: KHELMJ94",
"What is a spot wallet?"))
Actually the data you shared doesn't contain an empty string, it contains a Unicode zero-width space character. You can see that with
charToRaw(RedditSample$text[3])
# [1] e2 80 8b
You could make sure there is a non-space character using a regular expression that matches a "word" character
subset(RedditSample, grepl("\\w", text))
You could use the string length functions. For example in tidyverse which includes the stringr package:
library(tidyverse)
Reddit %>%
filter(str_length(text) > 0)
Or base R:
Reddit[ nchar(Reddit$text) >0, ]

Function to use "-" as text

I'm having a problem in R.
I have some data where I have different regions. Below is an example and it's actual names not values.
D1$Regions = c(Ab, Ba, Da-Cd, Db-a, Da-Aa)
If I only want to use the region Da-Cd and trying to select that one only I get an error.
D-C<-D1[D1$Regions =="Da-Cd",]
I get following this error:
Error in D - C <- D1[D1$Region == "Da-Cd", ] :
object 'Da' not found
I assume it's because it's trying to subtract C from D, but in this case the actual name of the region is D-C. What can I do to only select that region?
Can I do it without having to rename the region?
In my data I have several regions with a "-" between letters. It'll be fine if the "-" is removed for all the different regions.
I have tried to use D1$Region as character and as factor but that doesn't help.
Thanks.
- shouldn't be used in a name since it is reserved for the subtraction operator; it is reserved/illegal. But you could get around it by surrounding the name with illegal symbols with ``, although in some (all?) contexts the more familiar "" will suffice.
`D-C` <- D1[D1$Regions =="Da-Cd",]
R reads Da-Cd as Da minus Cd. I'd suggest to use characters as names for regions, i.e. c("Ab", "Ba", "Da-Cd", "Db-a", "Da-Aa")

Comparing Strings for match in a vectorized way

I have a large data frame, which contains two columns containing strings. When these columns are unequal, I want to do an operation.
The problem is that when I use a simple != operator, it gives incorrect results. I.e. apparently, 'Tout_Inclus' & 'Tout_Inclus' are unequal.
This leads me to string comparison functions, like strcmp from pracma package. However, this is not vectorised - my dataframe has 9.6M rows, therefore I think this would crash/take ages if I looped through.
Has anyone got any vectorised methods for comparing strings?
My dataframe looks like this:
City_Break City_Break
City_Break City_Break
Court_Break Court_Break
Petit_Budget Petit_Budget
Pas_Cher Pas_Cher
Deals Deals_Pas_Chers
Vacances Vacances_Éco
Hôtel_Vol Hôtel_Vol
Dernière_Minute Dernière_Minute
Formule Formule_Éco
Court_Séjour Court_Séjour
Voyage Voyage_Pas_Cher
Séjour Séjour_Pas_Cher
Congés Congés_Éco
when I do something like df[colA != colB,] it gives incorrect results, where strings (by looking at them) are equal.
I've ensured encoding is UTF-8, strings are not factors, and I also tried removing special characters before doing the comparison.
By the way, these strings are from multiple languages.
edit: I've already trimmed whitespaces, and still no luck
Try removing leading/trailing whitespace from both columns, and then compare:
df[trimws(df$colA, "both") != trimws(df$colB, "both"), ]
If evertyhing else is fine(trim, etc..), yours could be an encoding problem. In UTF-8 the same accented character could be rapresented with different byte sequences. It may be single byte coded or with modifier byte. However, very strange with 'Tout_Inclus'.
Just to have a check, from stringi package try this:
stringi::stri_compare(df$colA,df$colB, "fr_FR")
What's the output?

Error checking an inputed value

I've got a function that uses readline to ask people to enter in data. But I'm at a loss as to the best method to insure that the data entered meet my criteria. I'm figuring "if" statements may be the best way to go to check for errors, but I'm not sure how to incorporate them. My attempt at using them is obviously flawed (see below).
As a simple example, 2 of the most likely problems I'm going to run into would be I'd like to insure that at least some value is entered in for x (and if a value is entered for x it is a number) and that V1 and V2 contain the same number of values.
fun<-function(){
T<-readline("What is x" )
if(T=="" | typeof(x)!=numeric)
{print("Input non-aceptable")
T<-readline("What is x ")}
else
V<-readline("Enter 4 values" )
V2<-readline("Enter 4 more values ")
if(length(V1)!=length(V2))
{print("V1 & V2 do not contain equal # of values")
V<-readline("Enter 4 values ")
V<-readline("Enter 4 more values ")}
else
T<-as.numeric(T)
V<-as.numeric(V)
V2<-as.numeric(V2)
return(list(x,V1,V2)
}
As you can see, my hope is to try and spot potential errors before they cause an actual error to happen, and then to give the person an opportunity to re-enter the data. If "if" statements are the way to go, can I get some help on using the correctly?
Thanks!
In R the boolean types TRUE and FALSE can also be represented by T and F. So first off try changing the variables that you have named T to something sensible... like x maybe???
Secondly, in your typeof(x) argument, you called the variable T, so that won't work. In addition there were no quotes around numeric. Try if(!(is.numeric(x)))
Thirdly, your variables are inconsistently named, V and V, and then V1 and V2. Aside from hard to read, it also just won't work.
Lastly, your return statement needs a second closing parenthesis, the function code block needs a closing curly brace.

Variable name restrictions in R

What are the restrictions as to what characters (and maybe other restrictions) can be used for a variable name in R?
(This screams of general reference, but I can't seem to find the answer)
You might be looking for the discussion from ?make.names:
A syntactically valid name consists of letters, numbers and the dot or
underline characters and starts with a letter or the dot not followed
by a number. Names such as ".2way" are not valid, and neither are the
reserved words.
In the help file itself, there's a link to a list of reserved words, which are:
if else repeat while function for in next break
TRUE FALSE NULL Inf NaN NA NA_integer_ NA_real_ NA_complex_
NA_character_
Many other good notes from the comments include the point by James to the R FAQ addressing this issue and Josh's pointer to a related SO question dealing with checking for syntactically valid names.
Almost NONE! You can use 'assign' to make ridiculous variable names:
assign("1",99)
ls()
# [1] "1"
Yes, that's a variable called '1'. Digit 1. Luckily it doesn't change the value of integer 1, and you have to work slightly harder to get its value:
1
# [1] 1
get("1")
# [1] 99
The "syntactic restrictions" some people might mention are purely imposed by the parser. Fundamentally, there's very little you can't call an R object. You just can't do it via the '<-' assignment operator. "get" will set you free :)
The following may not directly address your question but is of great help.
Try the exists() command to see if something already exists and this way you know you should not use the system names for your variables or function.
Example...
> exists('for')
[1] TRUE
>exists('myvariable')
[1] FALSE
Using the make.names() function from the built in base package may help:
is_valid_name<- function(x)
{
length_condition = if(getRversion() < "2.13.0") 256L else 10000L
is_short_enough = nchar(x) <= length_condition
is_valid_name = (make.names(x) == x)
final_condition = is_short_enough && is_valid_name
return(final_condition)
}

Resources