Comparing Character Vectors Per Row in R if Same True/False - r

I'm missing something very simple here.
I tried referring to This Post
I'm trying to compare 2 character vectors to see if they equal each other on a per row basis, just expecting a TRUE / FALSE result per row. (Trying to find the FALSE)
all(data.frame(dBase$process_name == dBase$import_process))
When I run the above I receive the result:
Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables
I tried this as well but it seems to be overall, not per row.
identical(dBase$process_name, dBase$import_process)
So is there an alternative to comparing characters / strings to see if they are the same and pull up the rows were FALSE occurs?

If dBase is the name of an existing data.frame, then you can do the following
to see all the rows where process_name and import_process are not identical.
Notice that I use the != to get only those where they are not equal.
dBase[ dBase$process_name != dBase$import_process, ]

Related

Detecting "/' in R

I am working in RStudio with a data frame which has columns that contains dates. Some rows contains years such as 1687, others contains dates format such as 12/12/23 and others contains characters such as "First half of 19e century". I am trying to extract numerical values from this columns.
The way I went about it to extract numerical values from all rows expect the one that contains "/" because I want to keep those as it is. Here is my code:
detect<-str_detect("/", MR_all$period)
MR_all$period[detect=FALSE]
MR_all['period_3']=dplyr::case_when(MR_all$period[detect=FALSE]~gsub("\\D", "",MR_all$period))
This does not work because the detect functions fails to detect the slash and prints all observations. I'd really appreciate your help in writing a function that detects '/'
You're using a single = for comparison, as in [detect=FALSE], which is going to return nothing, c.f., mtcars$cyl[detect=FALSE] (with or without detect previously assigned, this does not check nor use it if it exists).
Why? Because a single = is an assignment, so in the case you are creating (in this case, overwriting) an object named detect with a single value, FALSE. Because assignment invisibly returns the assigned value to the calling portion, this means that the [..] around it is passed a single FALSE. Effectively, MR_all$period[FALSE] is what you're telling R you want to do, and it happily returns a 0-length vector (likely character(0), but I'm guessing ... see #2).
Since you are using str_detect("/", MR_all$period), this suggests that MR_all$period is a column of strings (character class), so your next use of MR_all$period[detect==FALSE] ~ gsub(..) seems wrong: the left-hand side (LHS) of the ~ pairs in case_when must resolve as logical, so a_string ~ gsub(..) is wrong.
Further, even if MR_all$period[detect==FALSE] did resolve to something logical so that case_when will continue, we then have the problem where this vector is shorter than the number of rows in the original frame, yet you're reassigning this shorter-vector back into the entire MR_all['period_3'], which if you're lucky will warn or fail, but if you aren't then it will silently recycle data into your frame (which is a logical problem and a chief complaint of mine about recycling arguments).
Ultimately, I think you need one of the following:
Assign the gsubed version of period only for those rows where detect is true, returning NA for all other rows:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period)
)
Same as above, but defaulting to period if it is not true.
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
TRUE ~ MR_all$period
)
or the preferred method in dplyr-1.1.0:
MR_all['period_3'] <- dplyr::case_when(
detect == FALSE ~ gsub("\\D", "", MR_all$period),
.default = MR_all$period
)

ID names should have the same number of characters in it. How do I filter for data without the appropriate number of characters than delete that data?

I have a dataset where id names are all suppose to have 16 characters in it. How do I filter out all of the data that does not have exactly 16 characters so I can delete it from my dataset. I am working in R Studio.
I've tried both of these in attempt to get r to retrieve data that did not have exactly 16 characters in it but it did not work. I'm new to R so I'm still figuring it out.
length(all_trips$ride_id != 16)
length(nchar(all_trips$ride_id !=16))
You are getting closer and you are on the right track with nchar().
I assume you have a data frame all_trips with a character column ride_id.
Your first attempt:
length(all_trips$ride_id != 16)
translates as "find all the values of ride_id that are not equal to 16, then find the length of the vector containing those values". This probably returns a single number - not what we want.
Your second attempt:
length(nchar(all_trips$ride_id !=16))
translates as "find all the values of ride_id that are not equal to 16, then count the characters in those values, then find the length of the vector containing the values". Again - not what we want.
What you want to do is:
"retain only the subset of all_trips where ride_id contains 16 characters"
Which you can do like this:
all_trips_filtered <- all_trips[nchar(all_trips$ride_id) == 16, ]
Or another way using subset, where you can just specify the column name:
all_trips_filtered <- subset(all_trips, nchar(ride_id) == 16)
See ?Extract or ?subset for more help.

":" operator in data table ( This is not regarding the := but only : can anyone please suggest.)

Using the ":" operator I'm trying to add columns in j argument in data table. These are simple 6 months and 12 Months .. 36 months aggregations
OrderQty36M[,':='(Stat6M=sum(M14:M19)),(Stat12M=sum(M14:M25))]
Can the ":" argument be used as a sequence operator in data table or there is some other way?
in [.data.table(OrderQty36M, , :=(Stat6M = sum(M14:M19)), (Stat12M = sum(M14:M25))) :
The items in the 'by' or 'keyby' list are length (1). Each must be same length as rows in x or number of rows returned by i (36703).
In addition: Warning messages:
1: In M14:M25 :
numerical expression has 36703 elements: only the first used
2: In M14:M25 :
numerical expression has 36703 elements: only the first used
There are a few things confused in how you are trying to use data.table here:
Your calculation for Stats12M is in the by column which I don't believe is intended. The brackets from the Stats6M calculation should be extended around the Stats12M calculation.
I don't think that data.table can select columns with the syntax M14:M19. Fortunately, dplyr can using the select function.
I think you are intending to sum across rows rather than the columns so you want to use rowSums rather than sum.
My corrected version of the code is below.
OrderQty36M[,':='(Stat6M=rowSums(select(.SD,M14:M19)), Stat12M=rowSums(select(.SD,M14:M25)))][]

How can I convert a factor variable with missing values to a numeric variable?

I loaded my dataset (original.csv) to R:
original <- read.csv("original.csv")
str(original) showed that my dataset has 16 variables (14 factors, 2 integers). 14 variables have missing values. It was OK, but 3 variables that are originally numbers, are known as factors.
I searched web and get a command as: as.numeric(as.character(original$Tumor_Size))
(Tumor_Size is a variable that has been known as factor).
By the way, missing values in my dataset are marked as dot (.)
After running: as.numeric(as.character(original$Tumor_Size)), the values of Tumor_Size were listed and in the end a warning massage as: “NAs introduced by coercion” was appeared.
I expected after running above command, the variable converted to numeric, but second str(original) showed that my guess was wrong and Tumor_Size and another two variables were factors. In the below is sample of my dataset:
a piece of my dataset
How can I solve my problem?
The crucial information here is how missing values are encoded in your data file. The corresponding argument in read.csv() is called na.strings. So if dots are used:
original <- read.csv("original.csv", na.strings = ".")
I'm not 100% sure what your problem is but maybe this will help....
original<-read.csv("original.csv",header = TRUE,stringsAsFactors = FALSE)
original$Tumor_Size<-as.numeric(original$Tumor_Size)
This will introduce NA's because it cannot convert your dot(.) to a numeric value. If you try to replace the NA's with a dot again it will return the field as a character, to do this you can use,
original$Tumor_Size[is.na(original$Tumor_Size)]<-"."
Hope this helps.

r programming - check for every value in a vector if it is numeric

I am trying to create a logical vector in R, which will indicate for every value of a complete vector, if it is numeric or not.
I am trying to use the function is.numeric but it will only check if all the vector is numeric or not like that:
vec<-c(1,2,3,"lol")
t<-is.numeric(c[])
t
will produce FALSE
i looked here, but it will only tell how to check the entire vector and get a single value
i looked here, but the issue is not finite vs infinite
i am trying to take a data set, with some values being numbers and other being a string that implies that there is no value, and find a minimum only in the numeric values. for that i try to create a logical vector that will say for every entry of the vector if it is numeric or not. this is important for me to create that vector and i am trying to avoid a complete loop and construction of that vector if possible.
We can use numeric coercion to our advantage. R will message us to be sure that we meant to change the strings to NA. In this case, it is exactly what we are looking for:
!is.na(as.numeric(vec))
#[1] TRUE TRUE TRUE FALSE
#Warning message:
#NAs introduced by coercion
We can use grepl to get a logical vector. We match that includes only numbers from start (^) to end ($). I also included the possibility that there could be negative and floating point numbers.
grepl('^-?[0-9.]+$', vec)
#[1] TRUE TRUE TRUE FALSE
NOTE: There will be no warning messages.

Resources