Adding characters/letters to value based on condition - r

Edit: The suggested code worked, so in total:
MyData$Numbers_2=with(MyData, ifelse(Numbers %in% MyVector,
paste0("L_N", Numbers),
paste0("L_", Numbers)))
This adds a column with the new (character) values.
I'd like to add a specific string of characters to every value in a column of a dataframe, based on the condition that the value is found in a vector.
#Create Dataframe
MyData=data.frame(Numbers=c(85,84,7,9,82,5,81,80,10))
MyVector=c("80","81","82","83","84","85")
# for Loop with if else condition
for (Nr in MyVector) {MyData$Numbers_2=as.character(MyData$Numbers)
if(is.element(Nr,MyData$Numbers_2)) {MyData$Numbers_2= paste("L_N",MyData$Numbers_2, sep = "")}
else {MyData$Numbers_2= paste("L_",MyData$Numbers_2, sep = "")}
}
Basically, I'd like to have a result like this:
L_N85, L_N84, L_7, L_9 etc.
But what I get is: L_N85, L_N84, L_N7, L_N9.
It adds L_N to every value in the column, and not only to specific ones. How to I have to change my code for that to happen?

Instead of a for loop or if/else we can do this with vectorized ifelse
with(MyData, ifelse(Numbers %in% MyVector, paste0("L_N", Numbers), paste0("L_", Numbers)))
#[1] "L_N85" "L_N84" "L_7" "L_9" "L_N82" "L_5" "L_N81" "L_N80" "L_10"
Or with case_when from dplyr
library(dplyr)
MyData %>%
mutate(Numbers_2 = case_when(Numbers %in% MyVector ~ paste0("L_N", Numbers),
TRUE ~ paste0("L_", Numbers)))

Related

ifelse function on a vector

I am using the ifelse function in order to obtain either a vector with NA if all the "value" of this vector are NA or a vector with all the values not equal to "NA_NA". In my example, I would like to obtain this results
[1] "14_mter" "78_ONHY"
but I am obtaining this
[1] "14_mter"
my example:
vect=c("NA_NA", "14_mter", "78_ONHY")
out=ifelse(all(is.na(vec)), vec, vec[which(vec!="NA_NA")])
What is wrong in this function ?
ifelse is vectorized and its result is as long as the test argument. all(is.na(vect)) is always just length one, hence the result. a regular if/else clause is fine here.
vect <- c("NA_NA", "14_mter", "78_ONHY")
if (all(is.na(vect))) {
out <- vect
} else {
out <- vect[vect != "NA_NA"]
}
out
#> [1] "14_mter" "78_ONHY"
additional note: no need for the which() here
The ifelse help file, referring to its three arguments test, yes and no, says:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
so if the test has a length of 1, which is the case for the code in the question, then the result will also have length 1. Instead try one of these.
1) Use if instead of ifelse. if returns the value of the chosen leg so just assign that to out.
out <- if (all(is.na(vect))) vect else vect[which(vect != "NA_NA")]
2) The collapse package has an allNA function so a variation on (1) is:
library(collapse)
out <- if (allNA(vect)) vect else vect[which(vect != "NA_NA")]
3) Although not recommended if you really wanted to use ifelse it could be done by wrapping each leg in list(...) so that the condition and two legs all have the same length, i.e. 1.
out <- ifelse(all(is.na(vect)), list(vect), list(vect[which(vect != "NA_NA")])) |>
unlist()
If the NAvalue is always the string NA_NA, this works:
grep("NA_NA", vect, value = TRUE, invert = TRUE)
[1] "14_mter" "78_ONHY"
While the pattern matches the NA_NA value, the invert = TRUE argument negates the match(es) and produces the unmatched values
Data:
vect=c("NA_NA", "14_mter", "78_ONHY")

R: Define ranges from text using regex

I need a way to call defined variables dependant from a string within text.
Let's say I have five variables (r010, r020, r030, r040, r050).
If there is a given text in that form "r010-050" I want to have the sum of values from all five variables.
The whole text would look like "{r010-050} == {r060}"
The first part of that equation needs to be replaced by the sum of the five variables and since r060 is also a variable the result (via parsing the text) should be a logical value.
I think regex will help here again.
Can anyone help?
Thanks.
Define the inputs: the variables r010 etc. which we assume are scalars and the string s.
Then define a pattern pat which matches the {...} part and a function Sum which accepts the 3 capture groups in pat (i.e. the strings matched to the parts of pat within parentheses) and performs the desired sum.
Use gsubfn to match the pattern, passing the capture groups to Sum and replacing the match with the output of Sum. Then evaluate it.
In the example the only variables in the global environment whose names are between r010 and r050 inclusive are r010 and r020 (it would have used more had they existed) and since they sum to r060 it returned TRUE.
library(gsubfn)
# inputs
r010 <- 1; r020 <- 2; r060 <- 3
s <- "{r010-050} == {r060}"
pat <- "[{](\\w+)(-(\\w+))?[}]"
Sum <- function(x1, x2, x3, env = .GlobalEnv) {
x3 <- if(x3 == "") x1 else paste0(gsub("\\d", "", x1), x3)
lst <- ls(env)
sum(unlist(mget(lst[lst >= x1 & lst <= x3], envir = env)))
}
eval(parse(text = gsubfn(pat, Sum, s)))
## [1] TRUE

R: Remove all data frames from work space that have 0 rows (i.e. are empty)

I know this should be easy, but I am baffled on how to solve this problem.
I have a bunch of data frames, some are empty (0 rows, 42 variables), some have information in them (x rows, 42 variables) from a previous working step. I now simply want to delete all those with 0 rows.
First, I get all DF by
alldfnames <- which(unlist(eapply(.GlobalEnv,is.data.frame)))
Second, I tried to write a function to distinguish between the data frames:
isFullDF <- function(x) dim(x)[1] > 0
Third, I tried to
for (i in seq_along(alldfnames)) {
if(isFullDF(alldfnames[i]) == FALSE){
rm(alldfnames[i])
} else {
# do nothing
}
}
But this gives me (for hours now) an error:
Error in if (isFullDF(alldfnames[i]) == FALSE) { :
argument is of length zero
Any idea?
First if you look at alldfnames you'll see it's a vector of integers where names(alldfnames) are the names of the variables you are after. So alldfnames[i] is just a number. So you need
alldfnames <- names(alldfnames)
which is a character vector of df names.
Next, when you do dim(x) and (e.g.) you have a dataframe called df in your enviromnent, x is the character "df" not the dataframe. So you need to retrieve it. You can use get for that.
isFullDF <- function(x) nrow(get(x)) > 0
And then when you rm you need to tell R that the things you are removing are character strings with the names of the things you want to remove. As opposed to removing the object called alldfnames[i]. ie
rm(list=alldfnames[i])
(as an aside, you don't need the else { } if it's empty).
Using Filter:
alldfnames = names(which(unlist(eapply(.GlobalEnv,is.data.frame))))
rowCounts = sapply(alldfnames,function(x) ifelse(nrow(get(x))==0,1,0))
emptyDF = names(Filter(function(f) f==1, rowCounts))
rm(list = emptyDF)
Try:
x <- eapply(.GlobalEnv,is.data.frame)
alldfnames <- names(x[x==T])
Now alldfnames contains all data frame names in your environment, then use the following function:
isFullDF <- function(nm) nrow(get(nm))>0
And then this one-line code instead of your for loop:
rm(list = alldfnames[!sapply(alldfnames, isFullDF)])

Count number of rows matching a criteria

I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42

dropping current row if the value of current row is the same as previous row in R

As title, I have drafted my script as below:
set.seed(1)
temp <- data.frame(cola=sample(1:10,100,replace=TRUE),
stay=TRUE)
for (loop in (2:nrow(temp))) {
temp[loop,'stay'] <- ifelse(temp[loop,'cola']==temp[loop-1,'cola'],FALSE,TRUE)
}
temp <- temp[temp[,'stay']==TRUE,]
I don't like the for-loop there, can we somehow vectorize it?
Here, we can check whether the current row ie. temp$cola[-1] is equal (==) to the previous row temp$cola[-nrow(temp)] and then concatenate (c) with FALSE to get the length same as the nrow(temp) and use that as an index to drop the rows. This would also work with characters.
indx <- with(temp, c(FALSE,cola[-1]==cola[-nrow(temp)]))
temp[!indx,]
How about:
temp <- temp[c(TRUE, diff(temp$cola) != 0),]

Resources