I am looking for a command in R which is equivalent of this SQL statement. I want this to be a very simple basic solution without using complex functions OR dplyr type of packages.
Select count(*) as number_of_states
from myTable
where sCode = "CA"
so essentially I would be counting number of rows matching my where condition.
I have imported a csv file into mydata as a data frame.So far I have tried these with no avail.
nrow(mydata$sCode == "CA") ## ==>> returns NULL
sum(mydata[mydata$sCode == 'CA',], na.rm=T) ## ==>> gives Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(subset(mydata, sCode='CA', select=c(sCode)), na.rm=T) ## ==>> FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
sum(mydata$sCode == "CA", na.rm=T) ## ==>> returns count of all rows in the entire data set, which is not the correct result.
and some variations of the above samples. Any help would be appreciated! Thanks.
mydata$sCode == "CA" will return a boolean array, with a TRUE value everywhere that the condition is met. To illustrate:
> mydata = data.frame(sCode = c("CA", "CA", "AC"))
> mydata$sCode == "CA"
[1] TRUE TRUE FALSE
There are a couple of ways to deal with this:
sum(mydata$sCode == "CA"), as suggested in the comments; because
TRUE is interpreted as 1 and FALSE as 0, this should return the
numer of TRUE values in your vector.
length(which(mydata$sCode == "CA")); the which() function
returns a vector of the indices where the condition is met, the
length of which is the count of "CA".
Edit to expand upon what's happening in #2:
> which(mydata$sCode == "CA")
[1] 1 2
which() returns a vector identify each column where the condition is met (in this case, columns 1 and 2 of the dataframe). The length() of this vector is the number of occurences.
sum is used to add elements; nrow is used to count the number of rows in a rectangular array (typically a matrix or data.frame); length is used to count the number of elements in a vector. You need to apply these functions correctly.
Let's assume your data is a data frame named "dat". Correct solutions:
nrow(dat[dat$sCode == "CA",])
length(dat$sCode[dat$sCode == "CA"])
sum(dat$sCode == "CA")
mydata$sCode is a vector, it's why nrow output is NULL.
mydata[mydata$sCode == 'CA',] returns data.frame where sCode == 'CA'. sCode includes character. That's why sum gives you the error.
subset(mydata, sCode='CA', select=c(sCode)), you should use sCode=='CA' instead sCode='CA'. Then subset returns you vector where sCode equals CA, so you should use
length(subset(na.omit(mydata), sCode='CA', select=c(sCode)))
Or you can try this: sum(na.omit(mydata$sCode) == "CA")
With dplyr package, Use
nrow(filter(mydata, sCode == "CA")),
All the solutions provided here gave me same error as multi-sam but that one worked.
Just give a try using subset
nrow(subset(data,condition))
Example
nrow(subset(myData,sCode == "CA"))
to get the number of observations the number of rows from your Dataset would be more valid:
nrow(dat[dat$sCode == "CA",])
grep command can be used
CA = mydata[grep("CA", mydata$sCode, ]
nrow(CA)
Call nrow passing as argument the name of the dataset:
nrow(dataset)
I'm using this short function to make it easier using dplyr:
countc <- function(.data, ..., preserve = FALSE){
return(nrow(filter(.data, ..., .preserve = preserve)))
}
With this you can just use it like filter. For example:
countc(data, active == TRUE)
[1] 42
Related
I am using the ifelse function in order to obtain either a vector with NA if all the "value" of this vector are NA or a vector with all the values not equal to "NA_NA". In my example, I would like to obtain this results
[1] "14_mter" "78_ONHY"
but I am obtaining this
[1] "14_mter"
my example:
vect=c("NA_NA", "14_mter", "78_ONHY")
out=ifelse(all(is.na(vec)), vec, vec[which(vec!="NA_NA")])
What is wrong in this function ?
ifelse is vectorized and its result is as long as the test argument. all(is.na(vect)) is always just length one, hence the result. a regular if/else clause is fine here.
vect <- c("NA_NA", "14_mter", "78_ONHY")
if (all(is.na(vect))) {
out <- vect
} else {
out <- vect[vect != "NA_NA"]
}
out
#> [1] "14_mter" "78_ONHY"
additional note: no need for the which() here
The ifelse help file, referring to its three arguments test, yes and no, says:
ifelse returns a value with the same shape as test which is filled
with elements selected from either yes or no depending on whether the
element of test is TRUE or FALSE.
so if the test has a length of 1, which is the case for the code in the question, then the result will also have length 1. Instead try one of these.
1) Use if instead of ifelse. if returns the value of the chosen leg so just assign that to out.
out <- if (all(is.na(vect))) vect else vect[which(vect != "NA_NA")]
2) The collapse package has an allNA function so a variation on (1) is:
library(collapse)
out <- if (allNA(vect)) vect else vect[which(vect != "NA_NA")]
3) Although not recommended if you really wanted to use ifelse it could be done by wrapping each leg in list(...) so that the condition and two legs all have the same length, i.e. 1.
out <- ifelse(all(is.na(vect)), list(vect), list(vect[which(vect != "NA_NA")])) |>
unlist()
If the NAvalue is always the string NA_NA, this works:
grep("NA_NA", vect, value = TRUE, invert = TRUE)
[1] "14_mter" "78_ONHY"
While the pattern matches the NA_NA value, the invert = TRUE argument negates the match(es) and produces the unmatched values
Data:
vect=c("NA_NA", "14_mter", "78_ONHY")
How could I identify a column in R dataframe using a variable? In the following code, I used paste0 to identify a columns with variable. Is there any alternative?
if ((leadsnp4[[paste0('Z_in_',trait1)]] > 0) & (leadsnp4[[paste0('Z_in_',trait2)]] > 0))
{leadsnp4$ConcordEffect='Yes'} else if ((leadsnp4[[paste0('Z_in_',trait1)]] < 0) & (leadsnp4[[paste0('Z_in_',trait2)]] < 0))
{leadsnp4$ConcordEffect='Yes'} else if ((leadsnp4[[paste0('Z_in_',trait1)]] > 0) & (leadsnp4[[paste0('Z_in_',trait2)]] < 0))
{leadsnp4$ConcordEffect='No'} else if ((leadsnp4[[paste0('Z_in_',trait1)]] < 0) & (leadsnp4[[paste0('Z_in_',trait2)]] > 0))
{leadsnp4$ConcordEffect='No'}
leadsnp4 is a dataframe. trait1 and trait2 are user defined variables. The above code is giving me warning : The condition has length > 1 and only the first element will be used. Also not getting the expected output.
Not sure what is wrong here. Maybe there are other alternatives for the above if else statements. Any help?
The way you're selecting columns in fine. Using df[[col_name]] (list context) is the same as df[, col_name] -- each returns a vector copy of column col_name. You can save the column name as a variable instead of using paste0 directly in the selection.
The reason you're getting an error is that if is not vectorized and you're giving it a vector with length > 1. In this case, if uses only the first value in the vector, but warns that it's doing so. ifelse is the vectorized version in base R (there's also dplyr::if_else). If I understand your code, the below should be close to what you're looking for.
t1 <- paste0('Z_in_', trait1)
t2 <- paste0('Z_in_', trait2)
# a single boolean vector indicating if trait1 and trait2 are
# both positive or both negative
same_sign <- ((leadsnp4[, t1] > 0) & (leadsnp4[, t2] > 0)) |
((leadsnp4[, t1] < 0) & (leadsnp4[, t2] < 0))
leadsnp4$ConcordEffect <- ifelse(same_sign, "Yes", "No")
Note that if trait1 and/or trai2 are equal to 0 they will be assigned false. You'll need to modify the logic if this is not the desired behavior.
Here is an explanation for why pasting will not work for creating a column reference and one suggestion for what you can do instead: Dynamically select data frame columns using $ and a character value
Edit: The suggested code worked, so in total:
MyData$Numbers_2=with(MyData, ifelse(Numbers %in% MyVector,
paste0("L_N", Numbers),
paste0("L_", Numbers)))
This adds a column with the new (character) values.
I'd like to add a specific string of characters to every value in a column of a dataframe, based on the condition that the value is found in a vector.
#Create Dataframe
MyData=data.frame(Numbers=c(85,84,7,9,82,5,81,80,10))
MyVector=c("80","81","82","83","84","85")
# for Loop with if else condition
for (Nr in MyVector) {MyData$Numbers_2=as.character(MyData$Numbers)
if(is.element(Nr,MyData$Numbers_2)) {MyData$Numbers_2= paste("L_N",MyData$Numbers_2, sep = "")}
else {MyData$Numbers_2= paste("L_",MyData$Numbers_2, sep = "")}
}
Basically, I'd like to have a result like this:
L_N85, L_N84, L_7, L_9 etc.
But what I get is: L_N85, L_N84, L_N7, L_N9.
It adds L_N to every value in the column, and not only to specific ones. How to I have to change my code for that to happen?
Instead of a for loop or if/else we can do this with vectorized ifelse
with(MyData, ifelse(Numbers %in% MyVector, paste0("L_N", Numbers), paste0("L_", Numbers)))
#[1] "L_N85" "L_N84" "L_7" "L_9" "L_N82" "L_5" "L_N81" "L_N80" "L_10"
Or with case_when from dplyr
library(dplyr)
MyData %>%
mutate(Numbers_2 = case_when(Numbers %in% MyVector ~ paste0("L_N", Numbers),
TRUE ~ paste0("L_", Numbers)))
I'm currently researching a matching-to-sample task in monkeys. I want to evaluate how often a certain stimulus was chosen, regardless of correctness of the choice.
To do so, I have a dataframe df with 6288 rows and 6 columns ("Monkey", "Session", "Sample", "Match", "Foil", "Success"), of which only the last three are important now.
The data in df$Match and df$Foilare the names of the stimuli (string) and df$Success is binary. df$Match and df$Foil are made up of 65 distinct stimuli names, which I included in a vector Match.Foil.
Now I want to count how often a picture (part of the vector Match.Foil) is clicked in all 6288 trials. That is, everytime the name is either part of df$Match & df$Success == "1" OR when df$Foil & df$Success == "0".
I tried to build a vector with the number of times clicked for each part of Match.Foil like this:
Pic.clicked= vector(mode="numeric", length= length(Match.Foil))
for (i in 1:length(Match.Foil)){
Pic.clicked[i] = ifelse(
df$Match == Match.Foil[i] & df$Success == "1")|
(df$Foil== Match.Foil[i] & df$Success == "0"),
Pic.clicked[i] +1,
Pic.clicked[i] +0)
}
So, as you see I wanted to use the functions Pic.clicked + 1 and Pic.clicked + 0 as the returns if the statement is TRUE or FALSE. It does not work and gives me the error:
In Pic.clicked[i] = ifelse((df$Match == Match.Foil[i] & ... :
number of items to replace is not a multiple of replacement length
Does anybody have an idea, how to build an appropriate counter? I thought about using switch, but I don't have any experience with that function and it seems not to work like I need it. I also tried running it for 6288 loops, but that produces the same warning.
you can use sum(), which on a boolean vector makes TRUE count as 1:
for (i in 1:length(Match.Foil)) {
Pic.clicked[i]= sum((Stage4.pics$Match == Match.Foil[i] & Stage4.pics$Success == "1")|
(Stage4.pics$Foil== Match.Foil[i] & Stage4.pics$Success == "0"))
}
I have several dataframes in which I want to delete each row that matches a certain string. I used the following code to do it:
df[!(regexpr("abc", df$V4) ==1),]
How can I delete the row that is following, e.g. if I delete row n as specified by the code above, how can I additionally delete row n+1?
My first try was to simply find out the indices of the desired rows, but that won't work, as I need to delete rows in different dataframes which are of different lengths. So the indices vary.
Thanks!
I would suggest taking out and manipulating the logical vector directly. Suppose we have the vector:
x = c(5,0,1, 4, 3)
and we want to do:
x[x > 3]
First, note that:
R> (s_n = x>3)
[1] TRUE FALSE FALSE TRUE FALSE
So
R> (s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
[1] TRUE TRUE FALSE TRUE TRUE
Hence,
x[s_n1]
gives you what you want.
In your particular example, something like:
s_n = !(regexpr("abc", df$V4) == 1)
s_n1 = as.logical(s_n + c(F, l[1:(length(s_n)-1)])))
df[s_n1, ]
should work.
Use which() on your logical expression and then you can just add 1 to the result.
sel <- which(grep("abc", df$V4))
sel <- c(sel, sel+1)
df[-sel,]
df[which(!(regexpr("abc", df$V4) ==1))+c(0,1),]