I would like to count all rows where two criterias are matched. My approach was:
a <- c(2,1,2,2,3,4,2,1,9)
b <- c(2,1,2,1,4,4,5,6,7)
c <- data.frame(a,b)
sum((c['a']==2) && (c['b']==2))
but for some reason this gives 1 instead of two. How could I count rows if multiple criterias are matched?
I think you are using the wrong ampersand operator. Try this:
sum(c['a']==2 & c['b']==2)
[1] 2
If you might have NAs in column a or b you might also try:
length(intersect(which(c['a']==2), which(c['b']==2)))
You can also subset within the data.frame and then find the rows for it.
nrow(c[a==2 & b==2, ])
# [1] 2
P.S : It is advised not to use c as a variable as it is also a base R function.
Related
My data frame df looks like follow:
Variable A Variable B Variable C
9 2 1
2 0 don't know
maybe 1 1
? 0 3
I need to remove all rows, where non-numerical values are used. It should look like this afterwards:
Variable A Variable B Variable C
9 2 1
I thought about something like
df[! grepl(*!= numerical*, df),]
or
df[! df %in% *!= numerical*, ]
but I don't find anything I could use as input for "take all that doesn't match numerical values". Could you please help me?
Thanks a lot!
One option would be to loop through the columns, convert to numeric so that all non-numeric elements convert to NA, check for NA with is.na , negate (!) it, compare the corresponding elements of list with Reduce and &, use that to subset the rows.
df[Reduce(`&`, lapply(df, function(x) !is.na(as.numeric(x)))),]
This might not be the best way to do it, but works.
s is the df that contains your data-
contains <- lapply(seq_len(nrow(s)), function(i){
yes <- grep("[^0-9.]" , s[i,]) #regex for presence of non-digits
ifelse(identical(yes, integer(0)),F,T)
}) %>% unlist
s <- s[which(!contains),]
Thanks!
I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.
My data frame:
x <- c("G448", "G459", "G479", "G406")
y <- c(1:4)
My.Data <- data.frame (x,y)
I have tried:
subset (My.Data, x=="G45*")
But I am unsure how to use wildcards. I have also tried grep() to find the indicies:
grep ("G45*", My.Data$x)
but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.
It's pretty straightforward using [ to extract:
grep will give you the position in which it matched your search pattern (unless you use value = TRUE).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2
You may also use the stringr package
library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))
You may not use '^' (starts with) in this case, to obtain the results you need
I have this example data
a<-c(1,5,7,8,10,15)
b<-c(2,6,7,9,10,20,31)
I need to find the duplicated values (the values which are in both vectors) and create new vector which include these number. It should looks like
c<-c(7,10)
Because of different length of vector I have tried to give them into list of vectors
l<-list(a=a,b=b)
and tried
duplicated(l)
or
duplicated(a,b)
but it gives nonsense output. I'm looking for correct solution but I cannot still find. Any advices?
Looks like a job for intersect()
a<-c(1,5,7,8,10,15)
b<-c(2,6,7,9,10,20,31)
c<-intersect(a,b)
c
[1] 7 10
c(a, b)[duplicated(c(a, b))]
produces:
[1] 7 10
duplicated applied to a vector returns a logical vector of the same length, with TRUE for every value that has already appeared earlier in the vector. You can use that to subset the original vector.
Note that if you don't care if values are duplicated within a single vector, then you should do:
a.b <- c(unique(a), unique(b))
a.b[duplicated(a.b)]
Keeping within the scope of the original question,
You could use match
> b[!is.na(match(a, b))]
# [1] 7 10
Or more simply, %in%
> b[a %in% b]
# [1] 7 10
Solved by creating function like
duplicated_values<-function(x){
if(x%in%b){
return(x)
}
}
values<-mclapply(c(1:length(a)),duplicated_values)
Suppose I have a vector data <- c(1,2,2,1) and a reference table, say : ref <- cbind(c(1,1,2,2,2,2,4,4), c(1,2,3,4,5,6,7,8))
I would like my code to return the following vector : result <- c(1,2,3,4,5,6,3,4,5,6,1,2). It's like using the R function match(). But match() only returns the first occurrence of the reference vector. Similar for %in%.
I have tried functions like merge(), join() but I would like something with only the combination of rep() and seq() R functions.
You can try
ref[ref[,1] %in% data,2]
To return the second column value whenever the first column value is in the given set. You can wrap this in a lapply:
unlist(lapply(data, function(x) ref[ref[,1] ==x, 2]))
You can get the indices you are looking for like this:
indices <- sapply(data,function(xx)which(ref[,1]==xx))
Of course, that is a list, since the number of hits will be different for each entry of data. So you just unlist() this:
ref[unlist(indices),2]
[1] 1 2 3 4 5 6 3 4 5 6 1 2
I am having trouble subsetting my data. I want the data subsetted on column x, where the first 3 characters begin G45.
My data frame:
x <- c("G448", "G459", "G479", "G406")
y <- c(1:4)
My.Data <- data.frame (x,y)
I have tried:
subset (My.Data, x=="G45*")
But I am unsure how to use wildcards. I have also tried grep() to find the indicies:
grep ("G45*", My.Data$x)
but it returns all 4 rows, rather than just those beginning G45, probably also as I am unsure how to use wildcards.
It's pretty straightforward using [ to extract:
grep will give you the position in which it matched your search pattern (unless you use value = TRUE).
grep("^G45", My.Data$x)
# [1] 2
Since you're searching within the values of a single column, that actually corresponds to the row index. So, use that with [ (where you would use My.Data[rows, cols] to get specific rows and columns).
My.Data[grep("^G45", My.Data$x), ]
# x y
# 2 G459 2
The help-page for subset shows how you can use grep and grepl with subset if you prefer using this function over [. Here's an example.
subset(My.Data, grepl("^G45", My.Data$x))
# x y
# 2 G459 2
As of R 3.3, there's now also the startsWith function, which you can again use with subset (or with any of the other approaches above). According to the help page for the function, it's considerably faster than using substring or grepl.
subset(My.Data, startsWith(as.character(x), "G45"))
# x y
# 2 G459 2
You may also use the stringr package
library(dplyr)
library(stringr)
My.Data %>% filter(str_detect(x, '^G45'))
You may not use '^' (starts with) in this case, to obtain the results you need