finding matching consecutive rows in r - r

Is there a one-line solution possible for this example?
df = data.frame('First' = c('T','T','V','V','A','E'),'Last' = c(rep('Ng',3),'Smith','Wolf','Wolf'))
matches = (df$First[-1] == df$First)
which(matches == 'TRUE')
# [1] 1 3
I want the indeces, but would rather not use a temporary variable.

Perhaps you could use the rleid function from data.table in combination with diff, like this:
which(diff(rleid(df$First)) == 0)
[1] 1 3
You could argue that the 2nd element and the 4th element in df$First match the previous value (instead of the 1st and 3rd), therefore, which(c(F, diff(rleid(df$First)) == 0)) might be more appropriate, which yields: [1] 2 4

Related

Extracting single row from data.frame without loss of names [duplicate]

This question already has answers here:
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 1 year ago.
I am simply extracting a single row from a data.frame. Consider for example
d=data.frame(a=1:3,b=1:3)
d[1,] # returns a data.frame
# a b
# 1 1 1
The output matched my expectation. The result was not as I expected though when dealing with a data.frame that contains a single column.
d=data.frame(a=1:3)
d[1,] # returns an integer
# [1] 1
Indeed, here, the extracted data is not a data.frame anymore but an integer! To me, it seems a little strange that the same function on the same data type wants to return different data types. One of the issue with this conversion is the loss of the column name.
To solve the issue, I did
extractRow = function(d,index)
{
if (ncol(d) > 1)
{
return(d[index,])
} else
{
d2 = as.data.frame(d[index,])
names(d2) = names(d)
return(d2)
}
}
d=data.frame(a=1:3,b=1:3)
extractRow(d,1)
# a b
# 1 1 1
d=data.frame(a=1:3)
extractRow(d,1)
# a
# 1 1
But it seems unnecessarily cumbersome. Is there a better solution?
Just subset with the drop = FALSE option:
extractRow = function(d, index) {
return(d[index, , drop=FALSE])
}
R tries to simplify data.frame cuts by default, the same thing happens with columns:
d[, "a"]
# [1] 1 2 3
Alternatives are:
d[1, , drop = FALSE]
tibble::tibble which has drop = FALSE by default
I can't tell you why that happens - it seems weird. One workaround would be to use slice from dplyr (although using a library seems unecessary for such a simple task).
library(dplyr)
slice(d, 1)
a
1 1
data.frames will simplify to vectors or scallars whith base subsetting [,].
If you want to avoid that, you can use tibbles instead:
> tibble(a=1:2)[1,]
# A tibble: 1 x 1
a
<int>
1 1
tibble(a=1:2)[1,] %>% class
[1] "tbl_df" "tbl" "data.frame"

Is there a way to determine how many rows in a dataset have the same categorical variable for multiple conditions (columns)?

For example, i have the dataset below where 1 = yes and 0 = no, and I need to figure out how many calls were made by landline that lasted under 10 minutes.
Image of example dataset
You can also specifically define the values you're looking for in each column when you're finding the sum. (This will help if you need count rows with values other than 1 in a column.)
sum(df$landline == 1 & df$`under 10 minutes` == 1)
We can use sum
sum(df1[, "under 10 minutes"])
If two columns are needed
colSums(df1[, c("landline", "under 10 minutes")])
If we are checking both columns, use rowSums
sum(rowSums(df1[, c("landline", "under 10 minutes")], na.rm = TRUE) == 2)
The grep function finds the rows where landline=1. We then only call those rows and sum the under 10 min column.
sum( df[ grep(1,df[,1]) ,4] )
R will conveniently treat 1 and 0 as if they mean TRUE and FALSE, so we can apply logical Boolean operations like AND (&) and OR (|) on them.
df <- data.frame(x = c(1, 0, 1, 0),
y = c(0, 0, 1, 1))
> sum(df$x & df$y)
[1] 1
> sum(df$x | df$y)
[1] 3
For future questions, you should look up how to use functions like dput or other ways to give an example data set instead of using an image.

Count occurences in a cell, with a condition- R studio

I have a string like this one:
0|294|314|20|314|SC49TST57ASG75A|1428.0
Using R, I want to extract only the data between two | (example- SC49TST57ASG75A), and then count only the numbers which are bigger than 20 (in this case I have the numbers 49,57,75 so the code needs to return the number 3)
I want to apply it on a column in a data frame.
Eventually, I want to get a new column that specify for each row how many numbers that are greater than 20 there is inside the |....|.
Thanks!
You can try strsplit with split = '\\|', if you only want to count between two pipes then you should exclude the first and the last elements also since you want elements greater than 20 ( we are using > sign for clarity in the solution)
I am assuming here that your columns have same structure as given in your question.
st <- '0|294|314|20|314|SC5GSC12ASG266T|1428.0'
Solution:
lapply(strsplit(st, '\\|'), function(x)sum(as.numeric(x[2:(length(x)-1)]) > 20, na.rm=TRUE))
I am not sure if this is what you are looking for, otherwise please tell me what is your expected result.
cnt <- Map(function(x) sum(as.numeric(x)>20),
regmatches(r <- unlist(regmatches(s,gregexpr("(?<=\\|).*?(?=\\|)",s,perl = TRUE))),
gregexpr("\\d+\\.?\\d+?",r)))
such that
> cnt
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 1
[[5]]
[1] 1
DATA
s <- "0|294|314|20|314|SC5GSC12ASG266T|1428.0"

get length of character matching between two string in R

I have a dataframe where i need to compare two columns and find the number of matching characters between two elements.
For eg: x and y are two elements to be compared which look like below:
x<- "1/2"
y<-"2/3"
I did unlisted and splitted them by '/' as below:
unlist(strsplit(x,"/"))->a
unlist(strsplit(y,"/"))->b
Then i used pmatch:
pmatch(a,b,nomatch =0)
[1] 0 1
Used sum() to know how many characters are matching:
sum(pmatch(a,b,nomatch =0))
[1] 1
However, when the comparison is done the other way:
pmatch(b,a,nomatch = 0)
[1] 2 0
Since there is only one match between the two string, why is it showing 2. It could be index. But i would need to get how many characters are same between the strings irrespective of the comparison a vs b or b vs a.
Could someone help how to get this.
Per ?pmatch, pmatch seeks matches for the elements of its first argument among those of its second.
For example, "2" in the first list matches the second element in the second list.
> pmatch(c("2", "1"),c("3","2"),nomatch =0)
# [1] 2 0
One way to know the number of elements got matched is to sum non-zero elements:
sum(pmatch(c("2", "1"),c("3","2"),nomatch =0) != 0)
# [1] 1
Both
sum(pmatch(b, a, nomatch = 0) != 0) # 1
sum(pmatch(a, b, nomatch = 0) != 0) # 1
return the same value.
Another option could be
sum(b %in% a)
[1] 1
sum(a %in% b)
[1] 1

Operations on elements of column vectors

I have a column vector containing 1's. I also have another numeric column containing numbers.
Example:
day_eq day
1 1
1 5
1 3
1 2
I now want to say:
If an element from day is smaller than its corresponding element in day_eq,
make invalid (a column vector element) = 5.
This is my code:
for (i in 1:nrow(setin)){
if (setin[[i,"day"]]<setin[[i,"day_eq"]]){
setin[[i,"valid"]] = 0
setin[[i,"invalid_code"]] = 5
}
}
It isn't working. It keeps saying:
Error in if (setin[[i, "day"]] < setin[[i, "day_eq"]]) { :
missing value where TRUE/FALSE needed
or
In if (test.ID1$day_eq > test.ID1$day) { :
the condition has length > 1 and only the first element will be used
Where test.ID1 is the set name.
You don't need a loop for that. I'm not sure exactly what you are doing... but ifelse should be able to help you...
setin$valid <- ifelse(setin$day < setin$day_eq, 0, NA)
setin$invalid_code <- ifelse(setin$day < setin$day_eq, 5, NA)
your data is
day_eq <- c(1,1,1,1)
day <- c (1,5,3,2)
setin <- data.frame(day_eq,day)
the solution using dplyr is
library(dplyr)
setin %>% mutate(invalid = ifelse (day < day_eq, 5, 0))
I used setin as set name, however, you also use test.ID1, so just replace it in case

Resources