grepl function - Combining a start and endstring - r

The last 2 hours ive been tyring to figure this out but i cant figure it out.
i got a variable where it starts with 4 letters ends with 2 numbers.
Now i want to subset only those starting with KJHB and ends with a number between 20-33.
The function im trying is:
df <- mydata
x <- seq(20,33)
df2 <- subset(df, grepl('^KJHB & x$, col1))
Any idea?

Alright i came up with a not totally correct answer but its working for me.
x <- paste("KJHB",seq(20,33), sep = "")
x <- as.data.frame(table(x))
df2 <- subset(df, col1 %in% x$x)
not the most correct way but did the job and the code is simple so a novice like me can understand it xD.

You could try stringr. This doesn't exactly check that it's the beginning or end of the string, but if it's a uniform pattern this may be useful:
my_match = function(string, start_string, num_seq){
return( str_extract(string, start_string) &&
any( !is.na( str_extract(string, as.character(num_seq)) ))
}
is_matched = my_match(df$your_col, "KJHB", 20:33)
df2 = df1[ is_matched, ]
There is probably something smarter that can be done with str_locate too.

Related

Get the number of lines of code in a R script

I would like to know if there is a way to count the number of lines in a R script.
Ignoring lines of comment.
I didn't find a solution on the Internet. But maybe I missed something.
Example sctipt tester.R with 8 lines, one commented:
x <- 3
x+1
x+2
#x+4
x*x
Function to count lines without comments:
foo <- function(path) {
rln <- read_lines(path)
rln <- rln[-grep(x = trimws(rln) , pattern = '^#')]
rln <- rln[ trimws(rln) != '']
return(length(rln))
}
Test run:
> foo('tester.R')
[1] 7
You could try this :
library(magrittr)
library(stringr)
library(readr)
number_of_lines_of_code <- function(file_path){
file <- readr::read_file(file_path)
file_lines <- file %>% stringr::str_split("\n")
first_character_of_lines <- file_lines %>%
lapply(function(line)stringr::str_replace_all(line," ","")) %>%
lapply(function(line)stringr::str_sub(line,1,1)) %>%
unlist
sum(first_character_of_lines != "#" & first_character_of_lines != "\r")
}
number_of_lines_of_code("your/file/path.R")
That doesn't seem like very useful information, but you can do this:
script <- "#assign
a <- 1
b <- 2
"
nrow(read.table(text = script, sep = "°"))
[1] 2
I use ° as the separator, because it's an unlikely character in most R scripts. Adjust that as needed.
Of course, this could be done much more efficiently outside of R.

How to build subset query using a loop in R?

I'm trying to subset a big table across a number of columns, so all the rows where State_2009, State_2010, State_2011 etc. do not equal the value "Unknown."
My instinct was to do something like this (coming from a JS background), where I either build the query in a loop or continually subset the data in a loop, referencing the year as a variable.
mysubset <- data
for(i in 2009:2016){
mysubset <- subset(mysubset, paste("State_",i," != Unknown",sep=""))
}
But this doesn't work, at least because paste returns a string, giving me the error 'subset' must be logical.
Is there a better way to do this?
Using dplyr with the filter_ function should get you the correct output
library(dplyr)
mysubset <- data
for(i in 2009:2016)
{
mysubset <- mysubset %>%
filter_(paste("State_",i," != \"Unknown\"", sep = ""))
}
To add to Matt's answer, you could also do it like this:
cols <- paste0( "State_", 2009:2016)
inds <- which( mysubset[ ,cols] == "Unknown", arr.ind = T)[,1]
mysubset <- mysubset[-(unique(inds), ]

R: Remove consecutive duplicates from comma separated string

I'm having issues removing just the right amount of information from the following data:
18,14,17,2,9,8
17,17,17,14
18,14,17,2,1,1,1,1,9,8,1,1,1
I'm applying !duplicate in order to remove the duplicates.
SplitFunction <- function(x) {
b <- unlist(strsplit(x, '[,]'))
c <- b[!duplicated(b)]
return(paste(c, collapse=","))
}
I'm having issues removing only consecutive duplicates. The result below is what I'm getting.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8
The data below is what I want to obtain.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8,1
Can you suggest a way to perform this? Ideally a vectorized approach...
Thanks,
Miguel
you can use rle function to sovle this question.
xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)
And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?
We can use rle
sapply(strsplit(x, ','), function(x) paste(inverse.rle(within.list(rle(x),
lengths <- rep(1, length(lengths)))), collapse=","))
#[1] "18,14,17,2,9,8" "17,14" "18,14,17,2,1,9,8,1"
data
x <- c('18,14,17,2,9,8', '17,17,17,14', '18,14,17,2,1,1,1,1,9,8,1,1,1')
Great rle-answers. This is just to add an alternative without rle. This gives a list of numeric vectors but can of course easily expanded to return strings:
numbers <- c("18,14,17,2,9,8", "17,17,17,14", "14,17,18,2,9,8,1", "18,14,17,11,8,9,8,8,22,13,6", "14,17,2,9,8", "18,14,17,2,1,1,1,1,1,1,1,1,9,8,1,1,1,1")
result <- sapply(strsplit(numbers, ","), function(x) x[x!=c(x[-1],Inf)])
print(result)

remove rows that a particular column has NA [duplicate]

I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]

Delete rows with blank values in one particular column

I am working on a large dataset, with some rows with NAs and others with blanks:
df <- data.frame(ID = c(1:7),
home_pc = c("","CB4 2DT", "NE5 7TH", "BY5 8IB", "DH4 6PB","MP9 7GH","KN4 5GH"),
start_pc = c(NA,"Home", "FC5 7YH","Home", "CB3 5TH", "BV6 5PB",NA),
end_pc = c(NA,"CB5 4FG","Home","","Home","",NA))
How do I remove the NAs and blanks in one go (in the start_pc and end_pc columns)? I have in the past used:
df<- df[-which(is.na(df$start_pc)), ]
... to remove the NAs - is there a similar command to remove the blanks?
df[!(is.na(df$start_pc) | df$start_pc==""), ]
It is the same construct - simply test for empty strings rather than NA:
Try this:
df <- df[-which(df$start_pc == ""), ]
In fact, looking at your code, you don't need the which, but use the negation instead, so you can simplify it to:
df <- df[!(df$start_pc == ""), ]
df <- df[!is.na(df$start_pc), ]
And, of course, you can combine these two statements as follows:
df <- df[!(df$start_pc == "" | is.na(df$start_pc)), ]
And simplify it even further with with:
df <- with(df, df[!(start_pc == "" | is.na(start_pc)), ])
You can also test for non-zero string length using nzchar.
df <- with(df, df[!(nzchar(start_pc) | is.na(start_pc)), ])
Disclaimer: I didn't test any of this code. Please let me know if there are syntax errors anywhere
An elegant solution with dplyr would be:
df %>%
# recode empty strings "" by NAs
na_if("") %>%
# remove NAs
na.omit
Alternative solution can be to remove the rows with blanks in one variable:
df <- subset(df, VAR != "")
An easy approach would be making all the blank cells NA and only keeping complete cases. You might also look for na.omit examples. It is a widely discussed topic.
df[df==""]<-NA
df<-df[complete.cases(df),]

Resources