Using grepl() to remove values from a dataframe in R - r

I have a data.frame with 1 column, and a nondescript number of rows.
This column contains strings, and some strings contain a substring, let's say "abcd".
I want to remove any strings from the database that contain that substring. For example, I may have five strings that are "123 abcd", and I want those to be removed.
I am currently using grepl() to try and remove these values, but it is not working. I am trying:
data.frame[!grepl("abcd", dataframe)]
but it returns an empty data frame.

We can use grepl to get a logical vector, negate (!) it, and use that to subset the 'data'
data[!grepl("abcd", data$Col),,drop = FALSE]

Related

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

R how to remove rows in a data frame based on the first character of a column

I have a big data frame and I want to remove certain rows from it based on first char of a column being a letter or a number. Sample of my data frame looks like a below:
y<-c('34TA912','JENAR','TEST','34CC515')
z<-('23.12.2015','24.12.2015','24.12.2015','25.12.2015')
abc<-data.frame(y,z)
Based on the sample above. I would like to remove second and third rows due to the value in y column in second row and third row starting with a letter instead of a number. Characters written in Y column could be anything, so only way I could filter is checking the first character without using any predefined value. If I use grep with a character, since other rows also contain letter, I could remove them aswell. Can you assist?
We can use grep. The regex ^ indicates the beginning of the string. We match numeric element ([0-9]) at the beginning of the string in the 'y' column using grep. The output will be numeric index, which we use to subset the rows of the 'abc'.
abc[grep('^[0-9]', abc$y),]
# y z
#1 34TA912 23.12.2015
#4 34CC515 25.12.2015

Count of Comma separated values in r

I have a column named subcat_id in which the values are stored as comma separated lists. I need to count the number of values and store the counts in a new column. The lists also have Null values that I want to get rid of.
I would like to store the counts in the n column.
We can try
nchar(gsub('[^,]+', '', gsub(',(?=,)|(^,|,$)', '',
gsub('(Null){1,}', '', df1$subcat_id), perl=TRUE)))+1L
#[1] 6 4
Or
library(stringr)
str_count(df1$subcat_id, '[0-9.]+')
#[1] 6 4
data
df1 <- data.frame(subcat_id = c('1,2,3,15,16,78',
'1,2,3,15,Null,Null'), stringsAsFactors=FALSE)
You can do
sapply(strsplit(subcat_id,","),FUN=function(x){length(x[x!="Null"])})
strsplit(subcat_id,",") will return a list of each item in subcat_id split on commas. sapply will apply the specified function to each item in this list and return us a vector of the results.
Finally, the function that we apply will take just the non-null entries in each list item and count the resulting sublist.
For example, if we have
subcat_id <- c("1,2,3","23,Null,4")
Then running the above code returns c(3,4) which you can assign to your column.
If running this from a dataframe, it is possible that the character column has been interpreted as a factor, in which case the error non-character argument will be thrown. To fix this, we need to force interpretation as a character vector with the as.character function, changing the command to
sapply(strsplit(as.character(frame$subcat_id),","),FUN=function(x){length(x[x!="Null"])})

How to combine the values in a column of dataframe

I have a dataframe with two column. I want to concatenate the values in a second column and return a string. How can I do this in R?
You can use paste with the appropriate delimiter. Here, I am using ''. You can specify it to -, _ or anything else.
paste(df$Col2, collapse="")
If there are NAs you could use na.omit
paste(na.omit(df$V2), collapse="")

How to remove duplicated rows by a column in an R matrix

I am trying to remove duplicated rows by one column (e.g the 1st column) in an R matrix. How can I extract the unique set by one column from a matrix? I've used
x_1 <- x[unique(x[,1]),]
While the size is correct, all of the values are NA. So instead, I tried
x_1 <- x[-duplicated(x[,1]),]
But the dimensions were incorrect.
I think you're confused about how subsetting works in R. unique(x[,1]) will return the set of unique values in the first column. If you then try to subset using those values R thinks you're referring to rows of the matrix. So you're likely getting NAs because the values refer to rows that don't exist in the matrix.
Your other attempt runs afoul of the fact that duplicated returns a boolean vector, not a vector of indices. So putting a minus sign in front of it converts it to a vector of 0's and -1's, which again R interprets as trying to refer to rows.
Try replacing the '-' with a '!' in front of duplicated, which is the boolean negation operator. Something like this:
m <- matrix(runif(100),10,10)
m[c(2,5,9),1] <- 1
m[!duplicated(m[,1]),]
As you need the indeces of the unique rows, use duplicated as you tried. The problem was using - instead of !, so try:
x[!duplicated(x[,1]),]

Resources