Remove row with specific number in R - r

I want to remove row with the test "student2". However, I don't want to remove row like "student22", "student 23"... etc.
For example:
Student.Code Values
1 canada.student12 2
2 canada.student2 3 # remove
3 canada.student23 5 # keep
4 US.student2 6 # remove
5 US.student32 2
6 Aus.student87 645
7 Turkey.student25 4 #keep
I used the code grepl("student2", example$Student.code, fixed = TRUE but it also find (remove) the rows with like "student23"

We can use grepl("student2$", example$Student.Code)
library(tidyverse)
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example$Student.Code
grepl("student2$", example$Student.Code)
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE
example %>%
filter(!grepl("student2$", Student.Code))
# A tibble: 5 x 2
Student.Code Values
<chr> <int>
1 canada.student12 2
2 canada.student23 5
3 US.student32 2
4 Aus.student87 645
5 Turkey.student25 4

Data:
df <- data.frame(
Student = c("canada.student12", "canada.student2", "canada.student23","US.student2", "US.student32", "Aus.student87", "Turkey.student25"),
Value = c(2,3,5,6,2,654,5)
)
Solution: (in base R)
The idea is to use grepl to match those values where the number 2 occurs at the word boundary, that is, in regex, at \\b, and to exclude these strings with the negator !:
df[!grepl("student2\\b", df$Student),]
Student Value
1 canada.student12 2
3 canada.student23 5
5 US.student32 2
6 Aus.student87 654
7 Turkey.student25 5
Alternatively, you can also go the opposite way and match those patterns that you want to keep:
df[grepl("student(?=\\d{2,})", df$Student, perl = T),]
Here, the idea is to use positive lookahead to match values with student iff they are followed immediately by at least two digits (\\d{2,}). (Note that when using lookahead or lookbehind you need to include perl = T.)

If you have a variable with an exact value you want to remove, don't use grep or grepl.
example <- tibble::tribble(
~Student.Code, ~Values,
"canada.student12", 2L,
"canada.student2", 3L,
"canada.student23", 5L,
"US.student2", 6L,
"US.student32", 2L,
"Aus.student87", 645L,
"Turkey.student25", 4L
)
example <- example[example$Student.Code != "canada.student2",]
# or, in dplyr
example <- filter(example, Student.Code != "canada.student2")
# for multiple values
example <- filter(example, !(Student.Code %in% c("canada.student2", "US.student2")))
fixed = TRUE is not working because all it means is 'search for this exact string in the input strings', not 'only match this exact string (it must be the whole value)'

Related

Regex expression exceptions in subsetting data with grepl

I'm trying to subset data in R by certain characters in a field and cannot find the correct regex logic to get what I need. I need to subset records for which the ID contains either:
Just "AB"
"AB" and "ABC"
But NOT fields with ONLY "ABC"
These patterns fall within any part of the field (beginning, middle, end) in this data set and have no certain separators.
Example dataset TEST:
Record ID value
1 blueAB_ABC 7
2 green_ABCblue 9
3 ABC_green 45
4 green_AB 23
5 CD_red 45
So for this example I would want to subset records 1 and 4.
I've gotten as far as returning those with just AB and excluding ABC, but cannot seem to find the proper regex to get all with "AB" and potentially "ABC".
AB_set <- subset(TEST, grepl("*AB", ID) & !grepl("*ABC", ID) )
Record ID value
4 green_AB 23
What I'm hoping to get:
Record ID value
1 blueAB_ABC 7
4 green_AB 23
EDIT: Just to clarify, I updated the dataset to show that the pattern in question may fall next to other characters than an underscore, or may not necessarily occur at the beginning/end (as previously noted, "no certain separators").
You can get this by specifying that "AB" should be surrounded by either underscore or a word boundary.
df[grepl("(\\b|_)AB(\\b|_)", df$ID),]
Record ID value
1 1 blue_AB_ABC 7
4 4 green_AB 23
"ABC" is not needed because "AB" is always required to be matched. The following matches AB only if it is surrounded by underscore or it starts or ends an ID:
AB_set <- subset(TEST, grepl("(^|_)AB(_|$)", TEST$ID))
Result:
Record ID value
1 1 blue_AB_ABC 7
4 4 green_AB 23
Data:
TEST = structure(list(Record = 1:5, ID = structure(c(2L, 5L, 1L, 4L,
3L), .Label = c("ABC_green", "blue_AB_ABC", "CD_red", "green_AB",
"green_ABC_blue"), class = "factor"), value = c(7L, 9L, 45L,
23L, 45L)), .Names = c("Record", "ID", "value"), class = "data.frame", row.names = c(NA,
-5L))

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

Data manipulations in R

As part of a project, I am currently using R to analyze some data. I am currently stuck with the retrieving few values from the existing dataset which i have imported from a csv file.
The file looks like:
For my analysis, I wanted to create another column which is the subtraction of the current value of x and its previous value. But the first value of every unique i, x would be the same value as it is currently. I am new to R and i was trying various ways for sometime now but still not able to figure out a way to do so. Request your suggestions in the approach that I can follow to achieve this task.
Mydata structure
structure(list(t = 1:10, x = c(34450L, 34469L, 34470L, 34483L,
34488L, 34512L, 34530L, 34553L, 34575L, 34589L), y = c(268880.73342868,
268902.322359863, 268938.194698248, 268553.521856105, 269175.38273083,
268901.619719038, 268920.864512966, 269636.604121984, 270191.206593437,
269295.344751692), i = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L)), .Names = c("t", "x", "y", "i"), row.names = c(NA, 10L), class = "data.frame")
You can use the package data.table to obtain what you want:
library(data.table)
setDT(MyData)[, x_diff := c(x[1], diff(x)), by=i]
MyData
# t x i x_diff
# 1: 1 34287 1 34287
# 2: 2 34789 1 502
# 3: 3 34409 1 -380
# 4: 4 34883 1 474
# 5: 5 34941 1 58
# 6: 6 34045 2 34045
# 7: 7 34528 2 483
# 8: 8 34893 2 365
# 9: 9 34551 2 -342
# 10: 10 34457 2 -94
Data:
set.seed(123)
MyData <- data.frame(t=1:10, x=sample(34000:35000, 10, replace=T), i=rep(1:2, e=5))
You can use the diff() function. If you want to add a new column to your existing data frame, the diff function will return a vector x-1 length of your current data frame though. so in your case you can try this:
# if your data frame is called MyData
MyData$newX = c(NA,diff(MyData$x))
That should input an NA value as the first entry in your new column and the remaining values will be the difference between sequential values in your "x" column
UPDATE:
You can create a simple loop by subsetting through every unique instance of "i" and then calculating the difference between your x values
# initialize a new dataframe
newdf = NULL
values = unique(MyData$i)
for(i in 1:length(values)){
data1 = MyData[MyData$i = values[i],]
data1$newX = c(NA,diff(data1$x))
newdata = rbind(newdata,data1)
}
# and then if you want to overwrite newdf to your original dataframe
MyData = newdf
# remove some variables
rm(data1,newdf,values)

is.element on column of lists in data frame

I have a data frame with a column that contains some elements that are lists. I would like to find out which rows of the data frame contain a keyword in that column.
The data frame, df, looks a bit like this
idstr tag
1 wl
2 other.to
3 other.from
4 c("wl","other.to")
5 wl
6 other.wl
7 c("ll","other.to")
The goal is to assign all of the rows with 'wl' in their tag to a new data frame. In this example, I would want a new data frame that looks like:
idstr tag
1 wl
4 c("wl","other.to")
5 wl
I tried something like this
df_wl <- df[which(is.element('wl',df$tag)),]
but this only returns the first element of the data frame (whether or not it contains 'wl'). I think the trouble lies in iterating through the rows and implementing the "is.element" function. Here are two implementations of the function and it's results:
is.element('wl',df$tag[[4]]) > TRUE
is.element('wl',df$tag[4]) > FALSE
How do you suggest I iterate through the dataframe to assign df_wl with it's proper values?
PS: Here's the dput:
structure(list(idstr = 1:7, tag = structure(c(6L, 5L, 4L, 2L, 6L, 3L, 1L), .Label = c("c(\"ll\",\"other.to\")", "c(\"wl\",\"other.to\")", "other.wl", "other.from", "other.to", "wl"), class = "factor")), .Names = c("idstr", "tag"), row.names = c(NA, -7L), class = "data.frame")
Based on your dput data. this may work. The regular expression (^wl$)|(\"wl\") matches wl from beginning to end, or any occurrence of "wl" (wrapped in double quotes)
df[grepl("(^wl$)|(\"wl\")", df$tag),]
# idstr tag
# 1 1 wl
# 4 4 c("wl","other.to")
# 5 5 wl

Replacing values in a data frame column

Given a large data frame with a column that has unique values
(ONE, TWO, THREE, FOUR, FIVE, SIX, SEVEN, EIGHT)
I want to replace some of the values. For example, every occurrence of 'ONE' should be replaced by '1' and
'FOUR' -> '2SQUARED'
'FIVE' -> '5'
'EIGHT' -> '2CUBED'
Other values should remain as they are.
IF/ELSE will run forever. How to apply a vectorized solution? Is match() the corrct way to go?
Using #rnso data set
library(plyr)
transform(data, vals = mapvalues(vals,
c('ONE', 'FOUR', 'FIVE', 'EIGHT'),
c('1','2SQUARED', '5', '2CUBED')))
# vals
# 1 1
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED
Try following using base R:
data = structure(list(vals = structure(c(4L, 8L, 7L, 3L, 2L, 6L, 5L,
1L), .Label = c("EIGHT", "FIVE", "FOUR", "ONE", "SEVEN", "SIX",
"THREE", "TWO"), class = "factor")), .Names = "vals", class = "data.frame", row.names = c(NA,
-8L))
initial = c('ONE', 'FOUR', 'FIVE', 'EIGHT')
final = c('1','2SQUARED', '5', '2CUBED')
myfn = function(ddf, init, fin){
refdf = data.frame(init,fin)
ddf$new = refdf[match(ddf$vals, init), 'fin']
ddf$new = as.character(ddf$new)
ndx = which(is.na(ddf$new))
ddf$new[ndx]= as.character(ddf$vals[ndx])
ddf
}
myfn(data, initial, final)
vals new
1 ONE 1
2 TWO TWO
3 THREE THREE
4 FOUR 2SQUARED
5 FIVE 5
6 SIX SIX
7 SEVEN SEVEN
8 EIGHT 2CUBED
>
Your column is probably a factor. Give this a try. Using rnso's data, I'd recommend you first create two vectors of values to change from and values to change to
from <- c("FOUR", "FIVE", "EIGHT")
to <- c("2SQUARED", "5", "2CUBED")
Then replace the factors with
with(data, levels(vals)[match(from, levels(vals))] <- to)
This gives
data
# vals
# 1 ONE
# 2 TWO
# 3 THREE
# 4 2SQUARED
# 5 5
# 6 SIX
# 7 SEVEN
# 8 2CUBED

Resources