Is there a way in R to convert the following character variable? - r

I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))

One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))

Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.

Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))

Related

Merging two columns into one based on value

I have a dataset with two columns containing the following: an indicator number and a hashcode
The only problem is that the columns have the same name, but the value can switch columns.
Now I want to merge the columns and keep the number (I don't care about the hashcode)
I saw this question: Merge two columns into one in r
and I tried the coalesce() function, but that is only for having NA values. Which I don't have. I looked at the unite function, but according to the cheat sheet documentation documentation here that doesn't what I'm looking for
My next try was the filter_at and other filter functions from the dplyr package Documentation here
But that only leaves 150 data points while at the start I have 61k data points.
Code of filter_at I tried:
data <- filter_at(data,vars("hk","hk_1"),all_vars(.>0))
I assumed that a #-string shall not be greater than 0, which seems to be true, but it removes more than intented.
I would like to keep hk or hk_1 value which is a number. The other one (the hash) can be removed. Then I want a new column which only contains those numbers.
Sample data
My data looks like this:
HK|HK1
190|#SP0839
190|#SP0340
178|#SP2949
#SP8390|177
#SP2240|212
What I would like to see:
HK
190
190
178
177
212
I hope this provides an insight into the data. There are more columns like description, etc which makes that 190 at the start are not doubles.
We can replace all the values that start with "#" to NA and then use coalesce to select non-NA value between HK and HK1.
library(dplyr)
df %>%
mutate_all(~as.character(replace(., grepl("^#", .), NA))) %>%
mutate(HK = coalesce(HK, HK1)) %>%
select(HK)
# HK
#1 190
#2 190
#3 178
#4 177
#5 212
data
df <- structure(list(HK = structure(c(4L, 4L, 3L, 2L, 1L), .Label = c("#SP2240",
"#SP8390", "178", "190"), class = "factor"), HK1 = structure(c(2L,
1L, 3L, 4L, 5L), .Label = c("#SP0340", "#SP0839", "#SP2949",
"177", "212"), class = "factor")), class = "data.frame", row.names = c(NA, -5L))

how to assign list of values of attribute to single attribute in R

Am at beginner stage of R programming, please help me in below issue.
I have different desc values assigned to the same sol attribute in different rows. I want to make all desc values of sol attribute in single row as mentioned below
My data is as follows:
sol desc
1 fry, toast
1 frt,grt,gty
1 ytr,uyt,ytr
6 hyt, ytr,oiu
4 hyg,hyu,loi
4 opu,yut,yut
I want the output as follows :
sol desc
1 fry,toast,frt,grt,gty,ytr,uyt,yir
6 hyt, ytr,oiu
4 hyg,hyu,loi,opu,yut,yut
Note: you can input any values in desc as per your convenience.
aggregate() is what you are looking for. Try this:
aggregate(desc ~ sol, data = df, paste, collapse = ",")
sol desc
1 1 fry, toast,frt,grt,gty,ytr,uyt,ytr
2 4 hyg,hyu,loi,opu,yut,yut
3 6 hyt, ytr,oiu
Data
df <- structure(list(sol = c(1L, 1L, 1L, 6L, 4L, 4L), desc = c("fry, toast",
"frt,grt,gty", "ytr,uyt,ytr", "hyt, ytr,oiu", "hyg,hyu,loi",
"opu,yut,yut")), .Names = c("sol", "desc"), class = "data.frame", row.names = c(NA,
-6L))

Delete rows using conditional [duplicate]

This question already has answers here:
Regular expressions (RegEx) and dplyr::filter()
(2 answers)
Closed 4 years ago.
I have a data.frame like this:
Client Product
1 VV_Brazil_Jul
2 VV_Brazil_Mar
5 VV_US_Jul
1 VV_JP_Apr
3 VV_CH_May
6 VV_Brazil_Aug
I would like to delete all rows with "Brazil".
You can do this using the grepl function and the ! to find the cases that are not matched:
# Create a dataframe where some cases have the product with Brazil as part of the value
df <- structure(list(Client = c(1L, 2L, 5L, 1L, 3L, 6L),
Product = c("VV_Brazil_Jul", "VV_Brazil_Mar", "VV_US_Jul", "VV_JP_Apr", "VV_CH_May", "VV_Brazil_Aug")),
row.names = c(NA, -6L), class = c("data.table", "data.frame"))
# Display the original dataframe in the Console
df
# Limit the dataframe to cases which do not have Brazil as part of the product
df <- df[!grepl("Brazil", df$Product, ignore.case = TRUE),]
# Display the revised dataframe in the Console
df
You can do the same thing with the tidyverse collection
dplyr::slice(df, -stringr::str_which(df$Product, "Brazil"))

how can I group based on similarity in strings

I have a data like this
df <-structure(list(label = structure(c(5L, 6L, 7L, 8L, 3L, 1L, 2L,
9L, 10L, 4L), .Label = c(" holand", " holandindia", " Holandnorway",
" USAargentinabrazil", "Afghanestan ", "Afghanestankabol", "Afghanestankabolindia",
"indiaAfghanestan ", "USA", "USAargentina "), class = "factor"),
value = structure(c(5L, 4L, 1L, 9L, 7L, 10L, 6L, 3L, 2L,
8L), .Label = c("1941029507", "2367321518", "2849255881",
"2913128511", "2927576083", "4550996370", "457707181.9",
"637943892.6", "796495286.2", "89291651.19"), class = "factor")), .Names = c("label",
"value"), class = "data.frame", row.names = c(NA, -10L))
I want to get the largest name (in letter) and then see how many smaller and similar names are and assign them to a group
then go for another next large name and assign them to another group
until no group left
at first I calculate the length of each so I will have the length of them
library(dplyr)
dft <- data.frame(names=df$label,chr=apply(df,2,nchar)[,1])
colnames(dft)[1] <- "label"
df2 <- inner_join(df, dft)
Now I can simply find which string is the longest
df2[which.max(df2$chr),]
Now I should see which other strings have the letters similar to this long string . we have these possibilities
Afghanestankabolindia
it can be
A
Af
Afg
Afgh
Afgha
Afghan
Afghane
.
.
.
all possible combinations but the order of letter should be the same (from left to right) for example it should be Afghand cannot be fAhg
so we have only two other strings that are similar to this one
Afghanestan
Afghanestankabol
it is because they should be exactly similar and not even a letter different (more than the largest string) to be assigned to the same group
The desire output for this is as follows:
label value group
Afghanestan 2927576083 1
Afghanestankabol 2913128511 1
Afghanestankabolindia 1941029507 1
indiaAfghanestan 796495286.2 2
Holandnorway 457707181.9 3
holand 89291651.19 3
holandindia 4550996370 3
USA 2849255881 4
USAargentina 2367321518 4
USAargentinabrazil 637943892.6 4
why indiaAfghanestan is a seperate group? because it does not completely belong to another name (it has partially name from one or another). it should be part of a bigger name
I tried to use this one Find similar strings and reconcile them within one dataframe which did not help me at all
I found something else which maybe helps
require("Biostrings")
pairwiseAlignment(df2$label[3], df2$label[1], gapOpening=0, gapExtension=4,type="overlap")
but still I don't know how to assign them into one group
You could try
library(magrittr)
df$label %>%
tolower %>%
trimws %>%
stringdist::stringdistmatrix(method = "jw", p = 0.1) %>%
as.dist %>%
`attr<-`("Labels", df$label) %>%
hclust %T>%
plot %T>%
rect.hclust(h = 0.3) %>%
cutree(h = 0.3) %>%
print -> df$group
df
# label value group
# 1 Afghanestan 2927576083 1
# 2 Afghanestankabol 2913128511 1
# 3 Afghanestankabolindia 1941029507 1
# 4 indiaAfghanestan 796495286.2 2
# 5 Holandnorway 457707181.9 3
# 6 holand 89291651.19 3
# 7 holandindia 4550996370 3
# 8 USA 2849255881 4
# 9 USAargentina 2367321518 4
# 10 USAargentinabrazil 637943892.6 4
See ?stringdist::'stringdist-metrics' for an overview of the string dissimilarity measures offered by stringdist.

R: Need to perform multiple matches for each row in data frame

I have a data frame where for each Filename value, there is a set of values for Compound. Some compounds have a value for IS.Name, which is a value that is one of the Compound values for a Filename.
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
For each set of rows with the same Filename value in my data frame, I want to match the IS.Name value with the corresponding Compound value, and put the Chrom.1.RT value from the matched row into the IS.RT cell. For example, in the table above I want to take the Chrom.1.RT value from row 2 for Compound=IS-1 and put it into IS.RT on row 1 like this:
,Batch,Index,Filename,Sample.Name,Compound,Chrom.1.Name,Chrom.1.RT,IS.Name,IS.RT
1,Batch1,1,Batch1-001,Sample001,Compound1,1,0.639883333,IS-1,0.61
2,Batch1,1,Batch1-001,Sample001,IS-1,IS1,0.61,NONE,0
If possible I need to do this in R. Thanks in advance for any help!
EDIT: Here is a larger, more detailed example:
Filename Compound Chrom.1.RT IS.Name IS.RT
1 Sample-001 IS-1 1.32495 NONE NA
2 Sample-001 Compound-1 1.344033333 IS-1 NA
3 Sample-001 IS-2 0.127416667 NONE NA
4 Sample-001 Compound-2 0 IS-2 NA
5 Sample-002 IS-1 1.32495 NONE NA
6 Sample-002 Compound-1 1.344033333 IS-1 NA
7 Sample-002 IS-2 0.127416667 NONE NA
8 Sample-002 Compound-2 0 IS-2 NA
This is chromatography data. For each sample, four compounds are being analyzed, and each compound has a retention time value (Chrom.1.RT). Two of these compounds are references that are used by the other two compounds. For example, compound-1 is using IS-1, while IS-1 does not have a reference (IS). Within each sample I am trying to match up the IS Name to the compound row for it to grab the CHrom.1.RT and put it in the IS.RT field. So for Compound-1, I want to find the Chrom.1.RT value for the Compound with the same name as the IS.Name field (IS-1) and put it in the IS.RT field for Compound-1. The tables I'm working with list all of the compounds together and don't match up the values for the references, which I need to do for the next step of calculating the difference between Chrom.1.RT and IS.RT for each compound. Does that help?
EDIT - Here's the code I found that seems to work:
sampleList<- unique(df1$Filename)
for (i in sampleList){
SampleRows<-which(df1$Filename == sampleList[i])
RefRows <- subset(df1, Filename== sampleList[i])
df1$IS.RT[SampleRows]<- RefRows$Chrom.1.RT[ match(df1$IS.Name[SampleRows], RefRows$Compound)]
}
I'm definitely open to any suggestions to make this more efficient though.
First of all, I suggest in the future you provide your example as the output of dput(df1) as it makes it a lot easier to read it into R instead of the space delimited table you provided
That being said, I've managed to wrangle it into R with the "help" of MS Excel.
df1=structure(list(Filename = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Sample-001", "Sample-002"), class = "factor"),
Compound = structure(c(3L, 1L, 4L, 2L, 3L, 1L, 4L, 2L), .Label = c("Compound-1",
"Compound-2", "IS-1", "IS-2"), class = "factor"), Chrom.1.RT = c(1.32495,
1.344033333, 0.127416667, 0, 1.32495, 1.344033333, 0.127416667,
0), IS.Name = structure(c(3L, 1L, 3L, 2L, 3L, 1L, 3L, 2L), .Label = c("IS-1",
"IS-2", "NONE"), class = "factor"), IS.RT = c(NA, NA, NA,
NA, NA, NA, NA, NA)), .Names = c("Filename", "Compound",
"Chrom.1.RT", "IS.Name", "IS.RT"), class = "data.frame", row.names = c(NA,
-8L))
The code below is severely clunky but it does the job.
library("dplyr")
df1=tbl_df(df1)
left_join(df1,left_join(df1%>%select(-Compound),df1%>%group_by(Compound)%>%summarise(unique(Chrom.1.RT)),c("IS.Name"="Compound")))%>%select(-IS.RT)%>%rename(IS.RT=`unique(Chrom.1.RT)`)
Unless I got i wrong, this is what you need?

Resources