dput(abc)
structure(list(Comparison = structure(1:15, .Label = c("C1_C2 ",
"C1_C3 ", "C1_C4 ", "C1_C5 ", "C1_C6 ", "C2_C3 ", "C2_C4 ", "C2_C5 ",
"C2_C6 ", "C3_C4 ", "C3_C5 ", "C4_C5 ", "C4_C6 ", "C5_C6 ", "C6_C5 "
), class = "factor")), row.names = c(NA, -15L), class = "data.frame")
This is my data-frame
Comparison
1 C1_C2
2 C1_C3
3 C1_C4
4 C1_C5
5 C1_C6
6 C2_C3
7 C2_C4
8 C2_C5
9 C2_C6
10 C3_C4
11 C3_C5
12 C4_C5
13 C4_C6
14 C5_C6
15 C6_C5
So when i do for C1 and subset it works fine like I will get C1_C2,C1_C3,C1_C4, C1_C5, C1_C6. This works fine for C1
But when i do grep for C2 ,this will also find row which are C1_C2 as well which i dont want.I want only which starts with C2_C3,C2_C4,C2_C5,C2_C6. Same goes with C3,C3,C5,C6.
My code to filter
C1 <- comparsion %>% filter(str_detect(Comparison,"C1")).
If you are looking for base R alternative you can choose startsWith, Also note that your Comparision column is in factor, so startsWith doesn't work on factors , you need to wrap it inside as.character to make it work, you may change it for other columns like C1, C2 etc if you are only looking for character starting with C1, C2 etc :
data[startsWith(as.character(data$Comparison), 'C1'),,drop=FALSE]
You can also work with dplyr using startsWith, like below:
data %>%
filter(startsWith(as.character(Comparison), 'C1'))
Use ^ to indicate start of the string.
subset(abc, grepl('^C1', Comparison))
# Comparison
#1 C1_C2
#2 C1_C3
#3 C1_C4
#4 C1_C5
#5 C1_C6
With C2 :
subset(abc, grepl('^C2', Comparison))
# Comparison
#6 C2_C3
#7 C2_C4
#8 C2_C5
#9 C2_C6
In dplyr :
library(dplyr)
library(stringr)
abc %>% filter(str_detect(Comparison, '^C2'))
Related
I have a column in a dataframe in R that contains values such as
C22/00556,
C21/00445,
B22/00111,
C22-00679, etc.
I would like to split this into 2 columns named initial and number. The delimiter being "-" or "/".
As a result I would expect a column containing C22, C21, B22, etc and another column containing 00556, 00445 etc.
I am trying to use the separate function but I am struggling with the sep= part.
I have tried using sep= c("/","-") but this is not working and throws an error.
You could use separate from tidyr by / or (|) - like this:
df <- data.frame(V1 = c("C22/00556", "C21/00445", "B22/00111", "C22-00679"))
library(tidyr)
df %>%
separate(V1, c("initial", "number"), sep = "/|-")
#> initial number
#> 1 C22 00556
#> 2 C21 00445
#> 3 B22 00111
#> 4 C22 00679
Created on 2023-01-05 with reprex v2.0.2
Using base R
read.table(text = chartr("/", "-", df$V1), sep = "-", header = FALSE,
col.names = c("initial", "number"), colClasses = "character")
initial number
1 C22 00556
2 C21 00445
3 B22 00111
4 C22 00679
I came from a Python background and I am working in R with this data df.
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a
Using R language, how can I remove the a in the last part of the age column and do data modification in place??
(just like Python using inplace=True)
Can I attain it using
df$Age[which(df$Age == `a pattern`)] <- ""
This is a perfect use case for parse_number from readr package (it is in tidyverse:
library(dplyr)
library(readr)
df %>%
mutate(age = parse_number(age))
name age
1 Anon1 52
2 Anon2 62
3 Anon3 44
4 Anon4 30
5 Anon5 110
data:
df <- structure(list(name = c("Anon1", "Anon2", "Anon3", "Anon4", "Anon5"
), age = c("52a", "62", "44a", "30", "110a")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5"))
You could use sub here:
df$age <- sub("a$", "", df$age, fixed=TRUE)
#A tidy solution
library(dplyr)
library(stringr)
df <- data.frame(name=c("anon1","anon2"),age=c("52","37a"))
df <- df %>%
mutate(age = str_extract(age,"^\\d+"))
df
name age
1 anon1 52
2 anon2 37
Here are two approaches. No packages are used.
1) We remove all non-digit characters where in a regular expression \D means non-digit. If we knew that only a could appear as a non-digit we could , instead, use "a" as the first argument to gsub and if we knew it only appears once we could use sub instead of gsub.
Also it is easier to debug code if you don't overwrite variables since then you always know that a particular variable is in its original state. Instead assign the result to a new variable.
transform(DF, age = as.numeric(gsub("\\D", "", age)))
This could also be written using pipes:
transform(DF, age = age |> gsub(pattern = "\\D", replacement = "") |> as.numeric())
2) We can use scan specifying that a is a comment character.
transform(DF, age = scan(text = age, comment.char = "a", quiet = TRUE))
Note
Lines <- "
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a"
DF <- read.table(text = Lines)
The inplace modifier in python refers to making a change without creating a copy. The data.table package in R allows for this (called replace by reference).
df <- read.table(text="
name age
1 Anon1 52a
2 Anon2 62
3 Anon3 44a
4 Anon4 30
5 Anon5 110a")
library(data.table)
library(stringi)
setDT(df)[, age:=stri_extract(age, regex='^\\d+')]
df
The first clause (setDT(df)) converts df to a data.table by reference (e.g., without making a copy), and the second clause ([, age:=...]) replaces the values in column age with ... also by reference.
I have a column in my data frame as shown below.
I want to keep the data in the pattern "\\d+Zimmer" and remove all the digits from the column such as "9586" and "927" in the picture.
I tried following gsub function.
gsub("[^\\d+Zimmer]", "", flat_cl_one$rooms)
But it removes all the digits, as below.
What Regex can I use to get the correct result? Thank You in Advance
We can coerce any rows that have alphanumeric characters to NA and then replace the rows that don't have NA to blanks.
library(dplyr)
flat_cl_one %>%
mutate(rooms = ifelse(!is.na(as.numeric(rooms)), "", rooms))
Or we can use str_detect:
flat_cl_one %>%
mutate(rooms = ifelse(str_detect(rooms, "Zimmer", negate = TRUE), "", rooms))
Output
rooms
1 647Zimmer
2 394Zimmer
3
4
5 38210Zimmer
We could do the same thing with filter if you wanted to actually remove those rows.
flat_cl_one %>%
filter(is.na(as.numeric(rooms)))
# rooms
#1 647Zimmer
#2 394Zimmer
#3 38210Zimmer
Data
flat_cl_one <- structure(list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389",
"38210Zimmer")), class = "data.frame", row.names = c(NA, -5L))
Just replace strings that don't contain the word "Zimmer"
flat_cl_one$room[!grepl("Zimmer", flat_cl_one$room)] <- ""
flat_cl_one
#> room
#> 1 3Zimmer
#> 2 2Zimmer
#> 3 2Zimmer
#> 4 3Zimmer
#> 5
#> 6
#> 7 3Zimmer
#> 8 6Zimmer
#> 9 2Zimmer
#> 10 4Zimmer
Data
flat_cl_one <- data.frame(room = c("3Zimmer", "2Zimmer", "2Zimmer", "3Zimmer",
"9586", "927", "3Zimmer", "6Zimmer",
"2Zimmer", "4Zimmer"))
Another possible solution, using stringr::str_extract (I am using #AndrewGillreath-Brown's data, to whom I thank):
library(tidyverse)
df <- structure(
list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389", "38210Zimmer")),
class = "data.frame",
row.names = c(NA, -5L))
df %>%
mutate(rooms = str_extract(rooms, "\\d+Zimmer"))
#> rooms
#> 1 647Zimmer
#> 2 394Zimmer
#> 3 <NA>
#> 4 <NA>
#> 5 38210Zimmer
This pattern [^\\d+Zimmer] matches any character except a digit or the following characters + Z i m etc...
Using gsub, you can check if the string does not start with the pattern ^\\d+Zimmer using a negative lookahead (?! setting perl = TRUE and then match 1 or more digits if the assertion it true.
gsub("^(?!^\\d+Zimmer\\b)\\d+\\b", "", flat_cl_one$rooms, perl = TRUE)
See an R demo.
How to separate a column into many, based on a symbol "|" and any additional spaces around this symbol if any:
input <- tibble(A = c("Ae1 tt1 | Ae2 tt2", "Be1 | Be2 | Be3"))
output <- tibble(B = c("Ae1 tt1", "Be1") , C = c("Ae2 tt2", "Be2"), D = c(NA, "Be3"))
I tried :
input %>%
separate(A, c("B","C","D"))
#separate(A, c("B","C","D"), sep = "|.")
#mutate(B = str_split(A, "*|")) %>% unnest
What is the syntax with regex ?
Solution from R - separate with specific symbol, vertical bare, | (and tidyr::separate() producing unexpected results) does not provide expected output and produces a warning:
input %>% separate(col=A, into=c("B","C","D"), sep = '\\|')
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 "Ae1 tt1 " " Ae2 tt2" <NA>
2 "Be1 " " Be2 " " Be3"
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
Using separate from tidyr with different length vectors does not seem related unfortunately.
You can use
output <- input %>%
separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
R test:
> input %>% separate(col=A, into=c("B","C","D"), sep="\\s*\\|\\s*", fill="right")
# A tibble: 2 x 3
B C D
<chr> <chr> <chr>
1 Ae1 tt1 Ae2 tt2 <NA>
2 Be1 Be2 Be3
The \s*\|\s* pattern matches a pipe char with any zero or more whitespace chars on both ends of the pipe.
The fill="right" argument fills with missing values on the right.
I am having some trouble cleaning up my data. It consists of a list of sold houses. It is made up of the sell price, no. of rooms, m2 and the address.
As seen below the address is in one string.
Head(DF, 3)
Address Price m2 Rooms
Petersvej 1772900 Hoersholm 10.000 210 5
Annasvej 2B2900 Hoersholm 15.000 230 4
Krænsvej 125800 Lyngby C 10.000 210 5
A Mivs Alle 119800 Hjoerring 1.300 70 3
The syntax for the address coloumn is: road name, road no., followed by a 4 digit postalcode and the city name(sometimes two words).
Also need to extract the postalcode.. been looking at 'stringi' package haven't been able to find any examples..
any pointers are very much appreciated
1) Using separate in tidyr separate the subfields of Address into 3 fields merging anything left over into the last and then use separate again to split off the last 4 digits in the Number column that was generated in the first separate.
library(dplyr)
library(tidyr)
DF %>%
separate(Address, into = c("Road", "Number", "City"), extra = "merge") %>%
separate(Number, into = c("StreetNo", "Postal"), sep = -4)
giving:
Road StreetNo Postal City Price m2 Rooms CITY
1 Petersvej 77 2900 Hoersholm 10 210 5 Hoersholm
2 Annasvej 121B 2900 Hoersholm 15 230 4 Hoersholm
3 Krænsvej 12 5800 Lyngby C 10 210 5 C
2) Alternately, insert commas between the subfields of Address and then use separate to split the subfields out. It gives the same result as (1) on the input shown in the Note below.
DF %>%
mutate(Address = sub("(\\S.*) +(\\S+)(\\d{4}) +(.*)", "\\1,\\2,\\3,\\4", Address)) %>%
separate(Address, into = c("Road", "Number", "Postal", "City"), sep = ",")
Note
The input DF in reproducible form is:
DF <-
structure(list(Address = structure(c(3L, 1L, 2L), .Label = c("Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C", "Petersvej 772900 Hoersholm"), class = "factor"),
Price = c(10, 15, 10), m2 = c(210L, 230L, 210L), Rooms = c(5L,
4L, 5L), CITY = structure(c(2L, 2L, 1L), .Label = c("C",
"Hoersholm"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))
Update
Added and fixed (2).
Check out the cSplit function from the splitstackshape package
library(splitstackshape)
df_new <- cSplit(df, splitCols = "Address", sep = " ")
#This will split your address column into 4 different columns split at the space
#you can then add an ifelse block to combine the last 2 columns to make up the city like
df_new$City <- ifelse(is.na(df_new$Address_4), as.character(df_new$Address_3), paste(df_new$Address_3, df_new$Address_4, sep = " "))
One way to do this is with regex.
In this instance you may use a simple regular expression which will match all alphabetical characters and space characters which lead to the end of the string, then trim the whitespace off.
library(stringr)
DF <- data.frame(Address=c("Petersvej 772900 Hoersholm",
"Annasvej 121B2900 Hoersholm",
"Krænsvej 125800 Lyngby C"))
DF$CITY <- str_trim(str_extract(DF$Address, "[a-zA-Z ]+$"))
This will give you the following output:
Address CITY
1 Petersvej 772900 Hoersholm Hoersholm
2 Annasvej 121B2900 Hoersholm Hoersholm
3 Krænsvej 125800 Lyngby C Lyngby C
In R the stringr package is preferred for regex because it allows for multiple-group capture, which in this example could allow you to separate each component of the address with one expression.