Rename columns of R dataframe with tidyselect and regular expression

Rename columns of R dataframe with tidyselect and regular expression - r

I have a dataframe whose columns names are combinations of numbering and some complicated texts:
A1. Good day
A1a. Have a nice day
......
Z7d. Some other titles
Now I want to keep only the "A1.", "A1a.", "Z7d.", removing both the preceding number and the ending texts. Is there any idea how to do this with tidyselect and regex?

You can use this regex -
names(df) <- sub('\\d+\\.\\s+([A-Za-z0-9]+).*', '\\1', names(df))
names(df)
#[1] "A1" "A1a" "Z7d"
The same regex can also be used in rename_with if you want a tidyverse answer.
library(dplyr)
df %>% rename_with(~sub('\\d+\\.\\s+([A-Za-z0-9]+).*', '\\1', .))
# A1 A1a Z7d
#1 0.5755992 0.4147519 -0.1474461
#2 0.1347792 -0.6277678 0.3263348
#3 1.6884930 1.3931306 0.8809109
#4 -0.4269351 -1.2922231 -0.3362182
#5 -2.0032113 0.2619571 0.4496466
data
df <- structure(list(`1. A1. Good day` = c(0.575599213383783, 0.134779160673435,
1.68849296209512, -0.426935114884432, -2.00321125417319), `2. A1a. Have a nice day` = c(0.414751904860513,
-0.627767775889949, 1.39313055331098, -1.29222310608057, 0.261957078465535
), `99. Z7d. Some other titles` = c(-0.147446140558093, 0.326334824433201,
0.880910933597998, -0.336218174873965, 0.449646567320979)),
class = "data.frame", row.names = c(NA, -5L))

We can use str_extract
library(stringr)
names(df) <- str_extract(names(df), "(?<=\\.\\s)[^.]+")
names(df)
[1] "A1" "A1a" "Z7d"
data
df <- structure(list(`1. A1. Good day` = c(0.575599213383783, 0.134779160673435,
1.68849296209512, -0.426935114884432, -2.00321125417319), `2. A1a. Have a nice day` = c(0.414751904860513,
-0.627767775889949, 1.39313055331098, -1.29222310608057, 0.261957078465535
), `99. Z7d. Some other titles` = c(-0.147446140558093, 0.326334824433201,
0.880910933597998, -0.336218174873965, 0.449646567320979)),
class = "data.frame", row.names = c(NA, -5L))

Related

how to convert a price character vector to a numeric vector

I state that I am a neophyte.
I have a single column (character) dataframe on which I would like to find the minimum, maximum
and average price. The min () and max () functions also work with a character vector, but the mean
() or median () functions need a numeric vector. I have tried to change the comma with the period
but the problem becomes more complex when I have the prices in the thousands. How can I do?
>price
Price
1 1.651
2 2.229,00
3 1.899,00
4 2.160,50
5 1.709,00
6 1.723,86
7 1.770,99
8 1.774,90
9 1.949,00
10 1.764,12
This is the dataframe. I thank anyone who wants to help me in advance

Replace , with ., . with empty string and turn the values to numeric.
In base R using gsub -
df <- transform(df, Price = as.numeric(gsub(',', '.',
gsub('.', '', Price, fixed = TRUE), fixed = TRUE)))
# Price
#1 1651.00
#2 2229.00
#3 1899.00
#4 2160.50
#5 1709.00
#6 1723.86
#7 1770.99
#8 1774.90
#9 1949.00
#10 1764.12
You can also use parse_number number function from readr.
library(readr)
df$Price <- parse_number(df$Price,
locale = locale(grouping_mark = ".", decimal_mark = ','))
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))

url <- "https://www.shoppydoo.it/prezzi-notebook-mwp72t$2fa.html?src=user_search"
page <- read_html(url)
price <- page %>% html_nodes(".price") %>% html_text() %>% data.frame()
colnames(price) <- "Price"
price$Price <- gsub("da ", "", price$Price)
price$Price <-gsub("€", "", price$Price)
price$Price <-gsub(".", "", price$Price
)

We could use chartr in base R
df$Price <- with(df, as.numeric(sub(",", "", chartr('[.,]', '[,.]', df$Price))))
data
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))

What is the best way to retrieve a column of a data.frame with its column name?

If I use df$columnName, you only get a vector which does not have the name of the column anymore.
This means names(df$columnName) --> null
I could use df["columnName"], but this is kind of unhandy.. because I have to pass the columnName as string character.

I prefer dplyr's select().
library('dplyr')
data = df %>% select(columnName)
returns a one column dataframe.

Forming #Edo's great comment in an answer, we may use subset.
subset(x=dat, subset=, select=X1)
or short:
subset(dat,,X1)
# X1
# 1 1
# 2 2
# 3 3
Data:
dat <- structure(list(X1 = 1:3, X2 = 4:6, X3 = 7:9, X4 = 10:12), class = "data.frame", row.names = c(NA,
-3L))

subset df according nested list while there is a white space

I have a data frame and I would like to subset it according specific values. When I have tried to do it, there is problem because of the white space inside the values in sample_df$mentions.
I used this script for subsetting the data frame:
sample_list <- list()
for (i in colnames(sample_name)){
sample_list <- sapply(sample_df$mentions, function(x)any(x %in% sample_name[[i]]))
new_sample_df <- sample_df[sample_list,]
}
I have tried strsplit function to get rid of the space but it has created other problems.
sample_df$mentions <- strsplit(as.charater(sample_df$mentions),"[[:space:]]")
Thank you for your help in advance.
My expected outcome should be like this:
mentions screen_name
5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
sample_name reproducible data:
sample_name <- structure(list(Name = structure(2:1, .Label = c("hamzayerlikaya",
"SSSBBL777"), class = "factor")), row.names = c(NA, -2L), class = "data.frame")
sample_df reproducible data:
sample_df <- structure(list(mentions = list(character(0), "srgnsnmz92", character(0),
"Berivan_Aslan_", c("islambey1453", " hamzayerlikaya", " tahaayhan",
" hidoturkoglu15"), character(0), "themarginale", character(0),
character(0), c("nurhandnci", " SSSBBL777", " serkanacar007",
" Chequevera06", " kubilayy81")), screen_name = c("SaadetYakar",
"beraydogru", "EL_Turco_DLC", "hebunagel", "ak_Furkan54", "zaferakyol011",
"melmitem", "mobbingabla", "BekarKronik", "tanrica_gaia")), row.names = c(NA,
10L), class = "data.frame")

We can loop through the 'Name' and use that in grepl, Reduce it to a single logical vector and subset the rows of 'sample_df'
sample_df[Reduce(`|`, lapply(as.character(sample_name$Name),
grepl, x = sample_df$mentions)),]
# mentions screen_name
#5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
NOTE: This would work with any length of 'Name' column
Another option is regex_inner_join
library(fuzzyjoin)
library(tidyverse)
regex_inner_join(sample_df, sample_name, by = c("mentions" = "Name")) %>%
select(mentions, screen_name)
# mentions screen_name
#1 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#2 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia

Since mentions is a list we can use sapply and select only those rows in sample_df where any of the mentions has Name in it.
sample_df[sapply(sample_df$mentions, function(x) any(grepl(pattern, x))), ]
# mentions screen_name
#5 islambey1453, hamzayerlikaya, tahaayhan, hidoturkoglu15 ak_Furkan54
#10 nurhandnci, SSSBBL777, serkanacar007, Chequevera06, kubilayy81 tanrica_gaia
where pattern is
pattern = paste0("\\b", sample_name$Name, "\\b", collapse = "|")

Remove unnecessary symbols in the data in R

That's my dataset
1.abc
2.def
3.2354
4.. $.?,
How can I delete those obs in which only digits, in which only symbols like point, comma ..., well, in which any symbols and digits(1#5??%).And words in the text where less than two letters

We can use str_count to count the number of characters and subset the dataset
library(stringr)
library(dplyr)
df1 %>%
filter(str_count(v1, "[[:alpha:]]") > 2)
Or with gsub to remove any character that is not a letter and count the number of characters with nchar to create a logical index for subsetting
subset(df1, nchar(gsub("[^[:alpha:]]+", "", v1))>2)
# v1
#1 1.abc
#2 2.def
data
df1 <- structure(list(v1 = c("1.abc", "2.def", "3.2354", "4.. $.?,")),
.Names = "v1", class = "data.frame", row.names = c(NA, -4L))

Removing the special symbols in data.frame column values

I have two data frame each with a column Name
df1:
name
#one2
!iftwo
there_2_go
come&go
df1 = structure(list(name = c("#one2", "!iftwo", "there_2_go", "come&go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
df2:
name
One2
IfTwo#
there-2-go
come.go
df2 = structure(list(name = c("One2", "IfTwo#", "there-2-go", "come.go")),.Names = c("name"), row.names = c(NA, -4L), class = "data.frame")
Now to compare the two data frames for inequality is cumbersome because of special symbols using %in%. To remove the special symbols using stringR can be useful. But how exactly we can use stringR functions with %in% and display the mismatch between them
have already done the mutate() to convert all in lowercasestoLower()as follows
df1<-mutate(df1,name=tolower(df1$name))
df2<-mutate(df2,name=tolower(df2$name))
Current output of comparison:
df2[!(df2 %in% df1),]
[1] "one2" "iftwo#" "there-2-go" "come.go"
Expected output as essentially the contents are same but with special symbols:
df2[!(df2 %in% df1),]
character(0)
Question : How do we ignore the symbols in the contents of the Frame

Here it is in a function,
f1 <- function(df1, df2){
i1 <- tolower(gsub('[[:punct:]]', '', df1$name))
i2 <- tolower(gsub('[[:punct:]]', '', df2$name))
d1 <- sapply(i1, function(i) grepl(paste(i2, collapse = '|'), i))
return(!d1)
}
f1(df, df2)
# one2 iftwo there2go comego
# FALSE FALSE FALSE FALSE
#or use it for indexing,
df2[f1(df, df2),]
#character(0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rename columns of R dataframe with tidyselect and regular expression - r

Related

how to convert a price character vector to a numeric vector

What is the best way to retrieve a column of a data.frame with its column name?

subset df according nested list while there is a white space

Remove unnecessary symbols in the data in R

Removing the special symbols in data.frame column values

Categories

Resources