Convert letters in a column of strings to numbers in R - r

I am trying to solve the following: here is the top of my df
Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B ab c
5 B AB12
Col2 is a string column. I now want to convert the strings to unique numbers, based on the specific words
Like this:
Col1 Col2 Col3
1 Basic ABC 123
2 B ABCD 1234
3 B abc 272829
4 B ab c 2728029
5 B AB12 1212
...
As you see, there can be CAPITAL LETTERS, numbers, lower cases, and spaces, that need to be converted to a specific numeric value. It doesn't matter, what numbers are generated, they only need to be unique.
The difficult part is, that I need static numeric IDs but my df is dynamic.
Meaning: Strings can be added or removed over time, but if i.e. the string "dog" is added - it gets an ID (i.e. "789") which was and will never be used by another string. So the generated IDs are not influenced by the col2 size, the position of strings in that column or any order - only by the content of a string itself.
Help is much appreciated

If you are just mapping characters within some master vector, then perhaps this:
chrs <- c(LETTERS, letters, 0:9)
quux$Col3 <- sapply(strsplit(quux$Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = ""))
quux
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
or a dplyr variant, if you're already using it (but this varies very little):
library(dplyr)
quux %>%
mutate(Col3 = sapply(strsplit(Col2, ""), function(z) paste(match(z, chrs, nomatch = 0L), collapse = "")))
# Col1 Col2 Col3
# 1 Basic ABC 123
# 2 B ABCD 1234
# 3 B abc 272829
# 4 B ab c 2728029
# 5 B AB12 125455
However, as MrFlick suggested, perhaps what you really need is a hashing function?
sapply(quux$Col2, digest::digest, algo = "sha256")
# ABC
# "8fe32130ce14a3fb071473f9b718e403752f56c0f13081943d126ffb28a7b923"
# ABCD
# "942a0d444d8cf73354e5316517909d5f34b17963214a8f5b271375fe1da43013"
# abc
# "9f7b8da9f3abe2caaf5212f6b224448706de57b3c7b5dda916ee8d6005d9f24b"
# ab c
# "a3f1c49979af0fffa22f68028a42d302e0a675798ac4ac8a76bed392880af8f2"
# AB12
# "9ffbe9825833ab3c6b183f9986ab194a7aefcc06f5c940549a2c799dd4cd15b1"
Data
quux <- structure(list(Col1 = c("Basic", "B", "B", "B", "B"), Col2 = c("ABC", "ABCD", "abc", "ab c", "AB12")), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")

ah r2 beat me but same concept
dd <- read.table(header = TRUE, text = "a Col1 Col2
1 Basic ABC
2 B ABCD
3 B abc
4 B 'ab c'
5 B AB12")
dd$Col2
f <- function(x) {
x <- strsplit(x, '')
sapply(x, function(y)
factor(y, c(' ', LETTERS, letters, 0:9), c(0, 1:26, 27:52, 0:9)) |>
as.character() |> paste0(`...` = _, collapse = ''))
}
f(dd$Col2)
# [1] "123" "1234" "272829" "2728029" "1212"

Related

Selecting all but the first element of a vector in data frame

I have some data that looks like this:
X1
A,B,C,D,E
A,B
A,B,C,D
A,B,C,D,E,F
I want to generate one column that holds the first element of each vector ("A"), and another column that holds all the rest of the values ("B","C" etc.):
X1 Col1 Col2
A,B,C,D,E A B,C,D,E
A,B A B
A,B,C,D A B,C,D
A,B,C,D,E,F A B,C,D,E,F
I have tried the following:
library(dplyr)
testdata <- data.frame(X1 = c("A,B,C,D,E",
"A,B",
"A,B,C,D",
"A,B,C,D,E,F")) %>%
mutate(Col1 = sapply(strsplit(X1, ","), "[", 1),
Col2 = sapply(strsplit(X1, ","), "[", -1))
However I cannot seem to get rid of the pesky vector brackets around the values in Col2. Any way of doing this?
You can use tidyr::separate with extra = "merge":
testdata %>%
tidyr::separate(X1, into = c("Col1","Col2"), sep = ",", extra = "merge", remove = F)
X1 Col1 Col2
1 A,B,C,D,E A B,C,D,E
2 A,B A B
3 A,B,C,D A B,C,D
4 A,B,C,D,E,F A B,C,D,E,F
A possible solution, using tidyr::separate:
library(tidyverse)
df <- data.frame(
stringsAsFactors = FALSE,
X1 = c("A,B,C,D,E", "A,B", "A,B,C,D", "A,B,C,D,E,F")
)
df %>%
separate(X1, into = str_c("col", 1:2), sep = "(?<=^.),", remove = F)
#> X1 col1 col2
#> 1 A,B,C,D,E A B,C,D,E
#> 2 A,B A B
#> 3 A,B,C,D A B,C,D
#> 4 A,B,C,D,E,F A B,C,D,E,F
Try the base R code below using sub + read.table
cbind(
df,
read.table(
text = sub(",", " ", df$X1)
)
)
which gives
X1 V1 V2
1 A,B,C,D,E A B,C,D,E
2 A,B A B
3 A,B,C,D A B,C,D
4 A,B,C,D,E,F A B,C,D,E,F
You can use str_sub() function as follow:
> df
# A tibble: 4 x 1
X1
<chr>
1 A,B,C,D,E
2 A,B
3 A,B,C,D
4 A,B,C,D,E,F
> df %>% mutate(X2 = str_sub(X1, 1,1), X3 = str_sub(X1, 3))
# A tibble: 4 x 3
X1 X2 X3
<chr> <chr> <chr>
1 A,B,C,D,E A B,C,D,E
2 A,B A B
3 A,B,C,D A B,C,D
4 A,B,C,D,E,F A B,C,D,E,F

How to remove all the variable that named as .x, .y?

I have a list of data.frame (lst1). In each data.frame in lst1, we have some variables that looks like test.x, test.y, try.x, try.y. etc.
I want to filter out those variables that were created by merging dataset without filter out those variable first (try, test, etc.). How should I filter them out now?
Thanks.
You can also try this:
#Data
List <- list(A=data.frame(a=1,b=5,test.x=NA,test.y=5),
B=data.frame(a=5,b=6,test.x=NA,try.x=7))
#Remove
myfun <- function(x)
{
i <- which(grepl('.x|.y',names(x)))
x <- x[,-i]
return(x)
}
#Apply
List2 <- lapply(List,myfun)
Output:
List2
$A
a b
1 1 5
$B
a b
1 5 6
Here's a tidyverse approach:
We can use the dplyr::select function to select only the columns we want. matches() allows us to select columns using regular expressions. \\.[xy]$ matches columns that contain a period followed by x or y and $ anchors the match to the end of the string.
The purrr::map function allows us to apply the selection to each list element. ~ defines a formula which is automatically converted to a function.
library(tidyverse)
lst2 <- lst1 %>%
map(~dplyr::select(.,-matches("\\.[xy]$")))
map(lst2, head, 2)
#[[1]]
# ID name
#1 1 A
#2 2 B
#[[2]]
# ID name
#1 1 A
#2 2 B
#[[3]]
# ID name
#1 1 A
#2 2 B
#[[4]]
# ID name
#1 1 A
#2 2 B
#[[5]]
# ID name
#1 1 A
#2 2 B
Sample Data:
lst1 <- replicate(5,data.frame(ID = 1:15, name = LETTERS[1:15], test.x = runif(15), test.y = runif(15)),simplify = FALSE)
map(lst1, head, 2)
#[[1]]
# ID name test.x test.y
#1 1 A 0.03772391 0.2630905
#2 2 B 0.11844048 0.2929392
#[[2]]
# ID name test.x test.y
#1 1 A 0.398029 0.5151159
#2 2 B 0.348489 0.9534869
#[[3]]
# ID name test.x test.y
#1 1 A 0.7447383 0.6862136
#2 2 B 0.3623562 0.7542699
#
#[[4]]
# ID name test.x test.y
#1 1 A 0.9341495 0.8660333
#2 2 B 0.8383039 0.6299427
#[[5]]
# ID name test.x test.y
#1 1 A 0.02662444 0.04502225
#2 2 B 0.29855214 0.46189116
In base R, we can use endsWith
lapply(List, function(x) x[!(endsWith(names(x),
'.x')|endsWith(names(x), '.y'))])
-output
#$A
# a b
#1 1 5
#$B
# a b
#1 5 6
data
List <- list(A = structure(list(a = 1, b = 5, test.x = NA, test.y = 5), class = "data.frame", row.names = c(NA,
-1L)), B = structure(list(a = 5, b = 6, test.x = NA, try.x = 7), class = "data.frame", row.names = c(NA,
-1L)))

How can be splitted sentences contained in a cell into different rows in R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I tried several times and it does not work.
How can I split sentences contained in a cell into different rows maintaining the rest of the values?
Example:
Dataframe df has 20 columns.
Row j, Column i contains some comments which are separated by " | "
I want to have a new dataframe df2 which increases the amount of rows depending the number of sentences.
This means, if cell j,i has Sentence A | Sentence B
Row j, Column i has Sentence A
Row j+1, Column i has Sentence B
Columns 1 to i-1 and i+1 to 20 have the same value in rows j and j+1.
I do not know if this has an easy solution.
Thank you very much.
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'col3', sep="\\|", "long", fixed = FALSE)
# col1 col2 col3
#1: a 1 fitz
#2: a 1 buzz
#3: b 2 foo
#4: b 2 bar
#5: c 3 hello world
#6: c 3 today is Thursday
#7: c 3 its 2:00
#8: d 4 fitz
data
df <- structure(list(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, 3,
4), col3 = c("fitz|buzz", "foo|bar", "hello world|today is Thursday | its 2:00",
"fitz")), class = "data.frame", row.names = c(NA, -4L))
Here is a solution using 3 tidyverse packages that accounts for an unknown maximum number of comments
library(dplyr)
library(tidyr)
library(stringr)
# Create function to calculate the max number comments per observation within
# df$col3 and create a string of unique "names"
cols <- function(x) {
cmts <- str_count(x, "([|])")
max_cmts <- max(cmts, na.rm = TRUE) + 1
features <- c(sprintf("V%02d", seq(1, max_cmts)))
}
# Create the data
df1 <- data.frame(col1 = c("a", "b", "c", "d"),
col2 = c(1, 2, 3, 4),
col3 = c("fitz|buzz", NA,
"hello world|today is Thursday | its 2:00|another comment|and yet another comment", "fitz"),
stringsAsFactors = FALSE)
# Generate the desired output
df2 <- separate(df1, col3, into = cols(x = df1$col3),
sep = "([|])", extra = "merge", fill = "right") %>%
pivot_longer(cols = cols(x = df1$col3), values_to = "comments",
values_drop_na = TRUE) %>%
select(-name)
Which results in
df2
# A tibble: 8 x 3
col1 col2 comments
<chr> <dbl> <chr>
1 a 1 "fitz"
2 a 1 "buzz"
3 c 3 "hello world"
4 c 3 "today is Thursday "
5 c 3 " its 2:00"
6 c 3 "another comment"
7 c 3 "and yet another comment"
8 d 4 "fitz"

Turning a text column into a vector in r

I want to see whether the text column has elements outside the specified values of "a" and "b"
specified_value=c("a","b")
df=data.frame(key=c(1,2,3,4),text=c("a,b,c","a,d","1,2","a,b")
df_out=data.frame(key=c(1,2,3),text=c("c","d","1,2",NA))
This is what I have tried:
df=df%>%mutate(text_vector=strsplit(text, split=","),
extra=text_vector[which(!text_vector %in% specified_value)])
But this doesn't work, any suggestions?
We can split the 'text' by the delimiter , with separate_rows, grouped by 'key', get the elements that are not in 'specified_value' with setdiff and paste them together (toString), then do a join to get the other columns in the original dataset
library(dplyr) # >= 1.0.0
library(tidyr)
df %>%
separate_rows(text) %>%
group_by(key) %>%
summarise(extra = toString(setdiff(text, specified_value))) %>%
left_join(df) %>%
mutate(extra = na_if(extra, ""))
# A tibble: 4 x 3
# key extra text
# <dbl> <chr> <chr>
#1 1 c a,b,c
#2 2 d a,d
#3 3 1, 2 1,2
#4 4 <NA> a,b
Using setdiff.
df$outside <- sapply({
x <- lapply(strsplit(df$text, ","), setdiff, specified_value)
replace(x, lengths(x) == 0, NA)},
paste, collapse=",")
df
# key text outside
# 1 1 a,b,c c
# 2 2 a,d d
# 3 3 1,2 1,2
# 4 4 a,b NA
Data:
df <- structure(list(key = c(1, 2, 3, 4), text = c("a,b,c", "a,d",
"1,2", "a,b")), class = "data.frame", row.names = c(NA, -4L))
specified_value <- c("a", "b")
use stringi::stri_split_fixed
library(stringi)
!all(stri_split_fixed("a,b", ",", simplify=T) %in% specified_value) #FALSE
!all(stri_split_fixed("a,b,c", ",", simplify=T) %in% specified_value) #TRUE
An option using regex without splitting the data on comma :
#Collapse the specified_value in one string and remove from text
df$text1 <- gsub(paste0(specified_value, collapse = "|"), '', df$text)
#Remove extra commas
df$text1 <- gsub('(?<![a-z0-9]),', '', df$text1, perl = TRUE)
df
# key text text1
#1 1 a,b,c c
#2 2 a,d d
#3 3 1,2 1,2
#4 4 a,b

Remove period and spaces within column headings nested in a list of data frames

I have a list of data frames:
mylist<-list(df1=data.frame(var1=c("a","b","c"), var.2=
c("a","b","c")), df2= data.frame(var1 = c("a","b","c"),
var..2=c("a","b","c")))
I would like to remove periods and spaces within the column headings of each data frame within the list. The output would look like:
mylist<-list(df1=data.frame(var1=c("a","b","c"), var2=
c("a","b","c")), df2= data.frame(var1= c("a","b","c"),
var2=c("a","b","c")))
I have tried the following:
cleandf <- lapply(ldf, function(x) x[(colnames(x) <- gsub(".", "",
colnames(x), fixed = TRUE))])
With Base R setNames:
lapply(mylist, function(x) setNames(x, gsub("\\.", "", names(x))))
or with tidyverse:
library(tidyverse)
map(mylist, ~rename_all(.x, str_replace_all, "\\.", ""))
Output:
$df1
var1 var2
1 a a
2 b b
3 c c
$df2
var1 var2
1 a a
2 b b
3 c c
I rename the columns in each data frame and then return the data frame. As explained here, double backslashes are needed as escape characters for the period.
lapply(mylist, function(x){names(x) <- gsub("\\.", "", names(x));x})
# $`df1`
# var1 var2
# 1 a a
# 2 b b
# 3 c c
#
# $df2
# var1 var2
# 1 a a
# 2 b b
# 3 c c

Resources