Subset dataframe if a $ symbol exists in a string column - r

I have a dataframe with a time column and a string column. I want to subset this dataframe - where I only keep the rows in which the string column contains a $ symbol somewhere in it.
After subsetting, I want to clean the string column so that it only contains the characters after the $ symbol until there is a space or symbol
df <- data.frame("time"=c(1:10),
"string"=c("$ABCD test","test","test $EFG test",
"$500 test","$HI/ hello","test $JK/",
"testing/123","$MOO","$abc","123"))
I want the final output to be:
Time string
1 ABCD
3 EFG
4 500
5 HI
6 JK
8 MOO
9 abc
It only keeps rows that have a $ in the string column, and then only keeps the characters after the $ symbol and until a space or symbol
I have had some success with sub simply to pull out the string, but haven't been able to apply that to the df and subset it. Thanks for your help.

Until someone comes up with pretty regex solutions, here is my take:
# subset for $ signs and convert to character class
res <- df[ grepl("$", df$string, fixed = TRUE),]
res$string <- as.character(res$string)
# split on non alpha and non $, and grab the one with $, then remove $
res$clean <- sapply(strsplit(res$string, split = "[^a-zA-Z0-9$']", perl = TRUE),
function(i){
x <- i[grepl("$", i, fixed = TRUE)]
# in case when there is more than one $
# x <- i[grepl("$", i, fixed = TRUE)][1]
gsub("$", "", x, fixed = TRUE)
})
res
# time string clean
# 1 1 $ABCD test ABCD
# 3 3 test $EFG test EFG
# 4 4 $500 test 500
# 5 5 $HI/ hello HI
# 6 6 test $JK/ JK
# 8 8 $MOO MOO
# 9 9 $abc abc

We can do this by extracting the substring with regexpr/regmatches to extract only substring that follows a $
i1 <- grep("$", df$string, fixed = TRUE)
transform(df[i1,], string = regmatches(string, regexpr("(?<=[$])\\w+", string, perl = TRUE)))
# time string
#1 1 ABCD
#3 3 EFG
#4 4 500
#5 5 HI
#6 6 JK
#8 8 MOO
#9 9 abc
Or with the tidyverse syntax
library(tidyverse)
df %>%
filter(str_detect(string, fixed("$"))) %>%
mutate(string = str_extract(string, "(?<=[$])\\w+"))

Related

R using grepl across multiple dataframes in a list

I have a list of dataframes that each contain multiple of the same columns. In one of the columns, there are multiple instances where a row just contains "[]". My goal is to replace these instances with a blank.
I've attempted to do so via the map function and grepl. While it runs there is no change to the output. Am I going in the right direction here?
Please not that I differentiate between "[]" and "[value]"
I only want to replace the empty brackets with blanks.
My code below:
first_column <- c("1", "2", "3","4")
second_column <- c("value1", "value2","[]","[value]")
first_column_2 <- c("5", "6", "7","8")
second_column_2 <- c("value1", "[]","[]","[value2]")
first_column_3<- c("9", "10", "11","12")
second_column_3 <- c("[]", "[value2]","[]","[]")
df_1 <- data.frame(first_column,second_column)
df_2 <- data.frame(first_column_2,second_column_2)
df_3 <- data.frame(first_column_3,second_column_3)
df_list <- list(df_1,df_2,df_3)
var <- c(2)
df_list <- map(df_list, ~.x[!grepl("[[]",var),])
Thanks!
We can use lapply and gsub to accomplish this. grepl returns true or false if the pattern matches, whereas gsub allows you to replace matches with something else. Note that instead of specifying an empty string (''), you could just as easily specify NA, but that will depend on your definition of "blank".
Here I use base R's lapply, which in this case is equivalent to purrr::map (even the syntax is interchangeable here).
data <- lapply(df_list, function(x) {
x %>%
mutate(across(where(is.character), ~gsub('\\[\\]', '', .x)))
})
[[1]]
first_column second_column
1 1 value1
2 2 value2
3 3
4 4 [value]
[[2]]
first_column_2 second_column_2
1 5 value1
2 6
3 7
4 8 [value2]
[[3]]
first_column_3 second_column_3
1 9
2 10 [value2]
3 11
4 12
You've got a few issues:
(a) you say you want to replace "[]" with "", but your code is trying to drop them completely, not replace them. Use sub instead of grepl for replacing---or even better, since you are matching a whole string don't use regex at all
(b) you are running grepl on the number 2: you have var <- 2 and your command is grepl("[[]",var), which is grepl("[[]", 2), which is always FALSE as the string "2" doesn't contain brackets.
(c) Your grepl pattern is searching for any string that contains a [ in it. So if you correct (a) and (b), you'll still match with strings like "[value1]".
As I said in (a), when you're matching a full string, you don't need regex at all. I'd do it like this:
df_list <- map(df_list, ~ {
.x[[var]][.x[[var]] == "[]"] = ""
.x
})
df_list
# [[1]]
# first_column second_column
# 1 1 value1
# 2 2 value2
# 3 3
# 4 4 [value]
#
# [[2]]
# first_column_2 second_column_2
# 1 5 value1
# 2 6
# 3 7
# 4 8 [value2]
#
# [[3]]
# first_column_3 second_column_3
# 1 9
# 2 10 [value2]
# 3 11
# 4 12

remove repetitive string in the sequence of strings and keep only those appear at the last time

I have a dictionary of words. I also have a column in a dataframe including the sequence of combinations of the words in the dictionary.
I want to remove repetitive words and keep only those appear last time in the sequence. So, we will have each unique word that appear as its last time. For example, if dictionary<- c("A","B","C") and my sequence is mySeq<-"ABCCBCA" I want the result to be: "BCA"
lets try it in following data
dic<- c("AA","BB","c","p")
df<-read.table(text="
id mySequece
1 AAcAABBcPAA
2 AABBAA
3 AABBAABB
4 AAcBBc
5 cBBAABBBBBBBB
6 cBBAABBBBcBB
7 ccp
8 ccppcc",header=T,stringsAsFactors = F)
desired result:
id My_new_sequence
1 BBcPAA
2 BBAA
3 AABB
4 AABBc
5 cAABB
6 AAcBB
7 cp
8 pc
How can I do it in R?
We can extract the elements based on the 'dic', then use duplicated to remove the duplicates from the end and paste it together
library(dplyr)
library(stringr)
library(purrr)
df %>%
mutate(mySequece = str_extract_all(mySequece, str_c(dic, collapse="|")) %>%
map_chr(~ str_c(.x[!duplicated(.x,
fromLast = TRUE)], collapse="")))
# id mySequece
#1 1 BBcAA
#2 2 BBAA
#3 3 AABB
#4 4 AABBc
#5 5 cAABB
#6 6 AAcBB
#7 7 cp
#8 8 pc
Or using base R
sapply(regmatches(df$mySequece, gregexpr(paste(dic, collapse="|"),
df$mySequece)), function(x)
paste(x[!duplicated(x, fromLast = TRUE)], collapse=""))
#[1] "BBcAA" "BBAA" "AABB" "AABBc" "cAABB" "AAcBB" "cp" "pc"
data
df <- structure(list(id = 1:8, mySequece = c("AAcAABBcPAA", "AABBAA",
"AABBAABB", "AAcBBc", "cBBAABBBBBBBB", "cBBAABBBBcBB", "ccp",
"ccppcc")), class = "data.frame", row.names = c(NA, -8L))

How do I upload a CSV file to R with one row that contains a list in the form of ["123", "456", "789"]?

I am trying to upload a CSV file that has various data in normal format (column name and then either numeric or string) as well as a column that has a list of numbers of various length in ["x"] format (i.e. row 1 = ["111", "222"], row 2 = ["333"], row 3 = ["555","666","777"]. How do I upload that data so that I can conduct analysis with it?
When I turned it into a character string, the data came back as "[\"x\"]". When I turned it into a factor, it looked like the format in the CSV. But I still can't do anything with the [" present.
Hi you can use the stringr package to grab the digits out of the square brackets. I think the reason \ shows up is because it is used as an escaping character to escape the second set of "". Anyways, this will simplify it,
I made some ugly data
df <- data.frame(x = c(1, 2, 3),
y = c('[\\"111\\", \\"222\\"]', '[\\"333\\"]', '[\\"555\\", \\"666\\", \\"777\\"]'))
df
x y
1 1 [\\"111\\", \\"222\\"]
2 2 [\\"333\\"]
3 3 [\\"555\\", \\"666\\", \\"777\\"]
Now just using some regex from and stringr::str_extract_all we grab all occurrences of 1 or more digits in succession.
df$y <- stringr::str_extract_all(df$y, "(\\d+)")
(\\d+) is simply saying I want to grab groups of 1 or more digits.
This yields a nested list without the \ included.
x y
1 1 111, 222
2 2 333
3 3 555, 666, 777
They are still strings, so if you want to evaluate the numbers you need to do stuff like:
> eval(parse(text = df$y[[1]][1])) / 111
[1] 1
For the whole data frame you may consider unnesting it and making a new column (or overriding the original to change the data type and turn the strings into evaluate(able) expressions, For this we can use some of the tidyverse (tidyr::unnest and dplyr::mutate)
df %>%
tidyr::unnest() %>%
dplyr::rowwise %>%
dplyr::mutate(numeric_y = eval(parse(text = y)))
# A tibble: 6 x 3
x y numeric_y
<dbl> <chr> <dbl>
1 1 111 111
2 1 222 222
3 2 333 333
4 3 555 555
5 3 666 666
6 3 777 777

Paste string values from df column into a function

I have a dataset in R organized like so:
x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2
Then, I have a function like
fun <- function(number, df1, string, df2){NormC <- as.numeric(df1[string, "normc"])
df2$NormC <- rep(NormC)}
How can I iterate through my df and insert each value of "x" into the function?
I think the problem is that this part of the function (which has 4 input variables) is structured like so- NormC <- as.numeric(df[string, "normc"])
As explained by #duckmayr, you don't need to iterate through column x. Here is an example creating new variable.
df <- read.table(text = " x freq
1 PRODUCT10000 6
2 PRODUCT10001 20
3 PRODUCT10002 11
4 PRODUCT10003 4
5 PRODUCT10004 1
6 PRODUCT10005 2", header = TRUE)
fun <- function(string){paste0(string, "X")} # example
# option 1
df$new.col1 <- fun(df$x) # see duckmayr's comment
# option 2
library(data.table)
setDT(df)[, new.col2 := fun(x)]

Subset a dataframe using a string of column names

I need to subset a dataframe (df) by a string of columns names that I have created - not sure how to inject this into a subet..?
for example
colstoKeep is a character string:
"col1", "col2", "col3", "col4"
how do I push this into a subset function
df<- df[colstoKeep]
I'm sure this is easy.? because the above doesn't work.
df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
colsToKeep <- "\"A\", \"C\""
If I understand your question correctly, your colsToKeep variable is a string as given above. In order to extract the variables, you will have to convert that into a vector. If I've used the right format, you can do that with the following code.
library(magrittr)
colsToKeepVector <-
strsplit(colsToKeep, ",") %>%
unlist() %>%
trimws() %>%
gsub("\"", "", .)
df[colsToKeepVector]
However, if I'm also understanding that you had a vector that you collapsed to a string (paste(..., collapse = ", ")?), I would strongly advise you not to do that.
(Edited to match the string format in the question)
df <- data.frame(A=seq(1:5),B=seq(5:1),C=seq(1:5))
df
A B C
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
cols_to_keep <- c("A","C")
df[,cols_to_keep]
A C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

Resources