How can I do a string replace for one column, but multiple conditions.
I have the following data
strings <- as_tibble(c("string.a","string.a", "string.b", "string.c"))
# A tibble: 4 x 1
value
<chr>
1 string_alice
2 string_alice
3 string_bob
4 string_joe
and the following replacements
replacements <- c("alice", "bob", "joe")
conditions <- c(".a", ".b", ".c")
The resulting data would be
result <- as_tibble(c("string_alice", "string_bob", "string_joe"))
# A tibble: 4 x 1
value
<chr>
1 string_alice
2 string_alice
3 string_bob
4 string_joe
I have considered a mapping table of some sort, but it is not clear to me how to feed a mapping table to a string replace function.
nm = setNames(replacements, gsub("\\.", "", conditions))
sapply(strsplit(strings$value, "\\."), function(x){
paste(c(x[1], nm[x[2]]), collapse = ".")
})
Data
strings = structure(list(value = c("string.a", "string.a", "string.b",
"string.c")), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
We can use gsubfn
library(gsubfn)
sub("\\.", "_", gsubfn("(\\w+)$", setNames(as.list(replacements),
sub("\\.", "", conditions)), strings$value))
#[1] "string_alice" "string_alice" "string_bob" "string_joe"
Related
I have a dataset which looks something like this:
print(animals_in_zoo)
// I only know the name of the first column, the second one is dynamic/based on a previously calculated variable
animals | dynamic_column_name
// What the data looks like
elefant x
turtle
monkey
giraffe x
swan
tiger x
What I want is to collect the rows in which the second columns' value is equal to "x".
What I want to do is something like:
SELECT * from data where col2 == "x";
After that, I want to grab only the first column and create a string object like "elefant giraffe tiger", but that is the easy part.
You can reference that column by its index and use that to get the animals you want:
df1 <- structure(list(animal = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), dynamic_column = c("x", NA, NA, "x", NA, "x"
)), row.names = c(NA, -6L), class = "data.frame")
df1[, 1][df1[, 2] == "x" & !is.na(df1[, 2])]
#> [1] "elefant" "giraffe" "tiger"
We could use filter with grepl which searches for a pattern 'x' in the string:
# the data frame
df <- read.table(header = TRUE, text =
'my_col
"elefant x"
turtle
monkey
"giraffe x"
swan
"tiger x"'
)
library(dplyr)
df %>%
filter(grepl('x', my_col))
my_col
1 elefant x
2 giraffe x
3 tiger x
Use [: the first argument refers to the rows. You want the rows where the second column is "x". The second argument is the column you need in the end, and you want the column named "animals":
dat[dat[2] == "x", "animals"]
#[1] "elefant" "giraffe" "tiger"
data
dat <- structure(list(animals = c("elefant", "turtle", "monkey", "giraffe",
"swan", "tiger"), V2 = c("x", "", "", "x", "", "x")), row.names = c(NA,
-6L), class = "data.frame")
# animals V2
# 1 elefant x
# 2 turtle
# 3 monkey
# 4 giraffe x
# 5 swan
# 6 tiger x
I guess you have a dataframe?
If so, something like df[df$col2 == 'x',] should work.
With base functions, you can do it like this:
# Option 1
your_dataframe[your_dataframe$col2 == "x", ]
# Option 2
your_dataframe[your_dataframe[,2] == "x", ]
With dplyr functions, you can do it like this:
library(dplyr)
your_dataframe %>%
filter(col2 == "x")
I state that I am a neophyte.
I have a single column (character) dataframe on which I would like to find the minimum, maximum
and average price. The min () and max () functions also work with a character vector, but the mean
() or median () functions need a numeric vector. I have tried to change the comma with the period
but the problem becomes more complex when I have the prices in the thousands. How can I do?
>price
Price
1 1.651
2 2.229,00
3 1.899,00
4 2.160,50
5 1.709,00
6 1.723,86
7 1.770,99
8 1.774,90
9 1.949,00
10 1.764,12
This is the dataframe. I thank anyone who wants to help me in advance
Replace , with ., . with empty string and turn the values to numeric.
In base R using gsub -
df <- transform(df, Price = as.numeric(gsub(',', '.',
gsub('.', '', Price, fixed = TRUE), fixed = TRUE)))
# Price
#1 1651.00
#2 2229.00
#3 1899.00
#4 2160.50
#5 1709.00
#6 1723.86
#7 1770.99
#8 1774.90
#9 1949.00
#10 1764.12
You can also use parse_number number function from readr.
library(readr)
df$Price <- parse_number(df$Price,
locale = locale(grouping_mark = ".", decimal_mark = ','))
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
url <- "https://www.shoppydoo.it/prezzi-notebook-mwp72t$2fa.html?src=user_search"
page <- read_html(url)
price <- page %>% html_nodes(".price") %>% html_text() %>% data.frame()
colnames(price) <- "Price"
price$Price <- gsub("da ", "", price$Price)
price$Price <-gsub("€", "", price$Price)
price$Price <-gsub(".", "", price$Price
)
We could use chartr in base R
df$Price <- with(df, as.numeric(sub(",", "", chartr('[.,]', '[,.]', df$Price))))
data
df <- structure(list(Price = c("1.651", "2.229,00", "1.899,00", "2.160,50",
"1.709,00", "1.723,86", "1.770,99", "1.774,90", "1.949,00", "1.764,12"
)), class = "data.frame", row.names = c(NA, -10L))
Say I have the following data frame:
# S/N a b
# 1 L1-S2 <blank>
# 2 T1-T3 <blank>
# 3 T1-L2 <blank>
How do I turn the above data frame into this:
# S/N a b
# 1 L1-S2 LS
# 2 T1-T3 T
# 3 T1-L2 TL
I am thinking of writing a loop, where
For x in column a,
If first character in x == L AND 4th character in x == S,
fill the corresponding cell in b with LS
and so on...
However, I am not sure how to implement it, or if there is a more elegant way of doing this.
We can extract the upper case letters and remove the repeated ones
library(stringr)
library(dplyr)
df1 %>%
mutate(b = str_replace(str_replace(a, "^([A-Z])\\d+-([A-Z])\\d+",
"\\1\\2"), "(.)\\1+", "\\1"))
-output
# S_N a b
#1 1 L1-S2 LS
#2 2 T1-T3 T
#3 3 T1-L2 TL
Or another option is str_extract_all to extract the upper case letters, loop over the list with map, paste the unique elements
library(purrr)
df1 %>%
mutate(b = str_extract_all(a, "[A-Z]") %>%
map_chr(~ str_c(unique(.x), collapse="")))
Or using a corresponding base R option for the first tidyverse option
df1$b <- sub("(.)\\1+", "\\1", gsub("[0-9-]+", "", df1$a))
Or with strsplit
df1$b <- sapply(strsplit(df1$a, "[0-9-]+"),
function(x) paste(unique(x), collapse=""))
data
df1 <- structure(list(S_N = 1:3, a = c("L1-S2", "T1-T3", "T1-L2"),
b = c(NA,
NA, NA)), class = "data.frame", row.names = c(NA, -3L))
Given this data.frame:
library(dplyr)
library(stringr)
ml.mat2 <- structure(list(value = c("a", "b", "c"), ground_truth = c("label1, label3",
"label2", "label1"), predicted = c("label1", "label2,label3",
"label1")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L))
glimpse(ml.mat2)
Observations: 3
Variables: 3
$ value <chr> "a", "b", "c"
$ ground_truth <chr> "label1, label3", "label2", "label1"
$ predicted <chr> "label1", "label2,label3", "label1"
I want to measure the length of the intersect between ground_truth and predicted for each row, after splitting the repeated labels based on ,.
In other words, I would expect a result of length 3 with values of 2 2 1.
I wrote a function to do this, but it only seems to work outside of sapply:
m_fn <- function(x,y) length(union(unlist(sapply(x, str_split,",")),
unlist(sapply(y, str_split,","))))
m_fn(ml.mat2$ground_truth[1], y = ml.mat2$predicted[1])
[1] 2
m_fn(ml.mat2$ground_truth[2], y = ml.mat2$predicted[2])
[1] 2
m_fn(ml.mat2$ground_truth[3], y = ml.mat2$predicted[3])
[1] 1
Rather than iterating through the rows of the data set manually like this or with a loop, I would expect to be able to vectorize the solution with sapply like this:
sapply(ml.mat2$ground_truth, m_fn, ml.mat2$predicted)
However, the unexpected results are:
label1, label3 label2 label1
4 3 3
Since you're interating within same observation size, you can generate an index of row numbers and run it in your sapply:
sapply(1:nrow(ml.mat2), function(i) m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
#[1] 2 2 1
or with seq_len:
sapply(seq_len(nrow(ml.mat2)), function(i)
m_fn(x = ml.mat2$ground_truth[i], y = ml.mat2$predicted[i]))
My df
name age
tom 21
mary 42
How can I combine each row to something like
name:tom,age:21
name:mary,age:42
the output can be a list of strings.
A more general approach using apply.
apply(df1, 1, function(x) {n <- names(df1); paste0(n[1],":",x[1],",", n[2],":",x[2], collapse = "")})
here is a super general version:
df1<-
structure(list(name = c("tom", "mary"), age = c(21L, 42L), cool = c("yes",
"no")), row.names = c(NA, -2L), class = "data.frame")
apply(
apply(df1, 1, function(x) {n <- names(df1); paste0(paste(n,x, sep = ":"))}),
2,
paste0, collapse = ","
)
# "name:tom,age:21,cool:yes" "name:mary,age:42,cool:no"
Try with thispaste combination:
df$new.col <- paste(paste(colnames(df)[1], df$name, sep = ":"),
paste(colnames(df)[2], df$age, sep = ":"),
sep = ",")
# output
# name age new.col
#1 tom 21 name:tom,age:21
#2 mary 42 name:mary,age:42
I have some sample data, such as:
name=c("ali","asgar","ahmad","aslam","alvi")
age=c(12,33,23,16,34)
mydf=data.frame(name,age)
Data frame looking is as
> mydf
name age
1 ali 12
2 asgar 33
3 ahmad 23
4 aslam 16
5 alvi 34
Now make a list object and fill it.
mylist=list()
for(i in 1:nrow(mydf))
{
a=as.integer(mydf$age[i])
n=as.String(mydf$name[i])
mylist[i]=paste(paste(paste("name",n,sep = ":"),"age",sep = ","),a,sep = ":")
}
Finall, result is
> mylist
[[1]]
[1] "name:ali,age:12"
[[2]]
[1] "name:asgar,age:33"
[[3]]
[1] "name:ahmad,age:23"
[[4]]
[1] "name:aslam,age:16"
[[5]]
[1] "name:alvi,age:34"