I need to define a function f(x,y) such that:
x = "col1,1,2,3,4"
y = "col2,a,b,c,d"
becomes:
# A tibble: 4 x 2
col1 col2
<int> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
Any thoughts? Thanks.
The most obvious idea that comes to mind is to split the input by comma, use paste to combine the output into a single string, and read that using read_csv.
Example:
paste(do.call(paste, c(strsplit(c(x, y), ","), sep = ", ")), collapse = "\n")
# [1] "col1, col2\n1, a\n2, b\n3, c\n4, d"
library(tidyverse)
read_csv(paste(do.call(paste, c(strsplit(c(x, y), ","), sep = ", ")), collapse = "\n"))
# # A tibble: 4 x 2
# col1 col2
# <int> <chr>
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
From there, I hope you're able to convert the approach to a function.
Related
I have
x<-"1, A | 2, B | 10, C "
x is always this way formatted, | denotes a new row and the first value is the variable1, the second value is variable2.
I would like to convert it to a data.frame
variable1 variable2
1 1 A
2 2 B
3 10 C
I haven't found any package that can understand the escape character |
How can I convert it to data.frame?
We may use read.table from base R to read the string into two columns after replacing the | with \n
read.table(text = gsub("|", "\n", x, fixed = TRUE), sep=",",
header = FALSE, col.names = c("variable1", "variable2"), strip.white = TRUE )
-output
variable1 variable2
1 1 A
2 2 B
3 10 C
Or use fread from data.table
library(data.table)
fread(gsub("|", "\n", x, fixed = TRUE), col.names = c("variable1", "variable2"))
variable1 variable2
1: 1 A
2: 2 B
3: 10 C
Or using tidyverse - separate_rows to split the column and then create two columns with separate
library(tidyr)
library(dplyr)
tibble(x = trimws(x)) %>%
separate_rows(x, sep = "\\s*\\|\\s*") %>%
separate(x, into = c("variable1", "variable2"), sep=",\\s+", convert = TRUE)
# A tibble: 3 × 2
variable1 variable2
<int> <chr>
1 1 A
2 2 B
3 10 C
Here's a way using scan().
x <- "1, A | 2, B | 10, C "
do.call(rbind.data.frame,
strsplit(scan(text=x, what="A", sep='|', quiet=T, strip.white=T), ', ')) |>
setNames(c('variable1', 'variable2'))
# variable1 variable2
# 1 1 A
# 2 2 B
# 3 10 C
Note: R version 4.1.2 (2021-11-01).
I want to paste a number and some letters together to index them. The columns of my dataframe are as follows;
When CNTR is NA, i want it to be the booking number + an index, so for booking 202653 for example, I want it to be 202653A and 202653B. I already achieved pasting the booking numbers into the CNTR column when its empty with;
dfUNIT$CNTR <- ifelse(is.na(dfUNIT$CNTR), dfUNIT$BOOKING, dfUNIT$CNTR)
which gives me the following table;
But as I said, I need unique CNTR values. My dataframe contains thousands of rows and changes frequently, is there a way to 'index' them the way I want (A, B, C etc)? Thank you in advance
I'll make up some data,
dat <- data.frame(B=c(202658,202654,202653,202653),C=c("TCLU","KOCU",NA,NA))
dplyr
library(dplyr)
dat %>%
group_by(B) %>%
mutate(C = if_else(is.na(C), paste0(B, LETTERS[row_number()]), C))
# # A tibble: 4 x 2
# # Groups: B [3]
# B C
# <dbl> <chr>
# 1 202658 TCLU
# 2 202654 KOCU
# 3 202653 202653A
# 4 202653 202653B
A fundamental risk in this is if you ever have more than 26 rows for a booking, in which case the letter-suffix will fail. An alternative is to append a number instead (e.g., paste0(B, "_", row_number()) or add some other safeguards.
base R alternatives
do.call(rbind, by(dat, dat[,"B",drop=FALSE],
FUN = function(z) transform(z,
C = ifelse(is.na(C), paste0(B, LETTERS[seq_along(z$C)]), C)
)
))
or
append <- ave(dat$C, dat$B, FUN = function(z) ifelse(is.na(z), LETTERS[seq_along(z)], ""))
append
# [1] "" "" "A" "B"
dat$C <- paste0(ifelse(is.na(dat$C), dat$B, dat$C), append)
dat
# B C
# 1 202658 TCLU
# 2 202654 KOCU
# 3 202653 202653A
# 4 202653 202653B
If you don't insist on using letters to index the transformations, here's arough and ready dplyr solution based on rleid from the data.table package:
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(B)) %>%
mutate(CNTR_new = if_else(is.na(CNTR), paste0(B, "_", grp), CNTR))
# A tibble: 7 x 4
# Groups: grp [5]
B CNTR grp CNTR_new
<dbl> <chr> <int> <chr>
1 12 TCU 1 TCU
2 13 NA 2 13_2
3 13 NA 2 13_2
4 15 NA 3 15_3
5 1 PVDU 4 PVDU
6 1 NA 4 1_4
7 5 NA 5 5_5
Data:
df <- data.frame(
B = c(12,13,13,15,1,1,5),
CNTR = c("TCU", NA, NA, NA, "PVDU", NA, NA)
)
I want to aggregate a data.frame with two columns: in one column I have "num", which is an identifier number and in the other I have text. It is important that the aggregated text has a space between the individual parts. My code is this:
data_aggr <- aggregate(
x = data_aggr,
FUN = paste,
by = list(data_aggr$num)
)
I have tried the obvious with FUN = paste(collapse = " ") and
FUN = paste,
collapse = " ",
but that doesn't work. How do I need to do this?
Aggregate can be used to paste together the rows with the same value of num as follows:
data_aggr <- data.frame(num=c(1,1,1,2,2), letters=letters[1:5])
aggregate(data_aggr$letters, list(data_aggr$num), FUN=paste, collapse= " ")
# Group.1 x
# 1 1 a b c
# 2 2 d e
A dplyr solution, the idea is to create a new column with row number to be able to conduct the operation on each row.
> library(dplyr)
> df.ask <- data.frame('Num' = 1:10,
+ 'Text' = letters[1:10])
>
> df.ask %>%
+ mutate(row_num = row_number()) %>%
+ group_by(row_num) %>%
+ mutate(together = paste(Num, Text, collapse = ' ')) %>%
+ ungroup() %>%
+ select(-row_num)
# A tibble: 10 x 3
Num Text together
<int> <fct> <chr>
1 1 a 1 a
2 2 b 2 b
3 3 c 3 c
4 4 d 4 d
5 5 e 5 e
6 6 f 6 f
7 7 g 7 g
8 8 h 8 h
9 9 i 9 i
10 10 j 10 j
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 2 years ago.
I have a dataframe and I want to replicate the input of a single cell n times dependent on the input of the next cell and display it in a new cell.
My dataframe looks like this:
data <- data.frame(c(1,1,2,3,4,4,4), c("A","B","A","C","D","E","A"), c(2,1,1,3,2,1,3))
colnames(data) <- c("document number", "term", "count")
data
This is my desired result:
datanew <- data.frame(c(1,2,3,4), c("A A B", "A", "C C C", "D D E A A A"))
colnames(datanew) <- c("document number", "term")
# document number term
# 1 1 A A B
# 2 2 A
# 3 3 C C C
# 4 4 D D E A A A
So basically, I like to multiplicate the input of the term cell with the input of the corresponding count cell. Does anyone has an idea how to code it in R?
We can use rep to replicate term count times and paste the data together.
library(dplyr)
data %>%
group_by(`document number`) %>%
summarise(new = paste(rep(term, count), collapse = " "))
# A tibble: 4 x 2
# `document number` new
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Similarly with data.table
library(data.table)
setDT(data)[, (new = paste(rep(term, count), collapse = " ")),
by = `document number`]
We can do this with tidyverse methods
library(dplyr)
library(tidyr)
library(stringr)
data %>%
uncount(count) %>%
group_by(`document number`) %>%
summarise(term = str_c(term, collapse=' '))
# A tibble: 4 x 2
# `document number` term
# <dbl> <chr>
#1 1 A A B
#2 2 A
#3 3 C C C
#4 4 D D E A A A
Or with data.table
library(data.table)
setDT(data)[rep(seq_len(.N), count)][, .(term =
paste(term, collapse=' ')), `document number`]
Or using base R with aggregate
aggregate(term ~ `document number`, data[rep(seq_len(nrow(data)),
data$count),], FUN = paste, collapse= ' ')
I have a large dataset grouped by agent and date, the variable I want to clean is a string type variable. For instance, for the following dataset
agent_id<-c("1","1","1","2","2","2","2")
date<-c("2007-02-01","2007-02-02","2007-02-05","2000-05-01","2000-05-02","2000-05-10","2000-05-20")
office<-c("A","A","B","C","D","C","C")
mydata<-data.frame(agent_id,date,office)
I want to replace the outlier within a office vector if it is different from the last observation and the next observation within each agent_id. For instance, for agent_id=1, I don't want to replace anything. For agent_id=2, I want to replace "D" to "C" in office because I observe C both before and after. Is there any ways to do that with dplyr? Additionally, it would be better if I can define the cutoff to replace the outliears i.e. if I observe n same values before and n same values after.
You could do:
library(dplyr)
mydata %>%
group_by(agent_id) %>%
mutate(
office = replaceOutliers(x = office, window = 1)
)
Where replaceOutliers is a custom function:
replaceOutliers <- function(x, window = 1, fixed_wind = FALSE) {
x <- as.character(x)
flag_Outl <- c(FALSE, sapply(2:(length(x) - 1), function(y) length(setdiff(x[pmax(1, y - window):pmax(1, y - 1)],
x[pmin(length(x) - 1, y + 1):pmin(length(x) - 1, y + window)])) == 0), FALSE)
if (fixed_wind) {
len_Lag <- sapply(1:length(x), function(y) length(office[pmax(1, y - window):pmax(1, y - 1)]))
len_Lead <- sapply(1:length(x), function(y) length(office[pmin(length(x), y + 1):pmin(length(x), y + window)]))
x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y] & len_Lag[y] == window & len_Lead[y] == window, x[y - 1], x[y]))
}
else x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y], x[y - 1], x[y]))
return(x)
}
Output:
# A tibble: 7 x 3
# Groups: agent_id [2]
agent_id date office
<fct> <fct> <chr>
1 1 2007-02-01 A
2 1 2007-02-02 A
3 1 2007-02-05 C
4 2 2000-05-01 C
5 2 2000-05-02 C
6 2 2000-05-10 C
7 2 2000-05-20 C
As you will see I've included a fixed_wind parameter - basically you can decide whether you always need to have the exact number of observations before and after to consider something an outlier.
By default this is FALSE, and when you increase the window to 2 in your example, it'll still replace D, but if you put it to TRUE, it'll keep it as it is (as there is only 1 observation before it in the group):
mydata %>%
group_by(agent_id) %>%
mutate(
office2 = replaceOutliers(x = office, window = 2),
office3 = replaceOutliers(x = office, window = 2, fixed_wind = TRUE)
)
Output:
# A tibble: 7 x 5
# Groups: agent_id [2]
agent_id date office office2 office3
<fct> <fct> <fct> <chr> <chr>
1 1 2007-02-01 A A A
2 1 2007-02-02 A A A
3 1 2007-02-05 C C C
4 2 2000-05-01 C C C
5 2 2000-05-02 D C D
6 2 2000-05-10 C C C
7 2 2000-05-20 C C C