splitting a column delimiter R - r

I have a dataframe as below. I want to split the last column into 2. Splitting needs to be done based upon the only first : and rest of the columns dont matter.
In the new dataframe, there will be 4 columns. 3 rd column will be (a,b,d) while 4th column will be (1,2:3,3:4:4)
any suggestions? 4th line of my code doesnt work :(. I am okay with completely new solution or corrections to the line 4
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
as.data.frame(do.call(rbind, strsplit(df,":")))
--------------------update1
Below solutions work well. But i need a modified solution as I just realized that some of the cells in column 3 wont have ":". In such case i want text in that cell to appear in only 1st column after splitting that column
employee <- c('John Doe','Peter Gynn','Jolie Hope')
salary <- c(3, 2, 1)
df <- data.frame(employee, salary, originalColumn = c("a :1", "b", "d: 3:4:4"))

You could use cSplit. On your updated data frame,
library(splitstackshape)
cSplit(df, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b NA
# 3: Jolie Hope 1 d 3:4:4
And on your original data frame,
df1 <- data.frame(employee, salary,
originalColumn = c("a :1", "b :2:3", "d: 3:4:4"))
cSplit(df1, "originalColumn", sep = ":{1}")
# employee salary originalColumn_1 originalColumn_2
# 1: John Doe 3 a 1
# 2: Peter Gynn 2 b 2:3
# 3: Jolie Hope 1 d 3:4:4
Note: I'm using splitstackshape version 1.4.2. I believe the sep argument has been changed from version 1.4.0

You could use extract from tidyr to split the originalColumn in to two columns. In the below code, I am creating 3 columns and removing one of the unwanted columns from the result.
library(tidyr)
pat <- "([^ :])( ?:|: ?|)(.*)"
extract(df, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Using the updated df, (for better identification - df1)
extract(df1, originalColumn, c("Col1", "ColN", "Col2"), pat)[,-4]
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b
#3 Jolie Hope 1 d 3:4:4
Or without creating a new column in df
extract(df, originalColumn, c("Col1", "Col2"), "(.)[ :](.*)") %>%
mutate(Col2= gsub("^\\:", "", Col2))
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
Based on the pattern in df, the below code also works. Here, the regex used to extract the first column is (.). A dot is a single element at the beginning of the string inside the parentheses will be extracted for the Col1. Then .{2} two elements following the first are discarded and the rest within the parentheses (.*) forms the Col2.
extract(df, originalColumn, c("Col1", "Col2"), "(.).{2}(.*)")
# employee salary Col1 Col2
#1 John Doe 3 a 1
#2 Peter Gynn 2 b 2:3
#3 Jolie Hope 1 d 3:4:4
or using strsplit
as.data.frame(do.call(rbind, strsplit(as.character(df$originalColumn), " :|: ")))
# V1 V2
#1 a 1
#2 b 2:3
#3 d 3:4:4
For df1, here is a solution using strsplit
lst <- strsplit(as.character(df1$originalColumn), " :|: ")
as.data.frame(do.call(rbind,lapply(lst,
`length<-`, max(sapply(lst, length)))) )
# V1 V2
#1 a 1
#2 b <NA>
#3 d 3:4:4

You were close, here's a solution:
library(stringr)
df[, c('Col1','Col2')] <- do.call(rbind, str_split_fixed(df$originalColumn,":",n=2))
df$originalColumn <- NULL
employee salary Col1 Col2
1 John Doe 3 a 1
2 Peter Gynn 2 b 2:3
3 Jolie Hope 1 d 3:4:4
Notes:
stringr::str_split() is better than base::strsplit() because you don't have to do as.character(), also it has the n=2 argument you want to limit to only split on the first ':'

Related

How to rename columns in R with dplyr using a character object?

My data frame is as such:
#generic dataset
datatest <- data.frame(col1 = c(1,2,3,4), col2 = c('A', 'B', 'C', 'D'))
#character objects
name1 <- 'A'
name2 <- 'B'
I want to rename my columns using the name1 and name2 objects. These dynamically change in the code so I can't use the following:
#I DON'T WANT THIS
datatest %>% rename(A = col1, B = col2)
I want to use this:
datatest %>% rename(name1 = col1, name2 = col2)
but then the data table columns end up becoming 'name1' and 'name2' respectively, when they should be A and B. Here is the data table at the moment.
name1 (I want this to be A)
name2 (I want this to be B)
1
A
2
B
3
C
4
D
Any help is hugely appreciated. I have the same issue with kable tables too.
Thanks in advance!
Couple of options -
Using rename_with -
library(dplyr)
name1 <- 'A'
name2 <- 'B'
datatest %>% rename_with(~c(name1, name2), c(col1, col2))
#If there are only two columns in datatest
datatest %>% rename_with(~c(name1, name2))
# A B
#1 1 A
#2 2 B
#3 3 C
#4 4 D
Use a named vector
name <- c(A = 'col1', B = 'col2')
datatest %>% rename(!!name)
You may try
datatest %>% rename({{name1}} := col1, {{name2}} := col2)
A B
1 1 A
2 2 B
3 3 C
4 4 D
Here is one more option using !!!setNames
datatest %>%
rename(!!!setNames(names(.), c(name1, name2)))
A B
1 1 A
2 2 B
3 3 C
4 4 D

Find Groups by multiple cascadingly related conditions

Problem
I have some data. I would like to flag the same instance (e.g. a person, company, machine, whatever) in my data by a unique ID. The data actually has some IDs but they are either not always present or one instance has different IDs.
What I try to acheive is to use these IDs along with individual information to find the same instance and assign a unique ID to them.
I found a solution, but this one is highly inefficient. I would appreciate both tipps to improve the performance of my code or - probably more promising - another approach.
Code
Example Data
dt1 <- data.table(id1 = c(1, 1, 2, 3, 4),
id2 = c("A", "B", "A", "C", "D"),
surname = "Smith",
firstname = c("John", "John", "Joe", "Joe", "Jack"))
dt1
#> id1 id2 surname firstname
#> 1: 1 A Smith John
#> 2: 1 B Smith John
#> 3: 2 A Smith Joe
#> 4: 3 C Smith Joe
#> 5: 4 D Smith Jack
Current Solution
find_grp <- function(dt,
by) {
# keep necessary variables only
dtx <- copy(dt)[, .SD, .SDcols = c(unique(unlist(by)))]
# unique data.table to improve performance
dtx <- unique(dtx)
# assign a row id column
dtx[, ID := .I]
# for every row and every by group, find all rows that match each row
# on at least one condition
res <- lapply(X = dtx$ID,
FUN = function(i){
unique(unlist(lapply(X = by,
FUN = function(by_sub) {
merge(dtx[ID == i, ..by_sub],
dtx,
by = by_sub,
all = FALSE)$ID
}
)))
})
res
print("merge done")
# keep all unique matching rows
l <- unique(res)
# combine matching rows together, if there is at least one overlap between
# two groups.
# repeat until all row-groups are completely disjoint form one another
repeat{
l1 <- l
iterator <- seq_len(length(l1))
for (i in iterator) {
for (ii in iterator[-i]) {
# is there any overlap between both row-groups
if (length(intersect(l1[[i]], l1[[ii]])) > 0) {
l1[[i]] <- sort(union(l1[[i]], l1[[ii]]))
}
}
}
if (isTRUE(all.equal(l1, l))) {
break
} else {
l <- unique(l1)
}
}
print("repeat done")
# use result to assign a groupId to the helper data.table
Map(l,
seq_along(l),
f = function(ll, grp) dtx[ID %in% ll, ID_GRP := grp])
# remove helper Id
dtx[, ID := NULL]
# assign the groupId to the original data.table
dt_out <- copy(dt)[dtx,
on = unique(unlist(by)),
ID_GRP := ID_GRP]
return(dt_out[])
}
Result
find_grp(dt1, by = list("id1",
"id2"
, c("surname", "firstname"))
)
#> [1] "merge done"
#> [1] "repeat done"
#> id1 id2 surname firstname ID_GRP
#> 1: 1 A Smith John 1
#> 2: 1 B Smith John 1
#> 3: 2 A Smith Joe 1
#> 4: 3 C Smith Joe 1
#> 5: 4 D Smith Jack 2
As you can see, ID_GRP is identified because
the first two rows share id1
since id2 for id1 contains A, row 3 with id2 = A belongs to the same group.
finally, all Joe Smith belong to the same group as well because its the name in row 3
so on and so forth
only row 5 is completely unrelated
{data.table} solutions are preferred
This might help you. I'm not sure if I've completely understood your question. I've written a function (gen_grp(), that takes a data table d, and a vector of variables v. It steps through each unique id1, and replaces id1 values if matches of certain types are found.
gen_grp <- function(d,v) {
for(id in unique(d$id1)) {
d[id2 %in% d[id1==id,id2], id1:=id]
k=unique(d[id1==id, ..v])[,t:=id]
d = k[d, on=v][!is.na(t), id1:=t][, t:=NULL]
}
d[, grp:=rleid(id1)]
return(d[])
}
Usage:
gen_grp(dt1,c("surname","firstname"))
Output:
surname firstname id1 id2 grp
<char> <char> <num> <char> <int>
1: Smith John 1 A 1
2: Smith John 1 B 1
3: Smith Joe 1 A 1
4: Smith Joe 1 C 1
5: Smith Jack 4 D 2

how to separate a column into multiple columns and change the results from characters to numbers

##id## ##initiativen##
1 abc 2a
2 cde 2b
3 efd a
4 geh c
5 jytd 5v
6 jydjytd e
Hello, I have something similar to this, just a lot bigger and I was wondering which is the most efficient way to divide the column initiativen into two columns, one containing the numbers (2,2,5,4) and one containing the letters or the blank space. it has to be a general formula as the data frame I need to apply it too is quite big. The letters correspond to a particular initiative number but the first initiative number is not indicated and "a" correspond to initiative number 2.
I would love it to look like something like that with the letters substituted by numbers (blank=1, a=2, b=3 etc..)
id initiativen question
abc 2 2
cde 3 2
efd 2 N/A
geh 4 N/A
jytd 23 5
jydjytd 6 N/A
bfdhslbf 1 3
I have tried to use "separate" but it doesn't really work and doesn't solve the problem of the first initiative having no corresponding letter.
Any help or suggestion would be extremely welcomed and helpful.
Thank you so much:)
How about the following tidyverse solution?
library(tidyverse);
df %>%
separate(initiativen, into = c("p1", "p2"), sep = "(?<=[0-9])(?=[a-z])") %>%
mutate(
initiativen = case_when(
str_detect(p1, "[a-z]") ~ p1,
str_detect(p2, "[a-z]") ~ p2),
question = case_when(
str_detect(p1, "[0-9]") ~ p1,
str_detect(p2, "[0-9]") ~ p2)) %>%
mutate(initiativen = ifelse(is.na(initiativen), 1, match(initiativen, letters) + 1)) %>%
select(-p1, -p2)
# id initiativen question
#1 abc 2 2
#2 cde 3 2
#3 efd 2 <NA>
#4 geh 4 <NA>
#5 jytd 23 5
#6 jydjytd 6 <NA>
#7 vbdjfkb 1 4
Note that the warning can be safely ignored as it stems from the missing fields when separateing.
Explanation: We use a positive look-behind and look-ahead to split entries in initiativen into two parts p1 and p2; we then fill initiativen and question with entries from p1 or p2 depending on whether they contain a number "[0-9]" or a character "[a-z]"; convert characters to numbers with match(initiativen, letters) and finally clean the data.frame.
Sample data
df <- read.table(text =
" id initiativen
1 abc 2a
2 cde 2b
3 efd a
4 geh c
5 jytd 5v
6 jydjytd e
7 vbdjfkb 4", row.names = 1)
Using data.table
# Step one
setDT(df)
df[, ":="(
question = gsub("[a-z]", "", initiativen),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2
4: geh 4
5: jytd 23 5
6: jydjytd 6
7: vbdjfkb 1 4
# Then some tidying
df[, question := ifelse(nzchar(question), question, NA)]
df
id initiativen question
1: abc 2 2
2: cde 3 2
3: efd 2 <NA>
4: geh 4 <NA>
5: jytd 23 5
6: jydjytd 6 <NA>
7: vbdjfkb 1 4
Data
df <- data.frame(
id = c("abc", "cde", "efd", "geh", "jytd", "jydjytd", "vbdjfkb"),
initiativen = c("2a", "2b", "a", "c", "5v", "e", "4"),
stringsAsFactors = FALSE
)
Edit
Can also be done in one step:
df[, question := gsub("[a-z]", "", initiativen)
][, ":="(
question = ifelse(nzchar(question), question, NA),
initiativen = match(gsub("[0-9]", "", initiativen), letters, nomatch = 0) + 1L
)
]
For the second column you can use regular expression to only keep numeric values:
df$initiativen <- gsub("[^0-9]", "", df$initiativen)

R function that merge rows and introduce a new merge variable

I have a data set like this....
ID Brand
--- --------
1 Cokacola
2 Pepsi
3 merge with 1
4 merge with 2
5 merge with 1
6 Fanta
And I want to write a R function which merge the rows and introduce new variable according to ID just like following...
ID Brand merge
---- -------- --------
1 Cokacola 1,3,5
2 Pepsi 2,4
6 Fanta 6
Your data:
dat <- data.frame(
id = 1:6,
brand = c('Cokacola', 'Pepsi', 'merge with 1', 'merge with 2', 'merge with 1', 'Fanta'))
Inelegant-but-functional code:
repeats <- grepl('^merge with', dat$brand)
groups <- ifelse(repeats, gsub('merge with ', '', dat$brand), dat$id)
merge <- sapply(unique(groups), function(x) paste(dat$id[groups==x], collapse=','))
dat <- dat[!repeats,]
dat$merge <- merge
dat
## id brand merge
## 1 1 Cokacola 1,3,5
## 2 2 Pepsi 2,4
## 6 6 Fanta 6
There are most certainly ways to make this more elegant, depending on the consistency and makeup of the data.
You could try
library(reshape2)
indx <- !grepl('merge', df$Brand)
df1 <- df[indx,]
val <- as.numeric(sub('[^0-9]+', '', df[!indx, 'Brand']))
ml <- melt(tapply(which(!indx), val, FUN=toString))
df2 <- merge(df1, ml, by.x='ID', by.y='Var1', all=TRUE)
df2$merge <- with(df2, ifelse(!is.na(value),
paste(ID, value, sep=', '), ID))
df2[-3]
# ID Brand merge
#1 1 Cokacola 1, 3, 5
#2 2 Pepsi 2, 4
#3 6 Fanta 6

Multiple Values in One Cell using R

Suppose, there are 2 data.frames, for instance:
dat1 <- read.table("[path_dat1]", header=TRUE, sep=",")
id name age
1 Jack 21
2 James 40
dat2 <- read.table("[path_dat2]", header=TRUE, sep=",")
id interests
1 football
1 basketball
1 soccer
2 pingpang ball
How do I join table 1 and table 2 into a data.frame like the one below?
id name age interests
1 1 Jack 21 (football, basketball, soccer)
2 2 James 40 (pingpang ball)
How can I join these using plyr in the simplest way?
I can't tell you how to solve this in plyr but can in base:
dat3 <- aggregate(interests~id, dat2, paste, collapse=",")
merge(dat1, dat3, "id")
EDIT: If you really want the parenthesis you could use:
ppaste <- function(x) paste0("(", gsub("^\\s+|\\s+$", "", paste(x, collapse = ",")), ")")
dat3 <- aggregate(interests~id, dat2, ppaste)
merge(dat1, dat3, "id")
Using Tyler's example:
dat1$interests <- ave(dat1$id, dat1$id,
FUN=function(x) paste(dat2[ dat2$id %in% x, "interests"], collapse=",") )
> dat1
id name age interests
1 1 Jack 21 football, basketball, soccer
2 2 James 40 pingpang ball

Resources