Separate values in a column into multiple columns name and column value - r

I would like to split a certain format of data from one column into multiple columns. Below are my sample data:
df = data.frame(id=c(1,2),data=c('apple:A%1^B%2^C%3_orange:A%1^B%2',
'apple:A%1^B%2^D%3_orange:A%3^B%2'))
# id data
# 1 apple:A%1^B%2^C%3_orange:A%1^B%2
# 2 apple:A%1^B%2^D%3_orange:C%3^B%2
which will then gives the following output
id data_apple_A data_apple_B data_apple_C data_apple_D data_orange_A data_orange_B
1 1 2 3 1 2
2 1 2 3 1 2
I have been able to do this but the method that I use involves looping through each of the row and perform the str_split by each of the separator in order to get the data for each row and append it to the final output dataframe which is very slow considering I will have 500k rows by 20 input column.
I don't think my for loop is a proper R way to code for this use case. Any help will be appreciated.

We can use cSplit with str_extract
library(splitstackshape)
library(zoo)
library(stringr)
dt <- cSplit(df, 'data', "\\^|_", fixed = FALSE, "long")[, c('grp', 'grp2', 'val')
:= .(na.locf(str_extract(data, "^[A-Za-z]+(?=:)")),
str_extract(data, "[A-Z](?=[%])"), as.numeric(str_extract(data, "\\d+"))) ][]
dcast(dt, id ~ paste0("data_", grp) + grp2, value.var = 'val', sep = "_", fill = 0)
# id data_apple_A data_apple_B data_apple_C data_apple_D data_orange_A data_orange_B
#1: 1 1 2 3 0 1 2
#2: 2 1 2 0 3 3 2

Related

Find a set of column names and replace them with new names using dplyr

I have below data frame
library(dplyr)
data = data.frame('A' = 1:3, 'CC' = 1:3, 'DD' = 1:3, 'M' = 1:3)
Now let define a vectors of strings which represents a subset of column names of above data frame
Target_Col = c('CC', 'M')
Now I want to find the column names in data that match with Target_Col and then replace them with
paste0('Prefix_', Target_Col)
I prefer to do it using dplyr chain rule.
Is there any direct function available to perform this?
Other solutions can be found here!
clickhere
vars<-cbind.data.frame(Target_Col,paste0('Prefix_', Target_Col))
data <- data %>%
rename_at(vars$Target_Col, ~ vars$`paste0("Prefix_", Target_Col)`)
or
data %>% rename_with(~ paste0('Prefix_', Target_Col), all_of(Target_Col))
We may use
library(stringr)
library(dplyr)
data %>%
rename_with(~ str_c('Prefix_', .x), all_of(Target_Col))
A Prefix_CC DD Prefix_M
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
With dplyrs rename_with
library(dplyr)
rename_with(data, function(x) ifelse(x %in% Target_Col, paste0("Prefix_", x), x))
A Prefix_CC DD Prefix_M
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3

Split a column list into columns

Suppose I have a DT as -
id values valid_types
1 2|3 100|200
2 4 200
3 2|1 500|100
The valid_types tells me what are the valid types I need. There are 4 total types(100, 200, 500, 2000). An entry specifies their valid types and their corresponding values with | separated character values.
I want to transform this to a DT which has the types as columns and their corresponding values.
Expected:
id 100 200 500
1 2 3 NA
2 NA 4 NA
3 1 NA 2
I thought I could take both the columns and split them on | which would give me two lists. I would then combine them by setting the keys as names of the types list and then convert the final list to a DT.
But the idea I came up with is very convoluted and not really working.
Is there a better/easier way to do this ?
Here is another data.table approach:
dcast(
DT[, lapply(.SD, function(x) strsplit(x, "\\|")[[1L]]), by = id],
id ~ valid_types, value.var = "values"
)
Using tidyr library you can use separate_rows with pivot_wider :
library(tidyr)
df %>%
separate_rows(values, valid_types, sep = '\\|', convert = TRUE) %>%
pivot_wider(names_from = valid_types, values_from = values)
# id `100` `200` `500`
# <int> <int> <int> <int>
#1 1 2 3 NA
#2 2 NA 4 NA
#3 3 1 NA 2
A data.table way would be :
library(data.table)
library(splitstackshape)
setDT(df)
dcast(cSplit(df, c('values', 'valid_types'), sep = '|', direction = 'long'),
id~valid_types, value.var = 'values')

Dummify character column, BUT with unequal number of categories in each row [duplicate]

This question already has an answer here:
Split a column into multiple binary dummy columns [duplicate]
(1 answer)
Closed 5 years ago.
I have a dataframe with the following structure
test <- data.frame(col = c('a; ff; cc; rr;', 'rr; a; cc; e;'))
Now I want to create a dataframe from this which contains a named column for each of the unique values in the test dataframe. A unique value is a value ended by the ';' character and starting with a space, not including the space. Then for each of the rows in the column I wish to fill the dummy columns with either a 1 or a 0. As given below
data.frame(a = c(1,1), ff = c(1,0), cc = c(1,1), rr = c(1,0), e = c(0,1))
a ff cc rr e
1 1 1 1 1 0
2 1 0 1 1 1
I tried creating a df using for loops and the unique values in the column but it's getting to messy. I have a vector available containing the unique values of the column. The problem is how to create the ones and zeros. I tried some mutate_all() function with grep() but this did not work.
I'd use splitstackshape and mtabulate from qdapTools packages to get this as a one liner,
i.e.
library(splitstackshape)
library(qdapTools)
mtabulate(as.data.frame(t(cSplit(test, 'col', sep = ';', 'wide'))))
# a cc ff rr e
#V1 1 1 1 1 0
#V2 1 1 0 1 1
It can also be full splitstackshape as #A5C1D2H2I1M1N2O1R2T1 mentions in comments,
cSplit_e(test, "col", ";", mode = "binary", type = "character", fill = 0)
Here's a possible data.table implementation. First we split the rows into columns, melt into a single column and the spread it wide while counting the events for each row
library(data.table)
test2 <- setDT(test)[, tstrsplit(col, "; |;")]
dcast(melt(test2, measure = names(test2)), rowid(variable) ~ value, length)
# variable a cc e ff rr
# 1: 1 1 1 0 1 1
# 2: 2 1 1 1 0 1
Here's a base R approach:
x <- strsplit(as.character(test$col), ";\\s?") # split the strings
lvl <- unique(unlist(x)) # get unique elements
x <- lapply(x, factor, levels = lvl) # convert to factor
t(sapply(x, table)) # count elements and transpose
# a ff cc rr e
#[1,] 1 1 1 1 0
#[2,] 1 0 1 1 1
We can do this with tidyverse
library(tidyverse)
rownames_to_column(test, 'grp') %>%
separate_rows(col) %>%
filter(col!="") %>%
count( grp, col) %>%
spread(col, n, fill = 0) %>%
ungroup() %>%
select(-grp)
# A tibble: 2 × 5
# a cc e ff rr
#* <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 0 1 1
#2 1 1 1 0 1
Here is a base R solution. First remove the space. Get all the unique combination. Split the actual data frame and then check presence of it in the cols which will have all the combo. Then you get a logical matrix which can be easily converted into numeric.
test=as.data.frame(apply(test,2,function(x)gsub('\\s+', '',x)))
cols=unique(unlist(strsplit(as.character(test$col), split = ';')))
yy=strsplit(as.character(test$col), split = ';')
z=as.data.frame(do.call.rbind(lapply(yy, function(x) cols %in% x)))
names(z)=cols
z=as.data.frame(lapply(z, as.integer))
Another approach with tidytext and tidyverse
library(tidyverse)
library(tidytext) #for unnest_tokens()
df <- test %>%
unnest_tokens(word, col) %>%
rownames_to_column(var="row") %>%
mutate(row = floor(parse_number(row)),
val = 1) %>%
spread(word, val, fill = 0) %>%
select(-row)
df
# a cc e ff rr
#1 1 1 0 1 1
#2 1 1 1 0 1
Another simple solution without any extra packages:
x = c('a; ff; cc; rr;', 'rr; a; cc; e;')
G = lapply(strsplit(x,';'), trimws)
dict = sort(unique(unlist(G)))
do.call(rbind, lapply(G, function(g) 1*sapply(dict, function(d) d %in% g)))

determine duplicate rows whose at least one row has different value in a column [duplicate]

I have data with a grouping variable ("from") and values ("number"):
from number
1 1
1 1
2 1
2 2
3 2
3 2
I want to subset the data and select groups which have two or more unique values. In my data, only group 2 has more than one distinct 'number', so this is the desired result:
from number
2 1
2 2
Several possibilities, here's my favorite
library(data.table)
setDT(df)[, if(+var(number)) .SD, by = from]
# from number
# 1: 2 1
# 2: 2 2
Basically, per each group we are checking if there is any variance, if TRUE, then return the group values
With base R, I would go with
df[as.logical(with(df, ave(number, from, FUN = var))), ]
# from number
# 3 2 1
# 4 2 2
Edit: for a non numerical data you could try the new uniqueN function for the devel version of data.table (or use length(unique(number)) > 1 instead
setDT(df)[, if(uniqueN(number) > 1) .SD, by = from]
You could try
library(dplyr)
df1 %>%
group_by(from) %>%
filter(n_distinct(number)>1)
# from number
#1 2 1
#2 2 2
Or using base R
indx <- rowSums(!!table(df1))>1
subset(df1, from %in% names(indx)[indx])
# from number
#3 2 1
#4 2 2
Or
df1[with(df1, !ave(number, from, FUN=anyDuplicated)),]
# from number
#3 2 1
#4 2 2
Using concept of variance shared by David but doing it dplyr way:
library(dplyr)
df %>%
group_by(from) %>%
mutate(variance=var(number)) %>%
filter(variance!=0) %>%
select(from,number)
#Source: local data frame [2 x 2]
#Groups: from
#from number
#1 2 1
#2 2 2

Assigning values to patterns of letters in character strings using R

I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5

Resources