Combine multiple columns keeping variable name as part of data

Combine multiple columns keeping variable name as part of data - r

I have data as below
df=data.frame(
Id=c("001","002","003","004"),
author=c('John Cage','Thomas Carlyle'),
circa=c('1988', '1817'),
quote=c('I cant understand why people are frightened of new ideas. Im frightened of the old ones.',
'My books are friends that never fail me.')
)
df
I would like to combine 3 columns to obtain the data frame below
df2 = data.frame(
Id=c("001","002"),
text = c(
'Author:
John Cage
Circa:
1988
quote:
I cant understand why people are frightened of new ideas. Im frightened of the old ones.
',
'Author:
Thomas Carlyle
Circa:
1817
quote:
My books are friends that never fail me.
'
)
)
df2
I am aware I can use paste or unite from tidyr, but how can I pass the column names to be within the new created column?

You can get the data in long format and then paste by group.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = -Id) %>%
group_by(Id) %>%
summarise(text = paste(name, value, sep = ":", collapse = "\n"))
# A tibble: 4 x 2
# Id text
# <fct> <chr>
#1 001 "author:John Cage\ncirca:1988\nquote:I cant understand why people are f…
#2 002 "author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that nev…
#3 003 "author:John Cage\ncirca:1988\nquote:I cant understand why people are f…
#4 004 "author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that nev…

Here is a solution with base R, where paste0() is used. Maybe the following code can help you make it
res <- cbind(df[1],text = apply(apply(df[-1], 1, function(v) paste0(names(df[-1]),": ",v)), 2, paste0, collapse = "\n"))
such that
> res
Id text
1 001 author: John Cage\ncirca: 1988\nquote: I cant understand why people are frightened of new ideas. Im frightened of the old ones.
2 002 author: Thomas Carlyle\ncirca: 1817\nquote: My books are friends that never fail me.
DATA
df <- structure(list(Id = structure(1:2, .Label = c("001", "002"), class = "factor"),
author = structure(1:2, .Label = c("John Cage", "Thomas Carlyle"
), class = "factor"), circa = structure(2:1, .Label = c("1817",
"1988"), class = "factor"), quote = structure(1:2, .Label = c("I cant understand why people are frightened of new ideas. Im frightened of the old ones.",
"My books are friends that never fail me."), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))

We can use melt in data.table
library(data.table)
melt(setDT(df), id.var = 'Id')[, .(text = paste(variable,
value, sep=":", collapse="\n")), Id]
# Id text
#1: 001 author:John Cage\ncirca:1988\nquote:I cant understand why people are frightened of new ideas. Im frightened of the old ones.
#2: 002 author:Thomas Carlyle\ncirca:1817\nquote:My books are friends that never fail me.

Related

Using str_count for multiple strings for "check all that apply" in survey data

Qualtrics' check all that applies produces per respondent a comma-separated cell with all the options each person clicked on. If I make a count table I get something like this, but with 10 unique choices:
Combinations
Count
for news
32
for news, talk to friends
7
for news, find information
14
for news, talk to friends, find information
5
talk to friends
55
I want to count each option so that for news = 32+7+14+5, talk to friends = 7+5+55 etc.
I can do df$Var %>% str_count("for news") %>% table() but this would have to be done 10 times. I am getting an error trying to mutate ("Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "factor")
df$Var%>%mutate(news = str_count("for news"),
friends = str_count("friends"),
info = str_count("information"))
Do I have to str_extract using the comma, creating individual columns, and then pivot_long or is there a way to make different str_counts in one go?

if space between each respondent, (for example, not for news, talk to friends,find information, like for news, talk to friends, find information) and let your data as df,
x <- rep(df$Combinations, df$Count)
x %>% str_plit(., ", ") %>% unlist %>% table
find information for news talk to friends
19 58 67
or by using str_count,
y <- c("for news", "talk to friends", "find information")
sapply(y, function(X) str_count(x, X)) %>% colSums()
str_count(x, "for news")
for news talk to friends find information
58 67 19

Split the comma separated strings in new rows using separate_rows and for each Combinations , sum it's count.
library(dplyr)
library(tidyr)
df %>%
separate_rows(Combinations, sep = ',\\s*') %>%
group_by(Combinations) %>%
summarise(Count = sum(Count))
# Combinations Count
# <chr> <int>
#1 find information 19
#2 for news 58
#3 talk to friends 67
data
It is easier to help if you provide data in a reproducible format -
df <- structure(list(Combinations = c("for news", "for news, talk to friends",
"for news, find information", "for news, talk to friends,find information",
"talk to friends"), Count = c(32L, 7L, 14L, 5L, 55L)), row.names = c(NA,
-5L), class = "data.frame")

df %>%
separate_rows(Var, sep = ',\\s*') %>%
group_by(Var) %>% tally()

Pivot_longer: Rotating multiple columns of data with same data types

I'm trying to rotate multiple columns of data into single, data-type consistent columns.
I've created a minimum example below.
library(tibble)
library(dplyr)
# I have data like this
df <- tibble(contact_1_prefix=c('Mr.','Mrs.','Dr.'),
contact_2_prefix=c('Dr.','Mr.','Mrs.'),
contact_1 = c('Bob Johnson','Robert Johnson','Bobby Johnson'),
contact_2 = c('Tommy Two Tones','Tommy Three Tones','Tommy No Tones'),
contact_1_loc = c('Earth','New York','Los Angeles'),
contact_2_loc = c('London','Geneva','Paris'))
# My attempt at a solution:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)") %>%
pivot_wider(names_from='dat',values_from='contact')
#What I want is to widen that data to achieve a tibble with these two example lines
df_desired <- tibble(name=c('Bob Johnson','Tommy Two Tones'),
loc =c('Earth','London'),
prefix=c('Mr.','Dr.'))
I want all names under name, all locations under loc, and all prefixes under prefix.
If I use just this snippet from the middle statement:
df %>% rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols=c(matches('_[12]_')),
names_to=c('.value','dat'),
names_pattern = "(.*)_[1-2]_(.*)")
The dput of the output is:
structure(list(dat = c("prefix", "prefix", "name", "name", "loc",
"loc", "prefix", "prefix", "name", "name", "loc", "loc", "prefix",
"prefix", "name", "name", "loc", "loc"), contact = c("Mr.", "Dr.",
"Bob Johnson", "Tommy Two Tones", "Earth", "London", "Mrs.",
"Mr.", "Robert Johnson", "Tommy Three Tones", "New York", "Geneva",
"Dr.", "Mrs.", "Bobby Johnson", "Tommy No Tones", "Los Angeles",
"Paris")), row.names = c(NA, -18L), class = c("tbl_df", "tbl",
"data.frame"))
From that, I thought for sure pivot_wider was the solution, but there is a name conflict.
I assume a single pivot_longer statement will achieve the task. I studied Gathering wide columns into multiple long columns using pivot_longer carefully but can't quite figure this out. I have to admit I don't quite understand what the names_to = c(".value", "group") phrase does.
In any event, any help is appreciated.
Thanks

You were on the right path. Renaming is needed since only the name columns do not have any suffix to identify them. .value identifies part of the original column name that you want to uniquely identify as new columns. If you remove everything until the last underscore the part that remains are the new column names which you can specify using regex in names_pattern.
library(dplyr)
library(tidyr)
df %>%
rename(contact_1_name=contact_1,
contact_2_name=contact_2) %>%
pivot_longer(cols = everything(),
names_to = '.value',
names_pattern = '.*_(\\w+)')
# prefix name loc
# <chr> <chr> <chr>
#1 Mr. Bob Johnson Earth
#2 Dr. Tommy Two Tones London
#3 Mrs. Robert Johnson New York
#4 Mr. Tommy Three Tones Geneva
#5 Dr. Bobby Johnson Los Angeles
#6 Mrs. Tommy No Tones Paris

Here is a solution using split.default
data.table::rbindlist(
lapply( split.default( df, gsub( "[^0-9]+", "", names(df) ) ),
data.table::setnames,
new = c("prefix", "name", " loc" ) ) )
# prefix name loc
# 1: Mr. Bob Johnson Earth
# 2: Mrs. Robert Johnson New York
# 3: Dr. Bobby Johnson Los Angeles
# 4: Dr. Tommy Two Tones London
# 5: Mr. Tommy Three Tones Geneva
# 6: Mrs. Tommy No Tones Paris

Simplifying tables (squashing them!) in R- basic q

I have a basic q I would like a quick R solution in...
I have a tab delimited table with multiple rows, but I want to "squash" all rows into one... for example:
name day red blue orange black
bill 1 yes
bill 2 yes
bill 3 yes
bill 4 no
But I want the output to be independent of day:
name red blue orange black
bill yes yes no yes
So essentially I am squashing the table down to include all answers regardless of the day. NB: There are never any overlaps i.e. Bill will select only one colour per day.
I could do this in excel, but I'd prefer to find an R solution... happy for guidance even wrt which libraries would be useful :).
Go easy on me, I'm a clinician not a bioinformatician!

Here is an option with dplyr. If the missing values are "", after grouping over 'name', summarise by looping across the columns and get the elements that are not a blank (.[. != ""])
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[.!= '']))
Or if the missing values are NA
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[!is.na(.)]))
If there are more than one non-missing element, the above output will be a list column. Instead, we can also paste it together
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(.[!is.na(.)])))
If there are both NA and "", an option is to convert the "" to NA and then use is.na or complete.cases or with na.omit
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(na.omit(na_if(., "")))))

In base R, you could use aggregate and select non-blank values for each name.
aggregate(cbind(red,blue,orange,black)~name, df, function(x) toString(x[x!='']))
# name red blue orange black
#1 bill yes yes no yes
data
df <- structure(list(name = c("bill", "bill", "bill", "bill"), day = 1:4,
red = c("yes", "", "", ""), blue = c("", "yes", "", ""),
orange = c("", "", "", "no"), black = c("", "", "yes", ""
)), class = "data.frame", row.names = c(NA, -4L))

Parsing a string with multiple brackets

I have a dataset dt with column "subject", that I need to parse. For example,
ID subject
1 USA(Texas)(Austin)
2 USA(California)(Sacramento)
As a result, I want to get the following table:
ID subject Country State Capital
1 USA(Texas)(Austin) USA Texas Austin
2 USA(California)(Sacramento) USA California Sacramento
How can I do it?

Since you have multiple brackets to extract data from you need to make your regex lazy.
library(dplyr)
library(tidyr)
extract(dt, subject, into = c("Country", "State", "Capital"),
regex = "(.*)\\((.*?)\\)\\((.*)\\)", remove = FALSE)
# ID subject Country State Capital
#1 1 USA(Texas)(Austin) USA Texas Austin
#2 2 USA(California)(Sacramento) USA California Sacramento
Another option with a simpler regex is to remove round brackets with gsub and use separate with sep argument as whitespace.
dt %>%
mutate(subject = trimws(gsub('[()]', ' ', subject))) %>%
separate(subject, into = c("Country", "State", "Capital"), sep = "\\s+")
data
dt <- structure(list(ID = 1:2, subject = structure(2:1,
.Label = c("USA(California)(Sacramento)", "USA(Texas)(Austin)"),
class = "factor")), class = "data.frame", row.names = c(NA, -2L))

dataframe column containing lists extracted as columns of the same dataframe

I have a dataframe with 3 columns. One of the columns (the second) contains a list of values per cell. Here dput sample data:
df <- structure(list(column1 = c("HEATER", "COOLER"), column2 = list(structure(list(
insidelist = structure(list(es = list("1"), en = list("00"), la = list(
"01")), .Names = c("es", "en", "la"))), .Names = "insidelist"),
structure(list(insidelist = structure(list(es = list("1"), en = list(
"01"), la = list("01")), .Names = c("es", "en", "la"))), .Names = "insidelist")),
column3 = c("88", "31")), .Names = c("column1", "column2", "column3"
), row.names = c(NA, -2L), class = "data.frame")
Giving this df:
column1 column2 column3
1 HEATER 1, 00, 01 88
2 COOLER 1, 01, 01 31
How to get that list of values from second column as columns of the original dataframe?
Desired output:
column1 column2 Column3 column4 column5
1 HEATER 1 00 01 88
2 COOLER 1 01 01 31

Don't get me wrong, I love tidy-way of doing things as much as everyone here, and many people are learning R programming walking by an easier path thanks to it, but I think sometimes when you have a hammer everything looks like a nail.
Tidyverse has a lot of virtues, but some drawbacks too, one of them that seems to mask/hide the basics of the R language. In this case the most powerful and "human readable" solution (imho) is to mix approaches in a readable way.
Let’s take a look.
First we get rid of nested lists converting them to data frame:
df$column2 <- data.frame(matrix(unlist(df$column2), nrow=nrow(df), byrow=T))
> df
column1 column2.X1 column2.X2 column2.X3 column3
1 HEATER 1 00 01 88
2 COOLER 1 01 01 31
Then extract the inner data frame (column2) and put it side by side with original df:
df <- cbind(select(df,-column2), df$column2)
Selecting/Renaming columns is a trivial task. Here, an example after binding:
df <- cbind(df, df$column2) %>%
select(Column1=1, Column2=4, Column3=5, Column4=6, Column5=3)
This gives us the desired output:
> df
Column1 Column2 Column3 Column4 Column5
1 HEATER 1 00 01 88
2 COOLER 1 01 01 31
Plunging into tidy code sometimes ends in a not-so-tidy-solution. I know many people are learning R this way, but wise programmers should be wary of the dark places this can lead to if you rush to tidyverse for every problem not taking base R into account.

We can do
library(tidyverse)
df %>%
mutate(out = map(column2, ~ .x %>%
transpose %>%
unlist %>%
as.list %>%
as_tibble)) %>%
unnest %>%
select(-column2)

Here's my go - not as concise as akrun and AntoniosK, but maybe a little more readable:
df %>%
unnest(column2) %>%
mutate(lengths = map_int(column2, ~ length(unlist(.x))),
column2 = map_chr(column2, ~ glue::collapse(unlist(.x), sep = ',') )) %>%
separate(column2, sep = ',', into = paste('temp', seq(1,max(.$lengths)), sep = '_')) %>%
select(column1, starts_with('temp'), column3) %>%
setNames(paste0("column", 1:ncol(.)))
Just a note - it looks like the answers in comments run a bit faster, so if you're working with a large dataset - it may be wise to go with those.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Combine multiple columns keeping variable name as part of data - r

Related

Using str_count for multiple strings for "check all that apply" in survey data

Pivot_longer: Rotating multiple columns of data with same data types

Simplifying tables (squashing them!) in R- basic q

Parsing a string with multiple brackets

dataframe column containing lists extracted as columns of the same dataframe

Categories

Resources