Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 5 years ago.
Improve this question
When I try to seperate a column with (long) string values:
df <- tbl_df(c("Indian | Londen", "Greek | Amsterdam", "Hamburger and BBQ | Paris du Nord"))
df <- separate(df, col = value, into = c("var1","var2"), sep = " | ")
i get a warning message which says that there are too many values at three locations and when i look the altered dataframe i don't get the desired df
# A tibble: 3 × 2
var1 var2
* <chr> <chr>
1 Indian |
2 Greek |
3 Hamburger and
It seems to split at each space, does anyone know a way to work around this? var2 should contain the city or area name, thanks.
separate interpret sep parameter as regular expression when it is character. So you need to escape | which is a special character (or) in regex, pattern | (whitespace or whitespace) is the same as a whitespace in regex, which is why your strings are split by space:
df <- separate(df, col = value, into = c("var1","var2"), sep = " \\| ")
df
# A tibble: 3 × 2
# var1 var2
#* <chr> <chr>
#1 Indian Londen
#2 Greek Amsterdam
#3 Hamburger and BBQ Paris du Nord
Just do :
Since pipe has a special meaning in regex , it means "OR" ,hence you have to escape it first. you can also use it under character class [|] to get the same result
df1 <- separate(df, col = value, into = c("var1","var2"), sep = "\\|")
OR
df1 <- separate(df, col = value, into = c("var1","var2"), sep = "[|]")
BASE R way:
dfx<- data.frame(do.call("rbind",strsplit(df$value,split="\\|")))
Output:
> dfx
X1 X2
1 Indian Londen
2 Greek Amsterdam
3 Hamburger and BBQ Paris du Nord
Related
I am rather new to R, and I have been trying to write a code that will find and concatenate multiple choice question responses when the data is in long format. The data needs to be pivoted wide, but cannot without resolving the duplicate IDs that result from these multiple choice responses. I want to combine the extra multiple choice response to the distinct ID number, so that it would look like: "affiliation 1, affiliation 2" for the individual respondent, in long format. I would prefer to not use row numbers, as the data is recollected on a monthly basis and row numbers may not stay constant. I need to identify the duplicate ID due to the multiple choice question, and attach its secondary answer to the other response.
I have tried various versions of aggregate, grouping and summarizing, filter, unique, and distinct, but haven't been able to solve the problem.
Here is an example of the data:
ID Question Response
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x
3 question 1 affiliation z
4 question 1 affiliation y
I want the data to look like this:
ID Question Response Text
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x, affiliation z
4 question 1 affiliation y
so that it is prepared for pivot_wider.
Some example code that I've tried:
library(tidyverse)
course1 <- all_surveys %>%
filter(`Survey Title`=="course 1") %>%
aggregate("ID" ~ "Response Text", by(`User ID`, Question), FUN=sum) %>%
pivot_wider(id_cols = c("ID", `Response Date`),
names_from = "Question",
values_from = "Response Text") %>%
select([questions to be retained from Question])
I have also tried
group_by(question_new, `User ID`) %>%
summarize(text = str_c("Response Text", collapse = ", "))
as well as
aggregate(c[("Response Text" ~ "question_new")],
by = list(`User ID` = `User ID`, `Response Date` = `Response Date`),
function(x) unique(na.omit(x)))
and a bunch of different iterations of the above.
Thank you very much, in advance!
We can try to pivot_wider using values_fn = toString:
df %>% pivot_wider(names_from = Question,
values_from = response,
values_fn = toString)
small minimal example
df<-tibble(ID = c(1,1,2,2), Question = c('question 1', 'question 2', 'question 1', 'question 1'), response = c('affiliation x', 'course 1', 'affiliation x', 'affiliation y'))
# A tibble: 4 × 3
ID Question response
<dbl> <chr> <chr>
1 1 question 1 affiliation x
2 1 question 2 course 1
3 2 question 1 affiliation x
4 2 question 1 affiliation y
output
# A tibble: 2 × 3
ID `question 1` `question 2`
<dbl> <chr> <chr>
1 1 affiliation x course 1
2 2 affiliation x, affiliation y NA
So I have a string that I'm attempting to parse through and then create 3 columns with the data I extract. From what I've seen, stringr doesn't really cover this case and the gsub I've used so far is excessive and involves me making multiple columns, parsing from those new columns, and then removing them and that seems really inefficient.
The format is this:
"blah, grabbed by ???-??-?????."
I need this:
???-??-?????
I've used placeholders here, but this is how the string typically looks
"blah, grabbed by PHI-80-J.Matthews."
or
"blah, grabbed by NE-5-J.Mills."
and sometimes there is text after the name like this:
"blah, grabbed by KC-10-T.Hill. Blah blah blah."
This is what I would like the end result to be:
Place
Number
Name
PHI
80
J.Matthews
NE
5
J.Mills
KC
10
T. Hill
Edit for further explanation:
Most strings include other people in the same format so "downed by" needs to be incorporated in someway to make sure it is grabbing the right name.
Ex.
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
Desired Output:
Place
Number
Name
KC
10
T. Hill
This solution simply extract the components based on the logic OP mentioned i.e. capture the characters that are needed as three groups - 1) one or more upper case letter ([A-Z]+) followed by a dash (-), 2) then one or more digits (\\d+), and finally 3) non-whitespace characters (\\S+) that follow the dash
library(tidyr)
extract(df1, col1, into = c("Place", "Number", "Name"),
".*grabbed by\\s([A-Z]+)-(\\d+)-(\\S+)\\..*", convert = TRUE)
-ouputt
# A tibble: 4 x 3
Place Number Name
<chr> <int> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
4 KC 10 T.Hill
Or do this in base R
read.table(text = sub(".*grabbed by\\s((\\w+-){2}\\S+)\\..*", "\\1",
df1$col1), header = FALSE, col.names = c("Place", "Number", "Name"), sep='-')
Place Number Name
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
data
df1 <- structure(list(col1 = c("blah, grabbed by PHI-80-J.Matthews.",
"blah, grabbed by NE-5-J.Mills.", "blah, grabbed by KC-10-T.Hill. Blah blah blah.",
"Throw by OAK-4-D.Carr, snap by PHI-62-J.Kelce, grabbed by KC-10-T.Hill. Penalty on OAK-4-D.Carr"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
This solution actually does what you say in the title, namely first remove the text around the the target substring, then split it into columns:
library(tidyr)
library(stringr)
df1 %>%
mutate(col1 = str_extract(col1, "\\w+-\\w+-\\w\\.\\w+")) %>%
separate(col1,
into = c("Place", "Number", "Name"),
sep = "-")
# A tibble: 3 x 3
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
Here, we make use of the fact that the character class \\w is for letters irrespective of case and for digits (and also for the underscore).
Here is an alternative way using sub with regex "([A-Za-z]+\\.[A-Za-z]+).*", "\\1" that removes the string after the second point.
separate that splits the string by by, and finally again separate to get the desired columns.
library(dplyr)
library(tidyr)
df1 %>%
mutate(test1 = sub("([A-Za-z]+\\.[A-Za-z]+).*", "\\1", col1)) %>%
separate(test1, c('remove', 'keep'), sep = " by ") %>%
separate(keep, c("Place", "Number", "Name"), sep = "-") %>%
select(Place, Number, Name)
Output:
Place Number Name
<chr> <chr> <chr>
1 PHI 80 J.Matthews
2 NE 5 J.Mills
3 KC 10 T.Hill
I have couple of unstructured sentences like below. Description below is column name
Description
Automatic lever for a machine
Vaccum chamber with additional spare
Glove box for R&D
The Mini Guage 5 sets
Vacuum chamber only
Automatic lever only
I want to split this sentence from Col1 to Col5 and count there occurrence like below
Col1 Col2 Col3 Col4
Automatic_lever lever_for for_a a_machine
Vaccum_chamber chamber_with with_additional additional_spare
Glove_box box_for for_R&D R&D
The_Mini Mini_Guage Guage_5 5_sets
Vacuum_chamber chamber_only only
Automatic_lever lever_only only
Also from above columns, can i have the occurence of these words. Like, Vaccum_chamber and Automatic_lever are repeated twice here. Similarly, the occurence of other words?
Here is a tidyverse option
df %>%
rowid_to_column("row") %>%
mutate(words = map(str_split(Description, " "), function(x) {
if (length(x) %% 2 == 0) words <- c(words, "")
idx <- 1:(length(words) - 1)
map_chr(idx, function(i) paste0(x[i:(i + 1)], collapse = "_"))
})) %>%
unnest() %>%
group_by(row) %>%
mutate(
words = str_replace(words, "_NA", ""),
col = paste0("Col", 1:n())) %>%
filter(words != "NA") %>%
spread(col, words, fill = "")
## A tibble: 6 x 6
## Groups: row [6]
# row Description Col1 Col2 Col3 Col4
# <int> <fct> <chr> <chr> <chr> <chr>
#1 1 Automatic lever for a mac… Automatic_… lever_for for_a a_machine
#2 2 Vaccum chamber with addit… Vaccum_cha… chamber_w… with_addi… additional…
#3 3 Glove box for R&D Glove_box box_for for_R&D R&D
#4 4 The Mini Guage 5 sets The_Mini Mini_Guage Guage_5 5_sets
#5 5 Vacuum chamber only Vacuum_cha… chamber_o… only ""
#6 6 Automatic lever only Automatic_… lever_only only ""
Explanation: We split the sentences in Description on a single whitespace " ", then concatenate every two words together with a sliding window approach, making sure that there are always an odd odd number of words per sentence; the rest is just a long-to-wide transformation.
Not pretty but it reproduces your expected output; instead of the manual sliding window approach you could also you zoo::rollapply.
Sample data
df <- read.table(text =
"Description
'Automatic lever for a machine'
'Vaccum chamber with additional spare'
'Glove box for R&D'
'The Mini Guage 5 sets'
'Vacuum chamber only'
'Automatic lever only'", header = T)
I am using RStudio for data analysis in R. I currently have a dataframe which is in a long format. I want to convert it into the wide format.
An extract of the dataframe (df1) is shown below. I have converted the first column into a factor.
Extract:
df1 <- read.csv("test1.csv", stringsAsFactors = FALSE, header = TRUE)
df1$Respondent <- factor(df1$Respondent)
df1
Respondent Question CS Imp LOS Type Hotel
1 1 Q1 Fully Applied High 12 SML ABC
2 1 Q2 Optimized Critical 12 SML ABC
I want a new dataframe (say, df2) to look like this:
Respondent Q1CS Q1Imp Q2CS Q2Imp LOS Type Hotel
1 Fully Applied High Optimized Critical 12 SML ABC
How can I do this in R?
Additional notes: I have tried looking at the tidyr package and its spread() function but I am having a hard time implementing it to this specific problem.
This can be achieved with a gather-unite-spread approach
df %>%
group_by(Respondent) %>%
gather(k, v, CS, Imp) %>%
unite(col, Question, k, sep = "") %>%
spread(col, v)
# Respondent LOS Type Hotel Q1CS Q1Imp Q2CS Q2Imp
#1 1 12 SML ABC Fully Applied High Optimized Critical
Sample data
df <- read.table(text =
" Respondent Question CS Imp LOS Type Hotel
1 1 Q1 'Fully Applied' High 12 SML ABC
2 1 Q2 'Optimized' Critical 12 SML ABC", header = T)
In data.table, this can be done in a one-liner....
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")[DT, `:=`(LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"][]
Respondent CSQ1 CSQ2 ImpQ1 ImpQ2 LOS Type Hotel
1: 1 Fully Applied Optimized High Critical 12 SML ABC
explained step by step
create sample data
DT <- fread("Respondent Question CS Imp LOS Type Hotel
1 Q1 'Fully Applied' High 12 SML ABC
1 Q2 'Optimized' Critical 12 SML ABC", quote = '\'')
Cast a part of the datatable to desired format by question
colnames might not be what you want... you can always change them using setnames().
dcast(DT, Respondent ~ Question, value.var = c("CS", "Imp"), sep = "")
# Respondent CSQ1 CSQ2 ImpQ1 ImpQ2
# 1: 1 Fully Applied Optimized High Critical
Then join by reference on the orikginal DT, to get the rest of the columns you need...
result.from.dcast[DT, `:=`( LOS = i.LOS, Type = i.Type, Hotel = i.Hotel), on = "Respondent"]
This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 5 years ago.
I have a tibble.
library(tidyverse)
df <- tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df
I want to separate the genres by "|" and empty columns filled with NA.
This is what I did:
df %>%
separate(genres, into = c("genre1", "genre2", "genre3", "genre4", "genre5"), sep = "|")
However, it's being separated after each letter.
I think you haven't included into:
df <- tibble::tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df %>% tidyr::separate(genres, into = c("genre1", "genre2", "genre3",
"genre4", "genre5"))
Result:
# A tibble: 4 x 6
id genre1 genre2 genre3 genre4 genre5
* <int> <chr> <chr> <chr> <chr> <chr>
1 1 Action Adventure Science Fiction Thriller
2 2 Adventure Science Fiction Thriller <NA>
3 3 Action Crime Thriller <NA> <NA>
4 4 Family Animation Adventure Comedy Action
Edit: Or as RichScriven wrote in the comments, df %>% tidyr::separate(genres, into = paste0("genre", 1:5)). For separating on | exactly, use sep = "\\|".
Well, this is what helped, writing regex properly.
df %>%
separate(genres, into = paste0("genre", 1:5), sep = "\\|")