This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Split column at delimiter in data frame [duplicate]
(6 answers)
Closed 5 years ago.
I have a tibble.
library(tidyverse)
df <- tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df
I want to separate the genres by "|" and empty columns filled with NA.
This is what I did:
df %>%
separate(genres, into = c("genre1", "genre2", "genre3", "genre4", "genre5"), sep = "|")
However, it's being separated after each letter.
I think you haven't included into:
df <- tibble::tibble(
id = 1:4,
genres = c("Action|Adventure|Science Fiction|Thriller",
"Adventure|Science Fiction|Thriller",
"Action|Crime|Thriller",
"Family|Animation|Adventure|Comedy|Action")
)
df %>% tidyr::separate(genres, into = c("genre1", "genre2", "genre3",
"genre4", "genre5"))
Result:
# A tibble: 4 x 6
id genre1 genre2 genre3 genre4 genre5
* <int> <chr> <chr> <chr> <chr> <chr>
1 1 Action Adventure Science Fiction Thriller
2 2 Adventure Science Fiction Thriller <NA>
3 3 Action Crime Thriller <NA> <NA>
4 4 Family Animation Adventure Comedy Action
Edit: Or as RichScriven wrote in the comments, df %>% tidyr::separate(genres, into = paste0("genre", 1:5)). For separating on | exactly, use sep = "\\|".
Well, this is what helped, writing regex properly.
df %>%
separate(genres, into = paste0("genre", 1:5), sep = "\\|")
Related
I am trying to do something that I think is straightforward but I am having an issue with.
I have several medication-related column variables (med_1, med_2, med_3 for example). These are character variables- so they have text for the name of medications
I want to combine them all into variable anymed using or logic, so that I can then use anymed to look at any medications reported across all medication related fields.
I am trying the following, for dataset FinalData.
FinalData <- FinalData %>% mutate(anymed = med_1 | med_2 | med_3)
I am receiving this error:
*Error: Problem with `mutate()` column `anymed`.
ℹ `anymed = |...`.
x operations are possible only for numeric, logical or complex types*
Could someone help explain what code I should use instead since these are characters? Do I need to convert to factors?
Are you looking for this kind of solution:
# data:
df <- tibble(med_1 = "A", med_2 = "B", med_3 = "C")
library(dplyr)
df %>%
mutate(any_med = paste(c(med_1, med_2, med_3), collapse = " | "))
med_1 med_2 med_3 any_med
<chr> <chr> <chr> <chr>
1 A B C A | B | C
You want to use pivot_longer from tidyverse to get them all in the same column. I also dropped the column name (i.e., col), but you could remove that line if you want to know what column the medication came from. I'm unsure what your data looks like, so I just made a small example to show how to do it.
library(tidyverse)
FinalData %>%
pivot_longer(-ind, names_to = "col", values_to = "anymed") %>%
select(-col)
Output
# A tibble: 6 × 2
ind anymed
<dbl> <chr>
1 1 meda
2 1 meda
3 1 meda
4 2 medb
5 2 medb
6 2 medb
It's a little unclear what your expected output is. But if you are wanting to combine all medications in each row, then you can also use unite.
FinalData %>%
unite("any_med", c("med_1", "med_2", "med_3"), sep = " | ")
Output
ind any_med
1 1 meda | meda | meda
2 2 medb | medb | medb
Data
FinalData <-
structure(
list(
ind = c(1, 2),
med_1 = c("meda", "medb"),
med_2 = c("meda",
"medb"),
med_3 = c("meda", "medb")
),
class = "data.frame",
row.names = c(NA,-2L)
)
I am rather new to R, and I have been trying to write a code that will find and concatenate multiple choice question responses when the data is in long format. The data needs to be pivoted wide, but cannot without resolving the duplicate IDs that result from these multiple choice responses. I want to combine the extra multiple choice response to the distinct ID number, so that it would look like: "affiliation 1, affiliation 2" for the individual respondent, in long format. I would prefer to not use row numbers, as the data is recollected on a monthly basis and row numbers may not stay constant. I need to identify the duplicate ID due to the multiple choice question, and attach its secondary answer to the other response.
I have tried various versions of aggregate, grouping and summarizing, filter, unique, and distinct, but haven't been able to solve the problem.
Here is an example of the data:
ID Question Response
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x
3 question 1 affiliation z
4 question 1 affiliation y
I want the data to look like this:
ID Question Response Text
1 question 1 affiliation x
1 question 2 course 1
2 question 1 affiliation y
2 question 2 course 1
3 question 1 affiliation x, affiliation z
4 question 1 affiliation y
so that it is prepared for pivot_wider.
Some example code that I've tried:
library(tidyverse)
course1 <- all_surveys %>%
filter(`Survey Title`=="course 1") %>%
aggregate("ID" ~ "Response Text", by(`User ID`, Question), FUN=sum) %>%
pivot_wider(id_cols = c("ID", `Response Date`),
names_from = "Question",
values_from = "Response Text") %>%
select([questions to be retained from Question])
I have also tried
group_by(question_new, `User ID`) %>%
summarize(text = str_c("Response Text", collapse = ", "))
as well as
aggregate(c[("Response Text" ~ "question_new")],
by = list(`User ID` = `User ID`, `Response Date` = `Response Date`),
function(x) unique(na.omit(x)))
and a bunch of different iterations of the above.
Thank you very much, in advance!
We can try to pivot_wider using values_fn = toString:
df %>% pivot_wider(names_from = Question,
values_from = response,
values_fn = toString)
small minimal example
df<-tibble(ID = c(1,1,2,2), Question = c('question 1', 'question 2', 'question 1', 'question 1'), response = c('affiliation x', 'course 1', 'affiliation x', 'affiliation y'))
# A tibble: 4 × 3
ID Question response
<dbl> <chr> <chr>
1 1 question 1 affiliation x
2 1 question 2 course 1
3 2 question 1 affiliation x
4 2 question 1 affiliation y
output
# A tibble: 2 × 3
ID `question 1` `question 2`
<dbl> <chr> <chr>
1 1 affiliation x course 1
2 2 affiliation x, affiliation y NA
I want to extract a part of data from a column and and paste it in another column using R:
My data looks like this:
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NULL,+3579514862,NULL,+5554848123)
data <- data.frame(names,country,mobile)
data
> data
names country mobile
1 Sia London +1234567890 NULL
2 Ryan Paris +3579514862
3 J Sydney +0123458796 NULL
4 Ricky Delhi +5554848123
I would like to separate phone number from country column wherever applicable and put it into mobile.
The output should be,
> data
names country mobile
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
You can use the tidyverse package which has a separate function.
Note that I rather use NA instead of NULL inside the mobile vector.
Also, I use the option, stringsAsFactors = F when creating the dataframe to avoid working with factors.
names <- c("Sia","Ryan","J","Ricky")
country <- c("London +1234567890","Paris", "Sydney +0123458796", "Delhi")
mobile <- c(NA, "+3579514862", NA, "+5554848123")
data <- data.frame(names,country,mobile, stringsAsFactors = F)
library(tidyverse)
data %>% as_tibble() %>%
separate(country, c("country", "number"), sep = " ", fill = "right") %>%
mutate(mobile = coalesce(mobile, number)) %>%
select(-number)
# A tibble: 4 x 3
names country mobile
<chr> <chr> <chr>
1 Sia London +1234567890
2 Ryan Paris +3579514862
3 J Sydney +0123458796
4 Ricky Delhi +5554848123
EDIT
If you want to do this without the pipes (which I would not recommend because the code becomes much harder to read) you can do this:
select(mutate(separate(as_tibble(data), country, c("country", "number"), sep = " ", fill = "right"), mobile = coalesce(mobile, number)), -number)
I have a df in R of only one column of food ratings from amazon.
head(food_ratings)
product.productId..B001E4KFG0
1 review/userId: A3SGXH7AUHU8GW
2 review/profileName: delmartian
3 review/helpfulness: 1/1
4 review/score: 5.0
5 review/time: 1303862400
6 review/summary: Good Quality Dog Food
The rows repeat themselves, so that rows 7 through 12 have the same information regarding another user(row 7). This pattern is repeated many times.
Therefore, I need to have every group of 6 rows distributed in one row with 6 columns, so that later I can subset, for instance, the review/summary according to their review/score.
I'm using RStudio 1.0.143
EDIT: I was asked to show the output of dput(head(food_ratings, 24)) but it was too big regardless of the number used.
Thanks a lot
I have taken your data and added 2 more fake users to it. Using tidyr and dplyr you can create new columns and collapse the data into a nice data.frame. You can use select from dplyr to drop the id column if you don't need it or to rearrange the order of the columns.
library(tidyr)
library(dplyr)
df %>%
separate(product.productId..B001E4KFG0, into = c("details", "data"), sep = ": ") %>%
mutate(details = sub("review/ ", "", details)) %>%
group_by(details) %>%
mutate(id = row_number()) %>%
spread(details, data)
# A tibble: 3 x 7
id helpfulness profileName score summary time userId
<int> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1/1 delmartian 5.0 Good Quality Dog Food 1303862400 A3SGXH7AUHU8GW
2 2 1/1 martian2 1.0 Good Quality Snake Food 1303862400 123456
3 3 2/5 martian3 5.0 Good Quality Cat Food 1303862400 123654
data:
df <- structure(list(product.productId..B001E4KFG0 = c("review/userId: A3SGXH7AUHU8GW",
"review/profileName: delmartian", "review/helpfulness: 1/1",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Dog Food",
"review/userId: 123456", "review/profileName: martian2", "review/helpfulness: 1/1",
"review/score: 1.0", "review/time: 1303862400", "review/summary: Good Quality Snake Food",
"review/userId: 123654", "review/profileName: martian3", "review/helpfulness: 2/5",
"review/score: 5.0", "review/time: 1303862400", "review/summary: Good Quality Cat Food"
)), class = "data.frame", row.names = c(NA, -18L))
This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Select the row with the maximum value in each group
(19 answers)
Closed 5 years ago.
I have two data frames: City and Country. I am trying to find out the most popular city per country. City and Country have a common field, City.CountryCode and Country.Code. These two data frames were merged to one called CityCountry. I have tried the aggregate command like so:
aggregate(Population.x~CountryCode, CityCountry, max)
This aggregate command only shows the CountryCode and Population.X columns. How would I show the name of the Country and the name of the City? Is aggregate the wrong command to use here?
Could also use dplyr to group by Country, then filter by max(Population.x).
library(dplyr)
set.seed(123)
CityCountry <- data.frame(Population.x = sample(1000:2000, 10, replace = TRUE),
CountryCode = rep(LETTERS[1:5], 2),
Country = rep(letters[1:5], 2),
City = letters[11:20],
stringsAsFactors = FALSE)
CityCountry %>%
group_by(Country) %>%
filter(Population.x == max(Population.x)) %>%
ungroup()
# A tibble: 5 x 4
Population.x CountryCode Country City
<int> <chr> <chr> <chr>
1 1287 A a k
2 1789 B b l
3 1883 D d n
4 1941 E e o
5 1893 C c r