I have a basic q I would like a quick R solution in...
I have a tab delimited table with multiple rows, but I want to "squash" all rows into one... for example:
name day red blue orange black
bill 1 yes
bill 2 yes
bill 3 yes
bill 4 no
But I want the output to be independent of day:
name red blue orange black
bill yes yes no yes
So essentially I am squashing the table down to include all answers regardless of the day. NB: There are never any overlaps i.e. Bill will select only one colour per day.
I could do this in excel, but I'd prefer to find an R solution... happy for guidance even wrt which libraries would be useful :).
Go easy on me, I'm a clinician not a bioinformatician!
Here is an option with dplyr. If the missing values are "", after grouping over 'name', summarise by looping across the columns and get the elements that are not a blank (.[. != ""])
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[.!= '']))
Or if the missing values are NA
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[!is.na(.)]))
If there are more than one non-missing element, the above output will be a list column. Instead, we can also paste it together
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(.[!is.na(.)])))
If there are both NA and "", an option is to convert the "" to NA and then use is.na or complete.cases or with na.omit
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(na.omit(na_if(., "")))))
In base R, you could use aggregate and select non-blank values for each name.
aggregate(cbind(red,blue,orange,black)~name, df, function(x) toString(x[x!='']))
# name red blue orange black
#1 bill yes yes no yes
data
df <- structure(list(name = c("bill", "bill", "bill", "bill"), day = 1:4,
red = c("yes", "", "", ""), blue = c("", "yes", "", ""),
orange = c("", "", "", "no"), black = c("", "", "yes", ""
)), class = "data.frame", row.names = c(NA, -4L))
Related
My data is a more complex but simplified it basically looks like this. Each row has only one Yes in one of the multiple Color:* columns. The "Color" before the colon needs to be the new column name, with the named color "Blue", "Red", or "Green" after the colon pivoted under "Color" to the correct "IDENTITY" I forgot to mention there are variable columns before and after these multiple columns I want to pivot into one column. the columns I want to pivot all have the same name before the colon.
IDENTITY
Color:Blue
Color:Red
Color:Green
1
Yes
2
Yes
3
Yes
and what I would like is to pivot to this
IDENTITY
Color
1
Green
2
Blue
3
Red
I am not sure if this is a pivot problem. I have read through the tidyr pivot documentation at
https://tidyr.tidyverse.org/articles/pivot.html
I do not see a similar example or able to identify one of the solutions that might work with my supplied data.
Can anyone help me with a code chunk I can follow to solve what seems on the face of it a simple problem, but eludes my limited proficiency with R. Thank you and Seasons Blessings
We may convert the blanks ("") to NA, in character columns by looping across the character columns, then reshape to 'long' by selecting the cols that have column names starts_with "Color:"
library(tidyr)
library(dplyr)
df1 %>%
mutate(across(where(is.character), ~ na_if(.x, ""))) %>%
pivot_longer(cols = starts_with("Color:"), names_to = c(".value", "Color2"),
names_sep = ":", values_drop=TRUE) %>%
select(IDENTITY, Color = Color2)
-output
# A tibble: 3 × 2
IDENTITY Color
<int> <chr>
1 1 Green
2 2 Blue
3 3 Red
You would need to use the pivot_longer function. This will gather columns and turn them into rows.
Note: I had to put `` in front of the column names since : is one of those pesky reserved characters in R. You want to avoid having that as a column separator ( _ could be a useful substitution). You can't always control this, but it's just something to look out for.
library(tidyverse)
x <- tribble( ~IDENTITY, ~`Color:Blue`, ~`Color:Red`, ~`Color:Green`,
1L, "", "", "Yes",
2L, "Yes", "", "",
3L, "" , "Yes", "")
x %>%
rename_with(stringr::str_replace,
pattern = "Color:",
replacement = "") %>%
pivot_longer(!IDENTITY,
names_to = "Color") %>%
filter(value != "") %>%
select(!value)
If I have a df:
Class sentence
1 Yes there is p beaker on the table
2 Yes they t the frown
3 Yes so Z it was asleep
How do I remove the length-one strings within "sentence" column to remove things like "t" "p" and "Z", and then do a final clean using the stop_words list in tidytext to get the below?
Class sentence
1 Yes beaker table
2 Yes frown
3 Yes asleep
If we want to use tidytext, then create a sequence column (row_number()), then apply unnest_tokens on the sentence column, do an anti_join with the default data from get_stopwords(), filter out the words that have characters only 1, and then do a group by paste on the 'word' column to create the 'sentence'
library(dplyr)
library(tidytext)
library(stringr)
df %>%
mutate(rn = row_number()) %>%
unnest_tokens(word, sentence) %>%
anti_join(get_stopwords()) %>%
filter(nchar(word) > 1) %>%
group_by(rn, Class) %>%
summarise(sentence = str_c(word, collapse = ' '), .groups = 'drop') %>%
select(-rn)
-Output
# A tibble: 3 x 2
Class sentence
<chr> <chr>
1 Yes beaker table
2 Yes frown
3 Yes asleep
Data
df <- structure(list(Class = c("Yes", "Yes", "Yes"), sentence = c("there is p beaker on the table",
"they t the frown", "so Z it was asleep")),
class = "data.frame", row.names = c("1",
"2", "3"))
I have an example dataframe as below.
pr_id
product
name
id_234
onion,bean
chris
id_34d
apple
tom
id_87t
plantain, potato, apple
tex
I want to access the product column and create a new column and assign 1 if apple is in the list and 0 if not.
So i expect a result like this:
pr_id
product
name
result
id_234
onion,bean
chris
0
id_34d
apple
tom
1
id_87t
plantain, potato, apple
tex
1
I thought of something like this:
my_df$result <- ifelse(my_df$product == 'apple', 1,0)
but this only work for rows 1 and 2, but not working for last row having multiple elements.
Please how do i go with this?
With dplyr, dataframe kindly taken from p. Paccioretti
Thanks to AnilGoyal for stringr::str_detect
# construct the dataframe
pr_id = c("id_234", "id_34d", "id_87t")
product = c("onion,bean",
"apple", "plantain, potato, apple")
name = c("chris", "tom","tex")
my_df <- data.frame(pr_id, product, name)
# check with case_when and str_detect if apple is in product
my_df <- my_df %>%
mutate(result = case_when(stringr::str_detect(product, "apple") ~ 1,
TRUE ~ 0)
)
You can use agrepl which searches for approximate matches within a string. If you use ==, you are searching for exact matching.
my_df <-
structure(
list(
pr_id = c("id_234", "id_34d", "id_87t"),
product = c("onion,bean",
"apple", "plantain, potato, apple"),
name = c("chris", "tom",
"tex")
),
class = "data.frame",
row.names = c(NA, -3L)
)
my_df$result <- ifelse(agrepl('apple', my_df$product), 1,0)
Or a tidyverse approach
library(dplyr)
my_df <-
my_df %>%
mutate(result = as.numeric(agrepl('apple', product)))
my_df
#> pr_id product name result
#> 1 id_234 onion,bean chris 0
#> 2 id_34d apple tom 1
#> 3 id_87t plantain, potato, apple tex 1
Using str_count
library(dplyr)
library(stringr)
df %>%
mutate(result = str_count(product, 'apple'))
I would use the str_detect option in stringr (tidyverse option).
my_df <- my_df %>%
mutate(result = ifelse(str_detect(product, "apple"), 1, 0))
I have an R data frame (actually an excel sheet which I have read into R) in the format below:
ID Text
1 This is a red
car. Its electric
and has 4 wheels.
2 This is a van with
six wheels.
I want to reshape it into the following format
ID Text
1 This is a red car. Its electric and has 4 wheels.
2 This is a van with six wheels
Essentially between the two ID numbers my text has been broken into multiple lines. I want to combine it to look like the output above.
Using group_by a numeric ID did not work as it gets rid of lines w/o the ID#.
Any thoughts on how I can achieve this type of output?
Thanks!
Here is one option with tidyverse. Convert the blank ("") in 'ID' to NA (na_if), using fill from tidyr, change the NA elements to previous non-Na value, grouped by 'ID', then paste the 'Text' by collapseing the elements together to a single string
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(ID = na_if(ID, "")) %>%
fill(ID) %>%
group_by(ID) %>%
summarise(Text = str_c(Text, collapse=' '))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
Or create a logical index converted to numeric to fill the 'ID' and use that as grouping variable to summarise the 'Text' column
df1 %>%
group_by(ID = ID[ID != ""][cumsum(ID != "")]) %>%
summarise(Text = str_c(Text, collapse=" "))
# A tibble: 2 x 2
# ID Text
# <chr> <chr>
#1 1 This is a red car. Its electric and has 4 wheels.
#2 2 This is a van with six wheels.
data
df1 <- structure(list(ID = c("1", "", "", "2", ""), Text = c("This is a red",
"car. Its electric", "and has 4 wheels.", "This is a van with",
"six wheels.")), row.names = c(NA, -5L), class = "data.frame")
I have a data frame that looks like this:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
This is dummy data, but the real data is being pulled from PDF via tabulizer. Any time there is a line break in Question in the source document, that question gets split into multiple lines. How do I concatenate back based on the condition that Answer is blank?
The desired result is simply:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
The logic is simply, if Answer[x] is blank, concatenate Question[x] and Question[x-1] and remove row x.
This could no doubt be improved, but if you're happy to use the tidyverse, perhaps an approach like this could work?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No
The following should do, if I follow your logic:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
The idea is to use cumsum on the expression nchar(dff$Answer)>0. This should create a grouping vector to use with the split function. Upon splitting on your grouping vector, you should be able to create smaller dataframes with the results of the split operation, by concatenating values from the Question column and taking the first value of the Answer column. Subsequently, you can rbind the resulting dataframes.
I hope this helps.
..another (very similar) approach using dplyr
require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)