R: how do I concatenate a string broken into multiple lines? - r

I have a data frame that looks like this:
df1 <- data.frame(Question=c("This is the start", "of a question", "This is a second", "question"),
Answer = c("Yes", "", "No", ""))
Question Answer
1 This is the start Yes
2 of a question
3 This is a second No
4 question
This is dummy data, but the real data is being pulled from PDF via tabulizer. Any time there is a line break in Question in the source document, that question gets split into multiple lines. How do I concatenate back based on the condition that Answer is blank?
The desired result is simply:
Question Answer
1 This is the start of a question Yes
2 This is a second question No
The logic is simply, if Answer[x] is blank, concatenate Question[x] and Question[x-1] and remove row x.

This could no doubt be improved, but if you're happy to use the tidyverse, perhaps an approach like this could work?
library(dplyr)
library(tidyr)
library(stringr)
df1 %>%
mutate(id = if_else(Answer != "", row_number(), NA_integer_)) %>%
fill(id) %>% group_by(id) %>%
summarise(Question = str_c(Question, collapse = " "), Answer = first(Answer))
#> # A tibble: 2 x 3
#> id Question Answer
#> <int> <chr> <fctr>
#> 1 1 This is the start of a question Yes
#> 2 3 This is a second question No

The following should do, if I follow your logic:
# test data
dff <- data.frame(Question=c("This is the start",
"of a question",
"This is a second",
"question",
"This is a third",
"question",
"and more space",
"yet even more space",
"This is actually another question"),
Answer = c("Yes",
"",
"No",
"",
"Yes",
"",
"",
"",
"No"),
stringsAsFactors = F)
# solution
do.call(rbind, lapply(split(dff, cumsum(nchar(dff$Answer)>0)), function(x) {
data.frame(Question=paste0(x$Question, collapse=" "), Answer=head(x$Answer,1))
}))
# Question Answer
# 1 This is the start of a question Yes
# 2 This is a second question No
# 3 This is a third question and more space yet even more space Yes
# 4 This is actually another question No
The idea is to use cumsum on the expression nchar(dff$Answer)>0. This should create a grouping vector to use with the split function. Upon splitting on your grouping vector, you should be able to create smaller dataframes with the results of the split operation, by concatenating values from the Question column and taking the first value of the Answer column. Subsequently, you can rbind the resulting dataframes.
I hope this helps.

..another (very similar) approach using dplyr
require(dplyr)
df1 %>% mutate(id = cumsum(!df1$Answer %in% c('Yes', 'No')),
Q2 = ifelse(Answer == "", paste(lag(Question), Question), ""),
A2 = ifelse(Answer == "", as.character(lag(Answer)), "")) %>%
filter(Q2 != "") %>%
select(id, Question = Q2, Answer = A2)

Related

R - Splitting a dataframe by using strsplit, but keep delimiter [duplicate]

This question already has an answer here:
R split on delimiter (split) keep the delimiter (split)
(1 answer)
Closed 2 months ago.
I have a dataframe like the following:
ref = c("ab/1bc/1", "dd/1", "cc/1", "2323")
text = c("car", "train", "mouse", "house")
data = data.frame(ref, text)
Which produces this:
IF the cell within the ref column has /1 in it, I want to split it and duplicate the row.
I.e. the table above should look like this:
I have the following code, which splits the cell by the /1, but it also removes it. I thought about adding /1 back onto every ref, but not all refs have it.
data1 = data %>%
mutate(ref = strsplit(as.character(ref), "/1")) %>%
unnest(ref)
Some of the other answers use regex for when people split by things like &/,. etc, but not /1. Any ideas?
With separate_rows and look-behind:
library(tidyr)
library(dplyr)
data %>%
separate_rows(ref, sep = "(?<=/1)") %>%
filter(ref != "")
output
# A tibble: 5 × 2
ref text
<chr> <chr>
1 ab/1 car
2 bc/1 car
3 dd/1 train
4 cc/1 mouse
5 2323 house
Or with strsplit:
data %>%
mutate(ref = strsplit(ref, "(?<=/1)", perl = TRUE)) %>%
unnest(ref)

Simplifying tables (squashing them!) in R- basic q

I have a basic q I would like a quick R solution in...
I have a tab delimited table with multiple rows, but I want to "squash" all rows into one... for example:
name day red blue orange black
bill 1 yes
bill 2 yes
bill 3 yes
bill 4 no
But I want the output to be independent of day:
name red blue orange black
bill yes yes no yes
So essentially I am squashing the table down to include all answers regardless of the day. NB: There are never any overlaps i.e. Bill will select only one colour per day.
I could do this in excel, but I'd prefer to find an R solution... happy for guidance even wrt which libraries would be useful :).
Go easy on me, I'm a clinician not a bioinformatician!
Here is an option with dplyr. If the missing values are "", after grouping over 'name', summarise by looping across the columns and get the elements that are not a blank (.[. != ""])
library(dplyr)
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[.!= '']))
Or if the missing values are NA
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ .[!is.na(.)]))
If there are more than one non-missing element, the above output will be a list column. Instead, we can also paste it together
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(.[!is.na(.)])))
If there are both NA and "", an option is to convert the "" to NA and then use is.na or complete.cases or with na.omit
df1 %>%
group_by(name) %>%
summarise(across(red:black, ~ toString(na.omit(na_if(., "")))))
In base R, you could use aggregate and select non-blank values for each name.
aggregate(cbind(red,blue,orange,black)~name, df, function(x) toString(x[x!='']))
# name red blue orange black
#1 bill yes yes no yes
data
df <- structure(list(name = c("bill", "bill", "bill", "bill"), day = 1:4,
red = c("yes", "", "", ""), blue = c("", "yes", "", ""),
orange = c("", "", "", "no"), black = c("", "", "yes", ""
)), class = "data.frame", row.names = c(NA, -4L))

ggplot. I have a bunch of codes that need a 0 number before the actual data in the dataframe. How I can put it? [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 2 years ago.
I am ploting a dataframe into a shapefile map, but I need to repair a bunch of these codes.
Well. The map have 3 areas, identify by codes. 2 areas works good with the code, but all of the codes that start with 3 don´t work. These numbers need to start with 03.
So if I have 3002 in the dataframe, I need to repair to 03002. So
It´s possible to make it from R?
This is a bit of my code: The column to repair with the 0 at the first 3´s its called "code2"
dfcsv1 <- read.csv("https://dadesobertes.gva.es/datastore/dump/e23bf332-be3e-4a3a-a07b-300db3d9a7be?bom=True", encoding = "UTF-8", header = TRUE, sep = ",")
colnames(dfcsv1) <- c("code", "code2", "Municipio", "PCR", "TasaPCR", "PCR14", "TasaPCR14", "Muertos", "TasaMuertos")
dfcsv1$TasaMuertos = as.numeric(gsub(",","\\.",dfcsv1$TasaMuertos))
dfcsv1$TasaPCR = as.numeric(gsub(",","\\.",dfcsv1$TasaPCR))
dfcsv1$TasaPCR14 = as.numeric(gsub(",","\\.",dfcsv1$TasaPCR14))
dfcsv1 <- dfcsv1 %>%
mutate(
municipio = stringr::str_c(code2)
)
mapa_df <- mapa_df %>%
left_join(
y = dfcsv1 %>% select(municipio, TasaPCR14),
by = "municipio"
)
thank you all in advance!
A slightly more concise method would be stringr::str_pad: (Note the default is to 'pad' to the left. see ?str_pad)
library(dplyr)
library(stringr)
t <- tibble(CODE = c("4444", "55555","333"))
t %>%
dplyr::mutate(CODE = str_pad(CODE, 5, pad = '0'))
# A tibble: 3 x 1
CODE
<chr>
1 04444
2 55555
3 00333
And small example:
library(dplyr)
t <- tibble(CODE = c("4444", "55555","3333"))
t %>%
dplyr::mutate(CODE = case_when(nchar(CODE) == 3 ~ paste0("00", CODE),
nchar(CODE) == 4 ~ paste0("0", CODE),
TRUE ~ CODE))

Count number of times a word appears (dplyr)

Simple question here, perhaps a duplicate of this?
I'm trying to figure out how to count the number of times a word appears in a vector. I know I can count the number of rows a word appears in, as shown here:
temp <- tibble(idvar = 1:3,
response = (c("This sounds great",
"This is a great idea that sounds great",
"What a great idea")))
temp %>% count(grepl("great", response)) # lots of ways to do this line
# answer = 3
The answer in the code above is 3 since "great" appears in three rows. However, the word "great" appears 4 different times in the vector "response". How do I find that instead?
We could use str_count from stringr to get the number of instances having 'great' in each row and then get the sum of that count
library(tidyverse)
temp %>%
mutate(n = str_count(response, 'great')) %>%
summarise(n = sum(n))
# A tibble: 1 x 1
# n
# <int>
#1 4
Or using regmatches/gregexpr from base R
sum(lengths(regmatches(temp$response, gregexpr('great', temp$response))))
#[1] 4
Off the top of my head, this should solve your problem:
library(tidyverse)
temp$response %>%
str_extract_all('great') %>%
unlist %>%
length

Dplyr pipe (%>%) within mutate()?

The piping in dplyr is cool and sometimes I want to clean up one column by applying multiple commands to it. Is there a way to use the pipe within the mutate() command? I notice this most when using regex and it comes up also in other contexts. In the example below, I can clearly see the different manipulations I am applying to the column "Clean" and I am curious if there is a way to do something that mimics %>% within mutate().
library(dplyr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone2 <- phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers), #remove extensions
Clean = gsub("[^0-9]", "", Clean), #remove parentheses, dashes, etc
Clean = substr(Clean, nchar(Clean)-9, nchar(Clean)), #grab the right 10 characters
Clean = gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", Clean)) #format
phone2
I know there might be a better gsub() command but for the purposes of this question, I want to know if there is a way to pipe these gsub() elements together so that I don't have to keep writing Clean = gsub(...) but also not have to use the method where I embed these inside each other.
It would be fine with me if you answer this question using a simpler example.
Don't fall into the trap of endless pipes. Do the correct thing for readability and efficiency, write a function.
phone %>% mutate(Clean = cleanPhone(Numbers))
# Numbers Clean
# 1 1234567890 (123)456-7890
# 2 555-3456789 (555)345-6789
# 3 222-222-2222 (222)222-2222
# 4 5131831249 (513)183-1249
# 5 123.321.1234 (123)321-1234
# 6 (333)444-5555 (333)444-5555
# 7 +1 123-223-3234 (123)223-3234
# 8 555-666-7777 x100 (666)777-7100
Custom function:
cleanPhone <- function(x) {
x2 <- gsub("[^0-9]", "", x)
x3 <- substr(x2, nchar(x2)-9, nchar(x2))
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", x3)
}
I guess you need
phone %>%
mutate(Clean = gsub("[A-Za-z].*", "", Numbers) %>%
gsub("[^0-9]", "", .) %>%
substr(., nchar(.)-9, nchar(.)) %>%
gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .))
# Numbers Clean
#1 1234567890 (123)456-7890
#2 555-3456789 (555)345-6789
#3 222-222-2222 (222)222-2222
#4 5131831249 (513)183-1249
#5 123.321.1234 (123)321-1234
#6 (333)444-5555 (333)444-5555
#7 +1 123-223-3234 (123)223-3234
#8 555-666-7777 x100 (555)666-7777
Even though the question is answered, consider this method that uses magrittr instead of dplyr
require(magrittr)
phone <- data.frame(Numbers = c("1234567890", "555-3456789", "222-222-2222",
"5131831249", "123.321.1234","(333)444-5555",
"+1 123-223-3234", "555-666-7777 x100"),
stringsAsFactors = F)
phone
cleanchain<- phone$Numbers %>% gsub("[A-Za-z].*", "", .) %>% gsub("[^0-9]", "", .) %>% substr(., nchar(.)-9, nchar(.)) %>% gsub("(^\\d{3})(\\d{3})(\\d{4}$)", "(\\1)\\2-\\3", .)
cleanchain
data.frame(old=phone$Numbers,new=cleanchain, stringsAsFactors = F)

Resources