How to combine two specific cells using a comma separation in R - r

I have the following data set
I'd like it such that cell [2,6] retains its current content with the addition of the cell below it, separated by a "," I have found paste() functions for concatenating columns into new columns but can't find an answer for specific cell combinations. Any help appreciated.
For example, I'd like the highlighted cell in the above to read John, Mary

Here is a way on how you can paste values with the values in below row.
# Loading required libraries
library(dplyr)
# Creating sample data
example <- data.frame(Type = c("Director", NA_character_),
Name = c("John", "Mary"))
example %>%
# Get value of the below row
mutate(new_col = lead(Name, 1),
# If Type is director then paste names else null
new_col = ifelse(Type == "Director", paste(Name, new_col, sep = ", "), NA_character_))

Related

How to merge tables and format appripriately?

So I have the following in cityzone.txt:
"earth/city/somerset/forest/somerset-test.txt#53497",
"earth/city/nottingham/forest/nighthill.txt#53498",
"earth/city/bury/town/bishop-zone1.mp3#53695",
And the following in areasize.txt:
planet\mars\red\crater.txt;56,
pluto\distant\dwarfmoon.txt;181,
mars\hot\red\redmoon.txt;43,
earth\city\somerset\forest\somerset-test.txt;205,
earth\city\bury\town\bishop-zone1.mp3;499,
So what I need is for a new table to be created and written to an output file.
What should happen is - for each row in cityzone.txt, the title for that row should be looked up in areasize.txt. If the title exists, the areasize number from areasize.txt should be appended to the cityzone row like this:
"title#id#areasize",
With quotes and comma accordingly.
So for cityzones.txt above, the output should be thus:
"earth/city/somerset/forest/somerset-test.txt#53497#205",
"earth/city/bury/town/bishop-zone1.mp3#53695#499",
And then it should be output to a file with quote sand comma as shown.
So only 2 of the 3 cityzone.txt rows are included in the results because only 2 of the 3 rows exist in areasize.txt.
My starter code for this is really a continuation from this question:
How do I merge partial data and format it in R?
So I will add the code for this to the code in that question.
Thank you.
You can do :
library(dplyr)
library(tidyr)
#Read the text files and keep only 1st column
cityzone <- read.table('cityzone.txt')[1]
areasize <- read.table('areasize.txt', sep = ';')
#Separate columns on # and join
#Clean areasize dataframe
cityzone %>% separate(V1, c('V1', 'V2'), sep = '#') %>%
inner_join(areasize %>%
mutate(V1 = gsub('\\\\', '/', V1),
V2 = sub(',$', '', V2)),
by = 'V1') -> result
#Combine output in required format and write
cat(sprintf('"%s#%s#%s",', result$V1, result$V2.x, result$V2.y),
file = 'output.lua', sep = '\n')

How to make purrr recognize colnames of df in a list?

I am trying to unite first and last names in each dataframe in a list of dataframes. The problem is that purrr doesn't seem to recognize colnames within each df.
Each df in data$authors_list looks something like
authid
surname
given-name
12345
Smith
John
85858
Scott
Jane
I want to unite the "surname" and "given-names" into a column called AuN.
data <- data %>%
mutate(authors_list = map(authors_list,
unite(col=AuN,
c(`given-name`,
surname),
sep = " ")))
However, I get the following error.
Error in unite(col = AuN, c(`given-name`, surname), sep = " ") :
object 'given-name' not found
I am new to using purrr, and I haven't been able to find solutions to a similar problem online. Any help would be appreciated!
I think this is what you're after. You need to put in .x in the unite call to stand in for each data frame in the list. For each one, it will unite with the parameters you specified.
library(tidyverse)
#Set up the data (but please in the future give us data so we don't have to set it up)
df <- tibble(authid = c(12345, 85858), surname = c("Smith", "Scott"), `given-name` = c("John","Jane"))
list_df <- list(df, df, df)
list_df_unite <- map(list_df, ~ unite(.x, AuN, c(`given-name`,surname), sep = " "))

How can I have R identify text in one column and then use it to create data in another column?

I have materials lists from vendors that I would like to expand the description to other columns so I can use the filter function Excel to more easily find products based on their description. He's an example of the description I receive from a vendor:
2 SS 150LB 304L SLIP ON FLANGE
I would like to take this description and have R identify certain bits of text, and based on that text, add data to another column. For instance: if the string "SS" is in this cell, then put the word "STAINLESS" in a Materials column. If the string "BLK" is found in this cell, then put the word "BLACK" in the Materials column. If the string "FLANGE" is found in this cell, then put the word "FLANGE" in another column called Part_Type.
Here is one simple approach which looks for certain character sequences to use as a trigger to add strings to other columns.
library(tidyverse)
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK ON FLANGE')))
# add new columns filled with NA
df <- df %>%
add_column(Materials = NA_character_) %>%
add_column(Part_Type = NA_character_)
df %>%
mutate(Materials = if_else(str_detect(x, 'SS'), 'STAINLESS', Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), 'BLACK', Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'),'FLANGE', Part_Type))
Can an item be both 'stainless steel' and 'black'? i.e. do we want to add multiple strings to one column? In that case, it would be necessary to append rather than overwrite. Here's one approach to that problem.
my_nrow = 2
df <- tibble(x = c(('2 SS 150LB 304L SLIP ON FLANGE'),
('3 BLK SS ON FLANGE')),
Materials = vector('character', my_nrow),
Part_type = vector('character', my_nrow))
df
df %>%
mutate(Materials = ifelse(str_detect(x, 'SS'), str_c(Materials,'STAINLESS '), Materials)) %>%
mutate(Materials = if_else(str_detect(x, 'BLK'), str_c(Materials,'BLACK '), Materials)) %>%
mutate(Part_type = if_else(str_detect(x, 'FLANGE'), str_c(Part_type,'FLANGE', sep = ' '), Part_type))

Subset strings in R

One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.

R: Conditional Formatting across excel files

I am trying to highlight rows of an excel file based on a match from the columns in a separate excel file. Pretty much, I want to highlight a row in file1 if a cell in that row matches a cell in file2.
I saw the R package "conditionalFormatting" has some of this functionality, but I cannot figure out how to use it.
the pseudo-code i think would look something like this:
file1 <- read_excel("file1")
file2 <- read_excel("file2")
conditionalFormatting(file1, sheet = 1, cols = 1:end, rows = 1:22,
rule = "number in file1 is found in a specific column of file 2")
Please let me know if this makes sense or if i need to clarify something.
Thanks!
The conditionalFormatting() function embeds active conditional formatting into the excel document but is likely more complicated than you need for a one-time highlight. I'd suggest loading each file into a dataframe, determining which rows contain a matching cell, creating a highlight style (yellow background), loading the file as a workbook object, setting the appropriate rows to the highlight style, and saving the updated workbook object.
The following function is the used to determine which rows have a match. The magrittr package provides the %>% pipes and the data.table package provides the transpose() function.
find_matched_rows <- function(df1, df2) {
require(magrittr)
require(data.table)
# the dataframe object treats each column as a list making it much easier and
# faster to search via column than row. Transpose the original file1 dataframe
# to treat the rows as columns.
df1_transposed <- data.table::transpose(df1)
# assuming that the location of the match in the second file is irrelevant,
# unlist the file2 dataframe so that each value in file1 can be searched in a
# vector
df2_as_vector <- unlist(df2)
# determine which columns contain a match. If one or more matches are found,
# attribute the row as 'TRUE' in the output vector to be used to subset the
# row numbers
match_map <- lapply(df1_transposed,FUN = `%in%`, df2_as_vector) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
sapply(function(x) sum(x) > 0)
# make a vector of row numbers using the logical match_map vector to subset
matched_rows <- seq(1:nrow(df1))[match_map]
return(matched_rows)
}
The following code loads the data, finds the matched rows, applies the highlight, and saves over the original file1.xlsx. The second tst_df1 and tst_df2 provide for an easy way of testing the find_matched_rows() function. As expected, it finds that the 1st and 3rd rows of the first dataframe contain a cell that matches a cell in second dataframe.
# used to ensure that the correct rows are highlighted. the dataframe does not
# include the header as an independent row unlike excel.
file1_header_row <- 1
file2_header_row <- 1
tst_df1 <- openxlsx::read.xlsx("./file1.xlsx",
startRow = file1_header_row)
tst_df2 <- openxlsx::read.xlsx("./file2.xlsx",
startRow = file2_header_row)
#example data for testing
tst_df1 <- data.frame(fname = c("John", "Bob", "Bill"),
lname = c("Smith", "Johnson", "Samson"),
wage = c(10, 15.23, 137.38),
stringsAsFactors = FALSE)
tst_df2 <- data.frame(a = c(10, 34, 284.2),
b = c("Billy", "Bill", "Billy-Bob"),
c = c("Samson", "Johansson", NA),
stringsAsFactors = FALSE)
df_matched_rows <- find_matched_rows(tst_df1, tst_df2)
# any color found in colours() can be used here or hex color beginning with "#"
highlight_style <- openxlsx::createStyle(fgFill = "yellow")
file1_wb <- openxlsx::loadWorkbook(file = "./file1.xlsx")
openxlsx::addStyle(wb = file1_wb,
sheet = 1,
style = highlight_style,
rows = file1_header_row + df_matched_rows,
cols = 1:ncol(tst_df1),
stack = TRUE,
gridExpand = TRUE)
openxlsx::saveWorkbook(wb = file1_wb,
file = "./file1.xlsx",
overwrite = TRUE)

Resources