Separating cells with several delimiters (splitstackshape) - r

I am working with a database that should be separated by several delimiters. The most common are semicolons and a point followed by a slash: './'.
How do I complete the code in order to apply both delimiters?
library(tidyverse)
library(splitstackshape)
values <- c("cat; dog; mouse", "cat ./ dog ./ mouse")
data <- data.frame(cbind(values))
separated <- cSplit(data.frame(data), "values", sep = ";", drop = TRUE)
I tried a vector solution but without much success.

I'm not exactly sure what your final output structure should be, but one approach could be to start with tidy::separate which would put all of your animals in a separate column:
df <- tidyr::separate(data, col = values,
into = c("Animal1", "Animal2", "Animal3"),
sep = c(";|./"))
#. Animal1 Animal2 Animal3
#1 cat dog mouse
#2 cat dog mouse
Without a pre-defined number of elements in each string, you could also try:
# Add in a third value to data with only 2 animals
values <- c("cat; dog; mouse", "cat ./ dog ./ mouse", "frog; squirrel")
data <- data.frame(cbind(values))
data_clean <- gsub(";|./", ";", data$values)
separated <- splitstackshape::cSplit(data.frame(values = data_clean),
"values", sep = ";", drop = TRUE)
# values_1 values_2 values_3
# 1: cat dog mouse
# 2: cat dog mouse
# 3: frog squirrel <NA>

Related

Create new columns with information extracted from another column on R

I have a data frame where one of the columns have several information separated by ";", like the following:
DF = data.frame(a = c(1,1,1,2,2), b = c('aaa','aaa','aba','abc','ccc'),
extra_info = c(
'animal=horse;color=orange;shape=circle',
'animal=monkey;shape=square;value=532',
'animal=horse;color=blue;shape=square;value=321',
'animal=dog;color=green;value=678',
'color=pink;shape=triangle'
))
I can't use read.table because I'm already using a different function to read the data (and also the content of each row in the column extra_info is different, and the columns would be messed up). What I wish to do is separate all this information to different columns, and assign proper names accordingly, such as:
a b animal color shape value
1 aaa horse orange circle NA
1 aaa monkey NA square 532
1 aba horse blue square 321
2 abc dog green NA 678
2 ccc NA pink triangle NA
So far, I've tried:
new_cols = DF %>% separate(extra_info, c(LETTERS[1:4]), sep = ";")
new_cols %>% separate(A, c("key","value"), sep = '=') %>%
separate(B, c("key","value"), sep = '=') %>%
separate(C, c("key","value"), sep = '=') %>%
separate(D, c("key","value"), sep = '=') %>%
pivot_wider(names_from = c("key"), values_from = c("value"))
But it doesn't work as expected.
Here's an approach where I change the syntax of your key-value pairs into valid JSON syntax and use jsonlite::fromJSON to parse it:
library(purrr)
library(dplyr)
library(stringr)
library(jsonlite)
DF %>%
mutate(
json = str_replace_all(extra_info, pattern = "\\b", replacement = '"'),
json = str_replace_all(json, pattern = fixed("="), replacement = ":"),
json = str_replace_all(json, pattern = fixed(";"), replacement = ","),
json = paste("{", json, "}"),
) %>%
pull(json) %>%
map(jsonlite::fromJSON) %>%
map(as.data.frame) %>%
bind_rows %>%
cbind(DF, .)
# a b extra_info animal color shape value
# 1 1 aaa animal=horse;color=orange;shape=circle horse orange circle <NA>
# 2 1 aaa animal=monkey;shape=square;value=532 monkey <NA> square 532
# 3 1 aba animal=horse;color=blue;shape=square;value=321 horse blue square 321
# 4 2 abc animal=dog;color=green;value=678 dog green <NA> 678
# 5 2 ccc color=pink;shape=triangle <NA> pink triangle <NA>
Here is a base R option using gsub + eval + str2expression
v <- DF$extra_info
p <- gsub(";", ",", gsub("(?<=\\=)(\\w+)", "'\\1'", v, perl = TRUE))
nms <- unique(unlist(regmatches(v, gregexpr("\\w+(?=\\=)", v, perl = TRUE))))
q <- unname(Map(function(x) setNames(eval(str2expression(x))[nms], nms), sprintf("c(%s)", p)))
cbind(DF[c("a","b")], type.convert(data.frame(do.call(rbind, q)), as.is = TRUE))
which gives
a b animal color shape value
1 1 aaa horse orange circle NA
2 1 aaa monkey <NA> square 532
3 1 aba horse blue square 321
4 2 abc dog green <NA> 678
5 2 ccc <NA> pink triangle NA
It's a bit neater with the stringr package, but if you just want base R you can use the following. In the pattern structure (?<=animal=)\\w+(?=\\b) here, the \\w+ is what's actually being returned, it is any word character (\\w) and there has to be at least one of them (+). This is swapped with \\d+ for 'value' since digits are required. Alternatively you could replace both with [:alnum:]+.
Then the (?<=animal=) structure is used to specify that it must be preceded by "animal=", and the (?=\\b) structure indicates that it has to be followed by a word boundary (\\b). You could get a bit more specific and replace \\b with (,|;|$) which stands for comma or semicolon or end of line (EDIT: the original question had commas in some places). There might be a nice way of writing a loop over the four words that creates the variable names and patterns dynamically.
pattern <- "(?<=animal=)\\w+(?=\\b)"
DF$animal <- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=color=)\\w+(?=\\b)"
DF$color<- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=shape=)\\w+(?=\\b)"
DF$shape<- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=value=)\\d+(?=\\b)"
DF$value <- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
If you're happy to use tidyverse/stringr, here is the code.
DF <- DF %>%
mutate(animal = str_extract(extra_info, "(?<=animal=)\\w+(?=\\b)" )) %>%
mutate(color = str_extract(extra_info, "(?<=color=)\\w+(?=\\b)" )) %>%
mutate(shape = str_extract(extra_info, "(?<=shape=)\\w+(?=\\b)" )) %>%
mutate(value = str_extract(extra_info, "(?<=value=)\\d+(?=\\b)" ))
For more info on string manipulation and regular expressions, see the stringr cheat sheet here: https://github.com/rstudio/cheatsheets/blob/master/strings.pdf
library(stringr)
col_names <- unlist(str_extract_all(DF$extra_info[3], "(?<=^|;)\\w+"))
DF %>%
mutate(animal = str_extract(extra_info, paste0("(?<=", col_names[1], "=)\\w+")),
color = str_extract(extra_info, paste0("(?<=", col_names[2], "=)\\w+")),
shape = str_extract(extra_info, paste0("(?<=", col_names[3], "=)\\w+")),
value = str_extract(extra_info, paste0("(?<=", col_names[4], "=)\\w+"))
a b extra_info animal color shape value
1 1 aaa animal=horse;color=orange;shape=circle horse orange circle <NA>
2 1 aaa animal=monkey;shape=square;value=532 monkey <NA> square 532
3 1 aba animal=horse;color=blue;shape=square;value=321 horse blue square 321
4 2 abc animal=dog;color=green;value=678 dog green <NA> 678
5 2 ccc color=pink;shape=triangle <NA> pink triangle <NA>

How to extract all matching patterns (words in a string) in a dataframe column?

I have two dataframes. one (txt.df) has a column with a text I want to extract phrases from (text). The other (wrd.df) has a column with the phrases (phrase). both are big dataframes with complex texts and strings but lets say:
txt.df <- data.frame(id = c(1, 2, 3, 4, 5),
text = c("they love cats and dogs", "he is drinking juice",
"the child is having a nap on the bed", "they jump on the bed and break it",
"the cat is sleeping on the bed"))
wrd.df <- data.frame(label = c('a', 'b', 'c', 'd', 'e', 'd'),
phrase = c("love cats", "love dogs", "juice drinking", "nap on the bed", "break the bed",
"sleeping on the bed"))
what I finally need is a txt.df with another column which contains labels of the phrases detected.
what I tried was creating a column in wrd.df in which I tokenized the phrases like this
wrd.df$token <- sapply(wrd.df$phrase, function(x) unlist(strsplit(x, split = " ")))
and then tried to write a custom function to sapply over the tokens column with grepl/str_detect
get the names (labels) of those which were all true
Extract.Fun <- function(text, df, label, token){
for (i in token) {
truefalse[i] <- sapply(token[i], function (x) grepl(x, text))
truenames[i] <- names(which(truefalse[i] == T))
removedup[i] <- unique(truenames[i])
return(removedup)
}
and then sapply this custom function on my txt.df$text to have a new column with the labels.
txt.df$extract <- sapply(txt.df$text, function (x) Extract.Fun(x, wrd.df, "label", "token"))
but I'm not good with custom functions and am really stuck. I would appreciate any help.
P.S. It would be very good if i could also have partial matches like "drink juice" and "broke the bed"... but it's not a priority... fine with the original ones.
If you need to match the exact phrases, the regex_join() from the fuzzyjoin-package is what you need.
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "phrase"), mode = "left" )
id text label phrase
1 1 they love cats and dogs a love cats
2 2 he is drinking juice <NA> <NA>
3 3 the child is having a nap on the bed d nap on the bed
4 4 they jump on the bed and break it <NA> <NA>
5 5 the cat is sleeping on the bed d sleeping on the bed
If you want to match all words, I guess you can build a regex out of the phrases that cover such behaviour...
update
#build regex for phrases
#done by splitting the phrases to individual words, and then paste the regex together
wrd.df$regex <- unlist( lapply( lapply( strsplit( wrd.df$phrase, " "),
function(x) paste0( "(?=.*", x, ")", collapse = "" ) ),
function(x) paste0( "^", x, ".*$") ) )
fuzzyjoin::regex_join( txt.df, wrd.df, by = c(text = "regex"), mode = "left" )
id text label phrase regex
1 1 they love cats and dogs a love cats ^(?=.*love)(?=.*cats).*$
2 1 they love cats and dogs b love dogs ^(?=.*love)(?=.*dogs).*$
3 2 he is drinking juice c juice drinking ^(?=.*juice)(?=.*drinking).*$
4 3 the child is having a nap on the bed d nap on the bed ^(?=.*nap)(?=.*on)(?=.*the)(?=.*bed).*$
5 4 they jump on the bed and break it e break the bed ^(?=.*break)(?=.*the)(?=.*bed).*$
6 5 the cat is sleeping on the bed d sleeping on the bed ^(?=.*sleeping)(?=.*on)(?=.*the)(?=.*bed).*$

Replace lowercase in names, not in surnames

I have a problem with a database with names of persons. I want to put the names in abbreviation but not the last names. The last name is separated from the name by a comma and the different people are separated from each other by a semicolon, like this example:
Michael, Jordan; Bird, Larry;
If the name is a single word, the code would be like this:
breve$autor <- str_replace_all(breve$autor, "[:lower:]{1,}\\;", ".\\;")
Result with this code:
Michael, J.; Bird, L.;
The problem is in compound names. With this code, the name:
Jordan, Michael Larry;
It would be:
Jordan, Michael L.;
Could someone tell me how to remove all lowercase letters that are between the comma and the semicolon? and it will look like this:
Jordan, M.L.;
Here is another solution:
x1 <- 'Michael, Jordan; Bird, Larry;'
x2 <- 'Jordan, Michael Larry;'
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x1, perl = TRUE)
# [1] "Michael, J.; Bird, L.;"
gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE)
# [1] "Jordan, M. L.;"
Surnames are followed by , while are parts of the names are followed by or ;. Here I use (?=[ ;]) to make sure that the following character after the pattern to be matched is a space or a semicolon.
To remove the space between M. and L., an additional step is needed:
gsub('\\. ', '.', gsub('([A-Z])[a-z]+(?=[ ;])', '\\1.', x2, perl = TRUE))
# [1] "Jordan, M.L.;"
There must be a regular expression that will do this, of course. But that magic is a little beyond me. So here is an approach with simple string manipulation in a data frame using tidyverse functions.
library(stringr)
library(dplyr)
library(tidyr)
ballers <- "Michael, Jordan; Bird, Larry;"
mj <- "Jordan, Michael Larry"
c(ballers, mj) %>%
#split the players
str_split(., ";", simplify = TRUE) %>%
# remove white space
str_trim() %>%
#transpose to get players in a column
t %>%
#split again into last name and first + middle (if any)
str_split(",", simplify = TRUE) %>%
# convert to a tibble
as_tibble() %>%
# remove more white space
mutate(V2=str_trim(V2)) %>%
# remove empty rows (these can be avoided by different manipulation upstream)
filter(!V1 == "") %>%
# name the columns
rename("Last"=V1, "First_two"=V2) %>%
# separate the given names into first and middle (if any)
separate(First_two,into=c("First", "Middle"), sep=" ",) %>%
# abbreviate to first letter
mutate(First_i=abbreviate(First, 1)) %>%
# abbreviate, but take into account that middle name might be missing
mutate(Middle_i=ifelse(!is.na(Middle), paste0(abbreviate(Middle, 1), "."), "")) %>%
# combine the First and middle initals
mutate(Initials=paste(First_i, Middle_i, sep=".")) %>%
# make the desired Last, F.M. vector
mutate(Final=paste(Last, Initials, sep=", "))
# A tibble: 3 x 7
Last First Middle First_i Middle_i Initials Final
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Michael Jordan NA J "" J. Michael, J.
2 Jordan Michael Larry M L. M.L. Jordan, M.L.
3 Bird Larry NA L "" L. Bird, L.
Warning message:
Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [1, 3].
Much longer than a regex.
There will probably be a better way to do this, but I managed to get it to work using the stringr and tibble packages.
library(stringr)
library(tibble)
names <- 'Jordan, Michael; Bird, Larry; Obama, Barack; Bush, George Walker'
df <- as_tibble(str_split(unlist(str_split(names, '; ')), ', ', simplify = TRUE))
df[, 2] <- gsub('[a-z]+', '.', pull(df[, 2]))
This code generates the tibble df, which has the following contents:
# A tibble: 4 x 2
V1 V2
<chr> <chr>
1 Jordan M.
2 Bird L.
3 Obama B.
4 Bush G. W.
The names are first split into first and last names and stored into a data frame so that the gsub() call does not operate on the last names. Then, gsub() searches for any lowercase letters in the last names and replaces them with a single .
Then, you can call str_c(str_c(pull(df[, 1]), ', ', pull(df[, 2])), collapse = '; ') (or str_c(pull(unite(df, full, c('V1', 'V2'), sep = ', ')), collapse = '; ') if you already have the tidyr package loaded) to return the string "Jordan, M.; Bird, L.; Obama, B.; Bush, G. W.".
...also, did you mean Michael Jordan, not Jordan Michael? lol
Here's one that uses gsub twice. The inner one is for names with no middle names and the outer is for names that have a middle name.
x = c("Michael, Jordan; Jordan, Michael Larry; Bird, Larry;")
gsub(", ([A-Z])[a-z]+ ([A-Z])[a-z]+;", ", \\1.\\2.;", gsub(", ([A-Z])[a-z]+;", ", \\1.;", x))
#[1] "Michael, J.; Jordan, M.L.; Bird, L.;"

Extracting Column data from .csv and turning every 10 consecutive rows into corresponding columns

Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}
Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b
We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")

R: How select rows form a csv, that match rows of another csv in R?

I have one csv file which contains 3 columns (species name, longitude, and latitude.) and a second csv file which contains a column with species names. What I want to do is to extract from the first csv the species (also the long and lat columns) that match the species in the second csv file.
How can I do this in R?
## Read csv files
file1 = read.csv(paste(path, "file1.csv", sep = ""), stringsAsFactors = FALSE, na.strings = "NA")
file2 = read.csv(paste(path, "file2.csv", sep = ""), stringsAsFactors = FALSE, na.strings = "NA")
#> file1
# Species Longitude Latitude
#1 Cat 0.4300052 0.04554442
#2 Dog 0.6568743 0.53359425
#3 Fish 0.8218709 0.20328321
#4 Deer 0.4601183 0.93191142
#5 Cow 0.9975495 0.02349226
#> file2
# Species
#1 Fish
#2 Dog
## Get subset in first file of species in second file
result = file1[file1$Species %in% file2$Species,]
You get:
#> result
# Species Longitude Latitude
#2 Dog 0.6568743 0.5335943
#3 Fish 0.8218709 0.2032832

Resources