I have two datasets which I want to merge :
df1 <- data.frame( title =
c("residence mozart",
"les hesperides auteuil mirabeau",
"chaillot",
"jouvenet",
"retraite dosne"))
df2 <- data.frame(title = c("terrasses mozart", "chaillot",
"villa jules janin", "retraites dosne"))
And I would like to have something like this :
1 residence mozart NA (or terrasses mozart)
2 les hesperides auteuil mirabeau NA
3 chaillot chaillot
4 jouvenet NA
5 retraite dosne retraites dosne
Here is what I did :
x = data.frame(title_df2 = matrix(ncol = 1, nrow = nrow(df1)))
for (i in nbr){
x[i, ] <- grep(df1$title[i], df2$title, value = T)
}
It does not work at all ! Even though grep(df1$title[5], df2$title, value = T) works and return "chaillot"!
If I understand correctly
df1 <- data.frame( title =
c("residence mozart",
"les hesperides auteuil mirabeau",
"chaillot",
"jouvenet",
"retraite dosne"))
df2 <- data.frame(title = c("terrasses mozart", "chaillot",
"villa jules janin", "retraites dosne"))
library(dplyr)
library(fuzzyjoin)
stringdist_left_join(x = df1, y = df2, method = "jw", distance_col = "d") %>%
filter(d < 0.25) %>%
right_join(df1, by = c("title.x" = "title"))
#> Joining by: "title"
#> title.x title.y d
#> 1 residence mozart terrasses mozart 0.23863636
#> 2 chaillot chaillot 0.00000000
#> 3 retraite dosne retraites dosne 0.09206349
#> 4 les hesperides auteuil mirabeau <NA> NA
#> 5 jouvenet <NA> NA
Created on 2021-04-19 by the reprex package (v2.0.0)
The issue is that grep returns a vector of length 0 when there is no match.
grep('a', 'hello', value = TRUE)
#character(0)
If we want to make use of the same for loop, make an adjustment in the code to return NA whereever there is no match
nbr <- seq_len(nrow(df1))
for (i in nbr){
x[i, ] <- c(grep(df1$title[i], df2$title, value = TRUE), NA_character_)[1]
}
-output
x
# title_df2
#1 <NA>
#2 <NA>
#3 chaillot
#4 <NA>
#5 <NA>
You could do:
a <-Vectorize(agrep, "pattern")(df1$title, df2$title, value=TRUE)
is.na(a)<- lengths(a) == 0
cbind(df1,df2_title=unlist(a, use.names = FALSE))
title df2_title
1 residence mozart <NA>
2 les hesperides auteuil mirabeau <NA>
3 chaillot chaillot
4 jouvenet <NA>
5 retraite dosne retraites dosne
To achieve your goal, you need a matching on each word of your strings within df1 title.
As used in your example, Grep will return an output only if there is a match on the full string.
In order to do that, you'll need to grep on possible words on df1 that are also contained in df2. This can be achieved by implementing an or condition on the full word contained in each string.
nbr <- 1:nrow(x)
for (i in nbr){
pattern <- paste("\\b",unlist(strsplit(as.character(df1$title[i]), " ")), "\\b", collapse = "|", sep = "") # here you create a regex expression whereby you can check if one of the words contained in 1 is also in df2. the \\b \\b escape makes sure that there is a full match on the single word.
fitInDataFrame <- grep(pattern, as.character(df2$title), value = T) # here you grep on the constructed regex expression
x[i, ] <- ifelse(length(fitInDataFrame) == 0, NA, fitInDataFrame)
}
Here the output:
> x
title_df2
1 terrasses mozart
2 <NA>
3 chaillot
4 <NA>
5 retraites dosne
You can do a left_join(df1, df2, by = c('title' = 'title'), keep = TRUE), specifying keep = TRUE so it doesn't drop df2's join column.
Or, for this particular case, you could do this:
df1$newcol <- ifelse(df1$title %in% df2$title, df1$title, NA)
This adds a new column to df1 which is filled out by going through each title in df1, checking if that title is in df2, if so writing that title in the second column and if not writing NA in that row of the second column. You could choose to put something else there instead, like:
df1$newcol <- ifelse(df1$title %in% df2$title, 'Title in DF2', 'Not in DF2')
Related
As the consequence of a misformatted file (which is unfortunately the only file available) I have a few thousand columns each with 9 rows of data in them. Unfortunately, the actual values are in a different order in each column.
I need to extract matched locus_tag= and gene= and/or product= values for each column whilst keeping the column order intact so that these do not get mismatched. Another complication is that these are formatted as "gene=ltas" so I had thought some kind of grepl would be useful.
However, I also need them ordered so that each row only contains one either the correct value (e.g. gene=) or NA:
Column A
Column B
gene = ltas
NA
NA
product = hypothetical protein
locus_tag = RAS_R12345
locus_tag = RAS_R14053
Here is an example of the data that I am working with:
header 1
header 2
Parent=gene-SAS_RS00035
Name=hutH
gbkey=CDS
gene_biotype=protein_coding
inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1
locus_tag=SAS_RS00040
Dbxref=Genbank:WP_000449218.1
gbkey=Gene
locus_tag=SAS_RS00035
old_locus_tag=SAS0008
Name=WP_000449218.1
gene=hutH
cds-WP_000449218.1
gene-SAS_RS00040
protein_id=WP_000449218.1
product=NAD(P)H-hydrate dehydratase
I'n not sure where to start with coding this as it is so disordered and poorly formatted, so any advice would be very welcomed.
How about this:
dat <- structure(list(`header 1` = c("Parent=gene-SAS_RS00035", "gbkey=CDS",
"inference=COORDINATES: similar to AA sequence:RefSeq:WP_002461649.1",
"Dbxref=Genbank:WP_000449218.1", "locus_tag=SAS_RS00035", "Name=WP_000449218.1",
"cds-WP_000449218.1", "protein_id=WP_000449218.1", "product=NAD(P)H-hydrate dehydratase"
), `header 2` = c("Name=hutH", "gene_biotype=protein_coding",
"locus_tag=SAS_RS00040", "gbkey=Gene", "old_locus_tag=SAS0008",
"gene=hutH", "gene-SAS_RS00040", "", "")), row.names = c(NA,
9L), class = "data.frame")
prod <- apply(dat, 2, function(x){
prod_ind <- grep("^product", x)
if(length(prod_ind == 1)){
out <- gsub("(product=.*)", "\\1", x[prod_ind])
}else{
out <- NA
}
out
})
gene <- apply(dat, 2, function(x){
gene_ind <- grep("^gene=", x)
if(length(gene_ind == 1)){
out <- gsub("(gene=.*)", "\\1", x[gene_ind])
}else{
out <- NA
}
out
})
locus <- apply(dat, 2, function(x){
locus_tag_ind <- grep("^locus_tag=", x)
if(length(locus_tag_ind == 1)){
out <- gsub("(locus_tag=.*)", "\\1", x[locus_tag_ind])
}else{
out <- NA
}
out
})
dplyr::bind_rows(gene, prod, locus)
#> # A tibble: 3 × 2
#> `header 1` `header 2`
#> <chr> <chr>
#> 1 <NA> gene=hutH
#> 2 product=NAD(P)H-hydrate dehydratase <NA>
#> 3 locus_tag=SAS_RS00035 locus_tag=SAS_RS00040
Created on 2022-04-05 by the reprex package (v2.0.1)
The example above does this in three parts, each time searching for one of the things you're interested in. Then, it combines all the results together.
I have a data frame where one of the columns have several information separated by ";", like the following:
DF = data.frame(a = c(1,1,1,2,2), b = c('aaa','aaa','aba','abc','ccc'),
extra_info = c(
'animal=horse;color=orange;shape=circle',
'animal=monkey;shape=square;value=532',
'animal=horse;color=blue;shape=square;value=321',
'animal=dog;color=green;value=678',
'color=pink;shape=triangle'
))
I can't use read.table because I'm already using a different function to read the data (and also the content of each row in the column extra_info is different, and the columns would be messed up). What I wish to do is separate all this information to different columns, and assign proper names accordingly, such as:
a b animal color shape value
1 aaa horse orange circle NA
1 aaa monkey NA square 532
1 aba horse blue square 321
2 abc dog green NA 678
2 ccc NA pink triangle NA
So far, I've tried:
new_cols = DF %>% separate(extra_info, c(LETTERS[1:4]), sep = ";")
new_cols %>% separate(A, c("key","value"), sep = '=') %>%
separate(B, c("key","value"), sep = '=') %>%
separate(C, c("key","value"), sep = '=') %>%
separate(D, c("key","value"), sep = '=') %>%
pivot_wider(names_from = c("key"), values_from = c("value"))
But it doesn't work as expected.
Here's an approach where I change the syntax of your key-value pairs into valid JSON syntax and use jsonlite::fromJSON to parse it:
library(purrr)
library(dplyr)
library(stringr)
library(jsonlite)
DF %>%
mutate(
json = str_replace_all(extra_info, pattern = "\\b", replacement = '"'),
json = str_replace_all(json, pattern = fixed("="), replacement = ":"),
json = str_replace_all(json, pattern = fixed(";"), replacement = ","),
json = paste("{", json, "}"),
) %>%
pull(json) %>%
map(jsonlite::fromJSON) %>%
map(as.data.frame) %>%
bind_rows %>%
cbind(DF, .)
# a b extra_info animal color shape value
# 1 1 aaa animal=horse;color=orange;shape=circle horse orange circle <NA>
# 2 1 aaa animal=monkey;shape=square;value=532 monkey <NA> square 532
# 3 1 aba animal=horse;color=blue;shape=square;value=321 horse blue square 321
# 4 2 abc animal=dog;color=green;value=678 dog green <NA> 678
# 5 2 ccc color=pink;shape=triangle <NA> pink triangle <NA>
Here is a base R option using gsub + eval + str2expression
v <- DF$extra_info
p <- gsub(";", ",", gsub("(?<=\\=)(\\w+)", "'\\1'", v, perl = TRUE))
nms <- unique(unlist(regmatches(v, gregexpr("\\w+(?=\\=)", v, perl = TRUE))))
q <- unname(Map(function(x) setNames(eval(str2expression(x))[nms], nms), sprintf("c(%s)", p)))
cbind(DF[c("a","b")], type.convert(data.frame(do.call(rbind, q)), as.is = TRUE))
which gives
a b animal color shape value
1 1 aaa horse orange circle NA
2 1 aaa monkey <NA> square 532
3 1 aba horse blue square 321
4 2 abc dog green <NA> 678
5 2 ccc <NA> pink triangle NA
It's a bit neater with the stringr package, but if you just want base R you can use the following. In the pattern structure (?<=animal=)\\w+(?=\\b) here, the \\w+ is what's actually being returned, it is any word character (\\w) and there has to be at least one of them (+). This is swapped with \\d+ for 'value' since digits are required. Alternatively you could replace both with [:alnum:]+.
Then the (?<=animal=) structure is used to specify that it must be preceded by "animal=", and the (?=\\b) structure indicates that it has to be followed by a word boundary (\\b). You could get a bit more specific and replace \\b with (,|;|$) which stands for comma or semicolon or end of line (EDIT: the original question had commas in some places). There might be a nice way of writing a loop over the four words that creates the variable names and patterns dynamically.
pattern <- "(?<=animal=)\\w+(?=\\b)"
DF$animal <- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=color=)\\w+(?=\\b)"
DF$color<- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=shape=)\\w+(?=\\b)"
DF$shape<- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
pattern <- "(?<=value=)\\d+(?=\\b)"
DF$value <- sapply(regmatches(DF$extra_info, regexec(pattern, DF$extra_info , perl=T)), "[", 1)
If you're happy to use tidyverse/stringr, here is the code.
DF <- DF %>%
mutate(animal = str_extract(extra_info, "(?<=animal=)\\w+(?=\\b)" )) %>%
mutate(color = str_extract(extra_info, "(?<=color=)\\w+(?=\\b)" )) %>%
mutate(shape = str_extract(extra_info, "(?<=shape=)\\w+(?=\\b)" )) %>%
mutate(value = str_extract(extra_info, "(?<=value=)\\d+(?=\\b)" ))
For more info on string manipulation and regular expressions, see the stringr cheat sheet here: https://github.com/rstudio/cheatsheets/blob/master/strings.pdf
library(stringr)
col_names <- unlist(str_extract_all(DF$extra_info[3], "(?<=^|;)\\w+"))
DF %>%
mutate(animal = str_extract(extra_info, paste0("(?<=", col_names[1], "=)\\w+")),
color = str_extract(extra_info, paste0("(?<=", col_names[2], "=)\\w+")),
shape = str_extract(extra_info, paste0("(?<=", col_names[3], "=)\\w+")),
value = str_extract(extra_info, paste0("(?<=", col_names[4], "=)\\w+"))
a b extra_info animal color shape value
1 1 aaa animal=horse;color=orange;shape=circle horse orange circle <NA>
2 1 aaa animal=monkey;shape=square;value=532 monkey <NA> square 532
3 1 aba animal=horse;color=blue;shape=square;value=321 horse blue square 321
4 2 abc animal=dog;color=green;value=678 dog green <NA> 678
5 2 ccc color=pink;shape=triangle <NA> pink triangle <NA>
I have a dataframe with a column that's really a list of integer vectors (not just single integers).
# make example dataframe
starting_dataframe <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
starting_dataframe$player_indices <-
list(as.integer(1),
as.integer(c(2, 5)),
as.integer(3),
as.integer(4),
as.integer(c(6, 7)))
I want to replace the integers with character strings according to a second concordance dataframe.
# make concordance dataframe
example_concord <-
data.frame(last_names = c("Rapinoe",
"Wambach",
"Naeher",
"Morgan",
"Dahlkemper",
"Mitts",
"O'Reilly"),
player_ids = as.integer(c(1,2,3,4,5,6,7)))
The desired result would look like this:
# make dataframe of desired result
desired_result <-
data.frame(first_names = c("Megan",
"Abby",
"Alyssa",
"Alex",
"Heather"))
desired_result$player_indices <-
list(c("Rapinoe"),
c("Wambach", "Dahlkemper"),
c("Naeher"),
c("Morgan"),
c("Mitts", "O'Reilly"))
I can't for the life of me figure out how to do it and failed to find a similar case here on stackoverflow. How do I do it? I wouldn't mind a dplyr-specific solution in particular.
I suggest creating a "lookup dictionary" of sorts, and lapply across each of the ids:
example_concord_idx <- setNames(as.character(example_concord$last_names),
example_concord$player_ids)
example_concord_idx
# 1 2 3 4 5 6
# "Rapinoe" "Wambach" "Naeher" "Morgan" "Dahlkemper" "Mitts"
# 7
# "O'Reilly"
starting_dataframe$result <-
lapply(starting_dataframe$player_indices,
function(a) example_concord_idx[a])
starting_dataframe
# first_names player_indices result
# 1 Megan 1 Rapinoe
# 2 Abby 2, 5 Wambach, Dahlkemper
# 3 Alyssa 3 Naeher
# 4 Alex 4 Morgan
# 5 Heather 6, 7 Mitts, O'Reilly
(Code golf?)
Map(`[`, list(example_concord_idx), starting_dataframe$player_indices)
For tidyverse enthusiasts, I adapted the second half of the accepted answer by r2evans to use map() and %>%:
require(tidyverse)
starting_dataframe <-
starting_dataframe %>%
mutate(
result = map(.x = player_indices, .f = function(a) example_concord_idx[a])
)
Definitely won't win code golf, though!
Another way is to unlist the list-column, and relist it after modifying its contents:
df1$player_indices <- relist(df2$last_names[unlist(df1$player_indices)], df1$player_indices)
df1
#> first_names player_indices
#> 1 Megan Rapinoe
#> 2 Abby Wambach, Dahlkemper
#> 3 Alyssa Naeher
#> 4 Alex Morgan
#> 5 Heather Mitts, O'Reilly
Data
## initial data.frame w/ list-column
df1 <- data.frame(first_names = c("Megan", "Abby", "Alyssa", "Alex", "Heather"), stringsAsFactors = FALSE)
df1$player_indices <- list(1, c(2,5), 3, 4, c(6,7))
## lookup data.frame
df2 <- data.frame(last_names = c("Rapinoe", "Wambach", "Naeher", "Morgan", "Dahlkemper",
"Mitts", "O'Reilly"), stringsAsFactors = FALSE)
NB: I set stringsAsFactors = FALSE to create character columns in the data.frames, but it works just as well with factor columns instead.
I have a data set in R studio (Aud) that looks like the following. ID is of type Character and Function is of type character as well
ID Function
F04 FZ000TTY WB002FR088DR011
F05 FZ000AGH WZ004ABD
F06 FZ0005ABD
my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example
ID Function SUBFUN1 SUBFUN2 SUBFUN3 SUBFUN4 SUBFUN5
F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time
I tried the the following
Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))
Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))
I get the error message below in my attempts for subfun1 & subfun2:
Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) :
replacement has 0 rows, data has 343456
Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) :
replacement has 0 rows, data has 343456
I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string
Any help would be greatly appreciated with this
Using data.table:
library(data.table)
Aud <- data.frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
stringsAsFactors = FALSE
)
setDT(Aud)
cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
ID Function V1 V2 V3 V4 V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY WB FR DR
2: F05 FZ000AGH WZ004ABD FZ AGH WZ ABD <NA>
3: F06 FZ0005ABD FZ ABD <NA> <NA> <NA>
Staying in base R one could do something like the following:
our_split <- strsplit(Aud$Function, "[0-9]+| ")
cbind(
Aud,
do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)
One can use tidyr::separate to divide Function column in multiple columns using regex as separator.
library(tidyverse)
df %>%
separate(Function, into = paste("V",1:5, sep=""),
sep = "([^[:alpha:]]+)", fill="right", extra = "drop")
# ID V1 V2 V3 V4 V5
# 1 F04 FZ TTY WB FR DR
# 2 F05 FZ AGH WZ ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>
([^[:alpha:]]+) : Separate on anything other than alphabates
Data:
df <- read.table(text=
"ID Function
F04 'FZ000TTY WB002FR088DR011'
F05 'FZ000AGH WZ004ABD'
F06 FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)
A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. If you were extracting any sets of consecutive capital letters, you could use "[A-Z]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by |.
library(tidyverse)
Aud <- data_frame(
ID = c("F04", "F05", "F06"),
Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)
search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"
Aud %>%
mutate(code = str_extract_all(Function, search_terms)) %>%
select(-Function) %>%
unnest(code) %>%
group_by(ID) %>%
mutate(subfun = row_number()) %>%
spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID subfun1 subfun2 subfun3 subfun4
#> <chr> <chr> <chr> <chr> <chr>
#> 1 F04 FZ TTY WB FR
#> 2 F05 FZ WZ ABD <NA>
#> 3 F06 FZ ABD <NA> <NA>
Created on 2018-07-11 by the reprex package (v0.2.0).
Below is the code I am trying to implement. I want to extract this 10 consecutive values of rows and turn them into corresponding columns .
This is how data looks like: https://drive.google.com/file/d/0B7huoyuu0wrfeUs4d2p0eGpZSFU/view?usp=sharing
I have been trying but temp1 and temp2 comes out to be empty. Please help.
library(Hmisc) #for increment function
myData <- read.csv("Clothing_&_Accessories.csv",header=FALSE,sep=",",fill=TRUE) # reading the csv file
extract<-myData$V2 # extracting the desired column
x<-1
y<-1
temp1 <- NULL #initialisation
temp2 <- NULL #initialisation
data.sorted <- NULL #initialisation
limit<-nrow(myData) # Calculating no of rows
while (x! = limit) {
count <- 1
for (count in 11) {
if (count > 10) {
inc(x) <- 1
break # gets out of for loop
}
else {
temp1[y]<-data_mat[x] # extracting by every row element
}
inc(x) <- 1 # increment x
inc(y) <- 1 # increment y
}
temp2<-temp1
data.sorted<-rbind(data.sorted,temp2) # turn rows into columns
}
Your code is too complex. You can do this using only one for loop, without external packages, likes this:
myData <- as.data.frame(matrix(c(rep("a", 10), "", rep("b", 10)), ncol=1), stringsAsFactors = FALSE)
newData <- data.frame(row.names=1:10)
for (i in 1:((nrow(myData)+1)/11)) {
start <- 11*i - 10
newData[[paste0("col", i)]] <- myData$V1[start:(start+9)]
}
You don't actually need all this though. You can simply remove the empty lines, split the vector in chunks of size 10 (as explained here) and then turn the list into a data frame.
vec <- myData$V1[nchar(myData$V1)>0]
as.data.frame(split(vec, ceiling(seq_along(vec)/10)))
# X1 X2
# 1 a b
# 2 a b
# 3 a b
# 4 a b
# 5 a b
# 6 a b
# 7 a b
# 8 a b
# 9 a b
# 10 a b
We could create a numeric index based on the '' values in the 'V2' column, split the dataset, use Reduce/merge to get the columns in the wide format.
indx <- cumsum(myData$V2=='')+1
res <- Reduce(function(...) merge(..., by= 'V1'), split(myData, indx))
res1 <- res[order(factor(res$V1, levels=myData[1:10, 1])),]
colnames(res1)[-1] <- paste0('Col', 1:3)
head(res1,3)
# V1 Col1 Col2 Col3
#2 ProductId B000179R3I B0000C3XXN B0000C3XX9
#4 product_title Amazon.com Amazon.com Amazon.com
#3 product_price unknown unknown unknown
From the p1.png, the 'V1' column can also be the column names for the values in 'V2'. If that is the case, we can 'transpose' the 'res1' except the first column and change the column names of the output with the first column of 'res1' (setNames(...))
res2 <- setNames(as.data.frame(t(res1[-1]), stringsAsFactors=FALSE),
res1[,1])
row.names(res2) <- NULL
res2[] <- lapply(res2, type.convert)
head(res2)
# ProductId product_title product_price userid
#1 B000179R3I Amazon.com unknown A3Q0VJTU04EZ56
#2 B0000C3XXN Amazon.com unknown A34JM8F992M9N1
#3 B0000C3XX9 Amazon.com unknown A34JM8F993MN91
# profileName helpfulness reviewscore review_time
#1 Jeanmarie Kabala "JP Kabala" 7/7 4 1182816000
#2 M. Shapiro 6/6 5 1205107200
#3 J. Cruze 8/8 5 120571929
# review_summary
#1 Periwinkle Dartmouth Blazer
#2 great classic jacket
#3 Good jacket
# review_text
#1 I own the Austin Reed dartmouth blazer in every color
#2 This is the second time I bought this jacket
#3 This is the third time I bought this jacket
I guess this is just a reshaping issue. In that case, we can use dcast from data.table to convert from long to wide format
library(data.table)
DT <- dcast(setDT(myData)[V1!=''][, N:= paste0('Col', 1:.N) ,V1], V1~N,
value.var='V2')
data
myData <- structure(list(V1 = c("ProductId", "product_title",
"product_price",
"userid", "profileName", "helpfulness", "reviewscore", "review_time",
"review_summary", "review_text", "", "ProductId", "product_title",
"product_price", "userid", "profileName", "helpfulness",
"reviewscore",
"review_time", "review_summary", "review_text", "", "ProductId",
"product_title", "product_price", "userid", "profileName",
"helpfulness",
"reviewscore", "review_time", "review_summary", "review_text"
), V2 = c("B000179R3I", "Amazon.com", "unknown", "A3Q0VJTU04EZ56",
"Jeanmarie Kabala \"JP Kabala\"", "7/7", "4", "1182816000",
"Periwinkle Dartmouth Blazer",
"I own the Austin Reed dartmouth blazer in every color", "",
"B0000C3XXN", "Amazon.com", "unknown", "A34JM8F992M9N1",
"M. Shapiro",
"6/6", "5", "1205107200", "great classic jacket",
"This is the second time I bought this jacket",
"", "B0000C3XX9", "Amazon.com", "unknown", "A34JM8F993MN91",
"J. Cruze", "8/8", "5", "120571929", "Good jacket",
"This is the third time I bought this jacket"
)), .Names = c("V1", "V2"), row.names = c(NA, 32L),
class = "data.frame")