I have a list containing a number of other lists, each of which contain varying numbers of character vectors, with varying numbers of elements. I want to create a dataframe where each list would be represented as a row and each character vector within that list would be a column. Where the character vector has > 1 element, the elements would be concatenated and separated using a "+" sign, so that they can be stored as one string. The data looks like this:
fruits <- list(
list(c("orange"), c("pear")),
list(c("pear", "orange")),
list(c("lemon", "apple"),
c("pear"),
c("grape"),
c("apple"))
)
The expected output is like this:
fruits_df <- data.frame(col1 = c("orange", "pear + orange", "lemon + apple"),
col2 = c("pear", NA, "pear"),
col3 = c(NA, NA, "grape"),
col4 = c(NA, NA, "apple"))
There is no limit on the number of character vectors that can be contained in a list, so the solution needs to dynamically create columns, leading to a df where the number of columns is equal to the length of the list containing the largest number of character vectors.
For every list in fruits you can create a one row dataframe and bind the data.
dplyr::bind_rows(lapply(fruits, function(x) as.data.frame(t(sapply(x,
function(y) paste0(y, collapse = "+"))))))
# V1 V2 V3 V4
#1 orange pear <NA> <NA>
#2 pear+orange <NA> <NA> <NA>
#3 lemon+apple pear grape apple
This is a bit messy but here is one way
cols <- lapply(fruits, function(x) sapply(x, paste, collapse=" + "))
ncols <- max(lengths(cols))
dd <- do.call("rbind.data.frame", lapply(cols, function(x) {length(x) <- ncols; x}))
names(dd) <- paste0("col", 1:ncol(dd))
dd
# col1 col2 col3 col4
# 1 orange pear <NA> <NA>
# 2 pear + orange <NA> <NA> <NA>
# 3 lemon + apple pear grape apple
or another strategy
ncols <- max(lengths(fruits))
dd <- data.frame(lapply(seq.int(ncols), function(x) sapply(fruits, function(y) paste(unlist(y[x]), collapse=" + "))))
names(dd) <- paste0("col", 1:ncols)
dd
But really you need to either build each column or row from your list and then combine them together.
Another approach that melts the list to a data.frame using rrapply::rrapply and then casts it to the required format using data.table::dcast:
library(rrapply)
library(data.table)
## melt to long data.frame
long <- rrapply(fruits, f = paste, how = "melt", collapse = " + ")
## cast to wide data.table
setDT(long)
dcast(long[, .(L1, L2, value = unlist(value))], L1 ~ L2)[, !"L1"]
#> ..1 ..2 ..3 ..4
#> 1: orange pear <NA> <NA>
#> 2: pear + orange <NA> <NA> <NA>
#> 3: lemon + apple pear grape apple
Related
How do you replace values in a column when the value fulfils certain conditions in R?
Here I have two data frames.
Fruits <- c("Apple", "Grape Fruits", "Lemon", "Peach", "Banana", "Orange", "Strawberry", "Apple")
df1 <- data.frame(Fruits)
df1
Fruits
Apple
Grape Fruits
Lemon
Peach
Banana
Orange
Strawberry
Apple
Name <- c("Apple", "Orange", "Lemon", "Grape", "Peach","Pinapple")
Rename <- c("Manzana", "Naranja", "Limon", "Uva", "Melocoton", "Anana")
df2 <- data.frame(Name, Rename)
df2
Name Rename
Apple Manzana
Orange Naranja
Lemon Limon
Grape Uva
Peach Melocoton
Pinapple Anana
I want to replace the values in df1$Fruits to corresponding values in df2$Rename, only when each value in df1$Fruits matches that in df2$Name.
So the designated data frame would be like this.
Fruits
Manzana
Grape Fruits
Limon
Melocoton
Banana
Naranja
Strawberry
Manzana
Does anybody know how to do this? Thank you very much for your help.
using plyr
library(plyr)
new.fruits <- mapvalues(Fruits, from = Name, to = Rename)
df <- data.frame(Fruits=new.fruits)
You can use merge and then replace all NA by their respective fruits.
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
If you need to keep the order:
df1$id <- 1:nrow(df1)
df3 <- merge(df1,df2, by.x = "Fruits", by.y = "Name", all.x = T)
df3$Rename[is.na(df3$Rename)] <- df3$Fruits[is.na(df3$Rename)]
df3 <- df3[order(df3$id),]
data.frame(Fruits = df3[,"Rename"])
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana
Shorter match solution from #Wen below
df1$new=df2$Rename[match(df1$Fruits,df2$Name)]
df1$new[is.na(df1$new)] <- df1$Fruits[is.na(df1$new)]
Using apply with pmatch can be provide desired output.
df1$Fruits <- apply(df1,1,function(x){
matched = (df2$Name == x)
if(any(matched)){
as.character(df2$Rename[matched])
} else {
x
}})
df1
# Fruits
# 1 Manzana
# 2 Grape Fruits
# 3 Limon
# 4 Melocoton
# 5 Banana
# 6 Naranja
# 7 Strawberry
# 8 Manzana
I have a dataframe that looks like this (I simplify):
df <- data.frame(rbind(c(1, "dog", "cat", "rabbit"), c(2, "apple", "peach", "cucumber")))
colnames(df) <- c("ID", "V1", "V2", "V3")
## ID V1 V2 V3
## 1 1 dog cat rabbit
## 2 2 apple peach cucumber
I would like to create a column containing all possible combinations of variables V1:V3 two by two (order doesn't matter), but keeping a link with the original ID. So something like this.
## ID bigrams
## 1 1 dog cat
## 2 1 cat rabbit
## 3 1 dog rabbit
## 4 2 apple peach
## 5 2 apple cucumber
## 6 2 peach cucumber
My idea: use combn(), mutate() and separate_row().
library(tidyr)
library(dplyr)
df %>%
mutate(bigrams=paste(unlist(t(combn(df[,2:4],2))), collapse="-")) %>%
separate_rows(bigrams, sep="-") %>%
select(ID,bigrams)
The result is not what I expected... I guess that concatenating a matrix (the result of combine()) is not as easy as that.
I have two questions about this: 1) how to debug this code? 2) Is this a good way to do this kind of thing? I'm new on R but I’ve an Open Refine background, so concatenate-split multivalued cells make a lot of sense for me. But is this also the right method with R?
Thanks in advance for any help.
We can do this with data.table. Convert the 'data.frame' to 'data.table' (setDT(df)), melt it to 'long' format, grouped by 'ID', get the combn of 'value' and paste it together
library(data.table)
dM <- melt(setDT(df), id.var = "ID")[, combn(value, 2, FUN = paste, collapse=' '), ID]
setnames(dM, 2, 'bigrams')[]
# ID bigrams
#1: 1 dog cat
#2: 1 dog rabbit
#3: 1 cat rabbit
#4: 2 apple peach
#5: 2 apple cucumber
#6: 2 peach cucumber
I recommend #akrun's "melt first" approach, but just for fun, here are more ways to do it:
library(tidyverse)
df %>%
mutate_all(as.character) %>%
transmute(ID = ID, bigrams = pmap(
list(V1, V2, V3),
function(a, b, c) combn(c(a, b, c), 2, paste, collapse = " ")
))
# ID bigrams
# 1 1 dog cat, dog rabbit, cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber
(mutate_all(as.character) just because you gave us factors, and factor to character conversion can be surprising).
df %>%
mutate_all(as.character) %>%
nest(-ID) %>%
mutate(bigrams = map(data, combn, 2, paste, collapse = " ")) %>%
unnest(data) %>%
as.data.frame()
# ID bigrams V1 V2 V3
# 1 1 dog cat, dog rabbit, cat rabbit dog cat rabbit
# 2 2 apple peach, apple cucumber, peach cucumber apple peach cucumber
(as.data.frame() just for a prettier printing)
I need to match my values in col1 with col 2 and col3 and if they match i need to add their frequencies.It should display the count from freq1 freq2 and freq3 of the unique values.
col1 freq1 col2 freq2 col3 freq3
apple 3 grapes 4 apple 1
grapes 5 apple 2 orange 2
orange 4 banana 5 grapes 2
guava 3 orange 6 banana 7
I need my output like this
apple 6
grapes 11
orange 12
guava 3
banana 12
I m a beginner.How do I code this in R.
We can use melt from data.table with patterns specified in the measure argument to convert the 'wide' format to 'long' format, then grouped by 'col', we get the sum of 'freq' column
library(data.table)
melt(setDT(df1), measure = patterns("^col", "^freq"),
value.name = c("col", "freq"))[,.(freq = sum(freq)) , by = col]
# col freq
#1: apple 6
#2: grapes 11
#3: orange 12
#4: guava 3
#5: banana 12
If it is alternating 'col', 'freq', columns, we can just unlist the subset of 'col' columns and 'freq' columns separately to create a data.frame (using c(TRUE, FALSE) to recycle for subsetting columns), and then use aggregate from base R to get the sum grouped by 'col'.
aggregate(freq~col, data.frame(col = unlist(df1[c(TRUE, FALSE)]),
freq = unlist(df1[c(FALSE, TRUE)])), sum)
# col freq
#1 apple 6
#2 banana 12
#3 grapes 11
#4 guava 3
#5 orange 12
I think that the easiest to understand for newbie would be creating 3 separate dataframes (I assumed here that your dataframe name is df):
df1 <- data.frame(df$col1, df$freq1)
colnames(df1) <- c("fruit", "freq")
df2 <- data.frame(df$col2, df$freq2)
colnames(df2) <- c("fruit", "freq")
df3 <- data.frame(df$col3, df$freq3)
colnames(df3) <- c("fruit", "freq")
Then bind all dataframes by rows:
df <- rbind(df1, df2, df3)
And at the end group by fruit and sum frequencies using dplyr library.
library(dplyr)
df <- df %>%
group_by(fruit)%>%
summarise(sum(freq))
I have this data.table with strings:
dt = tbl_dt(data.table(x=c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen")))
x
1 book|ball|apple
2 flower|orange|cup
3 banana|bandana|pen
..and I also have a reference string which I would like to match with the one in the data.table, extracting the word if it's in there, like so..
fruits = "apple|banana|orange"
str_match(fruits, "flower|orange|cup")
>"orange"
How do I do this for the entire data.table?
require(dplyr)
require(stringr)
dt %>%
mutate (fruit = str_match(fruits, x))
Error in rep(NA_character_, n) : invalid 'times' argument
In addition: Warning message:
In regexec(c("book|ball|apple", "flower|orange|cup", "banana|bandana|pen" :
argument 'pattern' has length > 1 and only the first element will be used
What I would like:
x fruit
1 book|ball|apple apple
2 flower|orange|cup orange
3 banana|bandana|pen banana
Or (in order to avoid warnings, it is better though that instead of tbl_dt you will use data.table)
dt[, fruits := mapply(str_match, fruits, x)]
dt
## x fruits
## 1: book|ball|apple apple
## 2: flower|orange|cup orange
## 3: banana|bandana|pen banana
Or you could do something similar to #akrun's answer, such as
dt[, fruits := lapply(x, str_match, fruits)]
dt$fruit <- unlist(lapply(dt$x, str_match, fruits))
dt
#Source: local data table [3 x 2]
#
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana
A solution using base R and without str_match:
fruit=NULL
reflist = unlist(strsplit(fruits, '\\|'))
for(xx in ddf$x){
ss = unlist(strsplit(xx,'\\|'))
for(s in ss) if(s %in% reflist) fruit[length(fruit)+1]=s
}
ddf$fruit = fruit
ddf
# x fruit
#1 book|ball|apple apple
#2 flower|orange|cup orange
#3 banana|bandana|pen banana
i have two data.frames that i want to merge and replace values of certain columns of df1
with values of df2. in this working example there are only 3 columns. but in the original data,
there are about 20 columns that should remain in the final data.frame.
NO <- c(2, 4, 7, 18, 25, 36, 48)
WORD <- c("apple", "peach", "plum", "orange", "grape", "berry", "pear")
CLASS <- c("p", "x", "x", "n", "x", "p", "n")
ColA <- c("hot", "warm", "sunny", "rainy", "windy", "cloudy", "snow")
df1 <- data.frame(NO, WORD, CLASS, ColA)
df1
# NO WORD CLASS ColA
# 1 2 apple p hot
# 2 4 peach x warm
# 3 7 plum x sunny
# 4 18 orange n rainy
# 5 25 grape x windy
# 6 36 berry p cloudy
# 7 48 pear n snow
NO <- c(4, 18, 36)
WORD <- c("patricia", "oliver", "bob")
CLASS <- c("p", "n", "x")
df2 <- data.frame(NO, WORD, CLASS)
df2
# NO WORD CLASS
# 1 4 patricia p
# 2 18 oliver n
# 3 36 bob x
i want to merge the two data.frames and replace the values of WORD and CLASS from df1
with the values of WORD and CLASS from df2
my data.frame should look like this:
# NO WORD CLASS ColA
# 1 2 apple p hot
# 2 4 patricia p warm
# 3 7 plum x sunny
# 4 18 oliver n rainy
# 5 25 grape x windy
# 6 36 bob x cloudy
# 7 48 pear n snow
Try this
auxind<-match(df2$NO, df1$NO) # Stores the repeated rows in df1
dfuni<-(rbind(df1[,1:3],df2)[-auxind,]) # Merges both data.frames and erases the repeated rows from the first three colums of df1
dfuni<-dfuni[order(dfuni$NO),] # Sorts the new data.frame
df1[,1:3]<-dfuni
This approach could work as well though is more playing around than the best answer to the question:
library(qdap); library(qdapTools)
df1[, 2] <- as.character(df1[, 2])
trms <- strsplit(df1[, 1] %lc% colpaste2df(df2, 2:3, keep.orig = FALSE), "\\.")
df1[sapply(trms, function(x) !all(is.na(x))), 2:3] <-
do.call(rbind, trms[sapply(trms, function(x) !all(is.na(x)))])