This question already has answers here:
Split character column into several binary (0/1) columns
(7 answers)
Closed 2 years ago.
I have a column in a dataframe that contains multiple values like this
fruits
1 apple,banana
2 banana,peaches
3 peaches
4 mango
Is there a way to create a dictionary of unique values for fruits which
is will create a new column fruits with values :
fruits = apple,banana,peaches,mango
UPDATE: I need the value as a column and not a list of just unique values . So that I can create a final dataframe that would have the following :
fruits fruit_apple fruit_banana fruit_mango fruit_peacheas
1 apple,banana 1 0 0 0
2 banana,peaches 0 1 0 1
3 peaches 0 0 0 1
4 mango 0 0 1 0
We can do this easily with cSplit_e from splitstackshape
library(splitstackshape)
cSplit_e(df1, "fruits", ",", type = "character", fill = 0)
# fruits fruits_apple fruits_banana fruits_mango fruits_peaches
#1 apple,banana 1 1 0 0
#2 banana,peaches 0 1 0 1
#3 peaches 0 0 0 1
#4 mango 0 0 1 0
data
df1 <- structure(list(fruits = c("apple,banana", "banana,peaches", "peaches",
"mango")), .Names = "fruits", class = "data.frame", row.names = c("1",
"2", "3", "4"))
Do you want the new column to be that concatenated list repeated? Sorry, it's not particularly clear. Assuming that's the case though, and that your data.frame consists of strings not factors;
df <- read.delim(
text="fruits
apple,banana
banana,peaches
peaches
mango",
sep="\n",
header=TRUE,
stringsAsFactors=FALSE)
df
#> fruits
#> 1 apple,banana
#> 2 banana,peaches
#> 3 peaches
#> 4 mango
df$uniquefruits <- paste0(unique(unlist(strsplit(df$fruits, split=","))), collapse=",")
df
#> fruits uniquefruits
#> 1 apple,banana apple,banana,peaches,mango
#> 2 banana,peaches apple,banana,peaches,mango
#> 3 peaches apple,banana,peaches,mango
#> 4 mango apple,banana,peaches,mango
Or do you mean taking only the values from your first fruits column that are not duplicated elsewhere?
Update: Based on comments, I think this is what you're after:
uniquefruits <- unique(unlist(strsplit(df$fruits, split=",")))
uniquefruits
#> [1] "apple" "banana" "peaches" "mango"
df2 <- cbind(df,
sapply(uniquefruits,
function(y) apply(df, 1,
function(x) as.integer(y %in% unlist(strsplit(x, split=","))))))
df2
#> fruits apple banana peaches mango
#> 1 apple,banana 1 1 0 0
#> 2 banana,peaches 0 1 1 0
#> 3 peaches 0 0 1 0
#> 4 mango 0 0 0 1
In theory, you could do this with dplyr but I can't figure out how to automate the column processing for the rowwise mutate (anyone know how?)
library(dplyr)
df %>% rowwise() %>% mutate(apple = as.integer("apple" %in% unlist(strsplit(fruits, ","))),
banana = as.integer("banana" %in% unlist(strsplit(fruits, ","))),
peaches = as.integer("peaches" %in% unlist(strsplit(fruits, ","))),
mango = as.integer("mango" %in% unlist(strsplit(fruits, ","))))
#> Source: local data frame [4 x 5]
#> Groups: <by row>
#>
#> # A tibble: 4 x 5
#> fruits apple banana peaches mango
#> <chr> <int> <int> <int> <int>
#> 1 apple,banana 1 1 0 0
#> 2 banana,peaches 0 1 1 0
#> 3 peaches 0 0 1 0
#> 4 mango 0 0 0 1
with base R:
fruits <- sort(unique(unlist(strsplit(as.character(df$fruits), split=','))))
cols <- as.data.frame(matrix(rep(0, nrow(df)*length(fruits)), ncol=length(fruits)))
names(cols) <- fruits
df <- cbind.data.frame(df, cols)
df <- as.data.frame(t(apply(df, 1, function(x){fruits <- strsplit(x['fruits'], split=','); x[unlist(fruits)] <- 1;x})))
df
fruits apple banana mango peaches
1 apple,banana 1 1 0 0
2 banana,peaches 0 1 0 1
3 peaches 0 0 0 1
4 mango 0 0 1 0
You can use below steps,
1) Just split dataframe by comma using strsplit function.
2) Unlist a split list of vectors into a single vector.
3) Then take unique of list.fruits character vector.
Here is the solution,
# DataFrame of fruits
f <- c("apple,banana","banana,peaches","peaches","mango")
fruits <- as.data.frame(f)
# fruits dataframe
f
#1 apple,banana
#2 banana,peaches
#3 peaches
#4 mango
list.fruits <- unlist(strsplit(f,split=","))
unique.fruits <- unique(list.fruits)
# Result
unique.fruits
[1] "apple" "banana" "peaches" "mango"
Related
I have one column of names of children who have teamed up in class together over multiple projects / activities, like so:
Note: This is ONE column.
Names
Tom,Jack,Meave
Tom,Arial
Arial,Tim,Tom
Neena,Meave
Meave
Tim,Meave
I want to use R so that I can see how many times two children have been paired over the projects they have done:
So:
Pair Counts
Meave,Jack 1
Tom,Jack 1
Meave,none 1
Tom,Arial 2
.
.
.
How do I go about doing this? A tidy-friendly solution would be appreciated.
(Ultimately, I would like to use this data to make a circle-network graph, but that is for another question.)
In Base R:
a <- tcrossprod(table(stack(setNames(strsplit(df$Names,","), rownames(df)))))
a
values
values Arial Jack Meave Neena Tim Tom
Arial 2 0 0 0 1 2
Jack 0 1 1 0 0 1
Meave 0 1 4 1 1 1
Neena 0 0 1 1 0 0
Tim 1 0 1 0 2 1
Tom 2 1 1 0 1 3
You could make the above look like the data you want. eg:
subset(as.data.frame.table(a),
as.character(values) > as.character(values.1) & Freq>0)
values values.1 Freq
5 Tim Arial 1
6 Tom Arial 2
9 Meave Jack 1
12 Tom Jack 1
16 Neena Meave 1
17 Tim Meave 1
18 Tom Meave 1
30 Tom Tim 1
In tidyverse:
df %>%
rownames_to_column()%>%
separate_rows(Names)%>%
table()%>%
crossprod()%>%
as.data.frame.table()%>%
filter(Freq>0 & as.character(Names) > as.character(Names.1))
Names Names.1 Freq
1 Tim Arial 1
2 Tom Arial 2
3 Meave Jack 1
4 Tom Jack 1
5 Neena Meave 1
6 Tim Meave 1
7 Tom Meave 1
8 Tom Tim 1
Data:
df <- structure(list(Names = c("Tom,Jack,Meave", "Tom,Arial", "Arial,Tim,Tom",
"Neena,Meave", "Meave", "Tim,Meave")), class = "data.frame", row.names = c(NA,
-6L))
Here is one tidyverse approach...
If df is
Names
<chr>
Tom,Jack,Meave
Tom,Arial
Arial,Tim,Tom
Neena,Meave
Meave
Tim,Meave
Then
df2 <- df %>%
mutate(ref = row_number(),
Names = ifelse(str_count(Names, ",") == 0, #add nobody if only one
paste0(Names, ",nobody"),
Names),
Names = str_split(Names, ",")) %>%
unnest(Names) %>%
nest(data = ref) %>% #creates a list of refs for each name
mutate(Names2 = list(Names)) %>% #add a column of second names for the pairs
unnest(Names2) %>%
filter(Names != Names2) %>% #remove self-pairs
left_join({.} %>% select(Names2 = Names, data2 = data) %>%
distinct()) %>% #create data for second column of names
mutate(paired = map2_dbl(data, data2, ~length(intersect(.x$ref, .y$ref)))) %>%
select(-data, -data2) %>%
filter(paired > 0, #remove non-occurring combinations
Names > Names2) #remove duplicates
Which gives...
> df2
# A tibble: 18 × 3
Names Names2 paired
<chr> <chr> <dbl>
1 Tom Jack 1
2 Tom Meave 1
3 Tom Arial 2
4 Tom Tim 1
5 Meave Jack 1
6 Tim Meave 1
7 Tim Arial 1
8 Neena Meave 1
9 nobody Meave 1
The code changes the dataframe from a list of names for each value of ref to a list of refs for each name. It then creates a column of other names (i.e. the second of a pair) and left-joins the refs to these other names. Note that the {.} in the left_join refers to the piped dataframe at that point, creating a left join with itself.
dat1 <- data.frame(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
description = c("apple is sweet", "bananass are not"),
description2 = c("melon", "bananas, sweet yes"))
> dat1
id1 pattern
1 1 apple
2 1 applejack
3 2 bananas, sweet
> dat2
id2 description description2
1 1174 apple is sweet melon
2 1231 bananass are not bananas, sweet yes
I have two data.frames, dat1 and dat2. I would like to take each pattern in dat1 and search for them in dat2's description and description2 using the regular expression, \\b[pattern]\\b.
Here is my attempt and the desired final output:
description_match <- description2_match <- vector()
for(i in 1:nrow(dat1)){
for(j in 1:nrow(dat2)){
search_pattern <- paste0("\\b", dat1$pattern[i], "\\b")
description_match <- c(description_match, ifelse(grepl(search_pattern, dat2[j, "description"]), 1, 0))
description2_match <- c(description2_match, ifelse(grepl(search_pattern, dat2[j, "description2"]), 1, 0))
}
}
final_output <- data.frame(id1 = rep(dat1$id1, each = nrow(dat2)),
pattern = rep(dat1$pattern, each = nrow(dat2)),
id2 = rep(dat2$id2, length = nrow(dat1) * nrow(dat2)),
description_match = description_match,
description2_match = description2_match)
> final_output
id1 pattern id2 description_match description2_match
1 1 apple 1174 1 0
2 1 apple 1231 0 0
3 1 applejack 1174 0 0
4 1 applejack 1231 0 0
5 2 bananas, sweet 1174 0 0
6 2 bananas, sweet 1231 0 1
This approach is slow and not efficient if dat1 and dat2 have many rows. What's a quicker way to do this so that I can avoid a for loop?
Using outer and Vectorized grepl.
r <- sapply(dat2[-1], \(x) +outer(dat1$pattern, x, Vectorize(grepl)))
cbind(dat1[rep(seq_len(nrow(dat1)), each=nrow(dat2)), ], id2=dat2$id2, r)
# id1 pattern id2 description description2
# 1 1 apple 1174 1 0
# 1.1 1 apple 1231 0 0
# 2 1 applejack 1174 0 0
# 2.1 1 applejack 1231 0 0
# 3 2 bananas, sweet 1174 0 0
# 3.1 2 bananas, sweet 1231 0 1
A tidyverse solution with:
tidyr::crossing producing all combinations of dat1 and dat2
stringr::str_detect pairwise detecting the presence of a pattern in a string.
library(tidyverse)
crossing(dat1, dat2) %>%
mutate(across(contains('description'), ~ +str_detect(.x, sprintf('\\b%s\\b', pattern))))
# A tibble: 6 × 5
id1 pattern id2 description description2
<dbl> <chr> <dbl> <int> <int>
1 1 apple 1174 1 0
2 1 apple 1231 0 0
3 1 applejack 1174 0 0
4 1 applejack 1231 0 0
5 2 bananas, sweet 1174 0 0
6 2 bananas, sweet 1231 0 1
Another option, but may be slower than #jay.sf's option
Your data frames:
dat1 <- data.frame(id1 = c(1, 1, 2),
pattern = c("apple", "applejack", "bananas, sweet"))
dat2 <- data.frame(id2 = c(1174, 1231),
description = c("apple is sweet", "bananass are not"),
description2 = c("melon", "bananas, sweet yes"))
Add a column with the pattern you'd like to use for matching:
dat1$pattern_grep = paste0("\\b", dat1$pattern, "\\b")
Perform a cartesian join: (i.e. join every row of dat2 to each row of dat1)
cj = merge(dat1, dat2, all = T, by = c())
Perform your grepl now:
cj$description_match <- mapply(grepl, cj$pattern_grep, cj$description)*1
cj$description2_match <- mapply(grepl, cj$pattern_grep, cj$description2)*1
Think about the mapply as performing the grepl on each row of your data frame
Multiplied by 1 to convert the boolean to 1/0
Keep relevant columns:
cj = cj[, c("id1", "pattern", "id2", "description_match", "description2_match")]
id1 pattern id2 description_match description2_match
1 1 apple 1174 1 0
2 1 applejack 1174 0 0
3 2 bananas, sweet 1174 0 0
4 1 apple 1231 0 0
5 1 applejack 1231 0 0
6 2 bananas, sweet 1231 0 1
I am trying to expand on the answer to this problem that was solved, Take Sum of a Variable if Combination of Values in Two Other Columns are Unique
but because I am new to stack overflow, I can't comment directly on that post so here is my problem:
I have a dataset like the following but with about 100 columns of binary data as shown in "ani1" and "bni2" columns.
Locations <- c("A","A","A","A","B","B","C","C","D", "D","D")
seasons <- c("2", "2", "3", "4","2","3","1","2","2","4","4")
ani1 <- c(1,1,1,1,0,1,1,1,0,1,0)
bni2 <- c(0,0,1,1,1,1,0,1,0,1,1)
df <- data.frame(Locations, seasons, ani1, bni2)
Locations seasons ani1 bni2
1 A 2 1 0
2 A 2 1 0
3 A 3 1 1
4 A 4 1 1
5 B 2 0 1
6 B 3 1 1
7 C 1 1 0
8 C 2 1 1
9 D 2 0 0
10 D 4 1 1
11 D 4 0 1
I am attempting to sum all the columns based on the location and season, but I want to simplify so I get a total column for column #3 and after for each unique combination of location and season.
The problem is not all the columns have a 1 value for every combination of location and season and they all have different names.
I would like something like this:
Locations seasons ani1 bni2
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
Here is my attempt using a for loop:
df2 <- 0
for(i in 3:length(df)){
testdf <- data.frame(t(apply(df[1:2], 1, sort)), df[i])
df2 <- aggregate(i~., testdf, FUN=sum)
}
I get the following error:
Error in model.frame.default(formula = i ~ ., data = testdf) :
variable lengths differ (found for 'X1')
Thank you!
You can use dplyr::summarise and across after group_by.
library(dplyr)
df %>%
group_by(Locations, seasons) %>%
summarise(across(starts_with("ani"), ~sum(.x, na.rm = TRUE))) %>%
ungroup()
Another option is to reshape the data to long format using functions from the tidyr package. This avoids the issue of having to select columns 3 onwards.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = -c(Locations, seasons)) %>%
group_by(Locations, seasons, name) %>%
summarise(Sum = sum(value, na.rm = TRUE)) %>%
ungroup() %>%
pivot_wider(names_from = "name", values_from = "Sum")
Result:
# A tibble: 9 x 4
Locations seasons ani1 ani2
<chr> <int> <int> <int>
1 A 2 2 0
2 A 3 1 1
3 A 4 1 1
4 B 2 0 1
5 B 3 1 1
6 C 1 1 0
7 C 2 1 1
8 D 2 0 0
9 D 4 1 2
This question already has answers here:
R - How to one hot encoding a single column while keep other columns still?
(5 answers)
Closed 2 years ago.
original table is like this:
id
food
1
fish
2
egg
2
apple
for each id, should have 1 or 0 value of its food, so the table should look like this:
id
food
fish
egg
apple
1
fish
1
0
0
2
egg
0
1
0
2
apple
0
0
1
A proposition using the dcast() function of the reshape2 package :
df1 <- read.table(header = TRUE, text = "
id food
1 fish
2 egg
2 apple
")
###
df2 <- reshape2::dcast(data = df1,
formula = id+food ~ food,
fun.aggregate = length,
value.var = "food")
df2
#> id food apple egg fish
#> 1 1 fish 0 0 1
#> 2 2 apple 1 0 0
#> 3 2 egg 0 1 0
###
df3 <- reshape2::dcast(data = df1,
formula = id+factor(food, levels=unique(food)) ~
factor(food, levels=unique(food)),
fun.aggregate = length,
value.var = "food")
names(df3) <- c("id", "food", "fish", "egg", "apple")
df3
#> id food fish egg apple
#> 1 1 fish 1 0 0
#> 2 2 egg 0 1 0
#> 3 2 apple 0 0 1
# Created on 2021-01-29 by the reprex package (v0.3.0.9001)
Regards,
I'm trying to do some complex calculations and part of the code requires that I parse a comma separated entry and count the number of values that are more than 0.
Example input data:
a <- c(0,0,3,0)
b <- c(4,4,0,1)
c <- c("3,4,3", "2,1", 0, "5,8")
x <- data.frame(a, b, c)
x
a b c
1 0 4 3,4,3
2 0 4 2,1
3 3 0 0
4 0 1 5,8
The column that I need to parse, c is factors and all other columns are numeric. The number of values comma separated will vary, in this example it varies from 0 to 3.
The desired output would look like this:
x$c_occur <- c(3, 2, 0, 2)
x
a b c c_occur
1 0 4 3,4,3 3
2 0 4 2,1 2
3 3 0 0 0
4 0 1 5,8 2
Where c_occur lists the number of occurrences > 0 in the c column.
I was thinking something like this would work... but I can't figure it out.
library(dplyr
x_desired <- x %>%
mutate(c_occur = count(strsplit(c, ","), > 0))
We can make use of str_count
library(stringr)
library(dplyr)
x %>%
mutate(c_occur = str_count(c, '[1-9]\\d*'))
# a b c c_occur
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2
After splitting the 'c', we can get the count by summing the logical vector after looping over the list output from strsplit
library(purrr)
x %>%
mutate(c_occur = map_int(strsplit(as.character(c), ","),
~ sum(as.integer(.x) > 0)))
# a b c c_occur
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2
Or we can separate the rows with separate_rows and do a group_by summarise
library(tidyr)
x %>%
mutate(rn = row_number()) %>%
separate_rows(c, convert = TRUE) %>%
group_by(rn) %>%
summarise(c_occur = sum(c >0)) %>%
select(-rn) %>%
bind_cols(x, .)
# A tibble: 4 x 4
# a b c c_occur
# <dbl> <dbl> <fct> <int>
#1 0 4 3,4,3 3
#2 0 4 2,1 2
#3 3 0 0 0
#4 0 1 5,8 2