Trying to spread two column data to a format where there will be some NA values.
dataframe:
df <- data.frame(Names = c("TXT","LSL","TXT","TXT","TXT","USL","LSL"), Values = c("apple",-2,"orange","banana","pear",10,-1),stringsAsFactors = F)
If a row includes TXT following rows that has LSL or USL will belong to that row.
For ex:
in the first row; Name is TXT Value is apple next row is LSL value will be for apple's LSL and since no USL that will be NA until the next TXT name.
If there is a TXT followed by another TXT, then LSL and USL values for that row will be NA
trying to create this:
I tried using spread with row numbers as unique identifier but that's not what I want:
df %>% group_by(Names) %>% mutate(row = row_number()) %>% spread(key = Names,value = Values)
I guess I need to create following full table with NAs then spread but couldn't figure out how.
We can expand the dataset with complete after creating a grouping index based on the occurence of 'TXT'
library(dplyr)
library(tidyr)
df %>%
group_by(grp = cumsum(Names == 'TXT')) %>%
complete(Names = unique(.$Names)) %>%
ungroup %>%
spread(Names, Values) %>%
select(TXT, LSL, USL)
# A tibble: 4 x 3
# TXT LSL USL
# <chr> <chr> <chr>
#1 apple -2 <NA>
#2 orange <NA> <NA>
#3 banana <NA> <NA>
#4 pear -1 10
In data.table, we can use dcast :
library(data.table)
dcast(setDT(df), cumsum(Names == 'TXT')~Names, value.var = 'Values')[, -1]
# LSL TXT USL
#1: -2 apple <NA>
#2: <NA> orange <NA>
#3: <NA> banana <NA>
#4: -1 pear 10
Related
I have a dataframe in R, in which I want to remove all rows of a particular group if two or more specific groups are present. In the example below, I want to remove all the rows related to the 'bil berry', if both bil berry and blackberry are present. I got to the point where I can identify whether my data has two or more kinds of berries, but I am not sure about the next steps. I prefer a solution with dplyr.
library(stringr)
library(dplyr)
data(fruit)
my.df <- data.frame(
"Name" = rep(fruit[1:7], each = 2),
"Value" = 1:14
)
UniqueFruits <- unique(my.df$Name)
sum(grepl("berry", UniqueFruits))>1
Maybe you are trying for :
library(dplyr)
unique_berries <- grep('berry', my.df$Name, value = TRUE)
if(n_distinct(unique_berries) > 1) my.df <- my.df %>% filter(Name != 'bilberry')
my.df
# Name Value
#1 apple 1
#2 apple 2
#3 apricot 3
#4 apricot 4
#5 avocado 5
#6 avocado 6
#7 banana 7
#8 banana 8
#9 bell pepper 9
#10 bell pepper 10
#11 blackberry 13
#12 blackberry 14
So here is a way that is "pure" dplyr
library(stringr)
library(dplyr)
data(fruit)
my.df <- data.frame(
"Name" = rep(fruit[1:7], each = 2),
"Value" = 1:14
)
my.df %>%
mutate( keepMe = case_when(
length (unique (grepl("berry", Name))) >0 & Name == "bilberry" ~ FALSE,
TRUE ~ TRUE)
) %>%
filter( keepMe != F )
Has no IF statements as such. Not sure I really like it! But it is what you asked for - a tidyverse solution
Like this?
my.df %>%
mutate( berry = grepl("berry", Name)) %>%
filter( berry == F )
I have a tibble with a character column. The character in each row is a set of words like this: "type:mytype,variable:myvariable,variable:myothervariable:asubvariableofthisothervariable". Things like that. I want to either convert this into columns in my tibble (a column "type", a column "variable", and so on; but then I don't really know what to do with my 3rd level words), or convert it to a column list x, so that x has a structure of sublists: x$type, x$variable, x$variable$myothervariable.
I'm not sure what is the best approach, but also, I don't know how to implement this two approaches that I suggest here. I have to say that I have maximum 3 levels, and more 1st level words than "type" and "variable".
Small Reproducible Example:
df <- tibble()
df$id<- 1:3
df$keywords <- c(
"type:novel,genre:humor:black,year:2010"
"type:dictionary,language:english,type:bilingual,otherlang:french"
"type:essay,topic:philosophy:purposeoflife,year:2005"
)
# expected would be in idea 1:
colnames(df)
# n, keywords, type, genre, year,
# language, otherlang, topic
# on idea 2:
colnames(df)
# n, keywords, keywords.as.list
We can use separate_rows from tidyr to split the 'keywords' column by ,, then with cSplit, split the column 'keywords' into multiple columns at :, reshape to 'long' format with pivot_longer and then reshape back to 'wide' with pivot_wider
library(dplyr)
library(tidyr)
library(data.table)
library(splitstackshape)
df %>%
separate_rows(keywords, sep=",") %>%
cSplit("keywords", ":") %>%
pivot_longer(cols = keywords_2:keywords_3, values_drop_na = TRUE) %>%
select(-name) %>%
mutate(rn = rowid(id, keywords_1)) %>%
pivot_wider(names_from = keywords_1, values_from = value) %>%
select(-rn) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 6 x 7
# id type genre year language otherlang topic
# <int> <chr> <chr> <int> <chr> <chr> <chr>
#1 1 novel humor 2010 <NA> <NA> <NA>
#2 1 <NA> black NA <NA> <NA> <NA>
#3 2 dictionary <NA> NA english french <NA>
#4 2 bilingual <NA> NA <NA> <NA> <NA>
#5 3 essay <NA> 2005 <NA> <NA> philosophy
#6 3 <NA> <NA> NA <NA> <NA> purposeoflife
data
df <- structure(list(id = 1:3, keywords = c("type:novel,genre:humor:black,year:2010",
"type:dictionary,language:english,type:bilingual,otherlang:french",
"type:essay,topic:philosophy:purposeoflife,year:2005")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
I wish to create a new column in the dataframe below that is contingent on certain strings - in this case, "next section".
library(tidyverse)
set.seed(123)
df1 <- tibble(text = c(sample(fruit, sample(1:3)), "next", "section", sample(fruit, sample(1:3))),
article = "df1")
df2 <- tibble(text = c(sample(fruit, sample(1:3)), "next", "section", sample(fruit, sample(1:3))),
article = "df2")
df3 <- tibble(text = c(sample(fruit, sample(1:3)), "next", "section", sample(fruit, sample(1:3))),
article = "df3")
final_df <- df1 %>%
bind_rows(df2) %>%
bind_rows(df3)
To be clear, this is the output I'd like to achieve:
final_df %>%
mutate(label = c("first","first","first","first","first", "second", "second",
"first","first","first","first","second",
"first","first","first","first","second","second"))
# A tibble: 18 x 3
text article label
<chr> <chr> <chr>
1 cantaloupe df1 first
2 quince df1 first
3 kiwi fruit df1 first
4 next df1 first
5 section df1 first
6 cantaloupe df1 second
7 date df1 second
8 rambutan df2 first
9 passionfruit df2 first
10 next df2 first
11 section df2 first
12 rock melon df2 second
13 blood orange df3 first
14 guava df3 first
15 next df3 first
16 section df3 first
17 strawberry df3 second
18 cherimoya df3 second
I'm thinking I could start with a group_by(article), followed with mutate(label = case_when()) but I'm stuck beyond this. Specifically, how do I populate the rows before and including the strings "next" and "section"?
We can use lag to get text from the previous row and use cumsum to increment the count whenever we observe 'section' in current row and 'next' in previous row for each article.
library(dplyr)
final_df %>%
group_by(article) %>%
mutate(temp = lag(cumsum(text == 'section' & lag(text) == 'next'),
default = 0) + 1)
# text article label
# <chr> <chr> <dbl>
# 1 cantaloupe df1 1
# 2 quince df1 1
# 3 kiwi fruit df1 1
# 4 next df1 1
# 5 section df1 1
# 6 cantaloupe df1 2
# 7 date df1 2
# 8 rambutan df2 1
# 9 passionfruit df2 1
#10 next df2 1
#11 section df2 1
#12 rock melon df2 2
#13 blood orange df3 1
#14 guava df3 1
#15 next df3 1
#16 section df3 1
#17 strawberry df3 2
#18 cherimoya df3 2
The same logic can be translated to data.table using shift.
library(data.table)
setDT(final_df)[, label := shift(cumsum(text == 'section' &
shift(text) == 'next'), fill = 0) + 1, article]
You can replace 1, 2 with 'first', 'second' if you need output in that form.
I have a large dataframe and I would like to split a column into many columns based on two conditions the caret character ^ and the letter following IMM-. Based on the data below Column 1 would be split into columns named IMM-A, IMM-B, IMM-C, and IMM-W. I tried the separate function but it only works if you specify the column names and because my data is not uniform I don't always know what the column names should be.
SampleId Column1
1 IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333
The expected output would be;
SampleId IMM-A IMM-B IMM-C IMM-W
1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
Not clear about the expected output. Based on the description, we may need
library(tidyverse)
map(strsplit(df$Column1, "[*+^]"), ~
stack(setNames(as.list(.x[c(FALSE, TRUE)]), .x[c(TRUE, FALSE)])) %>%
group_by(ind) %>%
mutate(rn = row_number()) %>%
spread(ind, values)) %>%
set_names(df$SampleId) %>%
bind_rows(.id = 'SampleId') %>%
select(-rn)
# A tibble: 6 x 5
# SampleId `IMM-A` `IMM-B` `IMM-C` `IMM-W`
# <chr> <chr> <chr> <chr> <chr>
#1 1 010306 6900 1212 <NA>
#2 1 0209 779999 3333 <NA>
#3 2 010306 <NA> 6900 1212
#4 2 0209 <NA> 779999 3333
#5 3 <NA> 010306 6900 1212
#6 3 <NA> 0209 779999 3333
Update
Based on the OP's expected output, we expand the data by splitting the 'Column1' at the ^ delimiter, then separate the 'Column1' into 'colA', 'colB' at the delimiter *, remove the 'colB' and spread to 'wide' format
df %>%
separate_rows(Column1, sep = "\\^") %>%
separate(Column1, into = c("colA", "colB"), remove = FALSE, sep="[*]") %>%
select(-colB) %>%
spread(colA, Column1, fill = "")
#SampleId IMM-A IMM-B IMM-C IMM-W
#1 1 IMM-A*010306+IMM-A*0209 IMM-B*6900+IMM-B*779999 IMM-C*1212+IMM-C*3333
#2 2 IMM-A*010306+IMM-A*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
#3 3 IMM-B*010306+IMM-B*0209 IMM-C*6900+IMM-C*779999 IMM-W*1212+IMM-W*3333
data
df <- structure(list(SampleId = 1:3, Column1 =
c("IMM-A*010306+IMM-A*0209^IMM-B*6900+IMM-B*779999^IMM-C*1212+IMM-C*3333",
"IMM-A*010306+IMM-A*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333",
"IMM-B*010306+IMM-B*0209^IMM-C*6900+IMM-C*779999^IMM-W*1212+IMM-W*3333"
)), class = "data.frame", row.names = c(NA, -3L))
This question may sound similar to others, but I hope it is different enough.
I want to take a specific list of values and count how often they appear in another list of values where non-occurring values are retuned as '0'.
I have a Data Frame (df1) with the following values:
Items <- c('Carrots','Plums','Pineapple','Turkey')
df1<-data.frame(Items)
>df1
Items
1 Carrots
2 Plums
3 Pineapple
4 Turkey
And a second Data Frame (df2) that contains a column called 'Thing':
> head(df2,n=10)
ID Date Thing
1 58150 2012-09-12 Potatoes
2 12357 2012-09-28 Turnips
3 50788 2012-10-04 Oranges
4 66038 2012-10-11 Potatoes
5 18119 2012-10-11 Oranges
6 48349 2012-10-14 Carrots
7 23328 2012-10-16 Peppers
8 66038 2012-10-26 Pineapple
9 32717 2012-10-28 Turnips
10 11345 2012-11-08 Oranges
I know the word 'Turkey' only appears in df1 NOT in df2. I want to return a frequency table or count of the items in df1 that appears in df2 and return '0' for the count of Turkey.
How can I summarize values of on Data Frame column using the values from another? The closest I got was:
df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
But this return a list of items filtered between df1 and df2 so 'Turkey' gets excluded. So close!
> df2%>% count (Thing) %>% filter(Thing %in% df1$Items,)
# A tibble: 3 x 2
Thing n
<fctr> <int>
1 Carrots 30
2 Pineapple 30
3 Plums 38
I want my output to look like this:
1 Carrots 30
2 Pineapple 30
3 Plums 38
4 Turkey 0
I am newish to R and completely new to dplyr.
I use this sort of thing all the time. I'm sure there's a more savvy way to code it, but it's what I got:
item <- vector()
count <- vector()
items <- list(unique(df1$Items))
for (i in 1:length(items)){
item[i] <- items[i]
count[i] <- sum(df2$Thing == item)
}
df3 <- data.frame(cbind(item, count))
Hope this helps!
Stephen's solution worked with a slight modification, adding the [i] to the item at the end of count[i] line. See below:
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df2$Thing == item[i])
}
df3 <- data.frame(cbind(item, count))
> df3
item count
1 Carrots 30
2 Plums 38
3 Pineapple 30
4 Turkey 0
dplyr drops 0 count rows, and you have the added complication that the possible categories of Thing are different between your two datasets.
If you add the factor levels from df1 to df2, you can use complete from tidyr, which is a common way to add 0 count rows.
I'm adding the factor levels from df1 to df2 using a convenience function from package forcats called fct_expand.
library(dplyr)
library(tidyr)
library(forcats)
df2 %>%
mutate(Thing = fct_expand(Thing, as.character(df1$Item) ) ) %>%
count(Thing) %>%
complete(Thing, fill = list(n = 0) ) %>%
filter(Thing %in% df1$Items,)
A different approach is to aggregate df2 first, to right join with df1 (to pick all rows of df1), and to replace NA by zero.
library(dplyr)
df2 %>%
count(Thing) %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
mutate(n = coalesce(n, 0L))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Plums 0
3 Pineapple 1
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same approach in data.table:
library(data.table)
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][is.na(N), N := 0L][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Note that in both implementations unique(df1) is used to avoid unintended duplicate rows after the join.
Edit 2019-06-22:
With development version 1.12.3 data.table has gained a coalesce() function. So, above statement can be written
setDT(df2)[, .N, by = Thing][unique(setDT(df1)), on = .(Thing = Items)][, N := coalesce(N, 0L)][]
If df2 is large and df1 contains only a few Items it might be more efficient to join first and then to aggregate:
library(dplyr)
df2 %>%
right_join(unique(df1), by = c("Thing" = "Items")) %>%
group_by(Thing) %>%
summarise(n = sum(!is.na(ID)))
# A tibble: 4 x 2
Thing n
<chr> <int>
1 Carrots 1
2 Pineapple 1
3 Plums 0
4 Turkey 0
Warning message:
Column `Thing`/`Items` joining factors with different levels, coercing to character vector
The same in data.table syntax:
library(data.table)
setDT(df2)[unique(setDT(df1)), on = .(Thing = Items)][, .(N = sum(!is.na(ID))), by = Thing][]
Thing N
1: Carrots 1
2: Plums 0
3: Pineapple 1
4: Turkey 0
Edit 2019-06-22: Above can be written more concisely by aggregating in a join:
setDT(df2)[setDT(df1), on = .(Thing = Items), .N, by = .EACHI]