Simple question,
I've provided two different data frames below with code/output, why does one work and the other doesn't? Having trouble understanding the Key/Value inputs (when they need to be explicitly defined/and what it means to just have them as strings in the input).
library(tidyverse)
dat <- data.frame(one = c("x", "x", "x"), two = c("x", "", "x"),
three = c("", "", ""), type = c("chocolate", "vanilla", "strawberry"))
dat %>%
na_if("") %>%
gather("Key", "Val", -type,na.rm=TRUE) %>%
rowid_to_column %>%
spread(Key, Val,fill = "") %>%
select(-1) # works well
dat %>%
na_if("") %>%
gather("Key", "Val", -type,na.rm=TRUE)
Error: Strings must match column names. Unknown columns: Val
Extra Credit: if someone could explain the effect of rowit_to_column & spread(), that'd be helpful.
Perhaps I'm missing something, but I can't reproduce your error.
dat %>%
na_if("") %>% # Replace "" with NA
gather("Key", "Val", -type, na.rm = TRUE) %>% # wide -> long
rowid_to_column() %>% # Sequentially number rows
spread(Key, Val, fill = "") %>% # long -> wide
select(-1) # works well # remove row number
# type one two
#1 chocolate x
#2 vanilla x
#3 strawberry x
#4 chocolate x
#5 strawberry x
dat %>%
na_if("") %>% # Replace "" with NA
gather("Key", "Val", -type, na.rm = TRUE) # wide -> long
# type Key Val
#1 chocolate one x
#2 vanilla one x
#3 strawberry one x
#4 chocolate two x
#6 strawberry two x
Explanation:
na_if("") replaces "" entries with NA.
gather("Key", "Val", -type, na.rm = TRUE) turns a wide table into a long "key-value" table, by storing entries in all columns except type in two columns Key (i.e. the column name) and Val (i.e. the entry). na.rm = TRUE removes rows with NA values.
rowid_to_column sequentially numbers the rows.
spread(Key, Val, fill = "") turns a long "key-value" table into a wide table, with as many columns as there are unique keys in Key. Entries are taken from column Val, if an entry is missing it's filled with "".
select(-1) removes the first column.
Related
Context
I have created a small sample dataframe to explain my problem. The original one is larger, as it has many more columns. But it is formatted in the same way.
df = data.frame(Case1.1.jpeg.text="the",
Case1.1.jpeg.text.1="big",
Case1.1.jpeg.text.2="DOG",
Case1.1.jpeg.text.3="10197",
Case1.2.png.text="framework",
Case1.3.jpg.text="BE",
Case1.3.jpg.text.1="THE",
Case1.3.jpg.text.2="Change",
Case1.3.jpg.text.3="YOUWANTTO",
Case1.3.jpg.text.4="SEE",
Case1.3.jpg.text.5="in",
Case1.3.jpg.text.6="theWORLD",
Case1.4.png.text="09.80.56.60.77")
The dataframe consists of output from a text detection ML model based on a certain number of input images.
The output format makes each word for each image a separate column, thereby creating a very wide dataset.
Desired Output
I am looking to create a cleaner version of it, with one column containing the image name (e.g. Case1.2.png) and the second with the concatenation of all possible words that the model finds in that particular image (the number of words varies from image to image).
result = data.frame(Case=c('Case1.1.jpeg','Case1.2.png','Case1.3.jpg','Case1.4.png'),
Text=c('thebigDOG10197','framework','BETHEChangeYOUWANTTOSEEintheWORLD','09.80.56.60.77'))
I have tried many approaches based on similar questions found on Stackoverflow, but none seem to give me the exact output I'm looking for.
Any help on this would be greatly appreciated.
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text.*)",
names_to = c("Case", NA)) %>%
group_by(Case) %>%
summarize(value = paste(value, collapse = ""), .groups = "drop")
Alternatively, this can be accomplished using just the pivot functions from tidyr:
library(tidyr)
library(stringr)
df %>%
pivot_longer(cols = everything(),
names_pattern = "(.*)\\.(text).*",
names_to = c("Case", "cols")) %>%
pivot_wider(id_cols = Case,
values_from = value,
names_from = cols,
values_fn = str_flatten)
Output
Case value
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
A possible solution:
library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(name = str_remove(name, "\\.text\\.*\\d*")) %>%
group_by(name) %>%
summarise(text = str_c(value, collapse = ""))
#> # A tibble: 4 x 2
#> name text
#> <chr> <chr>
#> 1 Case1.1.jpeg thebigDOG10197
#> 2 Case1.2.png framework
#> 3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
#> 4 Case1.4.png 09.80.56.60.77
An option in base R is stack the data into a two column data.frame with stack and then do a group by paste with aggregate
aggregate(cbind(Text = values) ~ Case, transform(stack(df),
Case = trimws(ind, whitespace = "\\.text.*")), FUN = paste, collapse = "")
Case Text
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
You can use pivot_longer(everything()), manipulate the "Case" column, group, and paste together:
pivot_longer(df,everything(),names_to="Case") %>%
mutate(Case = str_remove_all(Case, ".text.*")) %>%
group_by(Case) %>% summarize(Text=paste(value, collapse=""))
Output:
Case Text
<chr> <chr>
1 Case1.1.jpeg thebigDOG10197
2 Case1.2.png framework
3 Case1.3.jpg BETHEChangeYOUWANTTOSEEintheWORLD
4 Case1.4.png 09.80.56.60.77
This is my dataframe:
df <- data.frame(option_1 = c("Box 1", "", ""), option_2 = c("", 4, ""), Width = c("","",3))
I want to get this data frame:
option_1
1 Box 1
2 4
3 3
I'm doing this on a much bigger dataframe with 5+ columns I'm merging on blanks with respect to the option_1 column. I have tried using coalesce, but some of the columns won't "merge" on the blanks. For example:
df %>%
mutate(option_value_1 = coalesce(option_value_1, option_value_2, option_value_3, option_value_4, option_value_5, option_value_6, option_value_7))
option_value_5 wouldn't come together with option_value_1 on the blanks, but the other option values did. Should I put the vectors in a list then use coalesce?
We convert the blank ("") to NA and coalesce with the bang-bang (!!!) operator. According to ?"!!!"
The big-bang operator !!! forces-splice a list of objects. The elements of the list are spliced in place, meaning that they each become one single argument.
library(dplyr)
df %>%
na_if("") %>%
transmute(option_1 = coalesce(!!! .))
-output
option_1
1 Box 1
2 4
3 3
If we are interested only in the 'option' columns, subset the columns (also can use invoke with coalesce
library(purrr)
df %>%
na_if("") %>%
mutate(option_1 = invoke(coalesce,
across(starts_with("option"))), .keep = "unused")
With a base R approach:
df <- data.frame(option_1 = apply(df, 1, \(x) paste(x, collapse = "")))
df
#> option_1
#> 1 Box 1
#> 2 4
#> 3 3
Or using tidyverse:
df %>%
rowwise %>%
transmute(option_1 = str_c(c_across(everything()), collapse = "")) %>%
ungroup
I got help with this questions a while ago:
How to replace multiple values in a string depending on a list of keys
Now I need to take into account that some keys are not to be "translated". So in this case I want the key1-4 should be translated into code1-4. I want it to be able to handle keys that arent in the key_code translation. If I add a key that is missing from the keycode, say keyx, somewhere where there is already another valid key value, I can just filter the NA that appears when joining with the key_codes. But if I have an id which has only the keyx value, that whole row dissapears and I want to keep it (it can show up as NA for example). Any ideas on how to solve that?
library(dplyr)
library(tidyr)
library(stringr)
values = tibble(id = 1:4, values = c("key1;keyx", "key3;key4;key1", "key2;key1", "keyx"))
key_code = tibble(key = c("key1", "key2", "key3", "key4"), code = c("code1", "code2", "code3", "code4"))
values %>%
separate_rows(values) %>%
left_join(key_code, by = c("values" = "key")) %>%
group_by(id) %>%
filter(!is.na(code)) %>%
summarise(code = str_c(code, collapse=";"))
We could use an if/else condition to check if all the elements in the 'code' are NA, then return NA or else to paste the non-NA elements
library(dplyr)
library(tidyr)
library(stringr)
values %>%
separate_rows(values) %>%
left_join(key_code, by = c("values" = "key")) %>%
group_by(id) %>%
summarise(code = if(all(is.na(code))) NA_character_ else
str_c(str_replace_na(code, ""), collapse=";"), .groups = 'drop')
-output
# A tibble: 4 x 2
# id code
# <int> <chr>
#1 1 code1;
#2 2 code3;code4;code1
#3 3 code2;code1
#4 4 <NA>
I have a data frame DF in which I want to insert new column called Stage by comparing with the data frame DF1 columns Col1,Col2,Col3,Col4,Col5,Col6. Below is my sample data format
Col1=c("ABCD","","","","wxyz","")
Col2=c("","","MTNL","","","")
Col3=c("","PQRS","","","","")
Col4=c("","","","","","")
Col5=c("","","","","","")
Col6=c("","","","","","EFGH")
DF=data.frame(Col1,Col2,Col3,Col4,Col5,Col6)
Style=c("ABCD","WXYZ","PQRS","EFGH")
DF1=data.frame(Style)
Stage=c(1,1,3,6)
DFR=data.frame(Style,Stage)
DFR would be my resulting data frame.
Can Some one help me to solve this.
A tidyverse method:
library(tidyverse)
DFR <- DF %>%
mutate(across(everything(), ~na_if(., ""))) %>%
pivot_longer(cols = everything(),
names_to = "Stage",
values_to = "Style",
values_drop_na = T) %>%
filter(Style %in% c("ABCD","WXYZ","PQRS","EFGH"))%>%
mutate(Stage = as.integer(gsub("Col", "", Stage)))
The first mutate call replaces your blank values with NA. Then I pivot your table to long format and drop NA values, before filtering for only the Style values you're interested in (these can be saved in a vector instead to make the code cleaner, but here the column and your vector are named the same so I didn't want to make it confusing). The second mutate call is optional, it removes "Col" from each of your Stage values and converts the column to the type integer.
You can join the data after getting it into long format.
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(cols = everything()) %>%
right_join(DF1, by = c('value' = 'Style'))
# name value
# <chr> <chr>
#1 Col1 ABCD
#2 Col3 PQRS
#3 Col6 EFGH
#4 NA WXYZ
I tried to solve this by below way and it is working
DF <- DF %>%
mutate(across(everything(), ~na_if(., "")))
DFR=DF1
DFR$Stage=ifelse(is.na(DF1$Style),NA,ifelse(DF1$Style %in% DF$Col1,1,
ifelse(DF1$Style %in% DF$Col2,2,
ifelse(DF1$Style %in% DF$Col3,3,
ifelse(DF1$Style %in% DF$Col4,4,
ifelse(DF1$Style %in% DF$Col5,5,
ifelse(DF1$Style %in% DF$Col6,6,NA)))))))
I have Lookup_DF which contains dictionary to refer strings and Raw_file which has combination of strings, Lookup_DF is having Types to populate in Result data frame based on Items in raw files.
Item1=c("Banana","Toamto","Potato","Palak")
Item2=c("","Orange","Onion","Mango")
Type1=c("Fruit","Vegetable","Vegetable","Leaves")
Type2=c("","Fruit","Vegetable","Fruit")
DF1=data.frame(Item1,Item2,Type1,Type2)
Items=c("Onion,Potato,Ginger","Tomato","Banana","Palak,Mango","Onion,Capsicum","Orange,Sweet_potato")
Raw_file=data.frame(Items)
Result_Type1=c("Vegetable","Vegetable","Fruit","Leaves","","")
Result_Type2=c("Vegetable","","","Fruit","Vegetable","Fruit")
Result=data.frame(Items,Result_Type1,Result_Type2)
My Output data frame would look like Result.
I tried something with str_detect in case statement but not able to get it. Can someone help me out on this.
Maybe you can do a join between these two tables (similar to your other question).
First would put DF1 in long format. For Raw_file, use separate_rows to have a single item in each row before the join.
library(tidyverse)
DF1_long <- DF1 %>%
pivot_longer(cols = everything(),
names_to = c(".value", "number"),
names_pattern = "(\\w+)(\\d+)$")
Raw_file %>%
mutate(value = Items) %>%
separate_rows(value) %>%
inner_join(DF1_long, by = c("value" = "Item")) %>%
group_by(Items) %>%
distinct(Items, number, .keep_all = TRUE) %>%
pivot_wider(id_cols = Items,
names_from = number,
values_from = Type,
names_prefix = "Result_Type")
Output
Items Result_Type2 Result_Type1
<chr> <chr> <chr>
1 Onion,Potato,Ginger Vegetable Vegetable
2 Tomato NA Vegetable
3 Banana NA Fruit
4 Palak,Mango Fruit Leaves
5 Onion,Capsicum Vegetable NA
6 Orange,Sweet_potato Fruit NA