R dataframe with values in the wrong columns - r

I have a dataframe like this one:
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear British Yellow
Strawberries French Red
Blackberry Blue Austrian
As you can see the Characteristic can be in different Columns depending in the row. I would like to obtain a dataframe where each column contains only the values of a specific Characteristic.
Name Characteristic_1 Characteristic_2
Apple Yellow Italian
Pear Yellow British
Strawberries Red French
Blackberry Blue Austrian
My idea is to use the case_when function but I would like to know if there are Faster ways to achieve the same result.
Example data:
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

I suspect there is an easier way of solving the issue, but here is one potential solution:
# Load the libraries
library(tidyverse)
# Load the data
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
# R has 657 built in colour names. You can see them using the `colours()` function.
# Chances are your colours are contained in this list.
# The `str_to_title()` function capitalizes every colour in the list
list_of_colours <- str_to_title(colours())
# If your colours are not contained in the list, add them using e.g.
# `list_of_colours <- c(list_of_colours, "Octarine")`
# Create a new dataframe ("df2") by taking the original dataframe ("df")
df2 <- df %>%
# Create two new columns called "Colour" and "Origin" using `mutate()` with
# `ifelse` used to identify whether each word is in the list of colours.
# If the word is in the list of colours, add it to the "Colours" column, if
# it isn't, add it to the "Origin" column.
mutate(Colour = ifelse(!is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2),
Origin = ifelse(is.na(str_extract(Characteristic_1, paste(list_of_colours, collapse = "|"))),
Characteristic_1, Characteristic_2)) %>%
# Then select the columns you want
select(Name, Colour, Origin)
df2
# A tibble: 4 x 3
# Name Colour Origin
# <chr> <chr> <chr>
#1 Apple Yellow Italian
#2 Pear Yellow British
#3 Strawberries Red French
#4 Blackberry Blue Austrian

I think there is also a better way of achieving this but for now this is the one solution that came to my mind:
library(dplyr)
library(stringr)
df <- structure(list(Name = c("Apple", "Pear", "Strawberries", "Blackberry"
), Characteristic_1 = c("Yellow", "British", "French", "Blue"
), Characteristic_2 = c("Italian", "Yellow", "Red", "Austrian"
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
df %>%
mutate(char_1 = if_else(str_to_lower(Characteristic_1) %in% colours(distinct = TRUE),
Characteristic_1, Characteristic_2),
char_2 = if_else(Characteristic_1 == char_1, Characteristic_2, Characteristic_1)) %>%
select(-c(Characteristic_1, Characteristic_2))
# A tibble: 4 x 3
Name char_1 char_2
<chr> <chr> <chr>
1 Apple Yellow Italian
2 Pear Yellow British
3 Strawberries Red French
4 Blackberry Blue Austrian

Related

Is there a way to "CountIF" in R based on two conditions

I know how to do this in excel, but am trying to translate into R and create a new column. In R I have a data frame called CleanData. I want to see how many times the value in each row of column A shows up in all of column B. In excel it would read like this:
=COUNTIF(B:B,A2)>0,C="Purple")
The second portion would be a next if / and statement. It would look like this in excel:
=IF(AND(COUNTIF(B:B,A2)>0,C="Purple"),"Yes", "No")
Anyone know where to start?
I have tried mutating and also this:
sum(CleanData$colA == CleanData$colB)
and am getting no values
You don't need any extra packages, here is a solution with the base R function ifelse, which is a frequently very useful function you should learn. An example:
set.seed(7*11*13)
DF <- data.frame(cond=rnorm(100), X= sample(c("Yes","No"), 100, replace=TRUE))
with(DF, sum(ifelse( (cond>0)&(X=="Yes"), 1, 0)))
I think this will capture your if/countif scenario:
library(dplyr)
CleanData %>%
mutate(YesOrNo = case_when(Color != "Purple" ~ "No", is.na(LABEL1) | !nzchar(LABEL1) ~ "No", !LABEL1 %in% LABEL2 ~ "No", TRUE ~ "Yes"))
# LABEL1 LABEL2 Color YesOrNo
# 1 HELLO <NA> Purple Yes
# 2 <NA> HELLO!!! Blue No
# 3 HELLO$$ <NA> Purple Yes
# 4 <NA> HELLO Blue No
# 5 HELLOOO <NA> Purple Yes
# 6 <NA> <NA> Purple No
# 7 <NA> HELLOOO Blue No
# 8 <NA> HELLO$$ Blue No
# 9 <NA> HELLO Yellow No
Data
CleanData <- structure(list(LABEL1 = c("HELLO", NA, "HELLO$$", NA, "HELLOOO", NA, NA, NA, NA), LABEL2 = c(NA, "HELLO!!!", NA, "HELLO", NA, NA, "HELLOOO", "HELLO$$", "HELLO"), Color = c("Purple", "Blue", "Purple", "Blue", "Purple", "Purple", "Blue", "Blue", "Yellow")), class = "data.frame", row.names = c(NA, -9L))
or programmatically,
CleanData <- data.frame(LABEL1=c("HELLO",NA,"HELLO$$",NA,"HELLOOO",NA,NA,NA,NA), LABEL2=c(NA,"HELLO!!!",NA,"HELLO",NA,NA,"HELLOOO","HELLO$$","HELLO"),Color=c("Purple","Blue","Purple","Blue","Purple","Purple","Blue","Blue","Yellow"))

Classifying columns based on str_detect

I am currently working with a data frame that looks like this:
Example <- structure(list(ID = c(12301L, 12301L, 15271L, 11888L, 15271L,
15271L, 15271L), StationOwner = c("Brian", "Brian", "Simon",
"Brian", "Simon", "Simon", "Simon"), StationName = c("Red", "Red",
"Red", "Green", "Yellow", "Yellow", "Yellow"), Parameter = c("Rain - Daily",
"Temperature -Daily", "VPD - Daily", "Rain - Daily", "Rain - Daily",
"Temperature -Daily", "VPD - Daily")), class = "data.frame", row.names = c(NA,
-7L))
I am looking into using str_detect to filter for example all the observation that start with “Rain –“ and adding what comes after under a new column called "Rain". I have been able to filter out only the values that start with “Rain” using str_detect but have not found a way to assign them automatically. Is there a specific function that would help with this? Appreciate the pointers, thanks!
Example of desired output that I am trying to achieve:
Desired <- structure(list(ID = c(12301L, 15271L, 12301L, 15271L
), StationOwner = c("Brian", "Simon", "Brian", "Simon"), StationName = c("Red",
"Red", "Green", "Yellow"), Rain = c("Daily", NA, "Daily", "Daily"
), Temperature = c("Daily", NA, NA, "Daily"), VDP = c(NA, "Daily",
NA, "Daily")), class = "data.frame", row.names = c(NA, -4L))
Directly using pivot_wider:
pivot_wider(Example, names_from = Parameter, values_from = Parameter,
names_repair = ~str_remove(.,' .*'),values_fn = ~str_remove(.,'.*- ?'))
# A tibble: 4 x 6
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily
It's not using str_detectbut can achive Desired by
library(dplyr)
Example %>%
separate(Parameter, c('a', 'b'), sep = "-") %>%
mutate(across(where(is.character), ~trimws(.x))) %>%
pivot_wider(id_cols = c("ID","StationOwner", "StationName"), names_from = "a", values_from = "b")
ID StationOwner StationName Rain Temperature VPD
<int> <chr> <chr> <chr> <chr> <chr>
1 12301 Brian Red Daily Daily NA
2 15271 Simon Red NA NA Daily
3 11888 Brian Green Daily NA NA
4 15271 Simon Yellow Daily Daily Daily

Make many rows to single using an index

According to this input:
structure(list(mid = c("text11", "text12", "text21", "text22",
"text23"), term = c("test", "text", "section", "2", "sending"
)), class = "data.frame", row.names = c(NA, -5L))
How is it possible to transform it using the mid to make the melt row to a single. where in mid the part text1, text2... text12 shows the number of row and the new number the terms exists in this row. Merging them with a with space separation.
Example out dataframe
data.frame(mid = c("text1", "text2"), term = c("test "text", "section 2 sending"
))
This should work
library(dplyr)
library(stringr)
df <- structure(list(mid = c("text11", "text12", "text21", "text22",
"text23"), term = c("test", "text", "section", "2", "sending"
)), class = "data.frame", row.names = c(NA, -5L))
df %>%
mutate(mid = str_extract(mid, "text\\d")) %>%
group_by(mid) %>%
summarise(term = paste(term, collapse=" "))
# # A tibble: 2 x 2
# mid term
# <chr> <chr>
# 1 text1 test text
# 2 text2 section 2 sending
EDIT - to address comment
Addressing the question in the comment, the functions below will work for any case where all of the digits except the last one identify the group (i.e., 1 and 12 in the example below).
df <- structure(list(mid = c("text11", "text12", "text121", "text122", "text123"), term = c("test", "text", "section", "2", "sending")), class = "data.frame", row.names = c(NA, -5L))
df %>%
mutate(mid = str_sub(mid, 1, (nchar(mid)-1))) %>%
group_by(mid) %>%
summarise(term = paste(term, collapse=" "))
# # A tibble: 2 x 2
# mid term
# <chr> <chr>
# 1 text1 test text
# 2 text12 section 2 sending

Check if dataframe column value is present in list in R

I have a Master of colors as a list below
master <- list("Beige" = c("light brown", "light golden", "skin"),
"off-white" = c("off white", "cream", "light cream", "dirty white"),
"Metallic" = c("steel","silver"),
"Multi-colored" = c("multi color", "mixed colors", "mix", "rainbow"),
"Purple" = c("lavender", "grape", "jam", "raisin", "plum", "magenta"),
"Red" = c("cranberry", "strawberry", "raspberry", "dark cherry", "cherry","rosered"),
"Turquoise" = c("aqua marine", "jade green"),
"Yellow" = c("fresh lime")
)
and this is the datframe column that i have
df$color <- c('multi color','purple','steel','metallic','off white','raisin','strawberry','magenta','skin','Beige','Jade Green','cream','multi-colored','offwhite','rosered',"light cream")
Now i want to check if value persent in column is same as list key or same as list values
ex:
1)if df column value is off white first it should look at list keys which are Beige,off-white,Metallic... if it is present than get the value
2)it should also look at all the values that those keys have like if one of keys value is light cream than it should be considered as off-white
3)no case sensitive matters like OffWhITe == offwhite or space matters like off white==offwhite
OUTPUT
This should be the expected output
df$output <- c("Multi-colored","Purple","Metallic","Metallic","off-white","Purple","Red","Purple","Beige","Beige","Turquoise","off-white","Multi-colored","off-white","Red","off-white")
EDIT
any value in this c("multi color", "mixed colors", "mix", "rainbow","multicolored","MultI-cOlored","multi-colored","MultiColORed","Multi-colored") should be considered as Multi-colored
May be we can do a string_dist_join after stacking the list into a single data.frame
library(dplyr)
library(fuzzyjoin)
library(tibble)
enframe(master, value = 'color') %>%
unnest(c(color)) %>%
type.convert(as.is = TRUE) %>%
stringdist_right_join(df %>%
mutate(rn = row_number()), max_dist = 3) %>%
transmute(color = color.y, output = coalesce(name, color.y))
# A tibble: 19 x 2
# color output
# <chr> <chr>
# 1 multi color Multi-colored
# 2 purple purple
# 3 steel Metallic
# 4 metallic metallic
# 5 off white off-white
# 6 raisin Purple
# 7 strawberry Red
# 8 strawberry Red
# 9 magenta Purple
#10 skin Beige
#11 skin Multi-colored
#12 Beige Beige
#13 Jade Green Turquoise
#14 cream off-white
#15 cream Purple
#16 multi-colored Multi-colored
#17 offwhite off-white
#18 rosered Red
#19 light cream off-white
data
df <- structure(list(color = c("multi color", "purple", "steel", "metallic",
"off white", "raisin", "strawberry", "magenta", "skin", "Beige",
"Jade Green", "cream", "multi-colored", "offwhite", "rosered",
"light cream")), class = "data.frame", row.names = c(NA, -16L
))

colored categories in r wordclouds

Using the wordcloud package in R I would like to color different words according to a categorical variable in the dataset. Say my data is as follows:
name weight group
1 Aba 10 x
2 Bcd 20 y
3 Cde 30 z
4 Def 5 x
And here as a dput:
dat <- structure(list(name = c("Aba", "Bcd", "Cde", "Def"), weight = c(10,
20, 30, 5), group= c("x", "y", "z", "x")), .Names = c("name",
"weight", "group"), row.names = c(NA, -4L), class = "data.frame")
Is there a way in wordcloud() to color the names by their group (x, y, z) or should I use different software/packages?
It will automatically choose from a color list based on frequency or by word order if ordered.colors is specified.
name = c("Aba","Bcd","Cde","Def")
weight = c(10,20,30,5)
colorlist = c("red","blue","green","red")
wordcloud(name, weight, colors=colorlist, ordered.colors=TRUE)
The example above works for independent variables. In a data frame, your color specification will be stored as a factor, and it will have to be converted to text by wrapping it in as.character like this:
wordcloud(df$name, df$weight, colors=as.character(df$color), ordered.colors=TRUE)
If you just have factors and not a list of colors, you can generate a parallel colorlist with a couple of lines.
#general solution for any number of categories
basecolors = rainbow(length(unique(group)))
# solution for known categories
basecolors = c("red","green","blue")
group = c("x","y","z","x")
# find position of group in list of groups, and select that matching color...
colorlist = basecolors[ match(group,unique(group)) ]

Resources