Rename variables in a column - r

I have the following data and I want to rename the three variable-names inside the column Categories from:

Sim_long1 %>%
mutate( Categories = case_when(
Categories == "Ave...C "~ "Ave",
Categories == "Min...C" ~ "Min",
Categories == "Max...C"~ "Max"
)
)
Hope this is what you want

A simple answer is to just remove ...C.. business from each of the items. If they are all the same, you can relabel like this:
Sim_long1$Categories <- gsub("...C..", "", Sim_long1$Categories
Then, plot as normally. If you want a more general format you could use this type of syntax:
Sim_long1$Categories <- gsub('\\.*C*','', Sim_long1$Categories)

Related

Purrr package add species names to each output name in SSDM

I have a list of species and I am running an ensemble SDM modelling function on the datset filtering by each species, to give an ensemble SDM per species from the dataset.
I have used purrr package to get it running, and the code works fine when there is no naming convention added in. However, when it outputs the Ensemble.SDM for each species, they are all named the same thing "ensemble.sdm", so when I want to stack them, I cannot as they are all named the same thing.
I would like to be able to name each output of the model something different, ideally linked to the species name picked out in the line: data <- Occ_full %>% filter(NAME == .x)
The working code is written below:
list_of_species <- unique(unlist(Occ_full$NAME))
# Return unique values
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1)
})
The code I have tried to get it named within it, is below, but it does not work, it names it with lots of repeitions of the row number.
output <- purrr::map(limit_list_of_species, ~ {
data <- Occ_full %>% filter(NAME == .x)
label <- as.character(data)
ensemble_modelling(c('GAM'), data, Env_Vars,
Xcol = 'LONGITUDE', Ycol = 'LATITUDE', rep = 1, name = label )
})
Could anyone help me please? I simply want each "output" to be named with the species name specified in the filter. Thank you
Try using split with imap -
list_of_species <- split(Occ_full, Occ_full$NAME)
output <- purrr::imap(list_of_species,~{
ensemble_modelling(c('GAM'), .x, Env_Vars,Xcol = 'LONGITUDE',
Ycol = 'LATITUDE', rep = 1, name = .y)
})
split would ensure that the list_of_species is named which can be used in imap.

How to use a List Item in fct_recode

I want to rename factor levels with fct_recode by using items I created beforehand.
I first create some labels and save them into a list:
#Creating the Labels:
LabelsWithN <- c(
sprintf("Man(%s)", FreqGender["Man","Freq"]),
sprintf("Woman(%s)", FreqGender["Woman","Freq"]),
sprintf("Non-Binary(%s)", FreqGender["Non-Binary","Freq"]),
sprintf("Other(%s)", FreqGender["Other","Freq"]),
sprintf("Prefer Not To Disclose(%s)", FreqGender["Prefer not to disclose","Freq"])
)
This creates a chr list with items like "Man(105)", "Woman(51)" etc.
Now I want to relabel the factors in the original DataSet (i.e. "Man" --> "Man(105)") in order to label a graph. I want to use either the list item (i.e., LabelsWithN[1]) or directly the function creating the string (i.e., sprintf("Man(%s)", FreqGender["Man","Freq"]).
I then try to enter either the list item or the function into fct_recode:
#Using the Labels:
DataSet %>%
mutate(`Gender. What_is_your_ge.._` = fct_recode(`Gender. What_is_your_ge.._`, LabelsWithN[1] = "Man", sprintf("Woman(%s)", FreqGender["Woman","Freq"]) = "Woman")) %>%
#THis is just the code for the graph:
ggplot(aes(x = `Gender. What_is_your_ge.._` , y = `Age. How_old_are_you?`, main = "Age Distribution By Gender")) +
geom_boxplot() +
xlab("Gender (n)") +
ylab("Age")
However, this yields:
"Unexpected '=' in:
"DataSet %>%
mutate(`Gender. What_is_your_ge.._` = fct_recode(`Gender. What_is_your_ge.._`, LabelsWithN[1] ="
It doesn't matter if I use the function or the list item.
The vector is a factor and the list is filled with characters. If I manipulate the code to rename the factor "man" to "cat" ("cat" = "Man") the code works fine.
How can I address the list item/enter the function into fct_recode so that it works?
Also, can somebody explain to me what the problem here is? If I print out LabelsWithN[1] I get the correct string printed out.
Thank you and Bw,
Jan
perhaps this might help ?
x <- factor(c("apple", "bear", "banana", "dear")) # what you want to recode
levels <- c(fruit = "apple", fruit = "banana") # how you want to recode it
x <- fct_recode(x, !!!levels) # recoding it

Change column values depending on other column in R

I have problem with my data frame.
I have a dataframe with 2 columns, 'word' and 'word_categories'. I created different variables which include the different words, e.g. 'noun' which includes all the nouns of the word column. I now want to change the labels in the word_categories column to the corresponding variable. So if the word in the word column is included in the object 'noun', I want the word_categories column to display 'noun'.
df <- read.csv("palm.csv")
noun <- c("house", ...)
adj <- c("hard", ...)
...
The data frame looks like the following. It includes other columns but they are fine.
word word_categories
house
car
hard
...
I now want to look, if the words are in any of the created objects and if so, I want the corresponding label printed in the word_categories column. So for 'house' the column should show noun, for 'hard' it should show adjective. If the word is in none of the objects, it should show nothing or 'NA'.
I tried it with the following:
palm$word_categories <- ifelse(palm$word == noun, "noun",
ifelse(palm$word == adj, "adjective", "")))
This, however, doesn't work at all and I have 7 Objects in total so the statement becomes ridiculously long. How do I do it properly?
If the dataframe is called palm (you first call it df but later you use palm) and noun and adj are vectors as you define above, I would do:
library(dplyr)
palm <- palm %>%
mutate(word_categories = case_when(word %in% noun ~ "noun",
word %in% adj ~ "adjective",
TRUE ~ NA_character_))
One way would be to create a named vector of your noun/adjective dictionaries to select each element. The name would be the word and the corresponding data would be noun, adjective etc. You didn't really supply any data so I made some up.
df <- data.frame(
stringsAsFactors = FALSE,
word = c("dog", "short", "bird", "cat", "short", "man")
)
nounName <- c('dog', 'cat', 'bird')
adjName <- c('quick', 'brown', 'short')
noun <- rep('noun', length(nounName))
adj <- rep('adjective', length(adjName))
names(noun) <- nounName
names(adj) <- adjName
partsofspeech <- c(noun, adj)
df$word_categories <- partsofspeech[df$word]

Reorder panels in facet_wrap/facet_grid based on another factor, with multiple occurrences

Consider this example. I want to create a custom label for my panels by joining two columns into a string.
The panels created through faceting are ordered alphabetically, but actually, I want them to be ordered by src, so SRC01 should come first, then SRC02, etc.
library(tidyverse)
tibble::tibble(
src = rep(c("SRC03", "SRC04", "SRC01", "SRC02"), 2),
data = runif(8)
) %>%
mutate(
foo = case_when(src %in% c("SRC01", "SRC02") ~ "foo", TRUE ~ "bar"),
label = paste(foo, src)
) %>%
ggplot(aes(x = data)) +
geom_density() +
facet_wrap(~label)
Created on 2019-05-22 by the reprex package (v0.3.0)
I know that this order depends on the order of underlying factor levels, but this question shows how to manually specify the levels, which I do not want (there are many more SRC values and I don't want to type all of them…).
I found a solution using fct_reorder, in which I could specify:
mutate(label = fct_reorder(label, src, .fun = identity))
But this only works when there is one line per src/label combination. If I add data (i.e., more than one data point per combination), it fails with:
Error: `fun` must return a single value per group
What would be the most succinct way to achieve what I need?
You can use the numeric part of src, and then use reorder():
tibble::tibble(
src = rep(c("SRC03", "SRC04", "SRC01", "SRC02"), 2),
data = runif(8)
) %>%
mutate(
foo = case_when(src %in% c("SRC01", "SRC02") ~ "foo", TRUE ~ "bar"),
label = paste(foo, src)
) %>%
mutate(label_order = as.numeric(str_extract(src, "\\d+"))) %>%
# use str_extract() to find the "01" inside "SRC01", turn it to numeric.
ggplot(aes(x = data)) +
geom_density() +
facet_wrap(~reorder(label, label_order))
# user reorder to change the ordering based on the numbers
A note about str_extract(), it works on your example because:
str_extract("SRC01", "\\d+") gives "01", then transformed to 1. But:
str_extract("2SRC01", "\\d+") would return 2, which wouldn't be ideal possibly.
Luckily there are tons of way to use regex to extract what you may need.

Using paste to create logical expression for data frame subset

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!
You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.
I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Resources