I am trying to form a cross table for two items in my data frame, but they are not conveniently laid in two columns, rather they are elements inside the columns that have to be filtered out to continue with the crosstables.
e.g.
column titles: Gender, Favourite Fruit
column 1: F,M,M,M,F,M,F,M,M,F
column 2: apple, pear, pear, grapes, apple, banana, peach, apple, pear, grapes
I would like to make a cross-table for female and apple, to see if there is a relationship. How should I go about doing this?
Thank you!
Emmy
There are lots of ways to do this, but the workhorse is the table() function.
Here is some fake data:
set.seed(123)
df <- data.frame(gender = sample(c("M", "F"), 1000, replace = T ),
fruit = sample(c("apple", "grapes", "banana", "pear"), 1000, replace = T) )
The table() function is a great way to create cross tabulations. For example:
table(df)
fruit
gender apple banana grapes pear
F 134 122 128 109
M 114 131 127 135
You can do a lot with this function. To get something like what you want, you can do do some create your named logical vector right in the arguments of the function.
table(Female = df$gender == "F", Apple = df$fruit == "apple")
Apple
Female FALSE TRUE
FALSE 393 114
TRUE 359 134
Related
I have a data frame, df, having one column of different names. I have variable data frames, e.g. search_df or search_df1 containing search words which I would like to search via regex in the name column.
If the word has been found write it into a new column, e.g. df_final$which_word_search_df.
If more than one word has been found I would like to paste the results together.
The result should look like df_final.
# load packages
pacman::p_load(tidyverse)
# words I would like to search for
search_df <- data.frame(search_words = c("apple", "peach"))
search_df1 <- data.frame(search_words = c("strawberry", "peach", "banana"))
# data frame which is the basis for my search
df <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"))
# how I expect the final result to look like
df_final <- data.frame(name = c("apple123", "applepeach", "peachtime", "peachab", "bananarrr", "bananaxy"),
which_word_search_df = c("apple", "apple; peach", "peach", "peach", NA, NA),
which_word_search_df1 = c(NA, NA, "peach", "peach", "banana", "banana"))
That is my current solution but as you can see it is not dynamic. I type in manually every search word instead of automatically going through all the search words.
df_trial <- df %>%
mutate(which_search_word_trial = ifelse(grepl("apple", name, ignore.case = T), "apple", ""),
which_search_word_trial = ifelse(grepl("peach", name, ignore.case = T),
paste(which_search_word_trial, "peach", sep = ";"), which_search_word_trial)
)
The example I am sharing is just a minimal one. For the actual use case df will have ~200k rows and my search_df will have ~1k rows.
We can do the following.
library(dplyr)
library(stringr)
df %>%
mutate(which_word_search_df = str_extract_all(name,str_c(search_df$search_words, collapse = '|')),
which_word_search_df1 = str_extract_all(name, str_c(search_df1$search_words, collapse = '|')))
# name which_word_search_df which_word_search_df1
# 1 apple123 apple
# 2 applepeach apple, peach peach
# 3 peachtime peach peach
# 4 peachab peach peach
# 5 bananarrr banana
# 6 bananaxy banana
Using your df as input (not the df_final): Here is an "automatic" way to do it by providing the name of the search dataframes:
n = c('search_df','search_df1')
for(i in n){
a= (lapply(get(i)$search_word, function(j){grep(j, df$name)}))
a=stack(setNames(a,get(i)$search_word))
df[,paste0('which_word_',i)]=NA
df[a$values,paste0('which_word_',i)]=as.character(a$ind)
}
The output is directly stored in df but you can change this easily by copying df to final_df and then use this one in the two last lines.
output:
name which_word_search_df which_word_search_df1
1 apple123 apple <NA>
2 applebum apple <NA>
3 peachtime peach peach
4 peachab peach peach
5 bananarrr <NA> banana
6 bananaxy <NA> banana
Lemme know if it worked for you
I have a dataframe containing a column of strings, and I want to use filter() (or another pipeable function) to return only rows containing strings that contain any of the values in another vector of strings. I have looked at previous questions and answers but can't find anything that's quite what I'm looking for.
For example:
title <- c("apple pie", "fish pie", "peach strudel", "banana split", "chocolate cake", "pasta", "peaches and cream", "baked apples")
recipes <- data.frame(cbind(c(1:8), title))
fruits <- c("apple", "banana", "peach", "orange")
How do I filter recipes to return only the rows in which recipes$title contains anything from fruits?
We can use str_detect with filter after creating a single string from 'fruits' collapsed by | (OR)
library(dplyr)
library(stringr)
recipes %>%
filter(str_detect(title, str_c(fruits, collapse="|")))
# V1 title
#1 1 apple pie
#2 3 peach strudel
#3 4 banana split
#4 7 peaches and cream
#5 8 baked apples
I have a dataframe dflike this:
df <- data.frame(fruits = c("apple", "orange", "pineapple", "banana", "grape"))
df_rep <- data.frame(eng = c("apple", "orange", "grape"),
esp = c("manzana", "naranja", "uva"))
>df
fruits
apple
orange
pineapple
banana
grape
>df_rep
eng esp
apple manzana
orange naranja
grape uva
I want to replace the value in the fruits column of df referring to df_rep. If the values in the fruits column of df appears in eng column of df_rep, I want to replace them with the values in esp column of df_rep. So the result should look like this:
>df
fruits
manzana
naranja
pineapple
banana
uva
Point: I don't want to use ifelse as in my real data frame there are more than 100 replacement list. The example here is simplified for easy understanding. Nor for loop as my data frame contains more than 40,000 rows. I am looking for a simple and only one action solution.
Thank you very much for your help!
We can use the merge function (to simulate a SQL left join) and then the ifelse function to replace the fruits with non-NA esp values:
df2 <- merge(df, df_rep, by.x = 'fruits', by.y = 'eng', all.x = TRUE)
df2$fruits <- ifelse(is.na(df2$esp), df2$fruits, df2$esp)
# fruits esp
# 1 manzana manzana
# 2 banana <NA>
# 3 uva uva
# 4 naranja naranja
# 5 pineapple <NA>
Data
It's important to set stringsAsFactors = FALSE when creating the data:
df <- data.frame(fruits = c("apple", "orange", "pineapple", "banana", "grape"),
stringsAsFactors = FALSE)
df_rep <- data.frame(eng = c("apple", "orange", "grape"),
esp = c("manzana", "naranja", "uva"),
stringsAsFactors = FALSE)
Another option is coalesce from dplyr to replace the NAs that result from match with the respective values from df$fruits.
library(dplyr)
df$fruits2 <- coalesce(df_rep$esp[match(df$fruits, df_rep$eng)], df$fruits)
df
# fruits fruits2
#1 apple manzana
#2 orange naranja
#3 pineapple pineapple
#4 banana banana
#5 grape uva
I would like to create a new column in my dataframe that corresponds to values in list variables.
My dataframe includes many rows with a 'product names' column. My intention is to create a new column that allows me to sort products into categories.
Sample code -
library(dplyr)
products <- c('Apple', 'orange', 'pear',
'carrot', 'cabbage',
'strawberry', 'blueberry')
df <- data.frame(products)
ls <- list(Fruit = c('Apple', 'orange', 'pear'),
Veg = c('carrot', 'cabbage'),
Berry = c('strawberry', 'blueberry'))
test <- df %>%
mutate(category = products %in% ls)
I hope that illustrates what I'm trying to do. By creating the list, I've basically got a register of products and their categories which could change over time.
Is there a solution to this using a list, or am I over-complicating it and not seeing the wood for the trees?
edit - It might help to let you know that I'm working with 100s of products.
stack the list and then join with the data frame:
df %>%
left_join(stack(ls), by = c('products' = 'values')) %>%
rename(category = ind)
# products category
#1 Apple Fruit
#2 orange Fruit
#3 pear Fruit
#4 carrot Veg
#5 cabbage Veg
#6 strawberry Berry
#7 blueberry Berry
I just do not understand why my data frame keeps being a factor even if I try to change it to a character.
Here I have a data frame list_attribute$affrete.
affrete
<chr>
Fruits
Apple
Grape Fruits
Lemon
Peach
Banana
Orange
Strawberry
Apple
And I applied a function to replace some values in list_attribute$affrete to other values using another data frame renaming, which has two columns(Name and Rename).
affrete <- plyr::mapvalues(x = unlist(list_attribute$affrete, use.names = F),
from = as.character(renaming$Name),
to = as.character(renaming$Rename))
affrete <- as.character(affrete)
list_attribute$affrete <- data.frame(affrete)
The data frame renaming looks like this;
Name Rename
<fctr> <fctr>
Apple Manzana
Orange Naranja
Lemon Limon
Grape Uva
Peach Melocoton
Pinapple Anana
And here is list_attribute$affrete after applying these processes above.
affrete
<fctr>
Manzana
Grape Fruits
Limon
Melocoton
Banana
Naranja
Strawberry
Manzana
Why is this column still a factor? I tried the method discussed here but none of them works. WHY? I'd appreciate for any help!
By default data.frame has argument stringsAsFactors = TRUE. When you call data.frame(affrete) it converts characters to factors. You can either:
Call data.frame(affrete, stringsAsFactors = FALSE) instead
Set this behaviour off permanently for your session with options(stringsAsFactors = FALSE)
Fix after the fact once it's already in the list with list_attribute$affrete$affrete <- as.character(list_attribute$affrete$affrete)
Use tbls from the tidyverse, so call tibble(affrete) instead. These never convert characters, among other benefits.
I think the problem is from list_attribute$affrete <- data.frame(affrete), the default behavior of data.frame() is with stringsAsFactors = TRUE