How to aggregate data several times on the same table

How to aggregate data several times on the same table - r

I am working in R and have a list with 3 columns:
Fruit Drawer Amount
Banana Top 1
Peach Top 2
Apple Top 3
Banana Mid 4
Peach Mid 5
Apple Mid 6
Banana Bottom 7
Peach Bottom 8
Apple Bottom 9
and I want to create the smallest ratio of fruit type (ex. bananas) in each drawer (ex. Top) to total fruit (all the bananas).
I am using table:
x <- table(fruits)
but I get a type of data that I don't know how to work with.
Ultimately I want to get "bananas per drawer" divided by the "total bananas" in all the drawers. I guess I could do it column by column but I am sure there are better ways to go about it. Any suggestion?
Sorry for any etiquette mishaps, I haven't been programming for long.
Thanks.

Do you want something like this:
library(dplyr)
fruit__drawer =
"Fruit Drawer Amount
Banana Top 1
Peach Top 2
Apple Top 3
Banana Mid 4
Peach Mid 5
Apple Mid 6
Banana Bottom 7
Peach Bottom 8
Apple Bottom 9" %>%
read.table(text = . , header = TRUE)
fruit =
fruit__drawer %>%
group_by(Fruit) %>%
summarize(Amount.fruit = sum(Amount)) %>%
mutate(Proportion.overall = Amount.fruit / sum(Amount.fruit))
result =
fruit__drawer %>%
left_join(fruit) %>%
group_by(Drawer) %>%
mutate(Proportion= Amount/sum(Amount),
Proportion.ratio = Proportion/Proportion.overall)

Related

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.

Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles

I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

How to change values of a column according to some conditions?

I have a data frame named data1.
> data1 <- data.frame(name = c("apple","apple","pine","pine",
"apple","apple", "pine","pine","banana","banana"),
characters = c("red","green","yellow","brown",
"big","sweet","delicious","medium","soft", "long"))
> data1
name characters
1 apple red
2 apple green
3 pine yellow
4 pine brown
5 apple big
6 apple sweet
7 pine delicious
8 pine medium
9 banana soft
10 banana long
And I want to change the same values of the name variable according to the values of characters column.
Just like data2:
> data2
name characters
1 colapple red
2 colapple green
3 colorpine yellow
4 colorpine brown
5 othapple big
6 othapple sweet
7 despine delicious
8 despine medium
9 banana soft
10 banana long
In fact, data1 is very big. And I need to change the same values in the data1$name into special values. So I need a general way to realize it. I try to use If statement to do it, but there are some errors. How can I do it?

Like I said in the comment to the question, I am not seeing the relation between the columns, aren't the prefixes changing by groups of the 1st column?
If so, the code below will do what the question asks for. It creates an index k with a standard R cumsum trick. Then pastes the prefixes indexed by the index k and column data1$name.
pref <- c("col", "color", "oth", "des")
k <- cumsum(c(1, abs(diff(data1$name == "apple")) > 0))
data2 <- data.frame(name = paste0(pref[k], data1$name),
characters = data1$characters)
data2
# name characters
#1 colapple red
#2 colapple green
#3 colapple white
#4 colorpine yellow
#5 colorpine brown
#6 colorpine black
#7 othapple big
#8 othapple sweet
#9 othapple small
#10 despine delicious
#11 despine medium
#12 despine ache
Edit
With the new data set posted after the answer and following the discussion in comments, here is a solution with setNames and match.
pref3 <- c(rep("col", 2), rep("color", 2), rep("oth", 2), rep("des", 2), rep("", 2))
pref3 <- setNames(pref3, data3$characters)
k <- match(data3$characters, names(pref3))
data3$name <- paste0(pref3[k], data3$name)
Data
data1 <- data.frame(name = c("apple","apple","apple", "pine","pine","pine",
"apple","apple","apple", "pine","pine","pine"),
characters = c("red","green","white","yellow","brown","black",
"big","sweet","small","delicious","medium","ache"))
data3 <- data.frame(name = c("apple","apple","pine","pine",
"apple","apple", "pine","pine","banana","banana"),
characters = c("red","green","yellow","brown",
"big","sweet","delicious","medium","soft", "long"))

How to create new column based on the object name given in the column and Code ifurther

I have below data in the Excel. which I have imported.
Item_code Price Raw_Material
1. 10001jk10002 20 Made with Apple
2. 10001jk10002 20 Made with Grapes
3. 10001jk10002 30 Made with Banana
4. 10011jk10022 60 Made with Grapes
5. 10011jk10022 60 Made with Grapes
Result I am looking for with New column
Item_code Price Raw_Material Fruit Used
1. 10001jk10002 20 Made with Apple Apple
2. 10001jk10002 20 Made with Grapes Grapes
3. 10001jk10002 30 Made with Banana Banana
4. 10011jk10022 60 Made with Grapes Grapes
5. 10011jk10022 60 Made with Grapes Grapes
From New column I want to drive one more new column 'Final Fruite'
Item_code Price Raw_Material Fruit Used Final Fruit
1. 10001jk10002 20 Made with Apple Apple Banana
2. 10001jk10002 20 Made with Grapes Grapes Banana
3. 10001jk10002 30 Made with Banana Banana Banana
4. 10011jk10022 60 Made with Grapes Grapes Grapes
5. 10011jk10022 60 Made with Grapes Grapes Grapes
If you can see my first 3 rows are same. First I want to drive the Fruit column based on it Raw_material columns. Fruit name are used in the sentence (which can be random) and then i want to derive another column from the fruit column Final_Fruite no matter what fruit is come next rows I want to return the Banana in my new column
The actual list of the preferred fruit goes to 10. I am looking for dynamic solution. Can anyone suggest how I can do the same to get the desire result.

We can extract the last word with
library(stringi)
df1$Fruite_used <- stri_extract_last(df1$Raw_Material, regex = "\\w+")

library(readxl)
library(dplyr)
library(magrittr)
library(stringr)
fruity <- read_excel("fruity.xlsx")
fruity <- fruity %>%
group_by(item_code) %>%
mutate(id = row_number()) %>%
mutate(fruit_used = word(raw_material, -1))
tmp <- fruity %>% group_by(item_code) %>% top_n(1, id) %>%
select(item_code, fruit_used) %>%
set_colnames(c('item_code','final_fruit'))
fruity <- fruity %>% left_join(tmp, by = 'item_code') %>% select(-"id")

Filtering for multiple strings within the same column in r

My large data set (Groceries) has a column in it containing character data (Fruits) all of which is lower case and all of which contains no punctuation.
It looks a bit like this:
# Groceries Data Frame
Id Groceries$Fruits
1 apple orange banana lemon grapefruit
2 grapes tomato passion fruit
3 strawberry orange kiwi
4 lemon orange passion fruit grapefruit lime
5 lemon orange passion fruit grapefruit lime peach
...
I'm trying to select all the rows (of which there are 3,320) from the Fruits column that contain 5 specific fruits (orange, lime, lemon, grapefruit & passion fruit). Initially I'm only interested in the rows that contain all 5 of these fruits and no additional Fruits. Thus, the only row out of these 5 that should be filtered/subsetted would be row 4. The fruits do not have to be in any particular order.
The data is actually answers to a test, so eventually I'm interested in determining who got 0/5 fruits, who got 1/5, 2/5 and so on...
I've tried 2 methods so far, both to no avail.
Firstly I tried using grep(), but no rows were stored in the resulting data frame.
# 1st attempt with grep()
Correct fruits <- Groceries[grep("orange, lemon, lime, passion fruit,
grapefruit", Groceries$Fruits), ]
And then I tried using filter(), but the selected rows don't contain just the 5 Fruits I'm seeking out, it selects all rows that contain any of the 5 fruits.
# 2nd attempt with filter
library(dplyr)
library(stringr)
CorrectFruits <- c("lemon", "orange", "passion fruit", "grapefruit",
"lime")
filter <- Groceries %>%
select(Id, Fruits) %>%
filter(str_detect(tolower(Fruits), pattern = CorrectFruits))
The result I'm after initially is a new DF containing all the columns in the Groceries table, but only the rows of those people who got all 5 of the chosen fruits correct.
Next, it would be cool to select the opposite; everyone who didn't get all 5 correct.
Finally, I'd love to be able to subset those who got a specific proportion correct. I.e. row 1 got 3 correct, row 2 only got 1 correct and row 3 only got 1 correct.
Any help would be greatly appreciated!
Here's an example of what some of the columns look like:
# Groceries
Id Age Nationality Colour question Fruits question
1 26-35 Canadian Red apple orange banana lemon grapefruit
2 26-35 US Blue grapes tomato passion fruit
3 46-55 Canadian Red strawberry orange kiwi
4 55+ US Red lemon orange passion fruit grapefruit lime
5 36-45 British Green lemon orange passion fruit grapefruit lime peach

Might need more clarification on what you intend on doing with answers that have all 5 fruits with some extra, but this should help you out. I substituted all instances of "passion fruit" with "passionfruit" to make it easier:
df$Fruits <- gsub("passion fruit", "passionfruit", df$Fruits)
CorrectFruits <- c("lemon", "orange", "passionfruit", "grapefruit",
"lime")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
which gives
ID Fruits Count
1 apple orange banana lemon grapefruit 3
2 grapes tomato passionfruit 1
3 strawberry orange kiwi 1
4 lemon orange passionfruit grapefruit lime 5
5 lemon orange passionfruit grapefruit lime peach 0
First line does the passionfruit substitution, and then str_count counts all occurrences of correct fruits in df$Fruit. Finally, if all 5 fruits are correct but there are extras, Count resets to 0.

Here is my answer after seeing others' genius solutions.
ID <- c(1:5)
Age <- c("26-35", "26-35", "46-55", "55+", "56-45")
Nationality <- c("Canadian", "US", "Canadian", "US", "British")
Color <- c("Correct", "Incorrect", "Incorrect", "Correct", "Correect")
Fruits <- c("pineapple",
"apple",
"apple orange kiwi fifth",
"orange apple pineapple kiwi fifth",
"pineapple orange apple fifth kiwi"
)
df <- data.frame(ID, Age, Nationality, Color, Fruits)
df
heds1's reponse looks great. However, you want to be careful using string exacts such as grepl because it could return compound words. For example, consider the word pineapple; it contains pine and apple. Notice here that searching for apple returns pineapples.
filter(df, grepl("apple", Fruits))
ID Age Nationality Color Fruits
1 1 26-35 Canadian Correct pineapple
2 2 26-35 US Incorrect apple
3 3 46-55 Canadian Incorrect apple orange kiwi fifth
4 4 55+ US Correct orange apple pineapple kiwi fifth
5 5 56-45 British Correect pineapple orange apple fifth kiwi
The answer provided by sumshyftw is awesome. And I love that I am learning something from sumshyftw. But to demonstrate my point that unrestrained string search could mess your count:
CorrectFruits <- c("apple")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 2
5 5 56-45 British Correect pineapple orange apple fifth kiwi 2
Notice that it counted the pineapple as a correct answer despite that the only correct fruit is an apple. To overcome this, you want to wrap your words with \\b.
CorrectFruits <- c("\\bapple\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 0
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 1
4 4 55+ US Correct orange apple pineapple kiwi fifth 1
5 5 56-45 British Correect pineapple orange apple fifth kiwi 1
R no longer counts pineapple as an apple.
But for the record, sumshyftw deserves the credit for working out the hard part in my example:
CorrectFruits <- c("\\bapple\\b", "\\borange\\b", "\\bpineapple\\b", "\\bfifth\\b", "\\bkiwi\\b")
df$Count <- str_count(df$Fruits, paste(CorrectFruits, collapse = '|'))
df$Count <- ifelse((df$Count == 5 & str_count(df$Fruits, '\\w+') > 5), 0, df$Count)
df
ID Age Nationality Color Fruits Count
1 1 26-35 Canadian Correct pineapple 1
2 2 26-35 US Incorrect apple 1
3 3 46-55 Canadian Incorrect apple orange kiwi fifth 4
4 4 55+ US Correct orange apple pineapple kiwi fifth 5
5 5 56-45 British Correect pineapple orange apple fifth kiwi 5
To show only those with all five fruits:
df2 <- filter(df, df$Count == 5)
df2
ID Age Nationality Color Fruits Count
1 4 55+ US Correct orange apple pineapple kiwi fifth 5
2 5 56-45 British Correect pineapple orange apple fifth kiwi 5

Here's one way using grepl with a target list of keywords.
df <- structure(list(v1 = structure(1:4, .Label = c("row1", "row2",
"row3", "row4"), class = "factor"), v2 = structure(c(2L, 4L,
1L, 3L), .Label = c("another invalid row", "apple banana mandarin orange pear",
"banana apple mandarin pear orange", "not a valid row"), class = "factor")), class = "data.frame", row.names = c(NA,
-4L))
targets <- c("banana", "apple", "orange", "pear", "mandarin")
bool_df <- as.data.frame(sapply(targets, grepl, df$v2))
match_rows <- which(rowSums(bool_df) == 5)
df <- df[match_rows,]
You can then change the criteria in the match_rows vector by changing the 5 to, for example 4 for four fruit matches, etc.

How to get the row and col position by hovering over a cell in a data table using R Shiny?

I have data in the following format:
Fruits ID
Apple 1
Orange 2
Banana 3
Melon 4
Guava 5
Mango 6
Avocado 7
Grape 8
I am trying to plot a table of the IDs as shown below with different colors assigned to the cells based on the ID Number.
When the user hovers on a cell, the corresponding Fruit should display preferably in a table or at least as a tool tip.