Count most frequent word in row by R [duplicate] - r

This question already has answers here:
Find the most frequent value by row
(4 answers)
Closed 2 years ago.
There is a table shown below
Name Mon Tue Wed Thu Fri Sat Sun
1 John Apple Orange Apple Banana Apple Apple Orange
2 Ricky Banana Apple Banana Banana Banana Banana Apple
3 Alex Apple Orange Orange Apple Apple Orange Orange
4 Robbin Apple Apple Apple Apple Apple Banana Banana
5 Sunny Banana Banana Apple Apple Apple Banana Banana
So , I want to count the most frequent Fruit for each person and add those value in new column.
For example.
Name Mon Tue Wed Thu Fri Sat Sun Max_Acc Count
1 John Apple Orange Apple Banana Apple Apple Orange Apple 4
2 Ricky Banana Apple Banana Banana Banana Banana Apple Banana 5
3 Alex Apple Orange Orange Apple Apple Orange Orange Orange 4
4 Robbin Apple Apple Apple Apple Apple Banana Banana Apple 5
5 Sunny Banana Banana Apple Apple Apple Banana Banana Banana 4
I am facing problem in finding rows. I can find Frequency in column by using table() function.
>table(df$Mon)
Apple Banana
3 2
But here i want name of most frequent fruit in new column.

If we need the "Count" and "Names" corresponding to the max "Count", we loop through the rows of the dataset (using apply with MARGIN = 1), use table to get the frequency, extract the maximum value from it and the names corresponding to the maximum value, rbind it and cbind with the original dataset.
cbind(df1, do.call(rbind, apply(df1[-1], 1, function(x) {
x1 <- table(x)
data.frame(Count = max(x1), Names=names(x1)[which.max(x1)])})))
# Name Mon Tue Wed Thu Fri Sat Sun Count Names
#1 John Apple Orange Apple Banana Apple Apple Orange 4 Apple
#2 Ricky Banana Apple Banana Banana Banana Banana Apple 5 Banana
#3 Alex Apple Orange Orange Apple Apple Orange Orange 4 Orange
#4 Robbin Apple Apple Apple Apple Apple Banana Banana 5 Apple
#5 Sunny Banana Banana Apple Apple Apple Banana Banana 4 Banana
Or we can use data.table
library(data.table)
setDT(df1)[, c("Names", "Count") := {tbl <- table(unlist(.SD))
.(names(tbl)[which.max(tbl)], max(tbl))}, by = Name]

Another approach would be to loop over all unique fruits as follows
fruits_unique <- unique(unlist(dat[-1]))
occurence <- sapply(fruits_unique, function(x) rowSums(dat[,-1] == x))
# Using this data to create the resulting columns
ind <- apply(occurence,1,which.max)
dat$Names <- fruits_unique[ind]
dat$count <- occurence[cbind(seq_along(ind), ind)]
Result:
Name Mon Tue Wed Thu Fri Sat Sun Names Count
1 John Apple Orange Apple Banana Apple Apple Orange Apple 4
2 Ricky Banana Apple Banana Banana Banana Banana Apple Banana 5
3 Alex Apple Orange Orange Apple Apple Orange Orange Orange 4
4 Robbin Apple Apple Apple Apple Apple Banana Banana Apple 5
5 Sunny Banana Banana Apple Apple Apple Banana Banana Banana 4

Related

Collapse rows with same identifier and columns and retain all values in r

I have a dataframe that contains several fields related to an identifier but some are disjointed:
id store manager fruit vegetable
1 Grocery1 Joe apple NA
1 Grocery1 Joe lemon NA
1 Grocery1 Joe NA zucchini
2 Grocery2 Amy orange NA
2 Grocery2 Amy NA asparagus
2 Grocery2 Amy NA spinach
3 Grocery3 Bill NA NA
I want the dataframe to look like:
id store manager fruit vegetable
1 Grocery1 Joe apple zucchini
1 Grocery1 Joe lemon zucchini
2 Grocery2 Amy orange asparagus
2 Grocery2 Amy orange spinach
3 Grocery3 Bill NA NA
Is there a way to easily do this?
You can use tidyr::fill to fill the NA, and only keep the non-duplicated rows using distinct.
library(dplyr)
library(tidyr)
df %>%
group_by(store, manager) %>%
fill(fruit, vegetable, .direction = "updown") %>%
distinct()
# A tibble: 5 × 5
# Groups: store, manager [3]
id store manager fruit vegetable
<int> <chr> <chr> <chr> <chr>
1 1 Grocery1 Joe apple zucchini
2 1 Grocery1 Joe lemon zucchini
3 2 Grocery2 Amy orange asparagus
4 2 Grocery2 Amy orange spinach
5 3 Grocery3 Bill NA NA

Combining two dataframes based on presence of values of various different columns

I have a question about creating new columns in my dataset by checking whether a value is present in one of the columns of my dataframe, and assigning the columns of a different dataframe based on that presence. As this description is quite vague, see the example dataset below:
newDf <- data.frame(c("Juice 1", "Juice 2", "Juice 3", "Juice 4","Juice 5"),
c("Banana", "Banana", "Orange", "Pear", "Apple"),
c("Blueberry", "Mango", "Rasberry", "Spinach", "Pear"),
c("Kale", NA, "Cherry", NA, "Peach"))
colnames(newDf) <- c("Juice", "Fruit 1", "Fruit 2", "Fruit 3")
dfChecklist <- data.frame(c("Banana", "Cherry"),
c("100", "80"),
c("5", "3"),
c("4", "5"))
colnames(dfChecklist) <- c("FruitCheck", "NutritionalValue", "Deliciousness", "Difficulty")
This gives the following dataframes:
Juice Fruit 1 Fruit 2 Fruit 3
1 Juice 1 Banana Blueberry Kale
2 Juice 2 Banana Mango <NA>
3 Juice 3 Orange Rasberry Cherry
4 Juice 4 Pear Spinach <NA>
5 Juice 5 Apple Pear Peach
FruitCheck NutritionalValue Deliciousness Difficulty
1 Banana 100 5 4
2 Cherry 80 3 5
I want to combine the two and make the result to be like this:
Juice Fruit 1 Fruit 2 Fruit 3 FruitCheck NutritionalValue Deliciousness Difficulty
1 Juice 1 Banana Blueberry Kale Banana 100 5 4
2 Juice 2 Banana Mango <NA> Banana 100 5 4
3 Juice 3 Orange Rasberry Cherry Cherry 80 3 5
4 Juice 4 Pear Spinach <NA> <NA> <NA> <NA> <NA>
5 Juice 5 Apple Pear Peach <NA> <NA> <NA> <NA>
The dataset above is an example, my own dataset is much larger and complexer.
Thanks so much in advance for your help!
First find the first match for each row
tmp=unlist(
apply(
newDf[,grepl("Fruit",colnames(newDf))],
1,
function(x){
y=as.vector(x)
y=y[which.min(match(y,dfChecklist$FruitCheck))]
ifelse(length(y)==0,NA,y)
}
)
)
add this to your original df and then a simple merge
newDf$FruitCheck=tmp
merge(
newDf,
dfChecklist,
by="FruitCheck",
all.x=T
)
resulting in
FruitCheck Juice Fruit 1 Fruit 2 Fruit 3 NutritionalValue Deliciousness
1 Banana Juice 1 Banana Blueberry Kale 100 5
2 Banana Juice 2 Banana Mango <NA> 100 5
3 Cherry Juice 3 Orange Rasberry Cherry 80 3
4 <NA> Juice 4 Pear Spinach <NA> <NA> <NA>
5 <NA> Juice 5 Apple Pear Peach <NA> <NA>
Difficulty
1 4
2 4
3 5
4 <NA>
5 <NA>

Join dataframes in dplyr by characters

So I have two dataframes:
DF1
X Y ID
banana 14 1
orange 20 2
pineapple 1 3
guava 300 4
grapes 1 5
DF2
Store State ID
Walmart NY 1
Sears AL 1;2
Target DC 3
Old Navy PA 3
Popeye's HA 5
Footlocker NJ 4;5
I join with the following and get:
df1 %>%
inner_join(df2, by = "ID")
X Y ID Store State
banana 14 1 Walmart NY
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
grapes 1 5 Popeye's HA
But due to the semi-colons I'm not capturing those data points on the join, the end result should look like this:
X Y ID Store State
banana 14 1 Walmart NY
banana 14 1 Sears AL
orange 20 2 Sears AL
pineapple 1 3 Target DC
pineapple 1 3 Old Navy PA
guava 300 4 Foot Locker NJ
grapes 1 5 Popeye's HA
grapes 1 5 Popeye's HA
Using separate_rows from tidyr in combination with dplyr will get you there.
First table I called fruit, the other stores.
library(dplyr)
library(tidyr)
fruit %>%
inner_join(separate_rows(stores, ID) %>% mutate(ID = as.integer(ID)))
Joining, by = "ID"
X Y ID Store State
1 banana 14 1 Walmart NY
2 banana 14 1 Sears AL
3 orange 20 2 Sears AL
4 pineapple 1 3 Target DC
5 pineapple 1 3 Old Navy PA
6 guava 300 4 Footlocker NJ
7 grapes 1 5 Popeye's HA
8 grapes 1 5 Footlocker NJ
With base R, we can use strsplit with merge
lst1 <- strsplit(DF2$ID, ";")
merge(DF1, transform(DF2[rep(seq_len(nrow(DF2)),
lengths(lst1)), 1:2], ID = unlist(lst1)))
# ID X Y Store State
#1 1 banana 14 Walmart NY
#2 1 banana 14 Sears AL
#3 2 orange 20 Sears AL
#4 3 pineapple 1 Target DC
#5 3 pineapple 1 Old Navy PA
#6 4 guava 300 Footlocker NJ
#7 5 grapes 1 Popeye's HA
#8 5 grapes 1 Footlocker NJ

Collapsing group of strings into one string using an if statement within a for loop in R

I have a dataframe with a column "Food."
dataframe <- data.frame(Color = c("red","red","red","red","red","blue","blue","blue","blue","blue","green","green","green","green","green","orange","orange","orange","orange","orange"),
Food = c("banana","apple","potato","orange","egg","strawberry","cheese","yogurt","kiwi","butter","kale","sugar","carrot","celery","radish","cereal","milk","blueberry","squash","lemon"), Count = c(2,5,4,8,10,7,5,6,9,11,1,8,5,3,7,9,2,3,6,4))
Every time a fruit appears I want to replace the name of the fruit with "fruit."
I've tried making a vector of the fruit names. Then I go through each row in the dataframe and where the string matches the fruit, I want to replace the fruit name with "fruit."
fruit_list <- c("banana","apple","orange","strawberry","kiwi","blueberry","lemon")
for (r in 1:nrow(dataframe)) {
for (i in 1:length(fruit_list)){
if (length(grep(fruit_list[i], dataframe$Food[r])) != 0) {
dataframe$Food[r] <- paste("fruit")
}
}
}
How do I use this general format so that dataframe$Food doesn't just end up filled with NA?
With dplyr:
library(dplyr)
ataframe %>%
mutate(Food=as.character(Food),
Food=ifelse(Food%in%fruit_list,"Fruit",Food))#can change to fruit
Result:
Color Food Count
1 red Fruit 2
2 red Fruit 5
3 red potato 4
4 red Fruit 8
5 red egg 10
6 blue Fruit 7
7 blue cheese 5
8 blue yogurt 6
9 blue Fruit 9
10 blue butter 11
11 green kale 1
12 green sugar 8
13 green carrot 5
14 green celery 3
15 green radish 7
16 orange cereal 9
17 orange milk 2
18 orange Fruit 3
19 orange squash 6
20 orange Fruit 4
Only R base:
dataframe$Food <-
sapply(dataframe$Food,
function(x,fruit_list) ifelse(x %in% fruit_list, "fruit", as.character(x) ),
fruit_list = fruit_list )
You don't necessarily need dplyr for this.
Just use:
dataframe$Food <- ifelse(dataframe$Food %in% fruit_list, "Fruit", as.character(dataframe$Food))
You can do this in one line by using data.table package-
> setDT(dataframe)[,Food:=ifelse(Food %in% fruit_list,"fruit",as.character(Food))]
Color Food Count
1: red fruit 2
2: red fruit 5
3: red potato 4
4: red fruit 8
5: red egg 10
6: blue fruit 7
7: blue cheese 5
8: blue yogurt 6
9: blue fruit 9
10: blue butter 11
11: green kale 1
12: green sugar 8
13: green carrot 5
14: green celery 3
15: green radish 7
16: orange cereal 9
17: orange milk 2
18: orange fruit 3
19: orange squash 6
20: orange fruit 4

give each id the same column value R

I want to give each unique id the same column value for first.date based on their first.date for fruit=='apple'.
This is what I have:
names dates fruit first.date
1 john 2010-07-01 kiwi <NA>
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana <NA>
4 john 2010-12-01 orange <NA>
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange <NA>
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange <NA>
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
this is what I want:
names dates fruit first.date
1 john 2010-07-01 kiwi 2010-09-01
2 john 2010-09-01 apple 2010-09-01
3 john 2010-11-01 banana 2010-09-01
4 john 2010-12-01 orange 2010-09-01
5 john 2011-01-01 apple 2010-09-01
6 mary 2010-05-01 orange 2010-07-01
7 mary 2010-07-01 apple 2010-07-01
8 mary 2010-07-01 orange 2010-07-01
9 mary 2010-09-01 apple 2010-07-01
10 mary 2010-11-01 apple 2010-07-01
This is my disastrous attempt:
getdates$first.date[is.na]<-getdates[getdates$first.date & getdates$fruit=='apple',]
Thank you in advance
reproducible DF
names<-as.character(c("john", "john", "john", "john", "john", "mary", "mary","mary","mary","mary"))
dates<-as.Date(c("2010-07-01", "2010-09-01", "2010-11-01", "2010-12-01", "2011-01-01", "2010-05-01", "2010-07-01", "2010-07-01", "2010-09-01", "2010-11-01"))
fruit<-as.character(c("kiwi","apple","banana","orange","apple","orange","apple","orange", "apple", "apple"))
first.date<-as.Date(c(NA, "2010-09-01",NA,NA, "2010-09-01", NA, "2010-07-01", NA, "2010-07-01","2010-07-01"))
getdates<-data.frame(names,dates,fruit, first.date)
It's unclear what you want to do when there are duplicate entries for first.date and apple (for a given name), this will just take the first one:
library(data.table)
dt = data.table(getdates)
dt[, first.date := first.date[fruit == 'apple'][1], by = names]
dt
# names dates fruit first.date
# 1: john 2010-07-01 kiwi 2010-09-01
# 2: john 2010-09-01 apple 2010-09-01
# 3: john 2010-11-01 banana 2010-09-01
# 4: john 2010-12-01 orange 2010-09-01
# 5: john 2011-01-01 apple 2010-09-01
# 6: mary 2010-05-01 orange 2010-07-01
# 7: mary 2010-07-01 apple 2010-07-01
# 8: mary 2010-07-01 orange 2010-07-01
# 9: mary 2010-09-01 apple 2010-07-01
#10: mary 2010-11-01 apple 2010-07-01

Resources