imputing missing values in R dataframe - r

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.

Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles

I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

Related

R: how to aggreate rows by count

This is my data frame
ID=c(1,2,3,4,5,6,7,8,9,10,11,12)
favFruit=c('apple','lemon','pear',
'apple','apple','pear',
'apple','lemon','pear',
'pear','pear','pear')
surveyDate = ('1/1/2005','1/1/2005','1/1/2005',
'2/1/2005','2/1/2005','2/1/2005',
'3/1/2005','3/1/2005','3/1/2005',
'4/1/2005','4/1/2005','4/1/2005')
df<-data.frame(ID,favFruit, surveyDate)
I need to aggregate it so I can plot a line graph in R for count of favFruit by date split by favFruit but I am unable to create an aggregate table. My data has 45000 rows so a manual solution is not possible.
surveyYear favFruit count
1/1/2005 apple 1
1/1/2005 lemon 1
1/1/2005 pear 1
2/1/2005 apple 2
2/1/2005 lemon 0
2/1/2005 pear 1
... etc
I tried this but R printed an error
df2 <- aggregate(df, favFruit, FUN = sum)
and I tried this, another error
df2 <- aggregate(df, date ~ favFruit, sum)
I checked for solutions online but their data generally included a column of quantities which I dont have and the solutions were overly complex. Is there an easy way to do this? Thanx in advance. Thank you to whoever suggested the link as a possible duplicate but it has has date and number of rows. But my question needs number of rows by date and favFruit (one more column) 1
Update:
Ronak Shah's solution worked. Thanx!
The solution provided by Ronak is very good.
In case you prefer to keep the zero counts in your dataframe.
You could use table function:
data.frame(with(df, table(favFruit, surveyDate)))
Output:
favFruit surveyDate Freq
1 apple 1/1/2005 1
2 lemon 1/1/2005 1
3 pear 1/1/2005 1
4 apple 2/1/2005 2
5 lemon 2/1/2005 0
6 pear 2/1/2005 1
7 apple 3/1/2005 1
8 lemon 3/1/2005 1
9 pear 3/1/2005 1
10 apple 4/1/2005 0
11 lemon 4/1/2005 0
12 pear 4/1/2005 3

Only filter values in a column based on a condition

Let's say I have the following dataframe:
my_basket = data.frame(ITEM_GROUP = c("Fruit","Fruit","Fruit","Fruit","Fruit","Vegetable","Vegetable","Vegetable","Vegetable","Dairy","Dairy","Dairy","Dairy","Dairy"),
ITEM_NAME = c("Apple","Banana","Orange","Mango","Papaya","Carrot","Potato","Brinjal","Raddish","Milk","Curd","Cheese","Milk","Paneer"),
Price = c(100,80,80,90,65,70,60,70,25,60,40,35,50,NA),
Tax = c(2,4,5,6,2,3,5,1,3,4,5,6,4,NA))
This then yields:
> my_basket
ITEM_GROUP ITEM_NAME Price Tax
1 Fruit Apple 100 2
2 Fruit Banana 80 4
3 Fruit Orange 80 5
4 Fruit Mango 90 6
5 Fruit Papaya 65 2
6 Vegetable Carrot 70 3
7 Vegetable Potato 60 5
8 Vegetable Brinjal 70 1
9 Vegetable Raddish 25 3
10 Dairy Milk 60 4
11 Dairy Curd 40 5
12 Dairy Cheese 35 6
13 Dairy Milk 50 4
14 Dairy Paneer NA NA
What I now would like to do, is make a list of fruits I want to keep and then filter those, so:
fruitlist = c("Apple", "Banana")
How would I go about using tidyverse to filter the data in my data.frame to only keep the fruits in my fruitlist, but also all my Vegetables and Dairy? Normally I'd do:
my_basket %<>% filter(ITEM_NAME %in% fruitlist)
But then I'd also lose all the vegetables and dairy, which is not what I want. I've been trying to make something work with case_when but can't seem to make it work. There must be something obvious I'm missing here.
EDIT: Seconds after posting my question I finally realised:
my_basket %<>% filter(ITEM_NAME %in% fruitlist | ITEM_GROUP != "Fruit")
That solves it. I think if I'd have to filter multiple groups like this, piping the filter command repeatedly would work too.
You could use grepl with a regex alternation:
fruitlist <- c("Apple", "Banana")
regex <- paste0("^(?:", paste0(fruitlist, collapse="|"), ")$")
my_basket %<>% filter(grepl(regex, ITEM_NAME))

How to split words up in a cell by a comma in R? [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 4 years ago.
i have the following dataframe,
id food drink
1 chip coke, wine, punch
2 eggs pepsi, water
3 pie water, wine, orange juice
I want to know how i can get the following dataframe instead:
id food drink
1 chip coke
1 chip wine
1 chip punch
2 eggs pepsi
2 eggs water
3 pie water
3 pie wine
3 pie orange juice
i would like to use something from the tidyverse such as the stringr pacakge - but am stuck at the moment
any ideas how to do this in R?
We can use separate_rows
library(tidyverse)
separate_rows(df1, drink, sep=", ")
# id food drink
#1 1 chip coke
#2 1 chip wine
#3 1 chip punch
#4 2 eggs pepsi
#5 2 eggs water
#6 3 pie water
#7 3 pie wine
#8 3 pie orange juice
data
df1 <- structure(list(id = 1:3, food = c("chip", "eggs", "pie"),
drink = c("coke, wine, punch",
"pepsi, water", "water, wine, orange juice")), class = "data.frame",
row.names = c(NA, -3L))
another way you can solve this problem is with tidytext
library(tidytext)
unnest_tokens(df, drink)
This wil split that text column up into words.
You can also use it for other unnesting but this works.
See more info here: https://www.tidytextmining.com/tidytext.html#the-unnest_tokens-function

How to aggregate data several times on the same table

I am working in R and have a list with 3 columns:
Fruit Drawer Amount
Banana Top 1
Peach Top 2
Apple Top 3
Banana Mid 4
Peach Mid 5
Apple Mid 6
Banana Bottom 7
Peach Bottom 8
Apple Bottom 9
and I want to create the smallest ratio of fruit type (ex. bananas) in each drawer (ex. Top) to total fruit (all the bananas).
I am using table:
x <- table(fruits)
but I get a type of data that I don't know how to work with.
Ultimately I want to get "bananas per drawer" divided by the "total bananas" in all the drawers. I guess I could do it column by column but I am sure there are better ways to go about it. Any suggestion?
Sorry for any etiquette mishaps, I haven't been programming for long.
Thanks.
Do you want something like this:
library(dplyr)
fruit__drawer =
"Fruit Drawer Amount
Banana Top 1
Peach Top 2
Apple Top 3
Banana Mid 4
Peach Mid 5
Apple Mid 6
Banana Bottom 7
Peach Bottom 8
Apple Bottom 9" %>%
read.table(text = . , header = TRUE)
fruit =
fruit__drawer %>%
group_by(Fruit) %>%
summarize(Amount.fruit = sum(Amount)) %>%
mutate(Proportion.overall = Amount.fruit / sum(Amount.fruit))
result =
fruit__drawer %>%
left_join(fruit) %>%
group_by(Drawer) %>%
mutate(Proportion= Amount/sum(Amount),
Proportion.ratio = Proportion/Proportion.overall)

Inserting new rows to dataframe without losing format

I am trying to create a large empty data.frame and insert a groups of row. I have seen a few similar questions on numerous forums, however I have been unable to apply any of them successfully to the specific formatting issue I am having.
I started with rbind(df,allic) # allic is the data frame I would like to insert into df # however, given the size of my dataset the operation takes 5 1/2 minutes to complete. I understand that creating the data frame at the beginning and replacing rows improves efficiency, however I have been unable to make it work for my problem. Code is as follows:
Initial data:
Order.ID Product
1 193505 Onion Rings
2 193505 Pineapple Cheddar Burger
3 193623 Fountain Soda
4 193623 French Fries
5 193623 Hamburger
6 193623 Hot Dog
7 193631 French Fries
8 193631 Hamburger
9 193631 Milkshake
The products won't match to below, however this being a formatting issue I figured it best to show the formatting that brought me to where I am now.
nb$Order.ID <- as.factor(nb$Order.ID)
plist <- aggregate(nb$Product,list(nb$Order.ID),list)
allp <- unique(unlist(plist$x))
allic <- expand.grid(plist$x[[1]], Var2=plist$x[[1]], Var3=1)
Var1 Var2 Var3
1 Onion Rings Onion Rings 1
2 Pineapple Cheddar Burger Onion Rings 1
3 Onion Rings Pineapple Cheddar Burger 1
4 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
Now I create an empty dataframe (df) using:
df <- data.frame(factor=rep(NA, rcnt), factor=rep(NA,rcnt), stringsAsFactors=FALSE)
rcnt being a large, arbitrary number which I plan to trim once the operation is complete. My issue comes when I try to insert these lines using:
df[1:4,] <- allic
head(df, n=10)
factor factor.1
1 47 47
2 51 47
3 47 51
4 51 51
5 NA NA
6 NA NA
7 NA NA
8 NA NA
How can I insert rows in a dataframe without losing the format of my values? I would greatly appreciate any help I can get at this point.
EDIT Per comment below:
>df[i] <- for(i in 1:nrow(plist)) {
> allic <- expand.grid(plist$x[[i]], Var2=plist$x[[i]], Var3=1)
> df[i:nrow(allic),] <- sapply(allic, as.character)
I'm still very new with R, however this was working when I was using df <- rbind(df,allic). nrow(df) is 4096.
Try wrapping allic in as.character as follows:
df[1:4,] <- sapply(allic, as.character)
> df
factor factor.1
1 Onion Rings Onion Rings
2 Pineapple Cheddar Burger Onion Rings
3 Onion Rings Pineapple Cheddar Burger
4 Pineapple Cheddar Burger Pineapple Cheddar Burger
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>

Resources