Create data frame for each unique row in another data frame - r

For an assignment for my graduate program, I have been asked to extract data from datasets of English Premier League results (located here). I am very close to being done but need help on the last two outputs.
We must create a function that can receive two arguments, a date and a season. The function must return a data frame with the table of the respective season on that date. It must include wins, losses, home record, away record, etc. The only ones I have not managed to figure out are W/L streak and the results of the last 10 matches.
Here is an example of what the initial dataset looks like:
e.Date e.HomeTeam e.AwayTeam e.FTHG e.FTAG e.FTR
1 2015-08-08 Bournemouth Aston Villa 0 1 A
2 2015-08-08 Chelsea Swansea 2 2 D
3 2015-08-08 Everton Watford 2 2 D
4 2015-08-08 Leicester Sunderland 4 2 H
5 2015-08-08 Man United Tottenham 1 0 H
My plan was to get Home and Away data sorted out for each club then merge them together before doing the analysis to find streak and last 10 results.
I manipulated the data to look like this:
HomeTeam FTR Date freq
1 Arsenal L 2015-08-09 1
2 Arsenal D 2015-08-24 1
3 Arsenal W 2015-09-12 1
4 Aston Villa L 2015-08-14 1
5 Aston Villa L 2015-09-19 1
6 Aston Villa D 2015-08-29 1
And now I'm kinda lost. My idea was to run some kind of loop (for? ddply? data.table?) to create a data frame for each club with their results in it and then loop again to do whatever calculations to get the desired variables (streak and last 10) and somehow push those back into the main data frame where I am housing all of the other outputs.
I don't want to be told the answer outright since it's important I learn this on my own. However, if someone could point me in the right direction that would be great. Thanks so much.

I created some dummy data just to demonstrate a few commands and maybe give you some ideas.
set.seed(321)
dat <- data.frame(team = sample(letters[1:3], 20, replace=TRUE),
season = rep("season1", 20),
time = rnorm(20),
win_loss = sample(c("win", "loss"), 20, replace=TRUE))
Problem 1. Find win/loss streak
Take a look at the rle function example below
# 1. find wl streak of team 'a'
tmp <- dat[dat$team == "a", ]
tmp <- tmp[order(tmp$time), ]
> tmp
team season time win_loss
19 a season1 -1.12032742 loss
14 a season1 -1.07223880 loss
16 a season1 0.09500072 loss
3 a season1 0.18832552 loss
8 a season1 0.42033257 loss
4 a season1 2.44325982 win
# shows runs of 5 consecutive losses, then 1 consecutive win
rle(tmp$win_loss == "win")
Run Length Encoding
lengths: int [1:2] 5 1
values : logi [1:2] FALSE TRUE
Here's a very helpful post on rle How can I count runs in a sequence?
Problem 2. Last 3 results
I reversed the order of time and then picked the top 3 results.
# 2. find last 3 matches for team 'b'
tmp <- dat[dat$team == "b", ]
tmp <- tmp[rev(order(tmp$time)), ]
> tmp[1:3, ]
team season time win_loss
11 b season1 0.9172555 loss
9 b season1 0.5775845 win
7 b season1 0.4560691 loss

Related

Multithread computation with R: how to get all different random numbers?

Anyone knows how to get all the random numbers different in the following code? E.g. with doRNG package? I don't care about reproducibility.
Edit: Duplicates by pure chance are accepted.
rm(list = ls())
set.seed(666)
cat("\014")
library(plyr)
library(dplyr)
library(doRNG)
# ====== Data Preparation ======
dt = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
# ====== Set Parallel Computing ======
library(foreach)
library(doParallel)
cl = makeCluster(3, outfile = "")
registerDoParallel(cl)
# ====== SIMULATION ======
nsim = 1000 # number of simulations
iterChunk = 100 # split nsim into this many chunks
out = data.frame() # prepare output DF
for(iter in 1:ceiling(nsim/iterChunk)){
strt = Sys.time()
out_iter =
foreach(i = 1:iterChunk, .combine = rbind, .multicombine = TRUE, .maxcombine = 100000, .inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
# simulation number
id_sim = iterChunk * (iter - 1) + i
## Generate random numbers
tmp_sim = is.na(dt$HG) # no results yet
dt$random[tmp_sim] = runif(sum(tmp_sim))
dt$HG[tmp_sim] = 3
# Save Results
dt$id_sim = id_sim
dt$iter = iter
dt$i = i
print(Sys.time())
return(dt)
}#i;sim_forcycle
out = rbind.data.frame(out,subset(out_iter, !is.na(random)))
fnsh = Sys.time()
cat(" [",iter,"] ",fnsh - strt, sep = "")
}#iter
# ====== Stop Parallel Computing ======
stopCluster(cl)
# ====== Distinct Random Numbers ======
length(unique(out$random)) # expectation: 6000
I have been strugling with this for 2 days. I asked this question earlier with only general response about random numbers.
Here I would like to ask for a solution (if anybody knows) how to set doRNG package options (or similar package) in a way that all the random numbers are different. Across all the loops.
I have tried tons of doRNG settings and I still can't get it to work. Tried R versions 3.5.3 and 3.6.3 on two different computers.
UPDATE Following discussion with #Limey
Purpose of the code is to simulate football matches. As the simulation is large, I use iterChunk to "split" the simulation into managable parts and after each iter send the data into PostgreSQL database so the simulation doesn't overload RAM. Some matches already have real world results and have HG (home goals) filled in. I want to simulate the rest.
When setting iterChunk to 1 everything is fine. Increasing iterChunk leads to generation of same numbers within iter. For example when I set nsim to 100 and iterChunk to 10. (All matches simulated 100 times, 10 times in 10 loops). I expect 600 random numbers (each match independently simulated accross all the loops). However I only get 180 - following the logic: 3 cores * 6 matches * 10 iterChunks.) Using 2 workers I do get 120 distinct random numbers (2 * 6 * 10)
Furthermore: exluding dt$HG[tmp_sim] = 3 I do get all random numbers different with whatever setting.
To understand the problem, I suggest:
Run the code as is. (possibly setting nsim to 100 and iterChunk to 10) You will get 180 different random numbers. With lower number of nsim & iterChunk things may work as expected.
Comment out dt$HG[tmp_sim] = 3.
You will get 6000 different random numbers (600 if you change nsim and iterChunk)
The code in 2nd step assigns goals scored by home team. It looks like some kind of bug I can't get over. Even information that someone gets the same result and doesn't know why will be helpful - it will lift the weight of my own stupidity out of me.
Thank you, I highly appreciate any effort.
I realised what the problem with OP's code was whilst I was in the shower. It's simple, and obvious in retrospect: all the loops and parallel processes are working on the same object - the dt data frame. So they're constantly overwriting the changes that each makes, and at the end of the outer loop, you just have multiple copies of the changes made by the last loop to complete. The solution is equally simple: work on a copy of the dt data frame.
To minimise the changes, I renamed dt to baseDT
# ====== Data Preparation ======
baseDT = data.frame(id = 1:10,
part = rep("dt",10),
HG = c(1,3,6,NA,NA,2,NA,NA,NA,NA),
random = NA)
and then took a copy of it at the top of the foreach loop
out_iter = foreach(i = 1:iterChunk,
.combine = rbind, .multicombine = TRUE, .maxcombine = 100000,
.inorder = FALSE, .verbose = FALSE,
.packages = c("plyr", "dplyr")) %dopar% {
dt <- baseDT
This gives
> length(unique(out$random)) # expectation: 6000
[1] 6000
as expected.
Modifying the "Hello World" example in the "getting started with doParallel" vignette to generate random numbers, I came up with:
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
myFunc <- function(n) {runif(n)}
foreach(i=1:3) %dopar% myFunc(10)
[[1]]
[1] 0.18492375 0.13388278 0.65455450 0.93093066 0.41157625 0.89479764 0.14736529 0.47935995 0.03062963 0.16110714
[[2]]
[1] 0.89245145 0.20980791 0.83828019 0.04411547 0.38184303 0.48110619 0.51509058 0.93732055 0.40159834 0.81414140
[[3]]
[1] 0.74393129 0.66999730 0.44411989 0.85040773 0.80224527 0.72483644 0.64566262 0.22546420 0.14526819 0.05931329
Suggesting that getting random numbers across threads is straightforward. Indeed, the examples on pages 2 and 3 of the doRNG reference manual say the same thing.
In fact, if I understand you correctly, the purpose of doRNG is to do precisely the opposite of what you want: to make random processes reproducible across threads.
Of course, this doesn't guarantee that all numbers are different across all threads. But it makes duplication very unlikely. A guarantee of no duplicates would mean some degree of determinism in the process: a completely random process might produce duplicates by chance.
Update
Following on from our conversation in the comments...
We've established that the problem is in your program logic, not the parallelisation per se. So we need to refocus the question: what are you trying to do. I'm afraid it's not at all clear to me. So that means we need to simplify.
I set nsim to 5 and iterChunk to 1. I get 5 data frames which look like
id part HG random id_sim iter i
1 1 dt 1 NA 1 1 1
2 2 dt 3 NA 1 1 1
3 3 dt 6 NA 1 1 1
4 4 dt 3 0.6919744 1 1 1
5 5 dt 3 0.5413398 1 1 1
6 6 dt 2 NA 1 1 1
7 7 dt 3 0.3983175 1 1 1
8 8 dt 3 0.3342174 1 1 1
9 9 dt 3 0.6126020 1 1 1
10 10 dt 3 0.4185468 1 1 1
In each, the values of id_sim and iter are always the same, and run from 1 in the first data frame to 5 in the fifth. i is 1 for all rows in all data frames. Values in random do appear to be random, and different between data frames. But the NAs are all in the same positions in every data frame: the 1st, 2nd, 3rd and 6th rows. The values of HG are as shown above for all five data frames.
Is that what you would expect? If not, what do you expect? Given we know the problem is not the paraellisation, you need to give us more information.
Update 2
Do you know Arduan? They posted a related question over the weekend...
I'm not going to tell you what's wrong with your code. I'll show you how I would apprach your problem. I hope you'll agree it's more readable, if nothing else.
So, we're simulating some football matches. I'll assume its a league format and use the english Premier League as an example. Start by generating the fixture list for a single season.
library(tidyverse)
teams <- c("Arsenal", "Aston Villa", "Bournemouth", "Brighton & Hove Albion",
"Burnley", "Chelsea", "Crystal Palace", "Everton", "Leicester City",
"Liverpool", "Manchester City", "Manchester United", "Newcastle United",
"Norwich City", "Sheffield United", "Southampton", "Tottenham Hotspur",
"Watford", "West Ham United", "Wolverhampton Wanderers")
fixtures <- tibble(HomeTeam=teams, AwayTeam=teams) %>%
complete(HomeTeam, AwayTeam) %>%
filter(HomeTeam != AwayTeam) # A team can't play itself
fixtures %>% head(5)
# A tibble: 5 x 2
HomeTeam AwayTeam
<chr> <chr>
1 Arsenal Aston Villa
2 Arsenal Bournemouth
3 Arsenal Brighton & Hove Albion
4 Arsenal Burnley
5 Arsenal Chelsea
Suppose we know some results. I'll use yesterday's matches as an illustration.
knownResults <- tribble(~HomeTeam, ~AwayTeam, ~HomeGoals, ~AwayGoals,
"Burnley", "Sheffield United", 1, 1,
"Newcastle United", "West Ham United", 2, 2,
"Liverpool", "Aston Villa", 2, 0,
"Southampton", "Manchester City", 1, 0)
resultsSoFar <- fixtures %>%
left_join(knownResults, by=c("HomeTeam", "AwayTeam"))
resultsSoFar %>% filter(!is.na(HomeGoals))
# A tibble: 4 x 4
HomeTeam AwayTeam HomeGoals AwayGoals
<chr> <chr> <dbl> <dbl>
1 Burnley Sheffield United 1 1
2 Liverpool Aston Villa 2 0
3 Newcastle United West Ham United 2 2
4 Southampton Manchester City 1 0
Now some utility functions. You could certainly combine them, but I think it's clearer to keep them separate so you can see exactly what each one is doing.
First, a function to simulate the results of all matches whose results are unknown. The details of how you simulate the scores are entirely arbitrary. I've assumed that home teams score an average of 1.5 goals a game, away teams score 1.2 goals per game. Later on, I'm going to use this to simulate many seasons in one go, so I'll add a variable (Iteration) to index the season.
simulateResults <- function(i=NA, data) {
n <- nrow(data)
data %>%
add_column(Iteration=i, .before=1) %>%
mutate(
# Give the home team a slight advantage
HomeGoals=ifelse(is.na(HomeGoals), floor(rexp(n, rate=1/1.5)), HomeGoals),
AwayGoals=ifelse(is.na(AwayGoals), floor(rexp(n, rate=1/1.2)), AwayGoals)
)
}
Use it, and check that we haven't overwritten known results:
simulateResults(1, resultsSoFar) %>% filter(HomeTeam=="Burnley", AwayTeam=="Sheffield United")
# A tibble: 1 x 5
Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <chr> <chr> <dbl> <dbl>
1 1 Burnley Sheffield United 1 1
I'm going to parallelise the overall simulation, so now let's have a function to simulate a chunk of simulations. Again, create an index column to identify the chunk.
simulateChunk <- function(chunkID=NA, n) {
bind_rows(lapply(1:n, simulateResults, data=resultsSoFar)) %>%
add_column(Chunk=chunkID, .before=1)
}
simulateChunk(chunkID=1, n=3)
# A tibble: 1,140 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<dbl> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 0 0
3 1 1 Arsenal Brighton & Hove Albion 2 0
4 1 1 Arsenal Burnley 2 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 2 3
8 1 1 Arsenal Leicester City 2 0
9 1 1 Arsenal Liverpool 0 1
10 1 1 Arsenal Manchester City 4 0
OK. Now I'm ready to do the main simulation work. I'll run 10 chunks of 100 simulations eash, to give 1000 simulated seasons in total, the same as you had.
library(doParallel)
cl <- makeCluster(3)
registerDoParallel(cl)
chunkSize <- 100
nChunks <- 10
startedAt <- Sys.time()
x <- bind_rows(foreach(i=1:nChunks, .packages=c("tidyverse")) %dopar% simulateChunk(i, n=chunkSize))
finishedAt <- Sys.time()
print(finishedAt - startedAt)
Time difference of 6.772928 secs
stopCluster(cl)
> x
# A tibble: 380,000 x 6
Chunk Iteration HomeTeam AwayTeam HomeGoals AwayGoals
<int> <int> <chr> <chr> <dbl> <dbl>
1 1 1 Arsenal Aston Villa 2 0
2 1 1 Arsenal Bournemouth 3 1
3 1 1 Arsenal Brighton & Hove Albion 0 1
4 1 1 Arsenal Burnley 3 0
5 1 1 Arsenal Chelsea 1 0
6 1 1 Arsenal Crystal Palace 0 0
7 1 1 Arsenal Everton 1 2
8 1 1 Arsenal Leicester City 0 0
9 1 1 Arsenal Liverpool 0 0
10 1 1 Arsenal Manchester City 0 0
Let's check I've got sensible results. As a basic check, I'll look at the results of Arsenal vs Aston Villa:
x %>%
filter(HomeTeam == "Arsenal", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
values_from="N", names_prefix="AwayGoals",
names_sep="", names_from=AwayGoals
)
# A tibble: 8 x 10
HomeGoals AwayGoals0 AwayGoals1 AwayGoals2 AwayGoals3 AwayGoals4 AwayGoals5 AwayGoals6 AwayGoals8 AwayGoals7
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 299 129 57 19 12 7 NA NA NA
2 1 135 63 25 6 4 4 1 2 NA
3 2 75 21 12 9 4 1 NA NA 1
4 3 30 13 10 1 NA NA NA NA NA
5 4 21 7 1 1 NA NA NA NA NA
6 5 11 2 1 NA 2 NA NA NA NA
7 6 4 2 2 NA NA NA NA NA NA
8 7 4 1 1 NA NA NA NA NA NA
That looks reasonable. Now confirm that the matches with known results don't vary. For example:
x %>%
filter(HomeTeam == "Liverpool", AwayTeam=="Aston Villa") %>%
group_by(HomeGoals, AwayGoals) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(values_from="N", names_prefix="AwayGoals", names_sep="", names_from=AwayGoals)
HomeGoals AwayGoals0
<dbl> <int>
1 2 1000
All good.
So, That's 23 statements to generate the fixtures, take account of known results, simulate the remainder of the matches and do some basic sanity checking. I could easily get that down to under 20 statements if I had to. That's about a third less than you were using just to try to simulate the unknown results. [The actual simulation takes fewer than 10 statements.] I think my approach is easier to understand: by using tidy verbs the code is almost self-documenting.

For loops in R with added names to i

I've been having some trouble adding in titles to created data frames with for loops. This is the basic structure of the data:
> head(year2019)
date day_of_week away_team away_team_game_number home_team home_team_game_number away_score home_score day_night park game_length away_AB away_H away_2B
1 2019-03-20 Wed SEA 1 OAK 1 9 7 N OAK01 204 31 7 1
2 2019-03-21 Thu SEA 2 OAK 2 5 4 N OAK01 267 43 9 4
3 2019-03-28 Thu PIT 1 CIN 1 3 5 D CIN09 174 31 5 0
4 2019-03-28 Thu ARI 1 LAN 1 5 12 D LOS03 169 33 9 4
5 2019-03-28 Thu COL 1 MIA 1 6 3 D MIA02 175 36 9 5
6 2019-03-28 Thu SLN 1 MIL 1 4 5 D MIL06 156 32 5 0
With this code I've been able to create dataframes for each team:
for (i in teams) {
assign(i, year2019 %>% filter(away_team == i | home_team == i))
}
With teams <- c("ANA", "ARI", "ATL", ...)
However I want to run this with creating both home and away teams. I've tried some of the following but nothing has worked so far:
for(i in teams) {
i_home <- i %>% filter(home_team == i)
i_away <- i %>% filter(away_team == i)
}
Or
for (i in teams) {
i1 <- filter(year2019, home_team == i)
i2 <- filter(year2019, away_team == i)
}
Any advice on how to properly introduce added names for i in this?
Avoid flooding your global environment with many separate team data frames. Instead, use one list of many data frame elements which is easily index-able and searchable. For this, consider by or split.
home_team_dfs <- by(year_2019, year_2019$home_team, identity)
home_team_dfs <- split(year_2019, year_2019$home_team)
# RUN SELECT OPERATIONS ON DATA FRAMES
head(home_team_dfs$ANA)
tail(home_team_dfs$ARI)
summary(home_team_dfs$ATL)
away_team_dfs <- by(year_2019, year_2019$away_team, identity)
away_team_dfs <- split(year_2019, year_2019$away_team)
# RUN SELECT OPERATIONS ON DATA FRAMES
head(away_team_dfs$ANA)
tail(away_team_dfs$ARI)
summary(away_team_dfs$ATL)
Do note you lose no functionality of data frame if it is stored within a list. Therefore, any needed operation (e.g., head, tail, summary) should still be available. Also, you can easily run iterative, consistent, serial-able operations on list like with apply family functions to interact with single, multiple, or all underlying data frame elements.

Frequency count with multiple conditions R

Have a data frame
Date Team Opponent Weather Outcome
2017-05-01 All Stars B Stars Rainy 1
2017-05-02 All Stars V Stars Rainy 1
2017-05-03 All Stars M Trade Sunny 0
.
.
2017-05-11 All Stars Vdronee Sunny 0
Where Outcome 1 indicates a win. I have used the table function to get the frequency and applied condition.
table(df$Outcome, df$Team == "All Stars")
Returns me this
FALSE TRUE
0 1005 30
1 1323 57
So frequency of win is 57/87 =0.655
Two Questions:
Rather the calculating the win frequency manually, how do I embed this directly in a formula?
and
How do I filter based on the x most recent observations? i.e something like
table(df$Outcome, df$Team == "All Stars" & df$date = filtering for the 5 most recent observations)
thanks
An option is to use data.table
libray(data.table)
dt <- data.table(df)
dt[, .(prop=sum(outcome)/.N),Team]
to get the 5 most recent observations you can to the following:
dt[,head(.SD,5),by=.(Team,Date)][,.(prop=sum(outcoume/.N),Team]

How to specific rows from a split list in R based on column condition

I am new to R and to programming in general and am looking for feedback on how to approach what is probably a fairly simple problem in R.
I have the following dataset:
df <- data.frame(county = rep(c("QU","AN","GY"), 3),
park = (c("Downtown","Queens", "Oakville","Squirreltown",
"Pinhurst", "GarbagePile","LottaTrees","BigHill",
"Jaynestown")),
hectares = c(12,42,6,18,92,6,4,52,12))
df<-transform(df, parkrank = ave(hectares, county,
FUN = function(x) rank(x, ties.method = "first")))
Which returns a dataframe looking like this:
county park hectares parkrank
1 QU Downtown 12 2
2 AN Queens 42 1
3 GY Oakville 6 1
4 QU Squirreltown 18 3
5 AN Pinhurst 92 3
6 GY GarbagePile 6 2
7 QU LottaTrees 4 1
8 AN BigHill 52 2
9 GY Jaynestown 12 3
I want to use this to create a two-column data frame that lists each county and the park name corresponding to a specific rank (e.g. if when I call my function I add "2" as a variable, shows the second biggest park in each county).
I am very new to R and programming and have spent hours looking over the built in R help files and similar questions here on stack overflow but I am clearly missing something. Can anyone give a simple example of where to begin? It seems like I should be using split then lapply or maybe tapply, but everything I try leaves me very confused :(
Thanks.
Try,
df2 <- function(A,x) {
# A is the name of the data.frame() and x is the rank No
df <- A[A[,4]==x,]
return(df)
}
> df2(df,2)
county park hectares parkrank
1 QU Downtown 12 2
6 GY GarbagePile 6 2
8 AN BigHill 52 2

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Resources