R: how to aggreate rows by count - r

This is my data frame
ID=c(1,2,3,4,5,6,7,8,9,10,11,12)
favFruit=c('apple','lemon','pear',
'apple','apple','pear',
'apple','lemon','pear',
'pear','pear','pear')
surveyDate = ('1/1/2005','1/1/2005','1/1/2005',
'2/1/2005','2/1/2005','2/1/2005',
'3/1/2005','3/1/2005','3/1/2005',
'4/1/2005','4/1/2005','4/1/2005')
df<-data.frame(ID,favFruit, surveyDate)
I need to aggregate it so I can plot a line graph in R for count of favFruit by date split by favFruit but I am unable to create an aggregate table. My data has 45000 rows so a manual solution is not possible.
surveyYear favFruit count
1/1/2005 apple 1
1/1/2005 lemon 1
1/1/2005 pear 1
2/1/2005 apple 2
2/1/2005 lemon 0
2/1/2005 pear 1
... etc
I tried this but R printed an error
df2 <- aggregate(df, favFruit, FUN = sum)
and I tried this, another error
df2 <- aggregate(df, date ~ favFruit, sum)
I checked for solutions online but their data generally included a column of quantities which I dont have and the solutions were overly complex. Is there an easy way to do this? Thanx in advance. Thank you to whoever suggested the link as a possible duplicate but it has has date and number of rows. But my question needs number of rows by date and favFruit (one more column) 1
Update:
Ronak Shah's solution worked. Thanx!

The solution provided by Ronak is very good.
In case you prefer to keep the zero counts in your dataframe.
You could use table function:
data.frame(with(df, table(favFruit, surveyDate)))
Output:
favFruit surveyDate Freq
1 apple 1/1/2005 1
2 lemon 1/1/2005 1
3 pear 1/1/2005 1
4 apple 2/1/2005 2
5 lemon 2/1/2005 0
6 pear 2/1/2005 1
7 apple 3/1/2005 1
8 lemon 3/1/2005 1
9 pear 3/1/2005 1
10 apple 4/1/2005 0
11 lemon 4/1/2005 0
12 pear 4/1/2005 3

Related

imputing missing values in R dataframe

I am trying to impute missing values in my dataset by matching against values in another dataset.
This is my data:
df1 %>% head()
<V1> <V2>
1 apple NA
2 cheese NA
3 butter NA
df2 %>% head()
<V1> <V2>
1 apple jacks
2 cheese whiz
3 butter scotch
4 apple turnover
5 cheese sliders
6 butter chicken
7 apple sauce
8 cheese doodles
9 butter milk
This is what I want df1 to look like:
<V1> <V2>
1 apple jacks, turnover, sauce
2 cheese whiz, sliders, doodles
3 butter scotch, chicken, milk
This is my code:
df1$V2[is.na(df1$V2)] <- df2$V2[match(df1$V1,df2$V1)][which(is.na(df1$V2))]
This code works fine, however it only pulls the first missing value and ignores the rest.
Another solution just using base R
aggregate(DF2$V2, list(DF2$V1), c, simplify=F)
Group.1 x
1 apple jacks, turnover, sauce
2 butter scotch, chicken, milk
3 cheese whiz, sliders, doodles
I don't think you even need to import the df1 in this case can do it all based on df2
df1 <- df2 %>% group_by(`<V1>`) %>% summarise(`<V2>`=paste0(`<V2>`, collapse = ", "))

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

Complex merging in R with duplicate matching values in y set producing problems

So I'm trying to merge two dataframes. Dataframe x looks something like:
Name ParentID
Steve 1
Kevin 1
Stacy 1
Paula 4
Evan 7
Dataframe y looks like:
ParentID OtherStuff
1 things
2 stuff
3 item
4 ideas
5 short
6 help
7 me
The dataframe I want would look like:
Name ParentID OtherStuff
Steve 1 things
Kevin 1 things
Stacy 1 things
Paula 4 ideas
Evan 7 me
Using a left merge gives me substantially more observations than I want, with many duplicates. Any idea how to merge things, where y is duplicated where appropriate to match x?
I'm working with a databases set up similarly to the example. x has 5013 observations, while y has 6432. Using the merge function as described by Joel and thelatemail gives me 1627727 observations.
We can use match from base R
df1$OtherStuff <- with(df1, df2$OtherStuff[match(ParentID, df2$ParentID)])
df1
# Name ParentID OtherStuff
#1 Steve 1 things
#2 Kevin 1 things
#3 Stacy 1 things
#4 Paula 4 ideas
#5 Evan 7 me

fill missing group data with 0s in a data.table [duplicate]

This question already has answers here:
add missing rows to a data table
(2 answers)
Closed 6 years ago.
This isn't a dupe of this. That question deals with rows which already have NAs in them, my question deals with missing rows for which there should be a data point of 0.
Let's say I have this data.table
dt<-data.table(id=c(1,2,4,5,6,1,3,4,5,6),
varname=c(rep('banana',5),rep('apple',5)),
thedata=runif(10,1,10))
What's the best way to add, for each varname, the missing ids with a 0 for thedata?
At the moment I dcast with fill=0 and then melt again but this doesn't seem very efficient.
melt(dcast.data.table(dt,id~varname,value.var='thedata',fill=0),id.var='id',variable.factor=FALSE,variable.name='varname',value.name='thedata')
I also just thought of doing it this way but it gets a little clunky to fill in NAs at the end
merge(dt[,CJ(id=unique(id),varname=unique(varname))],dt,by=c('varname','id'),all=TRUE)[,.(varname,id,thedata=ifelse(!is.na(thedata),thedata,0))]
In this example, I only used one id column but any additional suggestion should be extensible to having more than one id column.
EDIT
I did a system.time on each approach with a largish data set and the melt/cast approach took between 2-3 seconds while the merge/CJ approach took between 12-13.
EDIT2
Roland's CJ approach is much better than mine as it only took between 4-5 seconds with my dataset.
Is there a better way to do this?
setkey(dt, varname, id)
dt[CJ(unique(varname), unique(id))]
# id varname thedata
# 1: 1 apple 9.083738
# 2: 2 apple NA
# 3: 3 apple 7.332652
# 4: 4 apple 3.610315
# 5: 5 apple 7.113414
# 6: 6 apple 9.046398
# 7: 1 banana 3.973751
# 8: 2 banana 9.907012
# 9: 3 banana NA
#10: 4 banana 9.308346
#11: 5 banana 1.572314
#12: 6 banana 7.753611
Then substitute NA with 0 if you must (usually not appropriate).

Inserting new rows to dataframe without losing format

I am trying to create a large empty data.frame and insert a groups of row. I have seen a few similar questions on numerous forums, however I have been unable to apply any of them successfully to the specific formatting issue I am having.
I started with rbind(df,allic) # allic is the data frame I would like to insert into df # however, given the size of my dataset the operation takes 5 1/2 minutes to complete. I understand that creating the data frame at the beginning and replacing rows improves efficiency, however I have been unable to make it work for my problem. Code is as follows:
Initial data:
Order.ID Product
1 193505 Onion Rings
2 193505 Pineapple Cheddar Burger
3 193623 Fountain Soda
4 193623 French Fries
5 193623 Hamburger
6 193623 Hot Dog
7 193631 French Fries
8 193631 Hamburger
9 193631 Milkshake
The products won't match to below, however this being a formatting issue I figured it best to show the formatting that brought me to where I am now.
nb$Order.ID <- as.factor(nb$Order.ID)
plist <- aggregate(nb$Product,list(nb$Order.ID),list)
allp <- unique(unlist(plist$x))
allic <- expand.grid(plist$x[[1]], Var2=plist$x[[1]], Var3=1)
Var1 Var2 Var3
1 Onion Rings Onion Rings 1
2 Pineapple Cheddar Burger Onion Rings 1
3 Onion Rings Pineapple Cheddar Burger 1
4 Pineapple Cheddar Burger Pineapple Cheddar Burger 1
Now I create an empty dataframe (df) using:
df <- data.frame(factor=rep(NA, rcnt), factor=rep(NA,rcnt), stringsAsFactors=FALSE)
rcnt being a large, arbitrary number which I plan to trim once the operation is complete. My issue comes when I try to insert these lines using:
df[1:4,] <- allic
head(df, n=10)
factor factor.1
1 47 47
2 51 47
3 47 51
4 51 51
5 NA NA
6 NA NA
7 NA NA
8 NA NA
How can I insert rows in a dataframe without losing the format of my values? I would greatly appreciate any help I can get at this point.
EDIT Per comment below:
>df[i] <- for(i in 1:nrow(plist)) {
> allic <- expand.grid(plist$x[[i]], Var2=plist$x[[i]], Var3=1)
> df[i:nrow(allic),] <- sapply(allic, as.character)
I'm still very new with R, however this was working when I was using df <- rbind(df,allic). nrow(df) is 4096.
Try wrapping allic in as.character as follows:
df[1:4,] <- sapply(allic, as.character)
> df
factor factor.1
1 Onion Rings Onion Rings
2 Pineapple Cheddar Burger Onion Rings
3 Onion Rings Pineapple Cheddar Burger
4 Pineapple Cheddar Burger Pineapple Cheddar Burger
5 <NA> <NA>
6 <NA> <NA>
7 <NA> <NA>
8 <NA> <NA>
9 <NA> <NA>
10 <NA> <NA>

Resources