Let's say that I have the following data frame with three columns.
data = data.frame(id=c(1:10), interest_1=c("food","","","drugs","beer","soda","","","drugs","sports"),
interest_2=c("fruits","car","jeans","","","","soda","shoes","","drugs"),
interest_3=c("","","","","soda","sports","","","",""))
data
I want to get a count of each row.
The following incident, where food is interest_1, fruits is interest_2, and nothing is interest_3 occurs only once.
id interest_1 interest_2 interest_3
1 1 food fruits
The following incident, where drugs are interest_1 and nothing is interest_2 or interest_3 occurs twice.
id interest_1 interest_2 interest_3
4 drugs
9 drugs
I want to get a count the number of times that each incidence occurs. How would I go about doing this?
Output should look like:
interest_1 interest_2 interest_3 count
food fruits 1
car 1
jeans 1
drugs 2
> aggregate(id~.,data,length)
interest_1 interest_2 interest_3 id
1 drugs 2
2 car 1
3 sports drugs 1
4 food fruits 1
5 jeans 1
6 shoes 1
7 soda 1
8 beer soda 1
9 soda sports 1
Basically, this means: apply function length to to the vector made up of id values for each combination of the other columns.
require(plyr)
ddply(data, .(interest_1, interest_2, interest_3), c("nrow"))
Related
How do you find the row numbers associated with a specific text string?
I need the function to return [1] 1 2 3 4 from the dataset below.
RowNumber Name
1 Diet Pepsi
2 Pepsi Max
3 Diet Pepsi 20 oz
4 Pepsi
5 Coca-Cola
So you're looking for "Pepsi" ?
which(grepl("Pepsi", dataset$Name))
We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1
In R, I can return the count results using the specific column names I am interested in as an array as below.
require("plyr")
bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
count(bevs, c("name", "drink"))
# produces
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
How can I get the count result of two specific column names in a matrix which has columns: all unique drinks, rows: all unique names and cells: freqs (like below)?
cocoa coffee tea water
Bill 2 2 0 0
Llib 0 0 2 2
P.S: Obviously, the solution does not need to use plyr.
You want a contingency table, which you can create using table:
table(bevs[, c("name", "drink")])
# drink
#name cocoa coffee tea water
# Bill 2 2 0 0
# Llib 0 0 2 2
I'm trying to make a cross tabulation in R, and having its output resemble as much as possible what I'd get in an Excel pivot table. So, given this code:
set.seed(2)
df<-data.frame("ministry"=paste("ministry ",sample(1:3,20,replace=T)),"department"=paste("department ",sample(1:3,20,replace=T)),"program"=paste("program ",sample(letters[1:20],20,replace=F)),"budget"=runif(20)*1e6)
library(tables)
library(dplyr)
arrange(df,ministry,department,program)
tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
which yields:
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 1 0 NaN 0 <---- LOOK HERE
department 2 5 680102 3400509
department 3 2 165118 330235
How do I get the output to hide the rows with zero frequencies?
I'm using tables::tabular but any other package is good for me (as long as there's a way, even indirect, of outputting to html). This is for generating HTML or Latex using R Markdown and displaying the table with my script's results as Excel would, or as in the example above in a pivot-table like form. But without the superfluous row.
Thanks!
Why not just use dplyr?
df %>%
group_by(ministry, department) %>%
summarise(count = n(),
avg_budget = mean(budget, na.rm = TRUE),
tot_budget = sum(budget, na.rm = TRUE))
ministry department count avg_budget tot_budget
1 ministry 1 department 1 5 479871.1 2399355.6
2 ministry 1 department 2 1 770027.9 770027.9
3 ministry 1 department 3 1 184673.5 184673.5
4 ministry 2 department 1 2 170818.3 341636.5
5 ministry 2 department 2 1 183373.2 183373.2
6 ministry 2 department 3 3 415479.9 1246439.7
7 ministry 3 department 2 5 680101.8 3400508.8
8 ministry 3 department 3 2 165117.6 330235.3
While I don't understand at all how the tabular object is made (since it says it's a list but seems to behaves like a data frame), you can select cells as usual, so
> results <-tabular(ministry*department~((Count=budget)+(Avg=(mean*budget))+(Total=(sum*budget))),data=df)
> results[results[,1]!=0,]
Avg Total
ministry department Count budget budget
ministry 1 department 1 5 479871 2399356
department 2 1 770028 770028
department 3 1 184673 184673
ministry 2 department 1 2 170818 341637
department 2 1 183373 183373
department 3 3 415480 1246440
ministry 3 department 2 5 680102 3400509
department 3 2 165118 330235
That's the solution.
I just found out the solution thanks to this user's reply on another question https://stackoverflow.com/users/516548/g-grothendieck
In R, I can return the count results using the specific column names I am interested in as an array as below.
require("plyr")
bevs <- data.frame(cbind(name = c("Bill", "Llib"), drink = c("coffee", "tea", "cocoa", "water"), cost = seq(1:8)))
count(bevs, c("name", "drink"))
# produces
name drink freq
1 Bill cocoa 2
2 Bill coffee 2
3 Llib tea 2
4 Llib water 2
How can I get the count result of two specific column names in a matrix which has columns: all unique drinks, rows: all unique names and cells: freqs (like below)?
cocoa coffee tea water
Bill 2 2 0 0
Llib 0 0 2 2
P.S: Obviously, the solution does not need to use plyr.
You want a contingency table, which you can create using table:
table(bevs[, c("name", "drink")])
# drink
#name cocoa coffee tea water
# Bill 2 2 0 0
# Llib 0 0 2 2