summarize multiple binary variables in a single column - r

in a survey I conducted, I asked about the education level of the participants. The results are spread over several columns as binary variables. I would appreciate efficient ways to combine the results into a single variable. The tables below show the current and desired data format.
ID
high school
college
PhD
1
high school
-1
-1
2
-1
college
-1
3
-1
-1
PhD
4
high school
-1
-1
ID
Educational background
1
high school
2
college
3
PhD
4
high school

To answer your specific question using the tidyverse, creating a test dataset with the code at the end of this post:
library(tidyverse)
df %>%
mutate(
across(-ID, function(x) ifelse(x == "-1", NA, x)),
EducationalBackground=coalesce(high_school, college, PhD)
)
ID high_school college PhD EducationalBackground
1 1 high_school <NA> <NA> high_school
2 2 <NA> college <NA> college
3 3 <NA> <NA> PhD PhD
4 4 high_school <NA> <NA> high_school
The code works by converting the text values of "-1" in your columns, which I take to be missing value flags, to true missing values. Then I use coalesce to find the first non-missing value in the three columns that contain survey data and place it in the new summary column. This assumes that there will be one and only one non-missing value in each row of the data frame.
That said, my preference would be to avoid the problem by adapting your workflow earlier in the piece to avoid the problem. But you haven't given any details of that, so I can't make any suggestions about how to do that.
Test data
df <- read.table(textConnection("ID high_school college PhD
1 high_school -1 -1
2 -1 college -1
3 -1 -1 PhD
4 high_school -1 -1"), header=TRUE)

Related

Find the favorite and analyse sequence questions in R

We have a daily meeting when participants nominate each other to speak. The first person is chosen randomly.
I have a dataframe that consists of names and the order of speech every day.
I have a day1, a day2 ,a day3 , etc. in the columns.
The data in the rows are numbers, meaning the order of speech on that particular day.
NA means that the person did not participate on that day.
Name day1 day2 day3 day4 ...
Albert 1 3 1 ...
Josh 2 2 NA
Veronica 3 5 3
Tim 4 1 2
Stew 5 4 4
...
I want to create two analysis, first, I want to create a dataframe who has chosen who the most times. (I know that the result depends on if a participant was nominated before and therefore on that day that participant cannot be nominated again, I will handle it later, but for now this is enough)
It should look like this:
Name Favorite
Albert Stew
Josh Veronica
Veronica Tim
Tim Stew
...
My questions (feel free to answer only one if you can):
1. What code shall I use for it without having to manunally put the names in a different dataframe?
2. How shall I handle a tie, for example Josh chose Veronica and Tim first the same number of times? Later I want to visualise it and I have no idea how to handle ties.
I also would like to analyse the results to visualise strong connections.
Like to show that there are people who usually chose each other, etc.
Is there a good package that is specialised for these? Or how should I get to it?
I do not need DNA sequences, only this simple ones, but I have not found a suitable one yet.
Thanks for your help!
If I am not misunderstanding your problem, here is some code to get the number of occurences of who choose who as next speaker. I added a fourth day to have some count that is not 1. There are ties in the result, choosing the first couple of each group by speaker ('who') may be a solution :
df <- read.table(textConnection(
"Name,day1,day2,day3,day4
Albert,1,3,1,3
Josh,2,2,,2
Veronica,3,5,3,1
Tim,4,1,2,4
Stew,5,4,4,5"),header=TRUE,sep=",",stringsAsFactors=FALSE)
purrr::map(colnames(df)[-1],
function (x) {
who <- df$Name[order(df[x],na.last=NA)]
data.frame(who,lead(who),stringsAsFactors=FALSE)
}
) %>%
replyr::replyr_bind_rows() %>%
filter(!is.na(lead.who.)) %>%
group_by(who,lead.who.) %>% summarise(n=n()) %>%
arrange(who,desc(n))
Input:
Name day1 day2 day3 day4
1 Albert 1 3 1 3
2 Josh 2 2 NA 2
3 Veronica 3 5 3 1
4 Tim 4 1 2 4
5 Stew 5 4 4 5
Result:
# A tibble: 12 x 3
# Groups: who [5]
who lead.who. n
<chr> <chr> <int>
1 Albert Tim 2
2 Albert Josh 1
3 Albert Stew 1
4 Josh Albert 2
5 Josh Veronica 1
6 Stew Veronica 1
7 Tim Stew 2
8 Tim Josh 1
9 Tim Veronica 1
10 Veronica Josh 1
11 Veronica Stew 1
12 Veronica Tim 1

Show proportion with multiple conditions in R

I have:
> dataframe
GENDER CITY NUMBER
Male NY 1
Female Paris 2
Male Paris 1
Female NY
Female NY 2
Male Paris 2
Male Paris
Male Paris 1
Female NY 2
Female Paris 1
And I would like to return the proportion of Male and Female in bomb city (then in NY) who has 2 as a third column (The DF is way longer that my example), knowing that there are empty rows in NUMBER column.
Technically speaking I want to show a proportion with two conditions (and more conditions in the future).
I tried:
prop.table(table(dataframe$GENDER, dataframe$CITY == 'NY' & dataframe$NUMBER == 2)
But this gives me the wrong results.
The xxpected output (or any that is close to this):
NY
Male 0
Female 20
Do you have any idea how I can get this?
The best would be to have a column per city
Use the environment data.table, that makes your life much more easier. It uses SQL syntax and its superfast in case your data grows up. The code should be:
library(data.table)
df <- data.table(yourdataframe)
df[, summary(GENDER), by = CITY]
The output should give you the count of each value

Combining 2 columns in R prioriziting one of them

I know nothing of R, and I have a data.frame with 2 columns, both of them are about the sex of the animals, but one of them have some corrections and the other doesn't.
My desired data.frame would be like this:
id sex father mother birth.date farm
0 1 john ray 05/06/94 1
1 1 doug ana 18/02/93 NA
2 2 bryan kim 21/03/00 3
But i got to this data.frame by using merge on 2 others data.frames
id sex.x father mother birth.date sex.y farm
0 2 john ray 05/06/94 1 1
1 1 doug ana 18/02/93 NA NA
2 2 bryan kim 21/03/00 2 3
data.frame 1 or Animals (Has the wrong sex for some animals)
id sex father mother birth.date
0 2 john ray 05/06/94
1 1 doug ana 18/02/93
2 2 bryan kim 21/03/00
data.frame 2 or Farm (Has the correct sex):
id farm sex
0 1 1
2 3 2
The code i used was: Animals_Farm <- merge(Animals , Farm, by="id", all.x=TRUE)
I need to combine the 2 sex columns into one, prioritizing sex.y. How do I do that?
If I correctly understand you example you have a situation similar to what I show below based on the example from the merge function.
> (authors <- data.frame(
surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
nationality = c("US", "Australia", "US", "UK", "Australia"),
deceased = c("yes", rep("no", 3), "yes")))
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia yes
> (books <- data.frame(
name = I(c("Tukey", "Venables", "Tierney",
"Ripley", "Ripley", "McNeil", "R Core")),
title = c("Exploratory Data Analysis",
"Modern Applied Statistics ...", "LISP-STAT",
"Spatial Statistics", "Stochastic Simulation",
"Interactive Data Analysis",
"An Introduction to R"),
deceased = c("yes", rep("no", 6))))
name title deceased
1 Tukey Exploratory Data Analysis yes
2 Venables Modern Applied Statistics ... no
3 Tierney LISP-STAT no
4 Ripley Spatial Statistics no
5 Ripley Stochastic Simulation no
6 McNeil Interactive Data Analysis no
7 R Core An Introduction to R no
> (m1 <- merge(authors, books, by.x = "surname", by.y = "name"))
surname nationality deceased.x title deceased.y
1 McNeil Australia yes Interactive Data Analysis no
2 Ripley UK no Spatial Statistics no
3 Ripley UK no Stochastic Simulation no
4 Tierney US no LISP-STAT no
5 Tukey US yes Exploratory Data Analysis yes
6 Venables Australia no Modern Applied Statistics ... no
Where authors might represent your first dataframe and books your second and deceased might be the value that is in both dataframe but only up to date in one of them (authors).
The easiest way to only include the correct value of deceased would be to simply exclude the incorrect one from the merge.
> (m2 <- merge(authors, books[names(books) != "deceased"],
by.x = "surname", by.y = "name"))
surname nationality deceased title
1 McNeil Australia yes Interactive Data Analysis
2 Ripley UK no Spatial Statistics
3 Ripley UK no Stochastic Simulation
4 Tierney US no LISP-STAT
5 Tukey US yes Exploratory Data Analysis
6 Venables Australia no Modern Applied Statistics ...
The line of code books[names(books) != "deceased"] simply subsets the dataframe books to remove the deceased column leaving only the correct deceased column from authors in the final merge.

Create data frame for each unique row in another data frame

For an assignment for my graduate program, I have been asked to extract data from datasets of English Premier League results (located here). I am very close to being done but need help on the last two outputs.
We must create a function that can receive two arguments, a date and a season. The function must return a data frame with the table of the respective season on that date. It must include wins, losses, home record, away record, etc. The only ones I have not managed to figure out are W/L streak and the results of the last 10 matches.
Here is an example of what the initial dataset looks like:
e.Date e.HomeTeam e.AwayTeam e.FTHG e.FTAG e.FTR
1 2015-08-08 Bournemouth Aston Villa 0 1 A
2 2015-08-08 Chelsea Swansea 2 2 D
3 2015-08-08 Everton Watford 2 2 D
4 2015-08-08 Leicester Sunderland 4 2 H
5 2015-08-08 Man United Tottenham 1 0 H
My plan was to get Home and Away data sorted out for each club then merge them together before doing the analysis to find streak and last 10 results.
I manipulated the data to look like this:
HomeTeam FTR Date freq
1 Arsenal L 2015-08-09 1
2 Arsenal D 2015-08-24 1
3 Arsenal W 2015-09-12 1
4 Aston Villa L 2015-08-14 1
5 Aston Villa L 2015-09-19 1
6 Aston Villa D 2015-08-29 1
And now I'm kinda lost. My idea was to run some kind of loop (for? ddply? data.table?) to create a data frame for each club with their results in it and then loop again to do whatever calculations to get the desired variables (streak and last 10) and somehow push those back into the main data frame where I am housing all of the other outputs.
I don't want to be told the answer outright since it's important I learn this on my own. However, if someone could point me in the right direction that would be great. Thanks so much.
I created some dummy data just to demonstrate a few commands and maybe give you some ideas.
set.seed(321)
dat <- data.frame(team = sample(letters[1:3], 20, replace=TRUE),
season = rep("season1", 20),
time = rnorm(20),
win_loss = sample(c("win", "loss"), 20, replace=TRUE))
Problem 1. Find win/loss streak
Take a look at the rle function example below
# 1. find wl streak of team 'a'
tmp <- dat[dat$team == "a", ]
tmp <- tmp[order(tmp$time), ]
> tmp
team season time win_loss
19 a season1 -1.12032742 loss
14 a season1 -1.07223880 loss
16 a season1 0.09500072 loss
3 a season1 0.18832552 loss
8 a season1 0.42033257 loss
4 a season1 2.44325982 win
# shows runs of 5 consecutive losses, then 1 consecutive win
rle(tmp$win_loss == "win")
Run Length Encoding
lengths: int [1:2] 5 1
values : logi [1:2] FALSE TRUE
Here's a very helpful post on rle How can I count runs in a sequence?
Problem 2. Last 3 results
I reversed the order of time and then picked the top 3 results.
# 2. find last 3 matches for team 'b'
tmp <- dat[dat$team == "b", ]
tmp <- tmp[rev(order(tmp$time)), ]
> tmp[1:3, ]
team season time win_loss
11 b season1 0.9172555 loss
9 b season1 0.5775845 win
7 b season1 0.4560691 loss

How do I infill non-adjacent rows with sample data from previous rows in R?

I have data containing a unique identifier, a category, and a description.
Below is a toy dataset.
prjnumber <- c(1,2,3,4,5,6,7,8,9,10)
category <- c("based","trill","lit","cold",NA,"epic", NA,NA,NA,NA)
description <- c("skip class",
"dunk on brayden",
"record deal",
"fame and fortune",
NA,
"female attention",
NA,NA,NA,NA)
toy.df <- data.frame(prjnumber, category, description)
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 <NA> <NA>
6 6 epic female attention
7 7 <NA> <NA>
8 8 <NA> <NA>
9 9 <NA> <NA>
10 10 <NA> <NA>
I want to randomly sample the 'category' and 'description' columns from rows that have been filled in to use as infill for rows with missing data.
The final data frame would be complete and would only rely on the initial 5 rows which contain data. The solution would preserve between-column correlation.
An expected output would be:
> toy.df
prjnumber category description
1 1 based skip class
2 2 trill dunk on brayden
3 3 lit record deal
4 4 cold fame and fortune
5 5 lit record deal
6 6 epic female attention
7 7 based skip class
8 8 based skip class
9 9 lit record deal
10 10 trill dunk on brayden
complete = na.omit(toy.df)
toy.df[is.na(toy.df$category), c("category", "description")] =
complete[sample(1:nrow(complete), size = sum(is.na(toy.df$category)), replace = TRUE),
c("category", "description")]
toy.df
# prjnumber category description
# 1 1 based skip class
# 2 2 trill dunk on brayden
# 3 3 lit record deal
# 4 4 cold fame and fortune
# 5 5 lit record deal
# 6 6 epic female attention
# 7 7 cold fame and fortune
# 8 8 based skip class
# 9 9 epic female attention
# 10 10 epic female attention
Though it would seem a little more straightforward if you didn't start with the unique identifiers filled out for the NA rows...
You could try
library(dplyr)
toy.df %>%
mutate_each(funs(replace(., is.na(.), sample(.[!is.na(.)]))), 2:3)
Based on new information, we may need a numeric index to use in the funs.
toy.df %>%
mutate(indx= replace(row_number(), is.na(category),
sample(row_number()[!is.na(category)], replace=TRUE))) %>%
mutate_each(funs(.[indx]), 2:3) %>%
select(-indx)
Using Base R to fill in a single field a at a time, use something like (not preserving the correlation between the fields):
fields <- c('category','description')
for(field in fields){
missings <- is.na(toy.df[[field]])
toy.df[[field]][missings] <- sample(toy.df[[field]][!missings],sum(missings),T)
}
and to fill them in simultaneously (preserving the correlation between the fields) use something like:
missings <- apply(toy.df[,fields],
1,
function(x)any(is.na(x)))
toy.df[missings,fields] <- toy.df[!missings,fields][sample(sum(!missings),
sum(missings),
T),]
and of course, to avoid the implicit for loop in the apply(x,1,fun), you could use:
rowAny <- function(x) rowSums(x) > 0
missings <- rowAny(toy.df[,fields])

Resources