I'm struggling with multiple response questions in R. I'm hoping to find an easy way to tackle this with dplyr and tidyr. Below is a sample multiple respose data frame. I'm trying to do things,first, create percentages - % of cats,% of dogs, etc. Percentages will be of overall responses. My usual of calculating percentages -
group_by(_)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
doesn't seem to cut it in this situation. Maybe I have to use summarise_each or a more specialized function? I'm still new to r and really new to Dplyr and Tidyr. I also tried to use Tidyr's "unite" function, which works, but it includes NA's, which I will have to recode away. But I still can't seem to calculate the percentages of the united column.
Any suggestions would be great! First, how to unite the multiple response columns using "unite" into all possible combinations and then calculating percentages of each, and also how to simply calculate the percentage of each binary column as a proportion of overall responses? Hope this makes sense! I'm sure there's a simple and elegant answer that I'm overlooking.
Cats<-c(Cat,NA,Cat,NA,NA,NA,Cat,NA)
Dogs<-c(NA,NA,Dog,Dog,NA,Dog,NA,Dog)
Fish<-c(NA,NA,Fish,NA,NA,NA,Fish,Fish)
Pets<-data.frame(Cats,Dogs,Fish)
Pets<-Pets%>%unite(Combined,Cats,Dogs,Fish,sep=",",remove=FALSE)
Animals%>%group_by(Combined)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
Sounds like what you're trying to do can be done by 'gather()' function from tidyr instead of 'unite()' function, based on my understanding of your question.
library(dplyr)
library(tidyr)
Pets %>%
gather(animal, type, na.rm = TRUE) %>%
group_by(animal) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count))
Related
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
Apologies if I have missed where this has been asked before - I couldn't find it. I am learning about tibbles and the verb arrange. I am wondering if there is a more efficient way to arrange rows in descending order of total row NA count then I have done below? I am using the nycflights13 dataset.
library(nycflights13)
library(tidyverse)
options(tibble.width = Inf)
flights$na_count <- flights %>% is.na %>% rowSums
arrange(flights, desc(na_count))
Part of result:
I checked the help centre and this appears to be on topic though I know working code is often a candidate for code review.
Not sure whether much more efficient, however, you can rewrite it into:
flights %>%
arrange(desc(rowSums(is.na(.))))
library(tidyverse)
library(ggmosaic) for "happy" dataset.
I feel like this should be a somewhat simple thing to achieve, but I'm having difficulty with percentages when using purrr::map together with table(). Using the "happy" dataset, I want to create a list of frequency tables for each factor variable. I would also like to have rounded percentages instead of counts, or both if possible.
I can create frequency precentages for each factor variable separately with the code below.
with(happy,round(prop.table(table(marital)),2))
However I can't seem to get the percentages to work correctly when using table() with purrr::map. The code below doesn't work...
happy%>%select_if(is.factor)%>%map(round(prop.table(table)),2)
The second method I tried was using tidyr::gather, and calculating the percentage with dplyr::mutate and then splitting the data and spreading with tidyr::spread.
TABLE<-happy%>%select_if(is.factor)%>%gather()%>%group_by(key,value)%>%summarise(count=n())%>%mutate(perc=count/sum(count))
However, since there are different factor variables, I would have to split the data by "key" before spreading using purrr::map and tidyr::spread, which came close to producing some useful output except for the repeating "key" values in the rows and the NA's.
TABLE%>%split(TABLE$key)%>%map(~spread(.x,value,perc))
So any help on how to make both of the above methods work would be greatly appreciated...
You can use an anonymous function or a formula to get your first option to work. Here's the formula option.
happy %>%
select_if(is.factor) %>%
map(~round(prop.table(table(.x)), 2))
In your second option, removing the NA values and then removing the count variable prior to spreading helps. The order in the result has changed, however.
TABLE = happy %>%
select_if(is.factor) %>%
gather() %>%
filter(!is.na(value)) %>%
group_by(key, value) %>%
summarise(count = n()) %>%
mutate(perc = round(count/sum(count), 2), count = NULL)
TABLE %>%
split(.$key) %>%
map(~spread(.x, value, perc))
I know how to take a random sample each group from a dataframe using sample_n or sample_frac in dplyr, which can go like this,
dataset %>%
group_by(user_id) %>%
sample_n(10)
However, I have a slightly different question. I want to take a random sample from the whole dataset. It should be as simple as this one,
sample_n(dataset,10)
But, because I have used group_by command on the dataset in a previous case, it seems the group_by still takes effect here. The second command is equivalent to the first here.
I wonder how can I remove the effect of group_by and get a random sample from the whole dataset?
We can use ungroup() to remove any group variable and then apply the sample_n
dataset %>%
group_by(user_id) %>%
ungroup() %>%
sample_n(10)
I have a question about dplyr.
Lets say I want to update certain values in a dataframe, can I do this?:
mtcars %>% filter(mpg>20) %>% select(hp)=1000
(The example is nonsensical where all cars with MPGs greater than 20 have HP set to 1000)
I get an error so I am guessing the answer is no I can't use %>% and the dplyr verbs to the left of an assignment, but the dplyr syntax is a lot cleaner than:
mtcars[mtcars$mpg>20,"hp"]=1000
Especially when you are dealing with more complex cases, so I wanted to ask if there is any way to use the dplyr syntax in this case?
edit: It looks like mutate is the verb I want, so now my question is, can I dynamically change the name of the var in the mutate statement like so:
for (i in c("hp","wt")) {mtcars<-mtcars %>% filter(mpg>20) %>% mutate(i=1000) }
This example just creates a column named "i" with value 1000, which isn't what I want.