I have a simple syntax question for an absolute beginner. I have been searching and experimenting and I can't figure it out. I need to only plot values from the variable SIZE that are greater than 0.8, but less than seven. I am using the with() expression along with plot(). Can someone tell me how I should write this?
with(dat[SIZE <7 | SIZE > 0.8 ,], plot(SP.RICH~SIZE))
Thank You.
Selecting only certain rows is called filtering.
One way is to use dplyr, it's a nicer idiom:
require(dplyr)
dat %>% filter(SIZE>0.8 & SIZE<7) %>%
plot(SP.RICH~SIZE, data = .)
Another is data.table package.
Related
I'm running into an issue that I feel should be simple but cannot figure out and have searched the board for comparable problems/question but unable to find an answer.
In short, I have data from a variety of motor vehicles and looking to know the average speed of the vehicle when it is at maximal acceleration. I also want the opposite - the average acceleration at top speed.
I am able to do this for the whole dataset using the following code
data<-data %>% group_by(Name) %>%
mutate(speedATaccel= with(data, avg.Speed[which.max(top.Accel)]),
accelATspeed= with(data, avg.Accel[which.max(top.Speed)]))
However, the group_by function doesn't appear to be working it just provide the values across the whole dataset as opposed to each individual vehicle group.
Any help would be appreciated.
Thanks,
The use of with(data, disrupt the group_by attribute and get the index on the whole data. Instead, use tidyverse methods, i.e. remove the with(data. Note that in tidyverse, we don't need to use any of the base R extraction methods i.e. with $ or [[ or with, instead specify the unquoted column name
library(dplyr)
data %>%
group_by(Name) %>%
mutate(speedATaccel = avg.Speed[which.max(top.Accel)],
accelAtspeed = avg.Accel[which.max(top.Speed)])
I am trying to write a code by using dplyr and a yeast dataset
I Read in the data with this code
gdat <- read_csv(file = "brauer2007_tidy1.csv")
I ommitted na's by using this
gdat <- na.omit(gdat)
library(ggplot2)
Then I tried to filter some genes according to their column name "symbol" and used ggplot to make a plot
filter(gdat, symbol=="QRI7", symbol== "CFT2", symbol== "RIB2",
symbol=="EDC3", symbol=="VPS5", symbol=="AMN1" & rate=.05) %>%
ggplot(aes(x=rate,
y=expression,
group=1,
colour=nutrient)) +
geom_line(lwd=1.5) +
facet_wrap(~nutrient)
facet_wrap(~nutrient) is used to seperate each gene's rate vs. expression graphs according to the nutrient which is depleted but this error keeps coming:
error: Faceting variables must have at least one value
I checked these genes by using the filter function if all of them could be displayed on r and they did when I filtered them individually but when I combine multiple genes with ggplot I get this error.
Also when I use "&rate=.05" I can't get only the values which are at rate=.05.
Does anyone know how I can fix this problem? I have a deadline till tomorrow 17.30 and if somebody could help me I would be very glad, thanks.
I downloaded what I assume is the same dataset like this:
library(readr)
library(dplyr)
library(ggplot2)
gdat <- read_csv("https://raw.githubusercontent.com/bioconnector/workshops/master/data/brauer2007_tidy.csv")
So the first problem is your filter. If you are looking for any of those gene symbols, you need to use %in%. And rate requires a double equals ==:
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1"),
rate == 0.05)
I don't think you want to filter for one rate and then use geom_line, because you will just get one vertical line at one value of x (rate).
Neither do I think you want to use geom_line for multiple values of rate, because there are several values for expression at each rate and a line will generate a nasty-looking zigzag.
And as you are faceting on nutrient, there's no need to color by nutrient. Perhaps you want to color by gene?
So you need to think about what makes a good visualisation of this data. Here's a simple example to get you started.
gdat %>%
filter(symbol %in% c("QRI7", "CFT2", "RIB2", "EDC3", "VPS5", "AMN1")) %>%
ggplot(aes(x=rate,
y=expression,
color = symbol)) +
geom_line() +
facet_wrap(~nutrient)
Result:
May Know what know is there in the below code. I am trying to extract distinct values of Species under iris but not getting . I am trying to code without %>%
iris[,c(distinct("Species"))]
I am guessing you want to do this:
library(dplyr)
distinct(iris, Species)
you do not need %>% to begin with, but if you mean that you don't want to use the dplyr package, maybe you can try what #sm925 suggested as a comment: as.character(unique(iris$Species))
This will give you a vector with all unique species:
unique(iris$Species)
I'm struggling with multiple response questions in R. I'm hoping to find an easy way to tackle this with dplyr and tidyr. Below is a sample multiple respose data frame. I'm trying to do things,first, create percentages - % of cats,% of dogs, etc. Percentages will be of overall responses. My usual of calculating percentages -
group_by(_)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
doesn't seem to cut it in this situation. Maybe I have to use summarise_each or a more specialized function? I'm still new to r and really new to Dplyr and Tidyr. I also tried to use Tidyr's "unite" function, which works, but it includes NA's, which I will have to recode away. But I still can't seem to calculate the percentages of the united column.
Any suggestions would be great! First, how to unite the multiple response columns using "unite" into all possible combinations and then calculating percentages of each, and also how to simply calculate the percentage of each binary column as a proportion of overall responses? Hope this makes sense! I'm sure there's a simple and elegant answer that I'm overlooking.
Cats<-c(Cat,NA,Cat,NA,NA,NA,Cat,NA)
Dogs<-c(NA,NA,Dog,Dog,NA,Dog,NA,Dog)
Fish<-c(NA,NA,Fish,NA,NA,NA,Fish,Fish)
Pets<-data.frame(Cats,Dogs,Fish)
Pets<-Pets%>%unite(Combined,Cats,Dogs,Fish,sep=",",remove=FALSE)
Animals%>%group_by(Combined)%>%summarise(count=n())%>%mutate(percent=count/sum(count))
Sounds like what you're trying to do can be done by 'gather()' function from tidyr instead of 'unite()' function, based on my understanding of your question.
library(dplyr)
library(tidyr)
Pets %>%
gather(animal, type, na.rm = TRUE) %>%
group_by(animal) %>%
summarize(count = n()) %>%
mutate(percentage = count / sum(count))
I'm working with a data frame that looks very similar to the below:
Image here, unfortunately don't have enough reputation yet
This is a 600,000 row data frame. What I want to do is for every repeated instance within the same date, I'd like to divide the cost by total number of repeated instances. I would also like to consider only those falling under the "Sales" tactic.
So for example, in 1/1/16, there are 2 "Help Packages" that are also under the "Sales" tactic. Because there are 2 instances within the same date, I'd like to divide the cost of each by 2 (so the cost would come out as $5 for each).
This is the code I have:
for(i in 1:length(dfExample$Date)){
if(dfExample$Tactic) == "Sales"){
list = agrep(dfExample$Package[i], dfExample$Package)
for(i in list){
date_repeats = agrep(i, dfExample$Date)
dfExample$Cost[date_repeats] = dfExample$Package[i]/length(date_repeats)
}
}
}
It is incredibly inefficient and slow. I know there's got to be a better way to achieve this. Any help would be much appreciated. Thank you!
ave() can give a solution without additional packages:
with(dfExample, Cost / ave(Cost, Date, Package, Tactic, FUN=length))
Using dplyr:
library(dplyr)
dfExample %>%
group_by(Date, Package, Tactic) %>%
mutate(Cost = Cost / n())
I'm a little unclear what you mean by "instance". This (pretty clearly) groups by Date, Package, and Tactic, and so will consider each unique combination of those columns as a grouper. If you don't include Tactic in the definition of an "instance", then you can remove it to group only by Date and Package.