I hope to use glob_wildcards for my two wildcards: "sample" and "ID"
but in the path: config["path"]+ "{sample}/map/filtered.{sample}.R1.clean.fq.gz.split/filtered.{sample}.R1.clean.id_{ID}.fq.gz"
{sample} appeared multiple times, what could I do?
the issue I have is the fact that for each of my sample, there are different number of IDs, how could I write the snakefile to make it ignore certain combinations of sample & ID?
Thank you very much!
Related
I have one .csv file which i have imported into R. It contains a column with locations, some locations are repeated depending on how many times that location has been surveyed. I have another column with the total no. of plastic items.
I would like to add together the number of plastic items for locations that appear more than once, and create a separate column with the total no. of plastic and another column of the no. of times the location appeared.
I am unsure how to do this, any help will be much appreciated.
Using dplyr:
data %>%
group_by(location) %>%
mutate(TOTlocation=n(),TOTitems=sum(items))
And here's a base solution that does pretty much the same thing:
data[c("TOTloc","TOTitem")]<-t(sapply(data$location, function(x)
c(TOTloc=sum(data$location==x),
TOTitem=sum(data$items[data$location==x]))))
Note that in neither case do you need to sort anything - in dplyr you can use group_by to have each action done on only the part of the data set that belongs to a group determined by the contents of a certain column. In my base solution, I break down the locations list using sapply and then recalculate the TOTloc and TOTitem again for each row. This may not be a very efficient solution. A better solution will probably use split, but for some reason I couldn't make it work with my made up dataset, so maybe someone else can suggest how to best do that.
I am trying to find the similar genes between two columns that I can later work with just the similar genes. Below is my code:
top100_1Beta <- data.frame(grp1_Beta$my_data.SYMBOL[1:100])
top100_2Beta<- data.frame(grp2_Beta$my_data.SYMBOL[1:100])
common100_Beta <- semi_join(top100_1Beta,top100_2Beta)`
When I run the code I get the following error:
Error: by required, because the data sources have no common variables
This is wrong since when I open top100_1Beta and top100_2Beta I can see at least the first few list the exact same genes: ATP2A1, SLMAP, MEOX2,...
I am confused on why then it's returning that no commonalities.
Any help would be greatly appreciated.
Thanks!
I don't think you need any form of *_join here; instead it seems you're looking for intersect
intersect(grp1_Beta$my_data.SYMBOL[1:100], grp2_Beta$my_data.SYMBOL[1:100])
This returns a vector of common entries amongst the first 100 entries of grp1_Beta$my_data.SYMBOL and grp1_Beta$my_data.SYMBOL.
Without a full working example, I'm guessing that your top100_1Beta and top100_2Beta dataframes do not have the same column names. They are probably grp1_Beta.my_data.SYMBOL.1.100. and grp2_Beta.my_data.SYMBOL.1.100.. This means the semi_join function doesn't know where to match the dataframes up. Renaming the columns should fix the issue.
im trying to find an easy formula to do the following:
=IF(AND(H6="OK";H7="OK";H8="OK";H9="OK";H10="OK";H11="OK";);"OK";"X")
This actually works. But I want to apply to a range of cells within a column (H6:H11) instead of having to create a rule for each and every one of them... But trying as a range:
=IF(AND(H6:H11="OK";);"OK";"X")
Does not work.
Any insights?
Thanks.
=ArrayFormula(IF(AND(H6:H11="OK");"OK";"X"))
also works
arrayformulas work the same way they do in excel... they just need an ArrayFormula() around to work (will be automatically set when pressing Ctrl+Alt+Return like in excel)
In google sheets the formula is:
=ArrayFormula(IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X"))
in excel:
=IF(SUM(IF(H6:H11="OK";1;0))=6;"OK";"X")
And confirm with Ctrl-Shift-Enter
This basically counts the number of times the said range is = to the criteria and compares it to the number it should be. So if the range is increased then increase the number 6 to accommodate.
I know function combn can generate all the possible combinations. However, if the total number of members is large, this is really time-consuming and memory-consuming.
My goal is to randomly pick combinations from all the possible combinations. For example, I want 5000 distinct triple set of members from a pool of 3000 members. I think I don't need to generate all possible combinations and choose 5000 from them. But seems that R doesn't have a ready-to-use function to do this. So how to deal with this problem?
This is not exactly what you need but perhaps it can get you started:
library(data.table) #to make the table easier
members=1:3000;
X=data.table(RUN=1:5000)
X<-X[,as.list(sample(members, 3)),by=RUN]
This will create 3 new columns that are randomly selected from the members vector. See them as IDs of each member.
I would do a check to see how many as unique using:
X[duplicated(X, by=c('V1','V2','V3'))]
Is this helping you at all?
So I'm trying to manipulate a simple Qualtrics CSV, and I want to use colSums on certain columns of data, given a certain filter.
For example: within the .csv file called data, I want to get the sum of a few columns, and print them with certain labels (say choice1, choice2 etc). That is easy enough by itself:
firstqn<-data.frame(choice1=data$Q7_2,choice2=data$Q7_3,choice3=data$Q7_4);
secondqn<-data.frame(choice1=data$Q8_6,choice2=data$Q8_7,choice3=data$Q8_8)
print colSums(firstqn); print colSums(secondqn)
The problem comes when I want to repeat the above steps with different filters, - say, only the rows where gender==2.
The only way I know how is to create a new dataset data2 and replace data$ with data2$ in every line of the above code, such as:
data2<-(data[data$Q2==2,])
firstqn<-data.frame(choice1=data2$Q7_2,choice2=data2$Q7_3,choice3=data2$Q7_4);
however i have 6 choices for each of 5 questions and am planning to apply about 5-10 different filters, and I don't relish the thought of copy/pasting data2 and `data3' etc hundreds of times.
So my question is: Is there any way of getting R to reference data by default without using data$ in front of every variable name?
I can probably use attach() to achieve this, but i really don't want to:
data2<-(data[data$Q2==2,])
attach(data2)
firstqn<-data.frame(choice1=Q7_2,choice2=Q7_3,choice3=Q7_4);
detach(data2)
is there a command like attach() that would allow me to avoid using data$ in front of every variable, for a specified amount of code? Then whenever I wanted to create a new filter, I could just copy/paste the same code and change the first command (defining a new dataset).
I guess I'm looking for some command like with(data2, *insert multiple commands here*)
Alternatively, if anyone has a better way to do the above in an entirely different way please enlighten me - i'm not very proficient at R (yet).