Are unplaced genomic scaffolds unique compared to actual chromosomes? - scaffold

I used UCSC blat to search for a horse genomic sequence. Three results were returned, two were unplaced scaffolds, and the other was chr1. All had 100% identity to my query (gagttcctagacaccaaatacaacgtgggaatacacaacctactggcctatgtgaaacacctgaaaggccagaatgaggaagccctgaagagcttgagagaagctgaagacttaatccaggaagaacatggtgaccaatcaggcat).
My question is, are there 3 copies of this gene in the horse, or can the scaffolds belong to chr1? For what its worth, there is only one copy of the gene in mouse.

The answer turns out to be all 3 results are unique places in the genome. Unplaced scaffolds are sequences in the genome that are being worked on, but have not been added to the reference genome.

Related

Determining whether values between rows match for certain columns

I am working with some observation data and have run into a bit of an issue beyond my current capabilities. I surveyed different polygons (the column "PolygonID" in the screenshot) for lizards two times during a survey season. I want to determine the total search effort (shown in the column "Effort") for each individual polygon within each survey round. Problem is, the software I was using to collect the data sometimes creates unnecessary repeats for polygons within a survey round. There is an example of this in the screenshot for the rows with PolygonID P3.
Most of the time it does not affect the effort calculations because the start and end time for the rows (the fields used to calculate effort) are the same, and I know how to filter the dataset so it only shows one line per polygon per survey, but I have reason to be concerned there might be some lines where the software glitched and assigned incorrect start and end times for one of the repeat lines. Is there a way I can test whether start and end time match for any such repeats with R, rather than manually going through all the data?
Thank you!

Recoding a column based off of a map/index that defines subsets of variables in that column

Up front, please excuse my terminology. I'm an MD/PhD student with little computer sci background and some experience with SPSS/R/Command line.
I have a set of patient encounter numbers and am trying to associate their surgeon identifier (AKA provider in the images below) with the encounter number. The excel file I'm working with has numerous rows of a given encounter number. Each row represents and individual item that was charged for during the patient's surgery (75,000 rows with ~1900 unique encounter #'s, example in picture 1 below).
encounter numbers in excel file, without surgeon identifiers
I have another excel file that has the appropriate surgeon identifier on the same row as each unique encounter number (picture 2 below). My problem is that my first file does not have the appropriate surgeon identifier next to the encounter numbers. So, I could manually paste over the right encounter numbers from file 2, but it would take me hours/days to do all 1900.
surgeons associated with specific encounter numbers
I thought that I would be able to use my second file as a lookup table/map of which surgeon identifier links to which encounter number, so that I could generate a new variable or transform a copy of the encounter numbers into the correct surgeon identifier. To start on that, I concatenated the encounter numbers associated with each surgeon's identifier. The list is csv & for each surgeon identifier is in a single cell of excel as below.
I can't find any question that addresses this.. Any insight on how I could proceed?

Extract list of genes that only appears in both List 1 and List 2

I have two lists of differentially expressed genes one from knockout A vs control and the other from knockout B vs control cells. However the list shows slightly different genes being expressed in KOA than KOB, as such I want to create a final list containing only those genes which appears in BOTH KOA and KOB but not mutually exclusive in either. Both lists of genes have same column names and are contained within .csv files.
How would I go about this in R as I am completely lost.
Thanks,
-Yaseen

Set vertex names

I have a network on R and I have to attach names to all vertices that have more than 3 related ties(or better, that have degree >=2, that is, 2 or more adjacent edges). In one case I have a network made of firms who collaborated with one another, and I need to assign to all vertices with degree>=3 the correspondent firm's name (which I have in the csv dataset in the column Project Company).

How can I query a genetics database of SNPs (preferably with R)?

Starting with a a few human single nucleotide polymorphisms (SNPs) how can I query a database of all known SNPS such that I can generate a list (data.table or csv file) of the 1000 or so closest SNPS, weather or not the SNP is a tagSNP, and what the minor allele frequency (MAF) is and how many bases it is away from the starting SNPS?
I would prefer to do this in R (although it does not have to be). Which database should I use? My only starting point would be listing the the starting snps (eg rs3091244 , rs6311, etc).
I am certain there is a nice simple Bioconductor package that could be my starting point. But what? Have you ever done it? I imagine it can be done in about 3 to 5 lines of code.
Again this is off topic but you can actually do all of the things you mention through this web based tool from the BROAD:
http://www.broadinstitute.org/mpg/snap/ldsearch.php
You just input a snp and it gives you the surrounding window of snps, and you can export to csv as well.
Good luck with your genetics project!

Resources