Sequence similarity search

This topic explores online tools for sequence similarity search, mainly BLAST at
  1. NCBI
  2. EBI

Example output for EPO, accession X02157 (cDNA) searched with BLASTN using default parameters:

  1. NCBI: RID-3598A6B101R
  2. EBI: jobID ncbiblast-I20161121-075026-0519-99904616-es

Exercises:

  1. Take one of the overrepresented sequences from the SRR515298 sample, listed in the FastQC report and try to find out where it might come from:
    • >overrepresented1
      ACGGTTATAAATCAACACATTGATTTATAAGCATGGAAATCCCCTGAGTG
      
    • >overrepresented2
      TTTCCCTGGTGTTGGCGCAGTATTCGCGCACCCCGGTCAAACCGGGGTCA
      
    • >overrepresented3
      TTTTAGCGCACGGCTCTCTCCCAAGAGCCATTTCCCTGGACCGAATACAG
      
    • >overrepresented4
      CTTTTAGCGCACGGCTCTCTCCCAAGAGCCATTTCCCTGGACCGAATACA
      
    • >overrepresented5
      CCTGTTCGGTACGACATTGCTCACATTGCTTCCAGTATTATTTGCCCGCC
      
    • >overrepresented6
      ATCCGAAGCGAAAGCGTCGGGATAATAATAACGATGAAATTCCTCTTTGA
      
    • Some example results:
      EBI: ncbiblast-I20161121-113449-0758-58478693-pg
      NCBI: 35HMDEJ7015
  2. The default BLAST search at NCBI does not produce any hits for this sequence - try to improve the sensitivity and figure out what it might be related to.
    BLASTN: 35J815RF015
    BLASTX: 35JBC2B1014
  3. Search a stretch of DNA from the human genome and try to find out where the exons are.
    Result: 35JG843D014
  4. Analyse a 9kb genomic yeast sequence for coding genes (NCBI). Which feature is obscuring the results and what parameter can be used to prevent this?
    Max matches in a query range set to 2: 35JXM61Y015
  5. Moving on to protein searches: What potential paralogues of HOXA1 (AAB35423.2, local copy) can you identify in the human genome by using NCBI's BLASTP server?
    Result: 35MDB80X014 or 35SJ9VG6014?
  6. Try the same search on the EBI server. In the 'Result Summary', what does 'Query-anchored showing identities' produce? How does this compare to the 'Flat query-anchored showing identities'? (result: ncbiblast-I20161121-103552-0292-36501782-pg
  7. Here is the rhodopsin protein sequence from the zebrafish: danio_rerio.aa. Do a BLASTP search and report the percentage identity to rhodopsin in human, mouse and orca.
  8. Here is the rhodopsin protein sequence from the dolphin: dolphin.aa. Do a BLASTP search and report the percentage identity to rhodopsin in human, mouse and orca.
  9. Which are the next closest homologues to TLR4 (NP_612564.1, local copy)? Try the tree view in NCBI!
    Result: 35KUF6PR015
  10. A 2009 paper by Aoife McLysaght's group identified three protein coding genes, which were only found in humans:
    >sp|Q5K131|CLLU1_HUMAN Chronic lymphocytic leukemia up-regulated protein 1 OS=Homo sapiens GN=CLLU1 PE=2 SV=1
    MFNKCSFHSSIYRPAADNSASSLCAIICFLNLVIECDLETNSEINKLIIYLFSQNNRIRF
    SKLLLKILFYISIFSYPELMCEQYVTFIKPGIHYGQVSKKHIIYSTFLSKNFKFQLLRVC
    W
    
    >sp|P86434|AAS1_HUMAN Putative uncharacterized protein ADORA2A-AS1 OS=Homo sapiens GN=ADORA2A-AS1 PE=5 SV=1
    MEQDWQPGEEVTPGPEPCSKGQAPLYPIVHVTELKHTDPNFPSNSNAVGTSSGWNRIGTG
    CSHTWDWRFSCTQQALLPLLGAWEWSIDTEAGGGRREQSQKPCSNGGPAAAGEGRVLPSP
    CFPWSTCQAAIHKVCRWQGCTRPALLAPSLATLKEHSYP
    
    >sp|P0CZ25|D10OS_HUMAN Uncharacterized protein DNAH10OS OS=Homo sapiens GN=DNAH10OS PE=2 SV=1
    MHSLPRSGSIRRTHSDTQATGWPPPQRIGDSPGPSPAFLSCPPSLCGGAAQTGDPVALPH
    GPEKWVWGGGLSPRNPHSWGIKAHGLRPPWAPRLERCMVPESEWAPWQPQLPCEPKWLGS
    RKSKPHRESGLRGGGPSRCAKRGTHSCGPRESGGPDTCHLPCH
    
    Carry out BLAST searches to see how unique these sequences are.



kahokamp@tcd.ie