About ASSEMBLER (ver 1.0)

ASSEMBLER was developed to facilitate the alignment of overlapping DNA sequencing reads and construction of the final sequence. In order to minimize processing time the program makes no attempt to "fine-tune" the alignment of nucleotides in regions of ambiguity. This is probably a job better suited to human intelligence anyway.

Index

Strategy
Levels of Use
Adjustable Parameters
Substitution Codes for Ambiguous Bases
Handling Output
The Script

Strategy

At this time (6/1/00) the program searches for alignment first between each read and the BASE sequence. The BASE is either 1) the current consensus sequence or, 2) if a consensus sequence is not submitted, sequence read 1. Sequences exhibiting too little similarity with the BASE sequence are secondarily examined for similarity to the sequences which have been aligned. A tertiary search, although potentially informative, is not carried out. The choice of BASE sequence can be important, especially when few sequences are being submitted. In general, the longer the BASE the better; hence the use of the current consensus sequence. Best results are probably obtained after downloading the results, fine-tuning the CONSENSUS generated by the program and resubmitting the same sequences with the edited consensus. Note: ALL sequences must consist only of the letters A, C, G, T (U) and N. Gaps are not permitted (use N).

TWO LEVELS of use are permitted:

1) SIMPLE: The user enters up to 10 sequencing reads and their names. A consensus sequence may optionally also be submitted to guide the program, especially through ambiguous areas.

2) ADVANCED permits the user to specify parameters determining either program performance or output style. These parameters are described below. Advanced also permits the entry of up to 25 sequence reads.

Parameters

i) BITE SIZE is the length (in nt) of sequence identity required to constitute 1 hit (default = 6).
ii) HIT NUMBER is the minimum number of hits with the same offset required to align two sequences (default = 4). They need not be contiguous.
iii) LINE LENGTH refers to the number of nucleotides shown per line of output (default = 50).
iv) CUTOFF 1 is the frequency required for a nucleotide to be shown in the consensus. The default (0.5) means that more than half of the nucleotides at one position must be the same to be used in the consensus. Set this value to 1 if all the sequences aligned at a position must show the same nucleotide for it to appear in the consensus.
v) CUTOFF 2 is the combined frequency required for two nucleotides to be specified in the consensus. Thus, A and T must occur in more than 0.7 (the default) of the sequences for the letter W to appear at that locus in the consensus. (See Table I, below).
vi) CUTOFF 3. As for CUTOFF 2 except for 3 nucleotides. (default = 0.9).

TABLE I.

Single letter codes for uncertain nucleotides.

=====================================
R = A or G           H = A, C or T
Y = C or T           V = A, C or G
M = A or C           B = C, G or T
K = G or T           D = A, G or T
S = C or G
W = A or T           N = A, C, G or T

NOTE: These codes are used in the consensus sequence produced by ASSEMBLER to assist the user in defining areas of ambiguity. With the exception of "N" they MAY NOT BE USED in sequences (or consensus sequence) submitted to ASSEMBLER.

Handling Output

At this time, output cannot be edited on the screen. For editing purposes, output should be saved as a TXT document. As mentioned above, resubmission of the sequencing "reads" along with an edited version of the CONSENSUS sequence should result in improved results. Of course, the output as it appears on the screen can be saved as an HTML document for future reference.

The Script

This program was written in Perl to run in a Linux environment.

last modified 14/1/00
Comments or suggestions: bawill@molecularworkshop.com