123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124 |
- '\" t
- .TH faidx 5 "August 2013" "htslib" "Bioinformatics formats"
- .SH NAME
- faidx \- an index enabling random access to FASTA files
- .SH SYNOPSIS
- .IR file.fa .fai,
- .IR file.fasta .fai
- .SH DESCRIPTION
- Using an \fBfai index\fP file in conjunction with a FASTA file containing
- reference sequences enables efficient access to arbitrary regions within
- those reference sequences.
- The index file typically has the same filename as the corresponding FASTA
- file, with \fB.fai\fP appended.
- .P
- An \fBfai index\fP file is a text file consisting of lines each with
- five TAB-delimited columns:
- .TS
- lbl.
- NAME Name of this reference sequence
- LENGTH Total length of this reference sequence, in bases
- OFFSET Offset within the FASTA file of this sequence's first base
- LINEBASES The number of bases on each line
- LINEWIDTH The number of bytes in each line, including the newline
- .TE
- .P
- The \fBNAME\fP and \fBLENGTH\fP columns contain the same
- data as would appear in the \fBSN\fP and \fBLN\fP fields of a
- SAM \fB@SQ\fP header for the same reference sequence.
- .P
- The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes
- starting from zero, of the first base of this reference sequence, i.e., of
- the character following the newline at the end of the "\fB>\fP" header line.
- Typically the lines of a \fBfai index\fP file appear in the order in which the
- reference sequences appear in the FASTA file, so \fB.fai\fP files are typically
- sorted according to this column.
- .P
- The \fBLINEBASES\fP column contains the number of bases in each of the sequence
- lines that form the body of this reference sequence, apart from the final line
- which may be shorter.
- The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of
- the sequence lines (except perhaps the final line), thus differing from
- \fBLINEBASES\fP in that it also counts the bytes forming the line terminator.
- .SS FASTA Files
- In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text
- file of the form
- .LP
- .RS
- .RI > name
- .RI [ description ...]
- .br
- ATGCATGCATGCATGCATGCATGCATGCAT
- .br
- GCATGCATGCATGCATGCATGCATGCATGC
- .br
- ATGCAT
- .br
- .RI > name
- .RI [ description ...]
- .br
- ATGCATGCATGCAT
- .br
- GCATGCATGCATGC
- .br
- [...]
- .RE
- .LP
- In particular, each reference sequence must be "well-formatted", i.e., all
- of its sequence lines must be the same length, apart from the final sequence
- line which may be shorter.
- (While this sequence line length must be the same within each sequence,
- it may vary between different reference sequences in the same FASTA file.)
- .P
- This also means that although the FASTA file may have Unix- or Windows-style
- or other line termination, the newline characters present must be consistent,
- at least within each reference sequence.
- .P
- The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header
- line text (i.e., up to the first whitespace character) as the \fBNAME\fP column.
- At present, there may be no whitespace between the
- ">" character and the \fIname\fP.
- .SH EXAMPLE
- For example, given this FASTA file
- .LP
- .RS
- >one
- .br
- ATGCATGCATGCATGCATGCATGCATGCAT
- .br
- GCATGCATGCATGCATGCATGCATGCATGC
- .br
- ATGCAT
- .br
- >two another chromosome
- .br
- ATGCATGCATGCAT
- .br
- GCATGCATGCATGC
- .br
- .RE
- .LP
- formatted with Unix-style (LF) line termination, the corresponding fai index
- would be
- .RS
- .TS
- lnnnn.
- one 66 5 30 31
- two 28 98 14 15
- .TE
- .RE
- .LP
- If the FASTA file were formatted with Windows-style (CR-LF) line termination,
- the fai index would be
- .RS
- .TS
- lnnnn.
- one 66 6 30 32
- two 28 103 14 16
- .TE
- .RE
- .SH SEE ALSO
- .IR samtools (1)
- .TP
- http://en.wikipedia.org/wiki/FASTA_format
- Further description of the FASTA format
|