faidx.5 3.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
  1. '\" t
  2. .TH faidx 5 "August 2013" "htslib" "Bioinformatics formats"
  3. .SH NAME
  4. faidx \- an index enabling random access to FASTA files
  5. .SH SYNOPSIS
  6. .IR file.fa .fai,
  7. .IR file.fasta .fai
  8. .SH DESCRIPTION
  9. Using an \fBfai index\fP file in conjunction with a FASTA file containing
  10. reference sequences enables efficient access to arbitrary regions within
  11. those reference sequences.
  12. The index file typically has the same filename as the corresponding FASTA
  13. file, with \fB.fai\fP appended.
  14. .P
  15. An \fBfai index\fP file is a text file consisting of lines each with
  16. five TAB-delimited columns:
  17. .TS
  18. lbl.
  19. NAME Name of this reference sequence
  20. LENGTH Total length of this reference sequence, in bases
  21. OFFSET Offset within the FASTA file of this sequence's first base
  22. LINEBASES The number of bases on each line
  23. LINEWIDTH The number of bytes in each line, including the newline
  24. .TE
  25. .P
  26. The \fBNAME\fP and \fBLENGTH\fP columns contain the same
  27. data as would appear in the \fBSN\fP and \fBLN\fP fields of a
  28. SAM \fB@SQ\fP header for the same reference sequence.
  29. .P
  30. The \fBOFFSET\fP column contains the offset within the FASTA file, in bytes
  31. starting from zero, of the first base of this reference sequence, i.e., of
  32. the character following the newline at the end of the "\fB>\fP" header line.
  33. Typically the lines of a \fBfai index\fP file appear in the order in which the
  34. reference sequences appear in the FASTA file, so \fB.fai\fP files are typically
  35. sorted according to this column.
  36. .P
  37. The \fBLINEBASES\fP column contains the number of bases in each of the sequence
  38. lines that form the body of this reference sequence, apart from the final line
  39. which may be shorter.
  40. The \fBLINEWIDTH\fP column contains the number of \fIbytes\fP in each of
  41. the sequence lines (except perhaps the final line), thus differing from
  42. \fBLINEBASES\fP in that it also counts the bytes forming the line terminator.
  43. .SS FASTA Files
  44. In order to be indexed with \fBsamtools faidx\fP, a FASTA file must be a text
  45. file of the form
  46. .LP
  47. .RS
  48. .RI > name
  49. .RI [ description ...]
  50. .br
  51. ATGCATGCATGCATGCATGCATGCATGCAT
  52. .br
  53. GCATGCATGCATGCATGCATGCATGCATGC
  54. .br
  55. ATGCAT
  56. .br
  57. .RI > name
  58. .RI [ description ...]
  59. .br
  60. ATGCATGCATGCAT
  61. .br
  62. GCATGCATGCATGC
  63. .br
  64. [...]
  65. .RE
  66. .LP
  67. In particular, each reference sequence must be "well-formatted", i.e., all
  68. of its sequence lines must be the same length, apart from the final sequence
  69. line which may be shorter.
  70. (While this sequence line length must be the same within each sequence,
  71. it may vary between different reference sequences in the same FASTA file.)
  72. .P
  73. This also means that although the FASTA file may have Unix- or Windows-style
  74. or other line termination, the newline characters present must be consistent,
  75. at least within each reference sequence.
  76. .P
  77. The \fBsamtools\fP implementation uses the first word of the "\fB>\fP" header
  78. line text (i.e., up to the first whitespace character) as the \fBNAME\fP column.
  79. At present, there may be no whitespace between the
  80. ">" character and the \fIname\fP.
  81. .SH EXAMPLE
  82. For example, given this FASTA file
  83. .LP
  84. .RS
  85. >one
  86. .br
  87. ATGCATGCATGCATGCATGCATGCATGCAT
  88. .br
  89. GCATGCATGCATGCATGCATGCATGCATGC
  90. .br
  91. ATGCAT
  92. .br
  93. >two another chromosome
  94. .br
  95. ATGCATGCATGCAT
  96. .br
  97. GCATGCATGCATGC
  98. .br
  99. .RE
  100. .LP
  101. formatted with Unix-style (LF) line termination, the corresponding fai index
  102. would be
  103. .RS
  104. .TS
  105. lnnnn.
  106. one 66 5 30 31
  107. two 28 98 14 15
  108. .TE
  109. .RE
  110. .LP
  111. If the FASTA file were formatted with Windows-style (CR-LF) line termination,
  112. the fai index would be
  113. .RS
  114. .TS
  115. lnnnn.
  116. one 66 6 30 32
  117. two 28 103 14 16
  118. .TE
  119. .RE
  120. .SH SEE ALSO
  121. .IR samtools (1)
  122. .TP
  123. http://en.wikipedia.org/wiki/FASTA_format
  124. Further description of the FASTA format