A TXT-type anviāo artifact. This artifact is typically provided by the user for anviāo to import into its databases, process, and/or use.
š To the main page of anviāo programs and artifacts.
There are no anviāo tools that generate this artifact, which means it is most likely provided to the anviāo ecosystem by the user.
This artifact is a TAB-delimited file that describes a set of enzymes.
The user can generate this file to define an arbitrary set of enzymes that they want to estimate metabolism on, using the program anvi-estimate-metabolism.
Each row (besides the header) in this file represents one enzyme in the set. At minimum, the file must contain three columns:
gene_id
column containing a unique value to identify a gene for the enzyme. The value can be either a string (like a gene name) or an integer (like a gene callers id), but it has to be unique because sometimes multiple genes can have the same enzyme annotation.enzyme_accession
column containing the accession of the enzyme, such as a KEGG Ortholog accession for KOfams, a COG accession for NCBI COGs, a Pfam, etcsource
column containing the name of the database that would be used to annotate the enzyme. For example, āKOfamā, āCOG20_FUNCTIONā, āPfamā, etc.Ideally, all annotation sources in this column would match to those used to define the metabolic pathways you are estimating completeness for (whether those are KEGG Modules or user-defined modules as in user-modules-data, but in practice, we donāt currently check for this. If you include some enzymes that are not part of any metabolic modules, they simply will not contribute to the completeness scores of any pathways, and you would therefore only see them in āhitsā mode output files.
So the source
column is (at this time) mostly for you to make sure you know which database these enzymes are coming from and that at least some (hopefully most) will actually be part of the metabolic pathways you are interested, because otherwise the results from anvi-estimate-metabolism might not make much sense. However, you do have complete freedom to define the āsourceā value arbitrarily, if you want. But please keep in mind that this may change in the very near future - one day these source
values might actually matter for the functioning of anvi-estimate-metabolism (in which case this documentation will be updated to reflect that). So it is best to get used to setting them properly. :)
Here is an example file with the minimum set of columns:
gene_id | enzyme_accession | source |
---|---|---|
aad:TC41_3038 | K02886 | KOfam |
aca:ACP_1744 | K02626 | KOfam |
aco:Amico_1604 | K00606 | KOfam |
ade:Adeh_0623 | K02669 | KOfam |
If you want downstream programs like anvi-estimate-metabolism to have access to the coverage and detection data for each enzyme (well, technically, its gene), then you can add two additional columns to this file:
coverage
column should contain the numerical coverage value for the gene encoding the enzymedetection
column should contain the numerical detection value for the gene encoding the enzymeIf these columns are included, you can use the --add-coverage
flag with anvi-estimate-metabolism so that this data is included in the output for each metabolic pathway and/or enzyme. However, you do need to include both of the columns - that program does not currently support adding just coverage or just detection.
gene_id | enzyme_accession | source | coverage | detection |
---|---|---|---|---|
aad:TC41_3038 | K02886 | KOfam | 4.44 | 0.7862318840579711 |
aca:ACP_1744 | K02626 | KOfam | 4.522875816993464 | 0.7790055248618785 |
aco:Amico_1604 | K00606 | KOfam | 2.63953488372093 | 0.8063380281690141 |
ade:Adeh_0623 | K02669 | KOfam | 2.011764705882353 | 0.6639344262295082 |
Edit this file to update this information.