ERROR LOADING HTML FROM SOURCE (http://ncf.sobek.ufl.edu//design/skins/UFDC/html/header_item.html)

Local Algebraic Invariant Statistics for a Heuristic to Compare Phylogenetic Trees

Permanent Link: http://ncf.sobek.ufl.edu/NCFE004268/00001

Material Information

Title: Local Algebraic Invariant Statistics for a Heuristic to Compare Phylogenetic Trees
Physical Description: Book
Language: English
Creator: Haywood, Ian
Publisher: New College of Florida
Place of Publication: Sarasota, Fla.
Creation Date: 2010
Publication Date: 2010

Subjects

Subjects / Keywords: Invariant
Statistic
Phylogenetics
Genre: bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: A central problem of evolutionary biology involves constructing the evolutionary pathway of observed species. Phylogenetic trees are natural structures used to model such pathways. We develop a method for comparing phylogenetic trees with a notion towards constructing a best fit tree for a given data set. Our method involves defining a local geometry on the space of trees. This geometry involves a comparison of local algebraic invariants. We use this geometry to construct an algorithm for producing best fit phylogenetic trees. Numerical experiments suggest this as an interesting avenue of research.
Statement of Responsibility: by Ian Haywood
Thesis: Thesis (B.A.) -- New College of Florida, 2010
Electronic Access: RESTRICTED TO NCF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE
Bibliography: Includes bibliographical references.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The New College of Florida, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Local: Faculty Sponsor: McDonald, Patrick

Record Information

Source Institution: New College of Florida
Holding Location: New College of Florida
Rights Management: Applicable rights reserved.
Classification: local - S.T. 2010 H4
System ID: NCFE004268:00001

Permanent Link: http://ncf.sobek.ufl.edu/NCFE004268/00001

Material Information

Title: Local Algebraic Invariant Statistics for a Heuristic to Compare Phylogenetic Trees
Physical Description: Book
Language: English
Creator: Haywood, Ian
Publisher: New College of Florida
Place of Publication: Sarasota, Fla.
Creation Date: 2010
Publication Date: 2010

Subjects

Subjects / Keywords: Invariant
Statistic
Phylogenetics
Genre: bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: A central problem of evolutionary biology involves constructing the evolutionary pathway of observed species. Phylogenetic trees are natural structures used to model such pathways. We develop a method for comparing phylogenetic trees with a notion towards constructing a best fit tree for a given data set. Our method involves defining a local geometry on the space of trees. This geometry involves a comparison of local algebraic invariants. We use this geometry to construct an algorithm for producing best fit phylogenetic trees. Numerical experiments suggest this as an interesting avenue of research.
Statement of Responsibility: by Ian Haywood
Thesis: Thesis (B.A.) -- New College of Florida, 2010
Electronic Access: RESTRICTED TO NCF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE
Bibliography: Includes bibliographical references.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The New College of Florida, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Local: Faculty Sponsor: McDonald, Patrick

Record Information

Source Institution: New College of Florida
Holding Location: New College of Florida
Rights Management: Applicable rights reserved.
Classification: local - S.T. 2010 H4
System ID: NCFE004268:00001


This item is only available as the following downloads:


Full Text

PAGE 1

LOCALALGEBRAICINVARIANTSTATISTICSFORAHEURISTICTO COMPAREPHYLOGENETICTREES BY IANTAYLORHAYWOOD AThesis SubmittedtotheDivisionofNaturalSciences NewCollegeofFlorida inpartialfulllmentoftherequirementsforthedegree BachelorofArts UnderthesponsorshipofPatrickMcDonald Sarasota,Florida May,2010

PAGE 2

Acknowledgments I'mverypooratacknowledgingthosemostimportanttomeandmystyleand skillofthewrittenwordhasatrophiedtothepointwhereIcannotgivetheproper addressofloveandrespecttoeachsingleindividualwhohashelpedmethroughmy education.ThefollowingisapoorfacsimileofgratitudetothoseIcanrecallvividly. PatMcDonaldforhispresence,hisguidance,andhisboundlessenergywhichhave driventhisthesis. Everysingleotherprofessorwhohaveshapedmypath,includingtheentireMathematicsfaculty,especiallypreviousadvisorsLeonKaganovskiyandEiriniPoimenidou. AspecialguestthankyoutoChrisHart,whohasquicklyandbrilliantlyshouldered hisshareofmyeducation. KevinHarrodattheLovelaceRespiratoryResearchInstituteforemployingme theprevioussummer,settinginmotionthepositiveenergyIneededtocompletemy thesis. Myparents,whoneverturnedskepticalorcritical,butlovingandsupportingover time. ErinDyles,whoseinnitepatiencethroughoutthisentireprocesshasbeenan invaluablestabilizingforce.Yourpresenceliveswithinthesepages. MyfriendsthroughoutmyNewCollegeexperience,especiallythosewhohave remainedhereaslongasIhave.Weareallintoxicatedbyourmutualfriendshipand itmakesusstronger. Finally,toeveryoneelsewhokeepsmeintheirheartsandminds.

PAGE 3

LOCALALGEBRAICINVARIANTSTATISTICSFORAHEURISTICTO COMPAREPHYLOGENETICTREES IanTaylorHaywood NewCollegeofFlorida,2010 ABSTRACT Acentralproblemofevolutionarybiologyinvolvesconstructingtheevolutionary pathwayofobservedspecies.Phylogenetictreesarenaturalstructuresusedtomodel suchpathways.Wedevelopamethodforcomparingphylogenetictreeswithanotion towardsconstructingabestttreeforagivendataset.Ourmethodinvolvesdening alocalgeometryonthespaceoftrees.Thisgeometryinvolvesacomparisonoflocal algebraicinvariants.Weusethisgeometrytoconstructanalgorithmforproducing besttphylogenetictrees.Numericalexperimentssuggestthisasaninteresting avenueofresearch. PatrickMcDonald DivisionofNaturalSciences

PAGE 4

Contents ITheory1 1Introduction1 2Genetics4 2.1Preliminaries...............................4 2.2Group-basedModels...........................10 3MaximumLikelihoodEstimation12 4HiddenMarkovModel17 5AlgebraicPropertiesofMarkovModels23 6Group-basedModels29 6.1FourieronGroups.............................29 6.2ToricIdeals................................35 6.3Labelings.................................37 6.4Group-basedFlattenings.........................42 7GeneralStochasticModel49 7.1VertexFlattenings............................51 7.2EdgeFlattenings.............................54 7.3SuciencyofFlattenings.........................57 8SpaceofTrees59 IIMethod,Data,andResults64 iv

PAGE 5

9Introduction64 10Hypothesis65 11Method67 12Results69 13Discussion75 14Conclusion76 ListofFigures 1Arootedevolutiontree,changesovertimeowdownwardawayfrom commonancestor,notdrawntoscale...................4 2Anunrootedevolutiontreee,onlytherelationsbetweentaxaareknown.5 3Asetof4alignmentswiththe2ndalignedbasehighlighted.7 4Aratematrixwiththeconditionofbeingsymmetric,theonlynecessaryconditionofGTR..........................9 5SimpleHMM...............................18 6Linegraphwiththreevertices......................24 7Arooted3-leaftree............................38 8Subtrees E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(fora6-leaftree T andedge v 8 ;v 9 ..43 9Asampleconsistentlabelingandedgepropagationon E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(..................................44 10Avertexatteningona6leaftree...................46 11Unrooted4-taxonphylogenetictree..................52 12A5-leafunrootedbinarytree.....................56 13Visualrepresentationofanedgeatteningalongedge v 9 ;v 10 58 v

PAGE 6

14Thespaceofall4-taxonunrootedbinarytrees.Alltreesin spacecollapsetheiredgeinto T 0 andextendtobecomeanother tree.....................................60 15Thelinkoftheoriginof T 4 ,representedbythe3bolddots. ThelabelscorrespondtothetreesinFigure........63 16Thelinkoftheoriginof T 5 ,shownhereinmostfamiliarform ofthePetersengraph.[22]......................64 17Resultsforinvariantstatisicheuristiconspaceoftreeswithhomogeneousbranchlengtht=.1.Group-based:+,General-Markov:x...70 18Resultsforinvariantmeasureheuristiconspaceoftreeswithhomogeneousbranchlengtht=.5,Group-based:+,General-Markov:x...71 19Asampleofallinvariantstatisticsforalltreesonaspacewithhomogeneousbranchlengtht=.1;1300group-basedattenings......72 20Asampleoftheinvariantstatisticsforalltreesonasinglespacewith homogeneousbranchlengtht=.5;1300generalMarkovattenings.74 vi

PAGE 7

PartI Theory 1Introduction Networktreesaremathematicalconstructsusedtovisuallyrepresenttherelationshipbetweenacollectionofobjects.Thesetreesaredenedusinga completesetofdistancesbetweenanytwoobjects.Ofimportancearebinary trees,directedorundirected,whichjointwopointsthatare"nearestneighbors"togethertoaninteriornode.Binarytreesarecapableofdetermining informationbetweenobjectsforavarietyofapplications.Asimpleexample isthevisualrepresentationofabinarysearchwithinasortedlist.The globaltopology ofatreeisdenedasthesetofallverticesandedgesof thetree.The localtopology ofatreeatagivenvertexoredgeisthesetof verticesandedgesinaconnectedcomponentsubtree.Subtreesobtainedby thedeletionofanedgeornode,producingsubtreesoftheoriginaltreethat areinandofthemselvesconnnectedbyalltheirrespectivenodesandedges. Binarytreeshavepropertiesverysimilartothevisualrepresentations ofthemacro-evolutionprocess.Coarsely,evolutioncanbedescribedfrom theintialpointofacommunityoforganismswithinaspeciesthatare somehowdividedtoallownaturalselectionpressurestocreatedierences betweenthetwocommunitiesandultimately,throughtheirinabilitytoshare geneticinformation,newspecies.Inthebinarytreewerepresenttheinitial communityasasinglenode,andthedivergencethroughnaturalselectionas abifurcationofasinglenodeconnectingtotwoothernodes.Abinarytree 1

PAGE 8

canrepresentanyseriesofdivergencesfromthethecommonancestorasa seriesofbifurcatingedgesfromthesinglenodecorrespondingtoanancestor, knownasthe root ,tothosenodesrepresentingthedescendants,knownas leaves Eachleafinthecontextofphylogenetictreesisreferredtoas taxon taxa thepluralarerepresentativeofanylevelofevolutionarydierence,from entirekingdomstosingleorganisms.Theconnectedhistoryofbifurcations relatingthesetofdescendantsistheevolutiontree,or phylogenetictree Untilrecently,therelationshipsbetweenorganismscouldonlybeattributed tophenotypiccharacteristicsthosecharacteristicswhicharededucedona largescalee.g.,hummingbirdbeaklengths,asopposedtothosecharacteristics atorbelowthescaleofthecell. ThetechnologicalinnovationsinDNAsequencinghaveledtonewmethods ofrelatingtaxabythedierencesintheirDNA.Thiscanbedoneby identifyingthosepartsofthegenome,or loci ,thatare orthologous ,orknown tobeevolvedfromacommonancestor.Incomparinglociwelookatthe DNAsequencethatcomprisesthesesharedsequenceothrologues. Each position inaDNAsequencecorrespondstoasingleletter.Wecreate anarrayofDNAsequencesfromthehomologousportionsofthegenome sharedbyeachtaxa;each i th columnreturningthe i th positionforeach DNAsequencebyorderingthehomologoussequencesinawaythatbestt thebiologicalintuitionorsomemathematicaloptimizationprograme.g.the Needleman-Wunschalgorithm.Thearraysareknownas alignments ,while asinglecolumnfromanarrayofalignmentsiscalledan alignedbase .By measuringthenumberoftimesweobservehowofteneachuniquealigned baseinthesetofalignmentsisobserved,wecandevelopthe"`distances"' 2

PAGE 9

betweenthetaxaandinferthephylogenetictreebestrelatingthem. Amongthemostpopularstatisticalmethodsinproducingaphylogenetic treeistheevaluationofaclassofstatisticalmodelsunder MaximumLikelihoodEstimation MLE.Themethodcorrespondstheexpectationofalignments tofunctionsmappingoveraspaceofparameters.TheendgoalofMLE isthenndingthoseparametersthatbesttthesefunctionstotheobserveddata.ThemostpopularsubclassofmodelsintheMLEmethod areMarkovmodels.IwillshowthatMarkovmodelsallowmanipulation thatproduceuniquenon-parametricobjectsknownas algebraicinvariants ThesecanbeusedtoanalyzecertaintopologicalfeaturesofthephylogenetictreeindependentofMLE.Amajordrawback,however,isthatthe naiveapproachtocomputeandanalyzeinvariantsindeterminingthecorrect phylogenetictreeistocomputeinvariantscorrespondingtoalltrees,acomputationwhosecomplexityisgreaterthanexponentialinthenumberofleaves. Givennaturalassumptionsontreestructure,Ishowthatalgebraicinvariants provideameansofdeterminingthetruetopologyofaphylogenetictree.I thenconstructthespaceofallpossibletreetopologies,wherethedistance metricbetweenuniquetreetopologiesisdenedbythenumberoflocal topological'changes'fromonetreetoanother.Fromthis,Iproposethe methodofcomparingnearestneighbortreesbytheirinvariants.Ishowthat byinitializingonarandomtree,evaluatingthistreeandalltreesclosestto it,choosingthebestinthissubset,thencontinuingtheprocesswiththis chosentreeuntilthebesttopologyisdetermined,Imaydeterminethebest ttopologyofthephylogenetictree.Thisisinfactthegreedyalgorithmon thespaceoftrees.Ishowthatthisapproachinthebestcasereducesthe 3

PAGE 10

Figure1:Arootedevolutiontree,changesovertimeowdownwardawayfromcommonancestor,notdrawntoscale. numberoftreestocomputefromexponentiallymanybythenaiveapproach, tolessthancubic.Iproposethatthoughtheideaofmeasuringstatisticsof treesforuseinagreedysearchonthespaceoftreeshasbeenproposed,the useofinvariantstocomputethestatisticalmeasureonagreedysearchhasnot beenpreviouslyproposed.Resultsofasmallexperimentusingthisapproach areincluded.Thoughthenumberofparametersandsamplesizeissmall,it isfoundthatthisapproachholdspromisetoproblemsinphylogenetics. 2Genetics 2.1Preliminaries DeoxyribonucleicAcidDNAisapolymerchaincomposedoffourmolecules: Adenine,Cytosine,Guanine,andThymine.Thesefourmoleculesare nucleotides ,orinterchangeably, bases .EachnucleotideinachainofDNAis 4

PAGE 11

Figure2:Anunrootedevolutiontreee,onlytherelationsbetweentaxaareknown. matchedagainstitschemicalcounterpartasa basepair toformahydrogen bond.Alongthestringofbases,thelocationofanucleotideintheDNAstring iscalledthe position .Thenumberofnucleotidesassociatedwiththegenomes ofanorganismvaryfromonseveralorders.Forexample,virusgenomesareas smallas3200nucleotides,andthehumangenomeisover3billionnucleotides. Nucleotidesubsequencesofgenomesthatdenebiologicaltraitsarereferred toas genes .Theprocessesrequiredtosupportthelifeofanorganism areinlargepartcontrolledbythetranscriptionandreplicationofgenes. TranscribingandtranslatinggenesintoproteinsthatperformbiologicalfunctionsisperformedbycollectionsofRibonucleicAcidRNA,whichare chemicallysimilartoDNA.Thesegenesareconservedbytheprocessof DNAreplicationinthedivisionofcellstoprovidechromosomesfornewly formedcells.Translationistheprocessofmanufacturingaminoacidsfrom transribedRNAthatinturnareusedtocreateanarrayofproteinsthat 5

PAGE 12

runmostprocessesinthecelland,eectively,theentireorganism.SubsequencesofDNAhavealsobeenfoundtocontainmost,ifnotall,ofthe informationnecessarytoproducemanyofthebiologicaltraitsofanorganism. A taxon isthebasicunitinphylogenetics,usedtodistinguishcollections oforganismsonassmallorlargeascaleasdesired.Ashandadog, forexample,aremembersofthetaxontype Phylum as Chordata ,duetotheir sharedtraitamongothersofhavingacentralskeletalstructure.Adog, ajackal,andthesharedgenesbetweenthemaremuchmoresimilarin comparisonandarecollectedinthetaxontype Genus as Canis .Thetaxon type species isacollectionoforganismsthatarebiologicallyabletobreed andshareDNA.Thecollectionofgenesoforganismsinthesamespecies andlocaledenesthe genepool ,thosegenesthatcanbeinheritedfrom breedingindividualswithinthesamelocale. Biologicaltraitsandtheirgenesareunderconstantpressurebynatural selction,andwhengenepoolsdiverge, speciation canoccur.Speciationisthe processbywhichtwopopulations,originallythesamespeciesbecometwo dierentspeciesandarenolongerabletomateandcombineDNAina commongenepool.Divergenceandnaturalselectionfromacommonancestor intheformationofnewspeciesisthemajorprocessinevolutionrst describedDarwin.Anexampleofthetreesofstudyinthisworkarealso usedtovisuallyinterperetthespeciationanddivergenceoftaxaovertime, Darwininhisbook TheOriginofSpecies ,includedadrawingofabifurcating evolutionary"`tree,"'withbranchesendingintaxacurrentlyexisting,and convergingtoacommonancestor,or root .Theevolutionarytreeovertime wasformallynameda phylogenetictree ,andthestudyofevolutionbythe 6

PAGE 13

Seq 1: Seq 2: Seq 3: Seq 4: A A T T 2 6 6 4 G C C C 3 7 7 5 T T C G C C C C Figure3:Asetof4alignmentswiththe2ndalignedbasehighlighted moleculargeneticsis Phylogenetics Animportantquestionthataroseimmediatelyaftertheintialdevelopment ofphylogeneticswas,"`Howtoproduceaphylogenetictreefromthecollection oftaxacurrentlyobservable?"'Atrstthisquestionwouldberesolvedby observingandcarefullyquantifyingthe phenotype ,orthephysicalcharacteristics ofanorganism.Solvingthisislimitedbytheidenticationofthosephysical characteristicsbetweentaxathatareassociatedbyacommonancestor.This hasseveralimmediateproblems,suchashowtoclassifyverysmallorganisms andphylogeneticallydistantorganismsthatshareacommontraitlikebirds andplatypi. Themodernphylogeneticmodelisbasedonthehypothesisofanevolutionary'clock',thatspeciesdivergefromacommonancestoranddonotconverge backtoashareddescendant,andthatthisdivergenceisoftenmeasuredby theexamingthesequencesdivergingbetweenspeciesatcommongenes.This hypothesisismadeformalonaphylogenetictreebyxingthegraphical structuretoadirected,bifurcatingtree.Allbinarytreeshavethefollowing properties: Allinteriornodesinbifurcatingtreeshaveexactlyoneparentnodeand twochildrennodes,whichrepresentsspeciation.Whenundirected,the interiornodeshaveexactlyvalencethree. 7

PAGE 14

Thesourcenodesrepresentthecommonancestorsinaphylogenetic tree. Thesinknodesleavesrepresentobservabletaxainaphylogenetictree. Ifthesinglecommonancestorisidentiable,itisrepresentedasthe uniquesourcenodeandreferredtoasthe root Amathematicalmodelfortheprocessofevolutionofageneticstring alongeachedgecanberepresentedtobeanindependentlyandidentically distributed i.i.d ,non-degenerativeprocessdenedbyaMarkovmatrix.That is,wesupposeeachtaxasequenceisaDNAstringoflength n ,thatthe rootstringisthegenesofacommonancestorforeachtaxaandthatthe i th positionalonganystringis position i .Bystackingtheobservedtaxa sequencessuchthateachpositionofthesequenceisthe i th columnofthe strings,weconstructalignments.Each i th columnofthealignmentsare referredtoasthe i th alignedbase FromeachpositionintheDNAstringassociatedtoaparentnode,we allowaposteriorprobabilitydistributions,knownas transition probabilites,to thesamepositionofthestringassociatedtoachildnodeinthetree.This istheprocessbywhichachildnode"`evolves"'fromitsparent.Allowing for v tobetheuniqueparentnodeofthechild v ,wedeneforevery edge v ;v aMarkovmatrix P v v = 0 B B B B B B B @ P v = A j v = A P v = A j v = C P v = A j v = G P v = A j v = T P v = C j v = A P v = C j v = C P v = C j v = G P v = C j v = T P v = G j v = A P v = G j v = C P v = G j v = G P v = G j v = T P v = T j v = A P v = T j v = C P v = T j v = G P v = T j v = T 1 C C C C C C C A 8

PAGE 15

Q = 0 B B @ )]TJ/F15 11.9552 Tf 9.299 0 Td [( r a + r b + r c r a r b r c r a )]TJ/F15 11.9552 Tf 9.298 0 Td [( r a + r d + r e r d r e r b r d )]TJ/F15 11.9552 Tf 9.298 0 Td [( r b + r d + r f r f r c r e r f )]TJ/F15 11.9552 Tf 9.299 0 Td [( r c + r e + r f 1 C C A Figure4:Aratematrixwiththeconditionofbeingsymmetric,theonlynecessary conditionofGTR Foreachofthefournucleotidesweidentifyfourprobabilitieswithcolumnsumtotalsofone. ThebiologicalassumptionsdonotallowtheuseofarbitraryMarkov matricesbetweennodes,andsoseveralmodelshavebeenproposedtoallow derivationsofMarkovmatricesasfunctionsofanon-degenerativeprocessover time.OneofthemostpopularmodelsistheGeneralTimeReversibleGTR model.Thismodelattributestheprocessofevolutionbetweentwosequencesto aratematrix[takentoacontinuousexponentialprocessovertime t .]InGTR foreachedgetheratematrixisasymmetricmatrixwithcolumntotals0[31]. InGTRtheMarkovmatrix P inEquationiscomputedbyan exponentiatingtime t overtheinstantaneousratematrix Q : P t = e Qt = 1 X i =0 1 i Q i t i Astime t increases,thereisagreaterprobabilitythatthelabelata positionforataxawilldegenerateintoanothernucleotide.Thegraphical representationoftimeonaMarkovmatrixisthelengthoftheedge correspondingtothephylogenetictree.Additionally,forthegeneralGTR modeltheaboveidentitycanonlybecomputednumerically,butonsome restrictionsoftheGTRthereareclosedalgebraicformson t 9

PAGE 16

2.2Group-basedModels Aspecialrestrictionontheratematricesassociatedtoedgesofthephylogenetic treemustbeconsideredforfuturereference.Ofrstconsiderationisthe Jukes-Cantormodel.Fixarate fortheratematrix Q tocreatethe Jukes-Cantor ratematrix: Q v v = 0 B B B B B B B @ )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.299 0 Td [(3 1 C C C C C C C A Asmentionedbefore,insomecaseswecanobtainanalgebraicformof thetransitionprobabilitymatrixfromcomputationoftheratematrix.This istrueforthisrestriction,andtheassociatedMarkovmatrixis: P v v = 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v 1 C C C C C C C A ThedevelopmentoftheJukes-Cantormodelwasmeanttosimplifyalgebraic methods,andisgenerallyexplainedasplausibleasthe a v willberelatively small,andthatvariationonpositionsisfairlyuniform.Thisdoesnot necessarilytthereality,asevenundertheassumptionthatevolutionoccurs astheprocessdescribedabove,theratematricesinwouldnotberestricted tobeuniform. AlessrestrictiveMarkovmatrixisthe Kimura 3-parametermodeldened by: 10

PAGE 17

M v v = 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v a v b v c v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v c v b v b v c v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v a v c v b v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v 1 C C C C C C C A TheKimura2-parametermodelisexactlylike.2,withtheadded restrction b v = c v .TheKimura-2andKimura-3modelsallowforsomeleeway inbiologicalplausibilitythattheJukes-Cantordidnot,asweareallowed dierentratesbetween transitions and transversions .Transitionsaresubstitutionsbetweenchemicallysimilarnucleotides,andaremuchmorecommon thantransversions,whicharethesetofallothersubstitutions. Byassumingadirectedbifurcatingtreetopology,andassociatingtoeach edgesetsoftransitionprobabilities,wedenea bayesiannetwork ;thatis,we havedenedamodelwiththefollowingcollectionofproperties[23]: Asetofvariablesandasetofdirectededgesbetweenvariables Eachvariablehasanitesetofmutuallyexclusivestates Thevariablesandedgesformadirectedacyclicgraph Thoughwehavedenedamodel,theassociatedtransitionprobabilites occurringinthemodelarenotdirectlyidentiable,andthequestionof howtoconstructthetreetopologycannotbeaddressed.Itremainsto beseenwhatotherformsofdatacanbeanalyzedfromtheobservations ofsequencesandhowtousethisdatatoconstructthecorrectphylogenetictree. 11

PAGE 18

Theaddresstheproblemofidentifyingthetransitionprobabilitiesthat besttthemodel,wemustmeasurestatisticsonalignedstrings.Ourmodel createsarandomwalkonMarkovmatricesfromroottoeachleaf,resulting inprobabilitiesthateachnucleotidewilloccuratagivenleafatasingle position.Thenforasinglepositiononallsequences,thenucleotidesobserved aretheresultsonarandomwalkoveraphylogenetictree.Theordered collectionofthesenucleotidesarrangedoveroneposition,isanalignedbase, andthemostusefulandpopularstatisticsaremeasuredoveroccurrencesof alignedbases.Fornotation,werefertothealignedbaseatposition j as a j andthenumberofoccurrencesobservedinthealignments u a j .Wereferto themeannumberoftimes a j isobservedby^ p a j = u a j n ,andtheindeterminate probabilityofthealignedbaseoccurring, p a j .Theindeterminateprobability isaconstructionofthemodelswecreate,andourpurposeistooptimze ourmodelsinawaysuchthatitallowsaminimaldierencebetweenthe evaluationof p a j andtheobservations ^ p a j 3MaximumLikelihoodEstimation Ofthemanystatisticalapproachesmeasuringalignedbases,themostpopular amongcomputationalbiologistsisMaximumLikelihoodEstimationMLE. MLEisageneralmethodthathasmanyclassesofmodelfunctionsovera parameterspace.Fornow,thisparameterspaceisacollectionofundened variables:= 1 ;:::; d ; i 2 C Let m bethelengthofthealignments, a 0 i thealignmentforposition i Forthearrayof n alignments[ a 0 1 ;a 0 2 ;:::;a 0 m ] ;a 0 i 2f A;C;G;T g n ,wecompute coordinateexpectationovertheparameterspace. 12

PAGE 19

f a 0 i : 1 ;:::; d 7!f 0 ; 1 g Fromthismodelwesupposethateachobservation a 0 i isindepedentfor i =1 ;:::;m ,alongthelengthofthealignments.Thentheprobabilitythatwe observetheseriesofalignedbases a 0 i forallpositions i inthealignmentsis theproductofall f a 0 i L a 0 1 ;:::;a 0 m j := m Y i =1 f a 0 i where L isreferredtoasthe Likelihood functionforallalignedbaseson. Example Fromtherst4alignedbasesinFigure,weidentifythe coordinatefunctionsandtheassociatedlikelihood: f AATT ;f GCCC ;f TTCG ;f CCCC L AATT;GCCC;TTCG;CCCC j = f AATT f GCCC f TTCG f CCCC Deneforthesetofall unique observations f a j g = f A;C;G;T g n the associatedsetofcoordinatefunctions: f a j : 1 ;:::; d 7! [0 ; 1]: a j 2f A;C;G;T g n ForthesetofcoordinatefunctionsinEquationdenethedirect product F ofallcoordinatefunctions: F : R d [0 ; 1] 4 n ; 7! f a 1 ;:::;f a 4 n Example Thedirectproductofcoordinatefunctionsforasetof4alignments 13

PAGE 20

is f AAAA ;f AAAC ;:::;f TTTG ;f TTTT Forthesetofcoordinatefunctions f a j andthesetofalignedbase observationcounts u a j ,thelikelihoodfunctiononthesetofalignmentsis L u a 1 ;:::;u a 4 n j = 4 n Y j =1 f a j u a j : Example Giventhesetofallnonzeroobservationcountsfor2alignmentsis u AA =1 ;u CA =2 ;u GC =4 ;u TG =3,thelikelihoodfunctionis: L u AA ;u AT ;:::;u TT j = f AA f 2 CA f 4 GC f 3 TG ItiseasilycheckedthatEquationsand13areequivalentonanyset ofalignments. TheproblemofMaximum-LikelihoodEstimationistomaximizethelikelihoodfunctionforasetofobservationcounts u a j over.Thisproducesa setofparametersthatbest'explain'thesetofcounts u a j asthemost likelyobservation,andreturnthesetofestimates f a j closestto^ p a j Theformof L looksintimidatingasis,butallowingfor L tobetwice continuouslydierentiable,solvingforthelocalandglobalmaximmumsof L isequivalenttosolvingforthelocalandglobalmaximumsofanystrictly increasing,twice-dierentiabletransformationof L .Thetransformationbest suitedinmostcasesfordeterminingthelocalandglobalmaximumsof L isthelogarithm. 14

PAGE 21

l := log L u a 1 ;:::;u a 4 n j = m X j =1 u a j f a j Thelogarithmtransformofthelikelihoodfunctionisthe log-likelihood ,and limitsthecomputationalcomplexitytoalinearfunctiononthecomplexity ofthecollectionoffunctions f a j Example FromExample,wecalculatethelog-likelihood: log L u AA ;u AT ;:::;u TT j = f AA +2 f CA +4 f GC +3 f TG Tondthebestsetofcoordinateexpectationsistoidentifytheglobal maximumofthelikelihoodfunctionandequivalentlythelog-likelihoodfunction. Anobviousmethodforobtainingthiswouldbetorstobtainacritical point.Acriticalpointofthelog-likelihoodfunctionisdenedbysatisfying for1 i d @log L @ i = 1 L @L @ i =0 Thesecriticalpoints,foranycaselargerthanone-dimension,donot implyglobalorlocalmaximaorminima.Toguaranteelocalmaximaof thelog-likelihoodfunction,allsecondpartialderivativesofthelog-likelihood functionmustbenegative. Example Thesimplestconstructionforacollectionofcoordinatefunctions f a j comprisingthelikelihoodfunctionisalinearcombinationof 1 ;:::; d f a j = d X i =1 b ji i + c j 15

PAGE 22

Thederivativeofthelog-likelihoodfunctionbyachosenindeterminate k withtheaboveasacoordinatefunctionforall j is: @log L @ k = m X j =1 u a j b jk d P i =1 b ji i + c j Forthecaseofasetof2alignments,thecoordinatefunctionswould havetheform f a 1 = f AA = b 11 1 + c 1 + b 12 2 + c 1 + + b 1 d d + c 1 f a 2 = f AC = b 21 1 + c 2 + b 22 2 + c 2 + + b 2 d d + c 2 f a 16 = f TT = b 1 + c 16 + b 2 + c 16 + + b d d + c 16 andthesetofallpartialderivativesofthelog-likelihoodwouldbe @log L @ 1 = m X j =1 u a j b j 1 P i =1 b ji i + c j @log L @ 2 = m X j =1 u a j b j 2 P i =1 b ji i + c j @log L @ d = m X j =1 u a j b jd P i =1 b ji i + c j Acriticalpointof L isobtainedwhenEquation14areallzeroforsome 2 R d Thesolutionforndingthezerosofsinglederivativesforalikelihood functioncanbedoneemployingcalculusorGroebnerbasesiftheloglikelihood 16

PAGE 23

functionisallowedarepresentationasapolynomialin R [ 1 ; 2 ;:::; d ]. TheMLEapproachisveryrichandthereareavarietyofmodelsthat havebeenpreviouslyproposedwithvaryinglevelsofsuccess[25]. 4HiddenMarkovModel The HiddenMarkovModelHMM isaclassofgraphicalmodelsthatinvolve actionsbetweennodesthatmayeitherbeobservedorhidden.Themodels themselvesaretusingMaximum-LikelihoodEstimation,withacollectionof coordinatefunctionsevaluatedtomaximizetheassociatedlikelihoodfunction. ThegraphicalconstructionofanHMMisaBayesiannetworkwithasingle sourcenode,associatedwithtwocomplementarysubsetsofitsvertexset.The rstsetconsistsofthosewhoselabelsmaybeobserved,andtheseconda collectionofnodeswhoselabelsarenotobserved,andaredeemed"hidden". Bythepropertiesofdirectedacyclicgraphs,wedonotallowmodelsto includeoutwardedgesfromobservednodes,astheirstateisobservableand henceare d-separated fromanycandidatenodesThereaderisreferredto [23]forfurtherreadingonthepropertiesofdirectedacylicgraphs.Theedges representanarrayofposteriorprobabilities,withanindependentdistribution associatedtotheuniquesourcenodeofthegraph.Toexplainthismethod further,wecomputeasmallexample. Example Imagineashortseriesofhiddenandobservedcoinips,two observedandtwohidden,suchthattheprobabilitiesforeachareconditional onanypreviouscoinip.Weassociatehiddenvariables v h 0 ;v h 1 and observedvariables f v o 2 ;v o 3 g withdirectededgesandtheirrespectiveprobabilites 17

PAGE 24

Figure5:SimpleHMM E T =[ v h 0 ;v h 1 ; v h 0 ;v o 2 ; v h 1 ;v o 3 ].Eachedgeinturnhasanassociated transition probabilitydistribution,andapriordistributionisassociatedtotheinitial vertex. =[ 0 ; 01 ; 02 ; 13 ]=[ P v h 0 ;P v h 1 j v h 0 ;P v h 2 j v h 0 ;P v h 3 j v h 1 ].Thenthecoordinatefunctionfortheobservation v o 2 = H;v o 3 = T is f HT = X i;j 2 [ H;T ] 0 i 0 ; 1 i;j 0 ; 2 i;H 1 ; 3 j;T Itiseasilycheckedthatthissatisesthenon-biased,independentcasefor binomialprobabilities. ByrestrictingBayesiannetworkstotrees,andattachingtoeachthe MarkovmatricesconstructedinSection,wecanevaluatethePhylogenetic treeproblembyMLE.The root istheuniquesourcevertexwithnodirected inwardedges.Someconstructionsallowasingleoutwarddirectededge,while othersdeletethisedgeandlabeltheresultingvalencetwonodeastheroot. 18

PAGE 25

Thisisamatterofpreference,asthetwoconstructionsareequivalentAllow forabifurcatingrootwithindependentdistributionvector r tohavethe form 0 r P e r forsomevectordistribution 0 r andtheadjacentoutwardedge posteriordistribution P e r .The leaves ofthetreeareallverticeswitha singledirectedinwardedgeandnodirectedoutwardedges. EachedgewillbeassociatedtoaMarkovmatrix,whichinturndescribes anarrayofparametersin. Weconstructourparametersforallpossibledistributionsontherootand edge-set: 8 i;j 2f A;C;G;T g ;v 2 V T = f r g ; v v i;j := P v = j j v = i r i := P r = i Allowingthevertexordering V T = f v 1 ;:::;v m g : = r A ;:::; r T ; v 1 v 1 A;A ; v 1 v 1 A;C ;::: :::; v 1 v 1 T;T ; v 2 v 2 A;A ;:::; v m v m T;T Withtheproperprobabilisticconditionsthat 19

PAGE 26

8 i;j 2f A;C;G;T g ; v v i;j 0 ; X k 2f A;C;G;T g v v i;k =1 8 i 2f A;C;G;T g ; r i 0; X k 2f A;C;G;T g r k =1 Forallpracticalcases,wesubstitutepositivityfornonnegativityfordistributions,asitassumedthatthereisachanceofpositionevolutiontoall bases,nomatterhowsmall. ThenwemayassociateforeachMarkovmatrix P v v asconstructedin Section2: P v v = 0 B B B B B B B @ v v A;A v v C;A v v G;A v v T;A v v A;C v v C;C v v G;C v v T;C v v A;G v v C;G v v G;G v v T;G v v A;T v v C;T v v G;T v v T;T 1 C C C C C C C A Remark Whetherthetransitionprobabilitiesaredependentorindependent ofachosenedgeisachoiceofwhethertheedgesareofthesamelength. Ifwerecall,theMarkovmatrixconstructedabovewasdependentontime. Fixinganedgelengthwouldbeequivalenttoxingatime t andthusxing theMarkovmatrixforeachedge. Eachuniqueobservation a j inthephylogeneticHMMisanalignedbase, withthelabelsinthealignedbasetheassociatedobservationsontheleaves ofthetree.Thenthecoordinatefunctions f a j aresumsovertheproducts oftheparametersassociatedtoeachedgeandtheindependentdistribution 20

PAGE 27

oftheroot. Example WeperturbtheHMMinFigureintoaphylogenetictreeby replacingthebinarylabelswithnucleotidelabels,andremovethenode v h 1 by thefollowingprocess: 8 i;j 2f A;C;G;T g ,let v h 0 v o 3 i;j = X k 2f A;C;G;T g v h i v h 0 v h 1 i;k v h 1 v o 3 k;j : Allowthetreetoonlycontainnodes v h 0 ;v o 2 ;v o 3 ,edges v h 0 ;v o 2 ; v h 0 ;v o 3 andprobabilities n v h 0 ; v h 0 v o 2 i;j ; v h 0 v o 3 i;j : 8 i;j 2f A;C;G;T g o .Theresult isarootedphylogenetictree,andtheproblemofdeterminingthevaluesof coordinatefunctionsfortwoalignmentsisequivalenttoevaluatingthismodel. Weordertheelementsinthesetofleaves L T astheorderofthe alignedbase, l 1 ;l 2 ;:::;l n .Lettheinteriorverticesbeallthosevertices withvalence3,hence IV T = V T )]TJ/F15 11.9552 Tf 12.005 0 Td [( L T [f r g .Let f v i 0 ;v i 1 ;:::;v i g bea partialorderingoftheinteriorverticesbydistanceawayfromroot,leastto greatest,suchthat v i )]TJ/F31 7.9701 Tf 6.586 0 Td [( n )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 ;v i )]TJ/F31 7.9701 Tf 6.587 0 Td [( n )]TJ/F31 7.9701 Tf 6.586 0 Td [(2 ;:::;v i aretheparentnodesof l 1 ;l 2 ;:::;l n respectively. Thenforanyalignedbase a j ,weassociatetheindividuallabelingonthe leavesby a j ;a j ;:::;a j n ,andtheexpectationofthealignedbaseon theleavesbythecoordinatefunction f a j = X k r X k v i o X k v i 1 X k v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 X k v i r k r v i o k r ;k v i o v i 1 k v i o ;k v i 1 v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 k v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 ;a j n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 v i k v i ;a j n 21

PAGE 28

whichisjustthesumoftheparameterproductsoverallrootandinterior labelings.Theresultinglikelihoodfunctionisthesameasabove,takinga productofeachalignedbaseexpectationforeachposition.Aclearerform ofthecoordinatefunctionistobuildasetof histories oftheinteriorlabels. Let IV T = V T )]TJ/F15 11.9552 Tf 12.79 0 Td [( L T [f r g bethesetofallinteriorverticesand H = k v v 2 IV T [f r g thesetofalllabelingsontheinteriorverticesplusthe root,thenthecoordinatefunction f l 1 l 2 l n k l 1 ;k l 2 ;:::;k l n = X k v 2 H r k r Y e M v v k v ;k v ItistrivialtoshowthatEquationsandareequivalent. TheindeterminatesforthegeneralMarkovmodeloneachedge-wise Markovmatrixis12werequirethecolumnsumstobe1,thenumberof indeterminatesacrossthewholemodelis3+12 j E T j =3+12 n )]TJ/F15 11.9552 Tf 9.495 0 Td [(3=24 n )]TJ/F15 11.9552 Tf 9.495 0 Td [(33, andthetotaldegreeofpolynomialtoevaluateforeachcoordinatefunction is2 n )]TJ/F15 11.9552 Tf 13.047 0 Td [(2.Whilethelogarithmtransformationrelievesusofevaluating indeterminateproductsofexponentiallygreatertotaldegree,itisapparent thatndingaglobalmaximumisindeeddicultforanymethod.Thisis usuallywhyafteracertainnumberoftaxa,theHMMonphylogenetictrees isperturbedtotradeecacyforeciency,ornotusedatall. Ingeneral,MLEmethodshavebeenfoundtohavecertainunsavory properties: [6]foundmultiplemaximumlikelihoodoptimafortheHMM,andfor certaintopologies,continuumofoptimaoccured. In[7],itisfoundthatphylogenetictreereconstructionbyMLEis NP-hard. 22

PAGE 29

Dependingonthemodel,suchasin[18],MLEonHMMhasalsobeen foundtobeinconsistent. ThoughtheseproblemsareespeciallyperniciousforthemostgeneralMarkov model,wecandrasticallyrestrictthedimensionoftheparameterspaceby choosingauniformMarkovmatrixforeachedge,andthenfurtherby restrictingthedegreesoffreedomoftheparametersinthematrix.These restrictionsandtheirimplicationswillcomeintouselateron. Thegeneralconsensusamongstcomputationalbiologistsisthat,ifnot intractable,maximum-likelihoodestimationispreferred.Thoughnoneofthe modelsaddressrealissuesinthephysicalprocessesinDNAreplicationand recombination,fortheapplicationsofcurrentinterest,MLEisthepreferred method. 5AlgebraicPropertiesofMarkovModels Maximum-LikelihoodEstimationisapowerfulapproachthatdirectlyaddresses thecommonlyassumedphylogenetictheory.However,parameterestimation alongthesemodelsbecomescomputationallyexpensiveasalignmentsizeand thenumberofleavesincreases.Itiswisetoconsiderotherpropertiesofthe HiddenMarkovModel. Apossibleavenueinrelievingourselvesfromtheburdenofexpensive parameterestimationistodiscussthealgebraicpropertiesofMarkovmodels. WeconsideragainourcoordinatefunctionformallyconstructedinEquation ,andperturbeachdeterminatecoordinatefunction f a j byrestricting theirdomain f 1 ; 2 ;:::; d g andcodomains p a j : a j 2f A;C;G;T g n tobe 23

PAGE 30

Figure6:Linegraphwiththreevertices indeterminate.Thatis,weallowthemtoremainunknownvariablesuntil substitutedforrealvalues.Denotetheinducedmap. ~ f a j : 1 ; 2 ;:::; d 7! p a j : Thenthedistributionoftheindeterminateexpectationsforallleaf-labels isadirectproductofallcoordinatefunctionsinEquationis ~ F : 7! p AA AA ;p AA AC ;:::;p TT TG ;p TT TT ; X a j 2f A;C;G;T g n p a j =1 : Weconstructfrom im ~ F thepolynomialring R [ p AA AA ;p AA AC ;:::;p TT TG ;p TT TT ],andwediscussthosefunctions E : im ~ F C andtheirvariousproperties.Oneofthemostnaturalproperty todiscussisthekernelofthemap E 24

PAGE 31

Example Westartwithanexamplefromanundirectedthreenodelinegraph G asdepictedinFigure,denedbythevertexset V G = f A;B;C g and undirectededgeset E G = f A;C ; C;B g .Eachnodeisassociatedanevent whosestatespaceis f 0 ; 1 g .Foreachedgeweattachthelogicalstatement of conditionaldependence Inclassicalprobability,wesaytwoevents A and B are independent if andonlyiftheysatisfy P A B = P A P B : Givenathirdevent C ,thisindependenceimplies P A B j C = P A j C P B j C : If A and B onlysatisfyFigureforsomeevent C ,wedescribe A and B as" conditionallyindependent given C ." Conversely,iftwoevents A and B donotsatisfyFigure,theyare depedendent ofeachother,andifthreeevents A B C donotsatisfyFigure A and B are conditionallydependent given C ." Forourthreenodelinegraph,leteachpairofeventsbeindpendent giventhesetallothersifandonlyiftheredoesnotexistanedgebetween theirassociatednodes.Thenin G ," A indpendentof C given B "isthe onlyindependencestatement. Let k 1 ;k 2 ;k 3 ;k 4 ;k 5 benotnecessarilyuniquestatesinthespace f 0 ; 1 g 5 25

PAGE 32

ofanyoftheevents A;B;C ,andlet p ijk := P A = i;B = j;C = k for allstates i;j;k .Thebelowfollowsfromthesingleindependencestatement associatedto G p k 1 k 5 k 2 p k 3 k 5 k 4 = P A = k 1 ;B = k 5 ;C = k 2 P A = k 3 ;B = k 5 ;C = k 4 = P A = k 1 ;C = k 2 j B = k 5 P B = k 5 P A = k 3 ;C = k 4 j B = k 5 P B = k 5 = P A = k 1 j B = k 5 P C = k 2 j B = k 5 P A = k 3 j B = k 5 P C = k 4 j B = k 5 P B = k 5 2 = P A = k 1 j B = k 5 P C = k 4 j B = k 5 P A = k 3 j B = k 5 P C = k 2 j B = k 5 P B = k 5 2 = P A = k 1 ;C = k 4 j B = k 5 P B = k 5 P A = k 3 ;C = k 2 j B = k 5 P B = k 5 = P A = k 1 ;B = k 5 ;C = k 4 P A = k 3 ;B = k 5 ;C = k 2 = p k 1 k 5 k 4 p k 3 k 5 k 2 Whichinturnimplies: p k 1 k 5 k 2 p k 3 k 5 k 4 )]TJ/F17 11.9552 Tf 11.956 0 Td [(p k 1 k 5 k 4 p k 3 k 5 k 2 =09 TheresultinEquationisinterestinginthatittellsusalogical propertyofthemodelwe'vedevelopedwithoutconstructingthemodelitself. SinceourmodelmustsatisfyEquationforanychoiceoflabelings k 1 ;k 2 ;k 3 ;k 4 ;k 5 2f 0 ; 1 g 5 ,thecontrapositiveisalsotrue:Ifourgraphical modeldoesnotsatisfyEquationforanychoiceoflabelings, A and C arenotconditionallyindependentgiven B .Hence,wecanmakeassertions 26

PAGE 33

ofthetopologyofaMarkovmodelwithlogicalindependenceconditions attachedtoedgesbyevaluatingasetofpolynomials.Thosepolynomialsin theringofindeterminates R [ f p ijk : 8 i;j;k g ]thatarecongruenttozeroare called algebraicinvariants .Thesetofalgebraicinvariantsofamodelform anidealintheringofindeterminates.Lettheidealofallinvariantsinduced by" AindependentofCgivenB "bedenoted I A;C j B : Supposewearegivenamodelwithevents A 1 ;A 2 ;:::A n ;B 1 ;B 2 ;:::;B n ;C 1 ;C 2 ;:::C n suchthatforall i A i isconditionallyindependentof C i given B i .Let M betheunionofallsuchindependentstatementsforthemodel.thenthe followingisanidealofalgebraicinvariantsforthewholemodel: I M = I A 1 ;C 1 j B 1 [ I A 2 ;C 2 j B 2 [[ I A n ;C n j B n Iftheindeterminatesoftheringaregivenalexicographicorder,thenthe setofallpossiblealgebraicinvariantsconstructedasabovearea Groebner Basis fortheideal[16].Groebnerbasesareobjectsdenedinpolynomial rings,andreadersshouldbereferredtotheexcellent Ideals,Varieties,and Algorithms [8]forfurtherdiscussion. Analgebraicinvariantnotyetdiscussedbuteasilydemonstrablebythe propertiesofasetofindeterminateprobabilitiesis: 27

PAGE 34

X k 1 ;k 2 ;:::;k n p k 1 k 2 k n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1=0 Theaboveexamplesaremeanttoshowhowalgebraicinvariantscan determinepropertiesofastatisticalmodel,includingtopologicalfeaturesona graphicalmodel. Remark In1987,CavenderandFelsenstein[5]andLake[20]independently derivedalgebraicinvariantsfromthephylogenetichiddenMarkovmodelofa phylogenetictree.Tospecifythatthealgebraicinvariantswerederivedfrom phylogeneticmodels,theyareoftenreferredas phylogeneticinvariants .These wereshownin[12]toberelativelyunpredictive.Earlyresearchinproducing phylogeneticinvariantsbydierentmethodsandonvariousmodelsweremostly analgebraicexercise.Acatalogofpapersdevelopinginvariantsintherst sevenyearsofresearchisfoundin[15]. Beforewecontinueonwardwithconstructinginvariants,weneedtodiscuss afewpropertiesoninvariantsthatwouldheadtoconfusion.Recallthatfor thegeneralproblemofproducingaphylogenetictreestrictlyfromthedata onthetaxawearenotallowedpriorinformationonanyoftheinterior vertices,andalsotheroot.Allofourconstructionsuptothispoint,and theconstructionsafterwards,supposetheexistenceofaroot,andinsome cases,useobjectsthatrepresentthedistributionacrosstheroot.However,the invariantswedescribedonotrepresentanyinformationacrosstherootor interiorvertices.Theproblemforconstructingphylogenetictreesisthennot toconstructdirected,rootedphylogenetictrees,butundirectedphylogenetic trees.Itshouldbenotedthatfor n taxa,thenumberofdirectedphylogenetic treesis n )]TJ/F15 11.9552 Tf 10.917 0 Td [(3!!= n )]TJ/F31 7.9701 Tf 6.586 0 Td [(2! 2 n )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 n )]TJ/F31 7.9701 Tf 6.586 0 Td [(1! ,thenbytheperturbationoflabelingtherootas 28

PAGE 35

anotherleaf.Thetopologiesofthe n -leafrootedtreeshaveanaturalbijection intothesetof n +1-leafunrootedtreetopologies.Thenthenumberof n -leafundirectedphylogenetictreesis n )]TJ/F15 11.9552 Tf 11.243 0 Td [(5!!= n )]TJ/F31 7.9701 Tf 6.587 0 Td [(4! 2 n )]TJ/F32 5.9776 Tf 5.756 0 Td [(2 n )]TJ/F31 7.9701 Tf 6.587 0 Td [(2! .[Sinceourgoalis onlytodiscovertheundirectedphylogenetictopologies,howcanwedetermine fromconstructionsofdirectedtrees?]Itturnsoutthatwhendetermining invariantsfromarootedmodel,thechoiceoftherootpositionisarbitrary. Corollary5.1 Let T bea n -taxontreeandlet a beoneofthetaxalabeling theleaves.Thenthephylogeneticinvariantideal I T forthegeneralMarkov modelon T rootedat a isidenticaltothephylogeneticinvariantideal I T forthegeneralMarkovmodelon T rootedat r ,where r isanyotherleaf, internalnode,ornewnodeinstertedonanedgeof T Proof See[1]. Thentheconstructionofinvariantsisindependentofthelocationof therootofthetree,henceallphylogeneticinvariantsdescribeundirected phylogenetictreemodels,andanyconstructioninvolvingrootdistributionsare independentoftherootdistribution.Nowwecandiscussinvariantsandthe constructionofundirectedphylogenetictreetopologies. 6Group-basedModels 6.1FourieronGroups RecallSection2,specicallythelabelraterestrictionsinEquations6and .2.Theserestrictionswerecomputationalshortcutsthatwerebegrudgingly upheldtoproducephylogenies.However,startingwith[13],analgebraically attractiveapproachwasderivedbythewayofharmonicanalysis. 29

PAGE 36

WereferspecicallytotheKimura-3parametermodeldevelopedinEquation .2.Thismodelissymmetricwith3degreesoffreedom,withanadded groupstructure.Moreprecisely,weconstructagroupdenedbytheelements f A;C;G;T g ,andwhosepermutationsaredeterminedbythemap: f 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.956 0 Td [(c v ;a v ;b v ;c v g$f A;C;G;T g ThenthesetofoperationsinducedontothearrayEquation.2is: 0 B B B B B B B @ ACGT CATG GTAC TGCA 1 C C C C C C C A Todrawclosertothepoint,weattributethefollowingreversiblemapto Equation: A $ ; 0 ;C $ ; 1 ;G $ ; 0 ;T $ ; 1 ThenEquationisisomorphictothegrouptablefor Z 2 Z 2 ; popularly knownasthe Kleinfour-group .Furthermore,indeterminingtheratein Equation.2frombases k 1 ;k 2 wendthegroupelementinEquation thatisequalto k 2 )]TJ/F17 11.9552 Tf 11.955 0 Td [(k 1 ,andreversingthemapinEquation. Remark Thetwo-stateJukes-Cantormodelhasasimilarmapinto Z 2 ,and thesetofallaminoacidwords,composedofall3-letterwordsintheset ofnucleotides,areallowedasimilarmapinto 6 L n =1 Z 2 : Thateachofthese modelshaverepresentationsasanitedirectproductof Z 2 isimportant. Wewillneedtodevelopnewnotationandobjects.Let G; +beanite 30

PAGE 37

abeliangroup.Allowing T := f z 2 C : j z j =1 g ,wedenethe characters ofa group G asallmaps : G T thatsatisfy g 1 + g 2 = g 1 g 2 ; 8 g 1 ;g 2 2 G Hencethecharactersaregrouphomomorphismson G ,andthesetofall characterson G aredenedasthe dualgroup of G ,denoted ^ G ; .When ^ G = Z 2 Z 2 ,denotetheelementsin ^ G as[1 ;;; ]. ^ G ; isisomorphic to G ; +bythefollowingmap. G ^ G ; [ ; 0 ; ; 1 ; ; 0 ; ; 1] 7! [1 ;;; ] Wealsousethenotation h g; i torepresentthenaturalpairingof ^ G and G : G ^ G T ; g; 7!h g; i = g 7 Thevalueforalloperationsarelisted: h g; i := ; 0 ; 1 ; 0 ; 1 11111 1 )]TJ/F15 11.9552 Tf 9.299 0 Td [(11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 1 )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 )]TJ/F15 11.9552 Tf 9.298 0 Td [(11 Theconceptworksinhigherordersof G .Thedualofthedirectsum G m = L m i =1 G isisomorphicto ^ G m underthenaturalpairing: h g 1 ;g 2 ;:::;g m ; 1 ; 2 ;:::; m i = m Y i =1 h g i ; i i Giventheobjectsdenedaboveandafunction f : G C ; ,wedenethe Fouriertransform of f by: 31

PAGE 38

^ f : ^ G C ; 7! X g 2 G h g; i f g Next,denethebinaryactionof convolution foranytwofunctions f 1 ;f 2 on G by f 1 f 2 g = X h 2 G f 1 g )]TJ/F17 11.9552 Tf 11.956 0 Td [(h f 2 h ;g 2 G Thefouriertransformhasafewinherentproperties: ^ 1 = 8 > > < > > : j G j ;if =1 0 ;otherwise and f 1 f 2 = ^ f 1 ^ f 2 Recovering f from ^ f isalsofoundbytheprocessof Fourierinversion : f g = j G j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 X 2 ^ G h g; i ^ f ; 8 g 2 G Remark Attributingasetofgroupbasedactionstorepresenttherandom walkonaMarkovmodelintroducesawealthofknowledgeconcerningharmonic analysisofgroupsthatunfortunatelywillonlybeutilizedsuperciallyto developourdesiredinvariants.Introductoryreferencestoapplicationsof harmonicanalysisofgroupswithapplicationstostatisticsinclude[10]and [21].Furthermore,whileharmonicanalysisonphylogeneticMarkovmodelsis introduced[13],andinturnproducesanumberofphylogeneticinvariants utilizingaformofgaussianelimination,amoreinterestingapproachin harmonicanalysiswillbeconsideredinstead. 32

PAGE 39

Weproposeavectordistribution v overtheelementsin G bythe following: P v g 1 ;g 2 = v g 2 )]TJ/F17 11.9552 Tf 11.956 0 Td [(g 1 ; 8 g 1 ;g 2 2 G ;v 2 V T ; Givenaleaf l ,deneallverticesinthepathfromtheroot r to l by l .Thenasetof G -valuedrandomvariables Z v v 2 V T overthevertices V T inducesasetof G -valuedrandomvariablesovertheleaves, Y l l 2 L by thefollowingsum: Y l = X v 2 l Z v Remark The"trick"totheharmonicanalysisin[13]istodevelopfourier analysisoncoordinateexpectationsbyndingarepresentationofalignedbase distributionsassummationsoverdistributionsofthe'ancestor'vertexlabels. Thefollowingdescribesthefouriertransformoncoordinateexpectations. Recalltheconstructionoftheindeterminatecoordinateexpectations p k 1 k 2 k n denedinEquation,andthepolynomialringassociatedtothem.As p k 1 k 2 k n onanyarrayofgroupelements k 1 ;k 2 ;:::;k n isnecessarilyafunction from G n intothereals,itmusthaveafouriertransform.EvansandSpeed [13]providethefollowingconstructionofitsfouriertransform.First,we deneforagivendirectedphylogenetictree,theset v ofallleavesthat are"below" v .Thatis,thereisadirectedpathfrom v to l inthetreeif andonlyif l 2 v Theorem6.1 Let p k 1 k 2 k n betheindeterminatejointdistributionofagroup basedmodelonaphylogenetictreeT.Thefouriertransformofthedistribution is 33

PAGE 40

q 1 n =^ 1 n Y v 2V T =r ^ f v Y l 2 v l withtherepresentation ^ r ^ r = X g 2 G n h 1 2 n ;g i g Theproofofwhichisreferredto[30]. Fromhere,[13]developsamethodforobtainingphylogeneticinvariants, thoughageneratingsetforallsuchinvariantsoftheKimura-3modelisnot dened. Itwasthearrivalof[1]thatsparkedtheideafortheconstructionof invariantsthatcoulddescribealocaltoplogicalfeatureofaphylogenetictree. Although[1]describedanunrestrictedMarkovmodelanddidnotproduce allinvariants,itdidhoweverinspire[Sturmfels,Sullivant;2005],whichdoes producealltheinvariantsformodelswithedgeratesrestrictedtoasingle groupstructure.Although[1]doespredate[27],itisillustrativetoshow exactlyhowthosein[27]developedageneratingsetforallgroup-based invariantsbeforeproceedingtothestill-openproblemofndingagenerating setofinvariantsforthegeneralmodel. Unfortunately,thenotationdevelopedtofullyexpressEquationisnot adequatetoconstructtheinvariantsin[27].Wewillrequirethedevelopment of ToricIdeals 34

PAGE 41

6.2ToricIdeals Therecentdevelopmentsofpolynomialringtheory,especiallyintherealmof Groebnerbases,hasledanexpansivesearchforspecialclassesofpolynmial ringideals.Onesuchclassthatisintriguingforusisthe ToricIdeal Werequiretheuseofaeld k formostinstances,weassumethat theeldis R .Fixanarrayofindeterminates x = x 1 ;x 2 ;:::;x n ,and denotethemonomials x u = x u 1 1 x u 2 2 x u n n .Considerthepolynomialring k [ x ]:= k [ x 1 ;:::;x n ].Theringisgeneratedbyallsumsandproductsover theindeterminateset f x 1 ;x 2 ;:::;x n g ,withcoecientsin k Let beasemigrouphomomorphismdenedbyanychoiceof nd dimensionalintegervectors A = a 1 ; a 2 ;:::; a n : : N n Z d ; u = u 1 ;:::;u n 7! u 1 a 1 + + u n a n Giventhissemigrouphomomorphism,wedenethehomomorphismfrom thepolynomialringin f x 1 ;x 2 ;:::;x n g tothepolynomialeld t 1 ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 1 ;t 2 ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 2 ;:::;t n ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 n foranychoiceofpowervectors f u i g on f x i g bya naturalliftingof f a 1 ;a 2 ;:::;a n g intothetermpowersof t j : ^ : k [ x ] k [ t ] ;x u i i 7! t a i u i Example Thecollectionofintegervectors A = f ; 1 ; 0 ; ; 3 ; 2 ; ; 1 ; 1 ; ; 0 ; 2 g inducesthefollowingmap^ on f x 1 ;x 2 ;x 3 ;x 4 g : ^ : x 1 7! t 2 1 t 2 ;x 2 7! t 3 2 t 2 3 ;x 3 7! t 1 t 2 t 3 ;x 4 7! t 2 1 t 2 3 : Thismapgeneralizestoallpolynomialsintheeld,e.g.: 35

PAGE 42

^ x 2 4 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 2 1 x 2 x 3 3 = t 2 1 t 2 3 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [( t 2 1 t 2 2 t 3 2 t 2 3 t 1 t 2 t 3 3 = t 4 1 t 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(t 5 1 t 8 2 t 5 3 Thekernelof^ areallthosepolynomialsin k [ x 1 ;x 2 ;:::;x n ]thatare mappedto0.Thesetofallpolynomialsinthekernelofamap^ associated toachoiceofintegervectors A isanidealinthepolynomialring,denoted I A andreferredasa ToricIdeal Everyvector u 2 Z n canbewrittenuniquelyas u = u + )]TJ/F28 11.9552 Tf 12.152 0 Td [(u )]TJ/F15 11.9552 Tf 15.466 -4.338 Td [(suchthat u + and u )]TJ/F15 11.9552 Tf 16.327 -4.338 Td [(arenon-negativewithdisjointsupport,i.e.theyrepresentthe positiveandnegativedomainsof u Foreverytermorder in k [ x 1 ;x 2 ;:::;x n ] ; weconstructanitesetof vectors G ker suchthatthereducedGroebnerbasisof I A withrespect to isequalto n x u + )]TJ/F28 11.9552 Tf 11.955 0 Td [(x u )]TJ/F15 11.9552 Tf 13.975 -7.151 Td [(: u 2G o .Thereisanalgorithm: Groebnerbasisoftoricideal I A 1.Introduce n + d +1indeterminates t 0 ;t 1 ;:::;t d ;x 1 ;x 2 ;:::;x n .Let beany termorderwith f t i gf x j g 2.ComputethereducedGroebnerbasis G oftheideal D t 0 t 1 t d )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ;x 1 t a )]TJ/F32 5.9776 Tf 0 -6.117 Td [(1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(t a + 1 ;:::;x n t a )]TJ/F35 5.9776 Tf 0 -4.838 Td [(n )]TJ/F17 11.9552 Tf 11.955 0 Td [(t a + n E 3.Return:Theeliminationideal G k [ x 1 ;x 2 ;:::;x n ] 36

PAGE 43

Remark If A isentirelynonnegative, t 0 isnotnecessaryandwecancompute theGroebnerBasisfromtheideal h x i )]TJ/F28 11.9552 Tf 11.955 0 Td [(t a i : i =1 ;:::;n i .[25] Examplecontinued ComputingtheGroebnerbasisof I A withgraded lexicographicorderyields f x 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 1 x 2 x 4 g .Thisiseasilychecked: ^ x 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 1 x 2 x 4 = t 1 t 2 t 3 4 )]TJ/F15 11.9552 Tf 11.955 0 Td [( t 2 1 t 2 t 3 2 t 2 3 t 2 1 t 2 3 =0 TheunionofallGroebnerbases G withrespecttoalexicographicorder foralllexicographicordersisthe universalGroebnerbasis ,andisa generatingsetof ker ^ Nowthatwehaveasolidideaonwhattoricidealsare,wecanconstruct invariantsthatmanipulateEquationtocomputetoricidealsofthe fouriertransformsofexpectationcoordinates.In[27],thismethodconstructs invariantsbysplittingatreetopologyintoasetoflocaltopologies. 6.3Labelings Weproduceinvariantsofthegroup-basedmodelbyassigninglabelstothe edgesandproducingamapfromthesetoffouriercoordinates q 11 1 ;q 11 ;:::;q toasetofindeterminatesrepresentingthe labelingsofanedgeforeachedge.Thepolynomialsinthetoricideal constructedthiswayarethenreturnedtotheringofexpectationcoordinates R [ p AA A ;p AA C ;:::;p TT T ],whicharethesetofinvariantsinthegroup-rate model.Thiswillbemadeexplicitsoon. 37

PAGE 44

Figure7:Arooted3-leaftree Foreachcoordinate q 1 ; 2 ;:::; n in q 11 1 ;q 11 ;:::;q ,weassociate thefourierleaflabelings 1 ; 2 ;:::; n toleaves l 1 ;l 2 ;:::;l n Weattributetoeachedgeonaphylogenetictreewithafourierleaf labelings,afourierlabelingdeterminedbytheleavesproceedingtheedge. Recallthedenitionof e asthesetofallleavesdirectedawayfromthe edge e .Foreachedge e 2 E T ,let g e betheassociatededgelabeldened by: g e = Y l 2 e l Example ConsiderFigure.3,andletthefourierlabelingontheleaves l 1 ;l 2 ;l 3 be ;; ,thentheinducedlabelingsoftheedgesare g e 1 = l 1 =1 ;g e 2 = l 2 = ;g e 3 = l 3 = 38

PAGE 45

g e 4 = l 2 l 3 = = Foreach leaf-wisecoordinate q 1 2 n ,weproduceaninjectivemapinto asetofindeterminateswithcoordinatesassociatedtothefourieredgelabels inducedbyEquation.Letting m = j E T j ,wedenotethese edge-wise coordinates by q g e 1 g e 2 g e m ,suchthatfor e i = f l i g ; 8 i =1 ;:::;n .Thatis, welabelthoseedgesadjacenttoleaveswiththeirassociatedleaf-labels. Recallingthetransformintofouriercoordinates,Equation,thismap cannowbeconstructedasaproductofthefouriertransitionfunction ^ f actingonthefourieredgelabelsconstructedabove. q 1 n $ q g e 1 g e 2 g e m = Y e 2 E T ^ f e g e Bythisconstruction,theinjectivityofthemapfromleaf-coordinateindeterminatestoedge-coordinateindeterminatesismadeexplicit.Itshouldalso benotedthatthisproductdoesnotrelyontherootrepresentation^ ,and alledges e havedirectionawayfromtheinteriornodeclosesttotheroot node.Thisdistinctionwillsoonbedoneawaywith,howeverforcaseson subtrees,andedgeswillbedirectedawayfromachosennode. Alongwiththosegrouplabelsandcharacterlabels,wedeneathirdset oflabels L and labelingfunction L : ^ G !L Recallthatwerequireidenticalgroup-basedratematricesattachedto theMarkovmodelonthephylogenetictree,orelsetheisomorphismfrom transitionprobabilitiestogroupactionsdoesnotexist.Weidentifyunique labelsforeachuniqueelementinthearray.ForJukes-Cantor,thisarray 39

PAGE 46

[1 )]TJ/F15 11.9552 Tf 12.606 0 Td [(3 ;;; ]hasonlytwouniquelabels,whileinKimura-3,thearray [1 )]TJ/F15 11.9552 Tf 12.272 0 Td [( + + ;;; ]hasauniquelabeltoeachelement.ForKimura-3, whichisthemodelweareinterestedin,thisonlycreatesanotherbijection betweenlabelings,andistrivial.However,itshouldbenotedthatwerewe tochooseamodel,we'dbeforcedtoconsider linearinvariants ,whichare coordinateequalitiesthatareattributedtoequivalentlabelings.Thosemodels withlinearinvariantsmustthencreateinvariantsmodulototheidealof linearinvariants.Thepurposeofcreating L -labelsisthentodeterminethose linearinvariants. Fromthelabelfunction L inEquation,weinduceamap L T from theedge-coordinatelabelingsabovetoanarrayofedge L -labels,dened: L T : ^ G m !L m ; : g e 1 ;:::;g e m 7! L g e e 2 E T = L g e 1 ;L g e 2 ;:::;L g e m Weconsiderforalledges e 2 E T andlabels l 2L ,weproposethe indeterminatevariables a e l ,termed edge-labelindeterminates .Thisinducesa maprelatingeachfourieredge-coordinatetothepolynomialringofedge-label indeterminates. C [ q g e 1 g e m : 1 ;:::; n 2 ^ G n ] C [ a e l : e 2 E T ;l 2L ] ; q g e 1 g e m 7! Y e 2 E T a e l L g e : Thenthemapfromleaf-coordinateindeterminatestoedge-labelindetermi40

PAGE 47

nantsisthecompositionofmapsEquationsand: C [ q 1 n ] [ q : 2 im L T ] C [ a e l : e 2 E T ;l 2L ] : Theedge-labelindeterminatesforagivenedge-wisefouriercoordinate isthefourierequivalenttoasinglehistoryasinEquation.We areonlyinterestedinthealgebraicmanipulationtheyallow.Themapof fourierleaf-coordinatestotheedge-labelindeterminatesinEquationis exactlythemapwhoseassociatedtoricidealwewishtocompute.More precisely,thosebinomialsin[ q : 2 im L T ]thatreturn0whenmapped into C [ a e l : e 2 E T ;l 2L ].Theunionofallidealsacrossthechoiceofleaf labelsisthesetofallphylogeneticinvariantsforgroup-basedmodels. Sogiventhetransformationfromgroup-coordinatestofouriercoordinates, andfouriercoordinatestoedge-labelindeterminates,wewantthekernel.The naiveapproachwouldbetouseBuchberger'salgorithmoranotheralgorithm tocomputeGroebnerbases,whichwouldwork,butisadistastefulformany reason. Groebnerbasiscomputationis slow ,prohibitivelyslow.Tobeavoided asidefromthesimplestcases. Providesusnorealintuitionastothetopologicalinformationthe phylogeneticinvariantsmaybecapableofcommunicating. [27]hasalreadysolvedthersttwoproblemswiththeirapproachof attenings 41

PAGE 48

6.4Group-basedFlattenings Withtheconceptofatteningscomesthedirtytruthaboutourconstruction overdirectedphylogenetictrees.Theconstructiondoesnot,infact,relyupon adirection.Thisisadvantageous,however,asourpurposesinthispaperis todeducethetopologyoftheundirectedphylogenetictree,sincepractically wehavenorealintutitionastowherethepositionofthecommonancestor isinrelationtothetaxa.Forthepurposesofpropagatinglabelsfromleaves, wedirectpropagationawayfromtheedgesandrootswechoosetoseparate oursubtreesalong. Let T beaphylogenetictree,weconstructforanedge e 2 E T the uniquepairofconnectedcomponentsubtreeswhoseunionofedgesis E T andwhoseintersectionisstrictly e ,andchooseanarbitrarydirectionaway from e thatwilldene e .Notethatforourpurposes,wewillonlychoose interioredges.Denethesesubtrees T e; + and T e; )]TJ/F15 11.9552 Tf 15.529 1.793 Td [(bytheirrespectiveedge sets: E T e; + := e [f e g E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(:= E T )]TJ/F15 11.9552 Tf 11.955 0 Td [( e [f e g Thevertexsets V T e; + ;V T e; )]TJ/F15 11.9552 Tf 7.084 1.794 Td [(areobtainedfromtheunionsofthevertices fromtheedgesets.Itisclearthatthechoiceofdirectionthatdenes e isarbitrary. Example ObserveFigure.4,whichistheresultofsplittinga6-leafunrooted treealongtheedge v 8 ;v 9 ,producing T e; + and T e; )]TJ/F15 11.9552 Tf 7.084 1.793 Td [(.Labelingeachsubtree 42

PAGE 49

Figure8:Subtrees E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.794 Td [(fora6-leaftree T andedge v 8 ;v 9 as T e; + or T e; )]TJ/F15 11.9552 Tf 16.608 1.793 Td [(isdependentonanarbirtrarychoiceofdirectioninthe undirectedtree T Theactionofobtainingapairofsubtreesinthismannerwillbereferred toasa edgeattening along e .Themeaningofwhichwillbecomemore clear.Next,wedenetheimagesofthelabelingfunctiononthesesubtrees, andrequirepairsoflabelingstobe consistent : Denition Wesayachoiceoftwolabelings )]TJ/F36 11.9552 Tf 11.385 -4.338 Td [(2 im L T e; )]TJ/F15 11.9552 Tf 6.752 -3.264 Td [( ; + 2 im L T e; + are consistent if )]TJ/F15 11.9552 Tf 7.084 -4.338 Td [([ e ]= + [ e ]. Thatis,werequirethoselabelingsontheleafsetsofoursubtreesto propagateontoourchosenedge e withthesamelabel.Theintuitionis sound,astherandomwalkalongthecorrectphylogenetictreedescribedas groupactionsmust,overtime,produceconsistentactionsalongtheinterior edgestoobtainlabelingsontheleaf-setovertime. 43

PAGE 50

Figure9:Asampleconsistentlabelingandedgepropagationon E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.794 Td [( Althoughitisnotrequired,wewillusethetableaunotationof[27]to developourinvariants.Let f l i :1 ; 2 ;:::;d g forthefollowing tableau notation representasetof L -labelsontheedge-setforthetree T .Foranygiven d 2 N ,theedge-coordinatemonomial M = q l 1 q l 2 q l d hasanassociatedtableau: M = 2 6 6 6 6 6 6 6 4 l 1 l 2 l d 3 7 7 7 7 7 7 7 5 Let T 1 ;T 2 ;T 3 beconnectedcomponentsubgraphs,andtheirrespective consistentedgelabelings l;m;n .Let f l i ;m i ;n i ;i =1 ;:::;d g beacollection ofconsistent,notnecessarilyuniqueedgelabelings.Wedenotetheproduct of L -labeledge-coordinates q l 1 ;m 1 ;n 1 q l 2 ;m 2 ;n 2 q l d ;m d ;n d asthemonomial M with tableaunotation: 44

PAGE 51

M = 2 6 6 6 6 6 6 6 4 l 1 m 1 n 1 l 2 m 2 n 2 . . l d m d n d 3 7 7 7 7 7 7 7 5 Usingthisnotation,wehavethefollowing. Lemma6.2 Supposeforthesubtrees T 1 ;T 2 ;T 3 havetheedgesets e [ f e g ; f e g ;E T )]TJ/F15 11.9552 Tf 12.715 0 Td [( e [f e g ,andlet l 1 ;m;n 1 and l 2 ;m;n 2 betwonot necessarilydistinctconsistentlabelingsof T 1 ;T 2 ;T 3 .Then g = 2 6 4 l 1 mn 1 l 2 mn 2 3 7 5 )]TJ/F39 11.9552 Tf 11.955 27.616 Td [(2 6 4 l 1 mn 2 l 2 mn 1 3 7 5 liesintheideal I T;L Proof Thelabelings l 1 ;m and l 2 ;m areconsistentlabelingsforthesubtree T e; )]TJ/F15 11.9552 Tf 7.084 1.793 Td [(,andlabelings m;n 1 and m;n 2 areconsistentlabelingsforthesubtree T e; + .[27]. Thesetofallpossiblebinomialsconstructedbyconsistentlabelingson edgeatteningswithrespecttoagiven e aredenoted Quad e;T Next,weconstructthesetofconnectedcomponentsubtrees T 1 ;T 2 ;T 3 for eachinternalvertex v 2 V T bytheunorderedsetoftreesthatsatisfythe condition: V T 1 V T 2 = V T 1 V T 3 = V T 2 V T 3 = f v g ,asinFigure .4: Choosingsubtreesinthismanneriscalled vertexattening .Muchas weobtainedinvariantsbyachoiceofedgeattenings,wedothesamefor 45

PAGE 52

Figure10:Avertexatteningona6leaftree vertices. Denition Chooseavertex v withvalence c .Thenthesubtreecontaining onlythoseedgesadjacentto v istheclawtree K 1 ;c : Notethatforourpurposes,weonlychooseinteriorverticesonbifurcating trees,hence c =3. Lemma6.3 [27]Let T 1 ;T 2 ;:::;T c betheconnectedcomponentsubtreesof T denedbythepropertythat 8 i 6 = j;V T i V T j = v .Thatis,we constructoursetofsubtreesbyseparatingthemattheirjointvertex.Let f l 1 1 ; ;l c 1 ;:::; l 1 d ; ;l c d ; m 1 1 ; ;m c 1 ;:::; m 1 d ; ;m c d g beacollectionof consistentlabelingsontothesesubtrees.Thenthefollowingisabinomialin tableaunotation 46

PAGE 53

g = 2 6 6 6 6 4 l 1 1 l c 1 . . l 1 d l c d 3 7 7 7 7 5 )]TJ/F39 11.9552 Tf 11.955 38.377 Td [(2 6 6 6 6 4 m 1 1 m c 1 . . m 1 d m c d 3 7 7 7 7 5 Thesetgeneratedbytheconstructionofallsuchbinomialswithconsistent labelingsonthesubtreesassociatedtotheatteningon v isdenoted Ext B v T ,andisasubsetof I T;L .Furthermore,foreachrow i ,column j ,consider labelings L i j ;M i j 2 Im L T v;e j ;l j i suchthattheset L j i d j =1 isequaltothe multiset M j i d j =1 .Thenthebinomial g = 2 6 6 6 6 4 L 1 1 L c 1 . . L 1 d L c d 3 7 7 7 7 5 )]TJ/F39 11.9552 Tf 11.955 38.376 Td [(2 6 6 6 6 4 M 1 1 M c 1 . . M 1 d M c d 3 7 7 7 7 5 liesin Ext B v T ,andfromallchoicesofconsistentlabelingsand arrangementofmultisets,alltheassociatedbinomialsgenerate Ext B v T Thisisreallypretty.Asweseehere,byatteningthetreebyeach vertex,we'reabletoproduceinvariantsbasedontheindependenceofthe edgelabelingsfromeachother.Wedenoteby B v thegeneratingsetfor thesevertexatteningsforeach v .Thenextlemmasettlesoursearchfor groupbasedinvariants. Lemma6.4 Let T beatreewithconsistentlabelingand e anyinterior edge.Thengiventheconstructionoftheobjectsdescribedabove,theideal I T;L isexactly [ v Ext B v T [ [ e Quad e;T : andgivenatermorder,thebinomialconstructionsabovearealsoauniversal 47

PAGE 54

Groebnerbasis.[27] Thislemmaprovesthatthepreviousconstructionsofinvariantscreatea generatingsetforinvariantsofthefamilyofKimuramodels,andwhats more,determinesallinvariantsbytheirlocalstructuresalone.Thisisincrediblyuseful.Weshallseemoreinvariantsthataredeterminedbytheirlocal topologies,andthatwhenconsideringlocaltopologies,wecandiscusstrees thatareverymuchlikeeachother. Awatchfuleyemayhavenoticedthatwhilewecancreatebinomialsas aboveuptoadegree d ,wehavenotmentionedwhat d maybe.Thefollowing theoremputsthistorest: Theorem6.5 [27]Let T beanarbitrarybinaryrootedtree.Modulothe trivialinvariant q 00 0 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1.thephylogeneticinvariantidealoftheJukes-Cantorbinaryisgenerated bypolynomialsofdegree2 2.thephylogeneticinvariantidealoftheJukes-CantorDNAmodelis generatedbypolynomialsofdegree1,2,and3 3.thephylogeneticinvariantidealoftheKimura2-parametermodelis generatedbypolynomialsofdegree1,2,3and4 4.thephylogeneticinvariantidealoftheKimura3-parametermodelis generatedbypolynomialsofdegree2,3and4 Themethoddescribedincomputingpolynomialsofdegree d forthevertex atteningsareallowedtobeboundedby d =4toobtainwithaterm orderingaHilbertbasisforthissubsetofpolynomials. 48

PAGE 55

Remark Allthiswasstartedbythepropositionin[1]thatinvariantscould bededucedfromexploitingthelocaltopologyofaphylogenetictree.Thatall thegroup-basedinvariantsweredeterminedbythisnotionisratherincidental, asthegroup-basedmodelisveryrestrictiveonthepredictivecapabilities. Rememember,wewereforcedbythemodeltoxalledgestothesame length. 7GeneralStochasticModel ThissectionreturnstothelanguageofMarkovmodelsandawayfromspectral analysis.Forreference,weconsiderthestructureforageneralstochastic Markovmatrix. 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(d )]TJ/F17 11.9552 Tf 11.955 0 Td [(g )]TJ/F17 11.9552 Tf 11.955 0 Td [(jabc d 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a )]TJ/F17 11.9552 Tf 11.955 0 Td [(h )]TJ/F17 11.9552 Tf 11.955 0 Td [(kef gh 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(b )]TJ/F17 11.9552 Tf 11.955 0 Td [(e )]TJ/F17 11.9552 Tf 11.955 0 Td [(li jkl 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(c )]TJ/F17 11.9552 Tf 11.955 0 Td [(f )]TJ/F17 11.9552 Tf 11.955 0 Td [(i 1 C C C C C C C A UnliketheKimuramodels,therearenoidentiablealgebraicstructuresto attachmachineryto;theconstructionofalgebraicinvariantsforthegeneral MarkovmodelcarriedoutviamanipulationofthelinearalgebraonMarkov matrices. Remark Therstconstructionofinvariantsarethosefoundby[5]and[20], werebuiltforthegeneralstochasticmodel.Thesewereimportantintroductory discussions,theinvariantswhenevaluatedoverdatahavebeenfoundtomake weakstatisticalmetrics.Itisgenerallyassumedthattheseandotherlinear invariants.Thoseinvariantswithtotalpolynomialdegree=1,areoflittle predictivevalue[12]. 49

PAGE 56

Consideragainthecoordinatefunctionevaluatedoverthecollectionof histories: f l 1 l 2 l n k l 1 ;k l 2 ;:::;k l n = X k v 2 H r k r Y v 2 V T = f r g M v v k v ;k v Thisisarealfunctionoverthe n dimensionalspaceofthevaluesateach leaf.RecallthedistributionfunctionEquation,whoseimageisthearray p AA AA ;p AA AC ;:::;p TT TT .Wedeneabijectivefunctionfrom im ~ F to thespacetothe n -tensorproductof n rank-4arrays[1]. E l 1 l 2 l n l 1 ;l 2 ;:::;l n : p AA AA ;p AA AC ;:::;p TT TT 7! 0 B B B B B B B @ p AA AA p AA AC p AA AG p AA AT 1 C C C C C C C A 0 B B B B B B B @ p AA CA p AA CC p AA CG p AA CT 1 C C C C C C C A 0 B B B B B B B @ p TT TA p TT TC p TT TG p TT TT 1 C C C C C C C A Ifwelabelasingleleaf l j by k j 2f A;C;G;T g E mapstothe n )]TJ/F15 11.9552 Tf 12.073 0 Td [(1 tensor: E l 1 l 2 l n l 1 ;l 2 ;:::;l j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ;k j ;l j +1 ;:::;l n : p AA AA ;p AA AC ;:::;p TT TT 7! 0 B B B B B B B @ p AA k j AA p AA k j AC p AA k j AG p AA k j AT 1 C C C C C C C A 0 B B B B B B B @ p AA k j CA p AA k j CC p AA k j CG p AA k j CT 1 C C C C C C C A 0 B B B B B B B @ p TT k j TA p TT k j TC p TT k j TG p TT k j TT 1 C C C C C C C A Achoiceoflabelings k j 1 ;k j 2 ;:::;k j h 2f A;C;G;T g h foranychoiceof h 50

PAGE 57

leavesreturnsan n )]TJ/F17 11.9552 Tf 12.322 0 Td [(h -tensor.Whenallleavesareattributedalabeling k 1 ;k 2 ;:::;k n ,thetensorreturnstheassociatedindeterminateexpectation E l 1 l 2 l n k 1 ;k 2 ;:::;k n = p k 1 k 2 k n : Lettheleaf-labelat l j inducethemarginaldistributionover l 1 ;:::;l j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ;l j +1 ;:::;l n be: E l 1 l 2 l n k 1 ;k 2 ;:::;k j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ; ;k j +1 ;:::;k n = X k j 2f A;C;G;T g p k 1 k 2 k j )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 k j k j +1 k n Thenforalabeling k j 1 ;:::;k j n )]TJ/F32 5.9776 Tf 5.756 0 Td [(2 : k j 2 [ A;C;G;T; ] onachoiceof n )]TJ/F15 11.9552 Tf 12.442 0 Td [(2leavesreturnsa4 4matrixwithentriesin R [ p A A ;p A C ; ;p T T ]. Bymanipulatingthesematricesweproducecertainequalitiesandinturn, phylogeneticinvariants.Fornotationalpurposes,wedenethedistribution X by X k 1 k 1 k n := E l 1 l 2 l n k 1 ;k 2 ;:::;k n ;k i 2f l i ;A;C;G;T; g 7.1VertexFlattenings Werefertoasampleconstructionofinvariantsona4-taxaphylogenetic tree,thengeneralizetothosetreeswith 4taxa.Inthiscase,werefer totheevaluationofthetensoronthelabels k 1 ;k 2 ;k 3 ;k 4 2f A;C;G;T g 4 WerefertothetreeinFigure.1denedbythedirectededges E T = [ l 1 ;v 5 ; v 5 ;l 2 ; v 5 ;v 6 ; v 6 ;l 3 ; v 6 ;l 4 ].Wechoose l 1 asthe"root"',thatis,we directallMarkovmatricesasactingawayfrom l 1 ,althoughwewillevaluate l 1 asataxon. 51

PAGE 58

Figure11:Unrooted4-taxonphylogenetictree. X k 1 k 2 k 3 k 4 = l 1 k 1 X k 5 2f A;C;G;T g M l 1 v 5 k 1 ;k 5 M v 5 l 2 k 5 ;k 2 X k 6 2f A;C;G;T g M v 5 v 6 k 5 ;k 6 M v 6 l 3 k 6 k 3 M v 6 l 4 k 6 ;k 4 D l 1 = 0 B B B B B B B @ l 1 A 000 0 l 1 C 00 00 l 1 G 0 000 l 1 T 1 C C C C C C C A Denethediagonalmatrix P v 5 ;l 3 l 4 ; ij whichassociatesgivenleaves l 3 ;l 4 to values i;j ,respectivelybytheentryvalues 52

PAGE 59

Fork 2f A;C;G;T g ; P v 5 ;l 3 l 4 ; ij k;k = 4 X m =1 M v 5 v 6 k;m M v 6 l 3 m;i M v 6 l 4 m;j Fork 6 = l; P v 5 ;l 3 l 4 ; ij k;l =0 Denefor i;j;k;l 2f A;C;G;T g X l 1 l 2 ij = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; ij M v 5 l 2 X l 1 l 2 kl = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 X l 1 l 2 = D l 1 M l 1 v 5 M v 5 l 2 Thefollowingisaresultoflinearalgebgra: X l 1 l 2 ij X )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 l 2 X l 1 l 2 kl = = D l 1 M l 1 e P e;l 2 l 3 ; ij M v 5 l 2 M )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 v 5 l 2 M )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 v 5 D )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; ij P v 5 ;l 3 l 4 ; kl M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl P v 5 ;l 3 l 4 ; ij M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 M )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 v 5 l 2 M )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 v 5 D )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 D l 1 M l 1 e P e;l 2 l 3 ; ij M v 5 l 2 = X l 1 l 2 kl X )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 l 2 X l 1 l 2 ij Wemakethesubstitution X )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 l 2 = 1 det X l 1 l 2 Cof X l 1 l 2 T 53

PAGE 60

X l 1 l 2 ij Cof X l 1 l 2 T X l 1 l 2 kl = X l 1 l 2 kl Cof X l 1 l 2 T X l 1 l 2 ij X l 1 l 2 ij Cof X l 1 l 2 T X l 1 l 2 kl )]TJ/F17 11.9552 Tf 19.261 0 Td [(X l 1 l 2 kl Cof X l 1 l 2 T X l 1 l 2 ij =0 ThelefthandsideofEquationhas16entriesin R [ p AA A ;p AA C ; ;p TT T ],allofwhichmustbecongruenttozero.Hence wehave16algebraicinvariantsofdegree5. Next,wegeneralizethisconstructiontotreeswith 4taxa,withan emphasisastohowtheseareattenings.Chooseaninternalvertexand removethisvertexfromthetree.Thevalencyofinternalverticesareexactly 3,hencetheremovalreturns3disjointconnectedsubtrees.Choosetwo subtrees,andfromeachtreechooseexactlyoneleaf,denoted a 1 and a 2 : Then fromtheremainingtree,ordertheleaves a 3 ;:::;a m .Thenforanylabelings i 3 ;:::;i m ; j 3 ;:::;j m 2f A;C;G;T g m )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 X l 1 l 2 i 3 :::i m Cof X l 1 l 2 T X l 1 l 2 j 3 :::j m ::: = X l 1 l 2 j 3 :::j m ::: Cof X l 1 l 2 T X l 1 l 2 i 3 :::i m Thisisoneofahandfulofconstructionsin[1].Notethattheconstruction isdeterminedbythelocaltopologyaroundachoseninteriorvertex. 7.2EdgeFlattenings Denition Denean edgeattening ofanedge e onthetensor X l 1 l 2 l n asdenedabovebyremoving e fromtheedgesetof E T andreturning 54

PAGE 61

thetwoconnected,componentsubtrees T 1 and T 2 .Letthe split inducedby theatteningoforderedleavesbethepairofleafsets L T V T 1 and L T V T 2 .Reordertheleavessuchthat: f l 1 ;:::;l m g = L T V T 1 f l m +1 ;:::;l n g = L T V T 2 The atteningof X on e isthe4 m 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m matrix F = Flat e X denedforallmatrixpositions i 2f 1 ;:::; 4 m g ;j 2f 1 ;:::; 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m g ,asfollows: Ordertheelementsin J 1 = f A;C;G;T g m ;J 2 = f A;C;G;T g n )]TJ/F34 7.9701 Tf 6.586 0 Td [(m suchthat u 1 u 2 u m = J 1 [ i ] ;v 1 v 2 v n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m = J 2 [ j ]andset: F i;j = X u 1 ;:::;u m ;v 1 ;:::;v n )]TJ/F35 5.9776 Tf 5.756 0 Td [(m = p u 1 ;:::;u m ;v 1 ;:::;v n )]TJ/F35 5.9776 Tf 5.756 0 Td [(m : Thisconstructionreturnsanarrayofallindeterminateexpectations. Althoughwehavemadeitexplicitforlabelingsinthesetofnucleotides, theconstructionofatteningsareeasilygeneralizedtoanynumberofnite states.Weprovideanexampleinthe2-statecase. Example Considerthe5-taxonunrootedtreeasinFigurewithleaflabels in f 0 ; 1 g .Fromlookingatthepicture,weseetwonaturaledgeattenings onthe2 2 2 2 2produced.Thesplitsinducedbytheinternaledge atteningare ff l 1 ;l 2 g ; f l 3 ;l 4 ;l 5 gg and ff l 1 ;l 2 ;l 3 g ; f l 4 ;l 5 gg ,withmatrices 55

PAGE 62

Figure12:A5-leafunrootedbinarytree 0 B B B B B B B @ p 00000 p 00001 p 00010 p 00011 p 00100 p 00101 p 00110 p 00111 p 01000 p 01001 p 01010 p 01011 p 01100 p 01101 p 01110 p 01111 p 10000 p 10001 p 10010 p 10011 p 10100 p 10101 p 10110 p 10111 p 11000 p 11001 p 11010 p 11011 p 11100 p 11101 p 11110 p 11111 1 C C C C C C C A 0 B B B B B B B B B B B B B B B B B B B B @ p 00000 p 00001 p 00010 p 00011 p 00100 p 00101 p 00110 p 00111 p 01000 p 01001 p 01010 p 01011 p 01100 p 01101 p 01110 p 01111 p 10000 p 10001 p 10010 p 10011 p 10100 p 10101 p 10110 p 10111 p 11000 p 11001 p 11010 p 11011 p 11100 p 11101 p 11110 p 11111 1 C C C C C C C C C C C C C C C C C C C C A DevelopinganedgeatteningbyperturbingtheindeterminatetensornaturallyleadsustoquestionexactlywhattheMarkovmodelforthisconstruction 56

PAGE 63

is.Letthegraphicalmodelinitiatefromanode r andbifurcateimmediately intotwoedgeswhoseterminalnodesrepresentstatesin f A;C;G;T g n and f A;C;G;T g n )]TJ/F34 7.9701 Tf 6.586 0 Td [(m .ThentheMarkovmatricesassociatedtotheseedgesare M 1 and M 2 withsizes4 4 m and4 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m ,andtheMarkovmatrixproduct thatdenes Flat e X is Flat e X = M T 1 D r M 2 Where D r isthediagonalizationoftherootdistributionvector.Since theranksof M 1 and M 2 areatmost4,matrix Flat e X musthaveat mostrank4.Thisleadsustothenubofthisconstruction.Since Flat e X hasrank 4,thenevery5 5minorof Flat e X musthavedeterminant zero.Thedeterminantisadegree-5polynomialintheringofindeterminants. Hence,bytakingthedeterminantofany5 5minorof Flat e X we computephylogeneticinvariants,implyingthattheseinvariantswhenevaluatedunderdatawilltellusinformationaboutwhetherthesplittingofthe leavesinthiswayisvalidforthebestttopologyofthephylogenetictree[2]. Remark PleasenoteagainthatthisconstructionofMarkovmatricescanbe generalizedtoanyMarkovtreemodelwithanynitenumberofstatesat theleaves.Thenforleaveswith numberofstates,wemustndthe determinantsofthe +1 +1minorsofanedgeatteningthatwill returna +1-degreealgebraicinvariant. 7.3SuciencyofFlattenings Thelastthingtodiscussisexactlyhowdenseconstructionsofinvariants byvertexandedgeatteningsareintheidealofinvariantsforthegeneral 57

PAGE 64

Figure13:Visualrepresentationofanedgeatteningalongedge v 9 ;v 10 Markovmodel.Withoutgoingtoodeepintothealgebra,wementionafew keyfacts. Conjecture7.1 [2]Foranynumberofstates andandnumberoftaxa n thephylogeneticideal I T forthegeneralMarkovmodelonabinary n -taxon tree T isthesumoftheidealsassociatedtotheatteningsof X l 1 l 2 l n at theverticesof T Thatedgeatteningsareactuallyaspecialcaseofvertexatteningsis madeclearerin[2],andanimportantcorollarythebinarystatecaseis deduced. Corollary7.2 Theideal I T ofallphylogeneticinvariantsforthephylogenetic treemodelwithbinarystatesisgeneratedbythe 3 3 minorsforalledge atteningsofthemodel'sassociated n -tensor. 58

PAGE 65

Remark Thenotionofvertexatteningsismoreformal,andrequiresthe useofthesecantvarietiesin[19].In[2]itismaderelativelyclearthat theidealofinvariantsonthegeneralMarkovphylogenetictreemodelis foundbyconstructingallinvariantsofvertexandedgeattenings,muchas in[27].However,aconstructionofageneratingsetofvertexatteningsis notdeduced,andwhileagreatmanyofinvariantscanbeproduced,nding ageneratingsetfor I T isstillanopenproblem.Pleasereferto[19]and [2]foracleardescriptionofhowweknowthatvertexandedgeattenings generatetheideal I T Whilethesuciencyofatteningsisnotcompletelyunderstood,wedo havethemethodstoproduceagreatnumberofthem,wehaverelative condenceintheirstatisticalvalue,andinpracticalcasesitshardtobelieve thatwe'llrequirethegeneratingsetof I T todevelopusefulstatistics.Recall thattheemphasisofourinvariantshasbeentoobtain local information. Thisnotionwillbecomecleareraswediscussthespaceoftrees. 8SpaceofTrees Eachconstructionofinvariantsthatwe'vediscussedarederivedfromasubset ofthetopologicalpropertiesofthephylogenetictreeassociatedtoit.These invariantswhenevaluatedundertheobserveddatacanbeusedtoconstruct statisticsofeachtree,andthebestttreeshouldhavetheoptimumstatistic, thisisusuallytheminimumstatisticwhenthestatisticisametriclike L 1 overallinvariants.Thenthenextproblemincomparingtreesishowto mostecientlycomparethem,soastoreducethetotalnumberofinvariants computedforanysetofobserveddata.Forsomenaivetrials,suchasthose abovein[4],invariantsarecomputedforalltrees,theinvariantsevaluated 59

PAGE 66

Figure14:Thespaceofall4-taxonunrootedbinarytrees.Alltreesinspace collapsetheiredgeinto T 0 andextendtobecomeanothertree overtheobserveddata,thenastatisticiscomputedfortheseinvariantsto comparealltreestoeachother.Wemayalsocomparesubsetsoftrees basedontheirlocalfeaturesandchoosetheoptimaltree.Thatis,there isasystematicwaytoorganizetreesbyadistanceawayfromthesolutiontree. Werefertothe spaceoftrees T n asthesetofallundirectedbinary n leaftrees.Itshouldbeagainnotedthatthereexistsanaturualbijection fromundirectedbinary n leaftreestothedirectedbinary n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1leaftrees. Denition Denetheactionof collapsing anedgebydeletinganinternal edgeanddeningitstwoadjacentverticesasonevertex.Denetheaction of extending anedgeonavertexofvalence4bypairingtheadjacentfour edgesintotwoconnectedsubgraphs,thenjoiningthesubgraphswithanedge thatisadjacenttoallfourinitialedges.Denea rotation bytheactionof successivelycollapsingandextendinganinternaledge.Denethe rotational 60

PAGE 67

distance betweenanytwotreesastheminimalnumberofrotationsnecessary totransformone n -leafunrootedbinarytreetoanother. Properties Thereareexactly )]TJ/F31 7.9701 Tf 9.968 -4.977 Td [(4! 2!2! = 2=3uniquerotationsforanygiven internaledgeinabinarytree,thenumberofunorderedchoicesofasetoftwo inasetoffour.Foranundirectedphylogenetictreewith n 3taxa,there areexactly n )]TJ/F15 11.9552 Tf 12.118 0 Td [(3internaledges,hence3 n )]TJ/F15 11.9552 Tf 12.118 0 Td [(9numberofrotationspossible. Since1outof3rotationsisanidentitymap,thereare2rotationsforeach edgethatreturnanewtree,hencethereare2 n )]TJ/F15 11.9552 Tf 11.93 0 Td [(6numberofuniquetrees neighboringanytree. Remark Wenowhavearelationbetweentrees.Wesaythattwophylogenetic treesare neighbors ifthereexistsasinglerotationthatmapsonetreeto another.Itisclearthatthismapisinvertiblebyasinglerotation.Here,the ideaof localtopology becomesclear.Thelocaltopologyisasubsetofvertices andedgesthatdenehowconnectedcomponentsubtreesofaphylogenetic treearerelatedtoeachother,andwhatrotationsmaptopologiestoeach other. Givensetsofmapsthatrelatetreestoeachother,wecandenea graphicalspacewithuniquetreetopologiesasnodesconnectedbyedgesifthe associatedtreesarerelatedtoeachotherbyarotation.Denotethisspace by T n .Thisspacehasanumberofniceproperties. Wedeneamapfromtheinternaledgesofasingleundirectedtree IE T totherealspace R n )]TJ/F31 7.9701 Tf 6.587 0 Td [(3 bythefollowingmap. D : IE T R n )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 ; v i 1 ;v t 1 ;ie 2 ;:::;ie n )]TJ/F31 7.9701 Tf 6.587 0 Td [(3 7! d ie 1 ;d ie 1 ;:::;d ie n )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 ; 61

PAGE 68

where d ie = d v ;v isarealmetricwiththeproperty: d v 1 ;v 3 = d v 1 ;v 2 + d v 2 ;v 3 Theimageofthismapisanonnegativeconvexpolyhedraofdimension n )]TJ/F15 11.9552 Tf 12.092 0 Td [(3,andwedescribetheimageforatree'sinternaledgesisthetree's orthant .Weconstructspacesbythismapforeachtreein T n .Each n )]TJ/F15 11.9552 Tf 9.946 0 Td [(4-dim. faceofthetreeorthantcorrespondstoabinarytreewithasinglecollapsed edge.Theedgewhichismeasuredalongtheaxisat0.Thereareexactly threetreesthatcollapseaninterioredgetothisnon-binarytree,hencethe orthantsofthethreetreessharethisface.Wethenallowthefacesofthe orthantsforallpossible n )]TJ/F15 11.9552 Tf 11.429 0 Td [(5!!undirectedbinarytreestobeequivalentin T n iftheirrespectivefacesarecorrespondingtothesametree,deningan equivalencerelationbetweenthefacesoftreeorthants. Denition Inthespaceoftreesasaconvexpolyhedra,theunionofall pointswherethecoordinatesummetric L 1 isexactly1isdenedasthe linkoftheorigin .Thelinkoftheorigininanyorthantinturnformsa simplex,andthelinkoftheoriginisasimplicialcomplex. Example n=4:Thereexist3binaryrootedtreeswith1interioredge.Eachtree determineseach1-dimensionalorthantconvergingtoa0-dimensionalface attheorigin.Thelinkoftheoriginisexactlythethreepointsalong eachraywithlength1fromfromtheorigin. n=5:Thereare15binaryrootedtrees,eachwitha2-dimensionalorthant sharingone-dimensionalfaces.Thelinkoftheoriginhastheinteresting 62

PAGE 69

Figure15:Thelinkoftheoriginof T 4 ,representedbythe3bolddots.The labelscorrespondtothetreesinFigure propertyofbeingequivalenttothePetersengraph,showninFigure Thefollowingdescribesthesizeof T n Theorem8.1 Themaximumrotationaldistancebetweenanytwotreesinthe spaceofundirected, n -leaftreesis n log n ,ortheupperandlowerbounds areconstantsmultipliedby n log n Proof In[9]it'sshownthattheupperboundforrotationaldistance O n log n In[29]itisshownfor n log n Denition Anotnecessarilyuniquepathofminimumrotationaldistance in T n betweenanytwotreesisreferredtoasa geodesic Withanunderstandingofthegeometryonthespaceofphylogenetictrees, andinparticularthemaximumgeodesiclengthin T n ,wecancompletethis 63

PAGE 70

Figure16:Thelinkoftheoriginof T 5 ,shownhereinmostfamiliarformof thePetersengraph.[22] briefintroductionandmoveforwardwiththeinterestingexperiment.However, in[9],[3],[28],and[29],thereisextensivediscussionastothecombinatorial natureofbinarytrees,andthereisfurtherpropertiesofinterest. PartII Method,Data,andResults 9Introduction Thenextsectiontakesintoaccountthepreviousthreesections.Wehave obtainedwaystocomputeinvariantsthatareproducednaturallyfromlocal topologiesonthephylogenetictreewhichtheydescribe.Wehavediscussed thespaceofphylogenetictrees,andtheenlighteningfactthatthemaximum 64

PAGE 71

pathbetweenanytwotreesinthespaceoftreesislog-lineartothenumber ofleavesinthetree.Aninterestingquestionposediswhetherornotthe "greedyalgorithm",asrespecttoachosenmetric,couldnotreturntous thebestttreetopologystartingfromarandomintialpointinthetree. Wewillattempttoanswerthisbyproposingthatametricderivedfrom invariantsofatreeisagoodcandidateforourgreedyalgorithmonthe spaceoftrees. 10Hypothesis Bynow,threethingsshouldbeclear. 1.Thatthecomputationandevaluationofinvariantscantellusinformation aboutthetopologyofaphylogenetictree. 2.Wecancomparephylogenetictreesthataretopologicallysimilarby asubsetoflocaltopologicalproperties,denedhereasadierenceof rotationsbetweenanytwotrees. 3.Ifweareallowedawayofdeterminingwhichtreefromanysubsetof topologicallysimilartreesis"closer"tothebestttopologyinthe spaceoftreesconstructedabove,thenthepathfromanytreetothe bestttopologybychoosingateachsteptheclosesttreeismaximally log-linear O nlogn Thenifwecanconstructanalgorithmthatchoosesapathbyfollowing thegeodesicbetweenanytwotreesbasedonsequencealignments,thiswould improvethecomputationaleciencyforcomputingthebesttphylogenetic tree.Thenaturalalgorithmtorsttestisagreedyalgorithm.Wewouldlike toseeifagreedyalgorithmonthespaceoftreesscoredusingachosen 65

PAGE 72

statisticparameterizedovertheinvariantswhenevaluatedunderobserveddata terminatesatthebesttphylogenetictree.Furthermore,ifthestatistic strictlydecreaseswhenmovingbetweennearestneighbortreestowardsthe besttphylogenetictree,thiswouldmakethegreedyalgorithmverystrong. Howwellthestatisticusedintheexperimentperformsdependsonits validityoverthenumberofinvariantscomputedandthevalidityofthe statisticovertheobservedalignmentdata.Weuse validity asaweakerform ofconsistency,inthataparameterchangemayonlyimprovethecondence ofastatistic,butdoesnotimplytheprobabilitythatthestatisticwill produceanincorrectresultconvergestozero. Thevalidityandcomputationaleciencyofthethegreedyalgorithmdependonboththevalidityofthestatisticoveritsrespectiveparametersand thegeometryofthespaceoftrees.Weperformanumberofexperiments eachwith105trialsofthegreedyalgorithmforeachtreeinthespace, measuringthenumberoftrialsthatpredictthebestttreeoverthenumberofinvariantsevaluatedforthestatisticoneachtree.Thisistotest thestatisticalvalidityofourgreedyalgorithmoverthenumberofinvariants tocomputeforthestatisticofeachtreeandoverthepositionoftheinitialtree. IfthelengthoftheDNAsequenceisallowedtogrowlarge,andthe sequenceisgeneratedbytheprocessofpropagatingMarkovmatricesalonga treenetwork,theneachinvariantwillconvergetozero,sothateachinvariant ofthebestttreeisstatisticallyconsistenthenceanymetriconthem.I donottestagainstthis,asmanyapplicationsofphylogeneticsrestrictthe lengthofthealignment,henceitisinterestingtotestthealgorithmona 66

PAGE 73

relativelysmallalignmentlength. 11Method IusedthesoftwareSeq-Gen[26]forrandomlygeneratingsequencesbasedon thegeneraltimereversiblemodel.Ichoseasinglelabeledtreetopologyx asourbestttopology,thenIinputtedintoSeq-Genthebestttopology andalistofparametersstatedbelow: NumberofSequencesTaxa:6 Lengthofsequences:10000 Branchlengths:[Homogeneoust=.1,Homogeneoust=.5] Ratematrix:GTR.1,.1,.1,.1,.1,.1 Thus,wehaveexactlytwosetsofsequencedatadierentiatedbythe lengthofthetreebranchestheydescribe.Sincethebranchesareofhomogeneouslength,andtheratematrixisoftheJukes-Cantorclass,boththe group-basedandgeneralMarkovinvariantsshouldbestatisticallyconsistent onthedata. Thisprojectrequiredthenimblemovementbetweenstrings,algebra,and graphs.EachofwhicharehandledadmirablybyPython2.6andthemodules numpy [24]and networkx [17]. Weproduceatotalof8experimentsforeachchoiceofparameter,which areasfollows.First,thechoiceofbranchlengths,whichwereproducedfrom theSeq-Gendatat=.1,t=.5.Second,themethodbywhichtheinvariants 67

PAGE 74

areconstructed;wehavechosenheretocompareexactlytwomethods:general Markovmodeledge-atteninginvariantsandgroup-basedmodeledge-attening invariants.Third,wehavechosentoperformtrialsacrossthenumberof invariantstoproduce:100,300,500,700,900,1100,1300. Letthenumberofinterioredgesbe j IE T j ,andthetotalnumberof experimentsbe H .Weperformthefollowingalgorithmforeachtrialonthe spaceoftrees. 1.Chooseatrandomexactly H= j IE T j invariantsfromanedgeattening foreachinterioredge.Thisweighstheinvariantsobtainedforalledges equally. 2.Evaluateeachinvariantbysubstitutingtheobservedalignedbasedata intoeachindeterminate. 3.Sumtheabsolutevaluesofallevaluatedinvariantsobtainedthisway. Labelthisnumberforeachtreeits"invariantstatistic". 4.Dothisforalltreeswithinthespaceoftrees. 5.Foreachtree,startbylabelingthetreechosenthe'initial'and'current' tree.Evaluatetheinvariantstatisticofthetreeandallneighboring trees. 6.Choosethetreewiththelowerstscoringinvariantstatisticthe'current' tree. 7.Performthisuntilthesamecurrenttreeischosentwice,thisisthe 'terminal'tree. 8.Ifthe'terminal'treeisthebestttree,addtheinitialtreetothe "success"binoftreesforthespaceoftrees. 68

PAGE 75

9.Forthespaceoftreesineachexperiment,ndthemeannumberof "successes",whichisthesizeofthe"success"bindividedby n )]TJ/F15 11.9552 Tf 11.374 0 Td [(5!!, thenumberoftreesinthespace. Ifwehavebeenpayingattention,we'llrecallthatageodesicfromany treetothebesttphylogenetictreeisatmost O nlogn ,fornlarge, andthatthenumberofneighboringundirectedtreestoanygiventreeis 2 n )]TJ/F15 11.9552 Tf 12.221 0 Td [(3.Henceifthealgorithmaboveisoptimalandwechoosethetree closesttothebestttreeateachturn,thenumberoftreestocalculate theirinvariantstatisticis O n )]TJ/F15 11.9552 Tf 11.58 0 Td [(3 nlogn = O n 2 logn ,fornlargeenough. Thisissignicantlybetterthanevaluating n )]TJ/F15 11.9552 Tf 11.955 0 Td [(5!!trees. Remark Themetricdescribedinisthe L 1 metric.Thisisnotthe optimalchoice,butitshouldprovideuswitharoughestimateastothe accuracyofthetreeitdescribes,andisutilizedeectivelyin[4].Itshould alsobenotedthatatreemusthaveaninvariantstatisticlocalminimumin thespaceoftreesforittobeaterminatingtree.Thus,itisessentialthat thebestttreemustbealocalminimumforthedatanottoskewwildly. Wewillseethatthisdoeshappen,butdoesnotsignicantlyaectoverall results. 12Results Above,theresultsinFigureareresultsforallgroup-basedandgeneral Markovtrialsgiventhechoiceofhomogeneousbranchlengthswitht=.1. Thosetrialsmarkedbetweenanytwonumbers n and n +200arechosen tohaveevaluated n statistics.Thatis,thosetrialsbetween100and300 onthebottomaxiswerechosentohave100invariants.Inredarethose experimentsperformedforthegroup-basedattenings,anditisclearthat 69

PAGE 76

Figure17:Resultsforinvariantstatisicheuristiconspaceoftreeswithhomogeneous branchlengtht=.1.Group-based:+,General-Markov:x 70

PAGE 77

Figure18:Resultsforinvariantmeasureheuristiconspaceoftreeswithhomogeneous branchlengtht=.5,Group-based:+,General-Markov:x forthissetofdata,clearlywinsoutagainstthegeneralMarkovmodel attenings,whichareinbluealthoughforboththeimprovementsfromthe choiceof100invariantsinanexperimentto1300invariantsisnoticeable. Thatthegroup-basedmodelsachievesuchaccuracyonouralgorithmwith suchalownumberofinvariantsistheremarkablefacetofourcomputational results. Figureisthecollectionofresultsforallgroup-basedandgeneral Markovtrialsgiventhechoiceofhomogeneousbranchlengthswitht=.5.The treebranchesforthesetrialsareconsiderablelargerthantherstexperiment, andthegraphshowsaconsiderablechangeinthedata.Nowitisthe generalMarkovmodelinvariantsthatperformhighly,althoughtherearesome experiments,showninthebluex'sonthebottom,wherethebestttree 71

PAGE 78

Figure19:Asampleofallinvariantstatisticsforalltreesonaspacewithhomogeneous branchlengtht=.1;1300group-basedattenings wasnotalocalminimuminthetreespace,leadingtoacompletefailure inthetrial.Still,itishighlynotablehowthegroup-basedmodelshave deterioratedinaccuracydepsitewhatshouldbegreaterstatisticalevidenceas totherelationshipbetweenthetrees. Withtheconsistencyofthemethodofusinginvariantsasastatistic alongthespaceoftreesupheldbythesefewexperiments,Inowshowthe invariantstatisticsforalltreesontwosingletrialsfromthebestcasesabove. Figureshowsall105invariantstatisticsfromasingletrialof1300 group-basededgeatteningsforatreewithhomogeneousbranchlength t = : 1, whichisthebestcasefromthetwosetsofdataforgroup-basededge 72

PAGE 79

attenings,showninFigure.Theaxesareabitfacetious,andreferto thenumberofthetreesinthepathinsteadofrotations.Notethatthe invariantstatisticscoreisnotanerror,oratleastnotagraphingerror.If thetrialswereperformedagain,itwouldbeprudenttoaveragethisoutto somethingthesameorderasthegeneralMarkovinvariantstatisticsbelow. Thenumberofrotationsinthepathissimplyonelessthanthenumberof trees.Noticehowthereisonesinglepointforapathlengthof1,whichis thestatisticforthebestttreetopology.Thestatisticsforalltheother treesfollowaniceprogression,anditiseasytoseehowagreedyalgorithm onthisdatacould,atleastmostofthetime,followthegeodesic.There areafewdatapointsthatpointtoapathof6trees,or5rotations. Thisisactuallyamisdirectionfromthegeodesic,asthemaximumgeodesic is5rotations.Thereisalsoasubsetofpointsforthe4-treepathtrials thatareaslowasmost3-treepathtrials.Thisbehaviorisnotoptimal, butotherwiseinteresting.Whatisencouragingisthesteepdropobetween statisticsfortrialsalonga3-treepath,2-treepath,andthebestttree. Thistellsusthattheinvariantstatistics,atleastforthiscase,aregoodat determiningwhenthetreesareclosetothebestttree. Figureshowsall105invariantstatisticsfromanexperimentof1300 generalMarkovedgeatteningsforatreewithhomogeneousbranchlength t = : 5,whichisthebestcasefromthetwoexperimentsforgeneralMarkov edgeattenings,showninFigure2.Onceagainweobservealovely progression.ComparedtoFigure,thistrialdidnotndextralongpaths, sowemaynaivelyassumethatmosttreesfollowedthegeodesicorapath closetoone.Also,thedataforthesetsoftrialsforeachpathlengthare well-grouped,especiallycomparedtoFigure,althoughthedropofrom the3-path,2-path,andbestttreetopologiesareaboutthesame.That 73

PAGE 80

Figure20:Asampleoftheinvariantstatisticsforalltreesonasinglespacewith homogeneousbranchlengtht=.5;1300generalMarkovattenings 74

PAGE 81

twodierentmodelsshouldshowthispatternofprogressionimpliesastrong relationshipbetweenusinginvariantstatisticsandgreedysearchesonthespace oftrees. 13Discussion Ratherthancomparingallpossibletreetopologiesatriskofcomputational ineciency,wecanorganizethetreesbyarelationintoagraphicalspacewith advantageousgeometry.Thenifwecanproduceametricoverthealgebraic invariantsofthebesttphylogenetictreethatisminimizedconsistently, andstrictlyincreasesasthedistanceawayfromthebestphylogeneticas thedistanceawayfromthebesttphylogenetictreeincreases,wehavean ecientandgeometricallyinterestingmethodtoproducethebestphylogenetic tree. Fromthesmallamountoftrialsperformed,itisclearthatthepotential forthisapproachisnotunfounded,andwhollyinteresting.Ifweareallowed optimalresultsforeachtrial,thenwewillndstatisticsonnomorethan O n 2 logn trees,afarsmallernumberthan n )]TJ/F15 11.9552 Tf 12.988 0 Td [(5!!.Thereareother methodsthathavebeenproposedtoreducethenumberoftreestocompute usinginvariantstatistics.In[11],notably,isanapproachthatoptimallynds aninvariantstatisticfor O n 2 .Thisapproachndsthesingularvalue decompositionofedge-attenedmatrices,whichwhileanaturalmetriconall edgeattenings,doesnotallowanaturalextensiontomeasuringoverother attenings.Whilethisiscertainlynotprohibitivetoprediction,theapproach aboveallowsforachoiceofanytypeofinvariant,includingvertexattenings, althoughsadlytherewasnotenoughtimetoperformtrialsusingvertex attenings.UntilageneratingsetforthegeneralMarkovinvariantsarefound, 75

PAGE 82

weareonlyallowedsubsetsofgeneratingsets,andhowwechoosethese subsetsisapreferenceofwhichtopologicalpropertiesareofinteresti.e.in [1]theydescribeinvariantsthatpredictthe4-pointcondition. 14Conclusion Invariantshavebeenfoundtodiscusslocaltopologicalfeaturesofphylogenetic trees,andphylogenetictreeswhenrelatedtoeachotherbylocaltopological featuresconstructacombinatoriallypleasingspacetostudy.Thoughthedata sofarisminimal,thegeneralmethodofusingthespaceoftreestoreduce thecomputationofthenumberoftreeshassofarbeensupported,andfurther simulationsshouldbeconductedtorevealitspotenetial.Theadvantageof algebraicinvariantsisthattheyareofxedpolynomialdegree,andthatthey maybecomputedbeforeanydataiscollected,thenusedasmanytimes asdesired,reducingthetotalcomputationalcomplexity.Bythismethod, thecomplexityoftheconstructionofthebesttphylogenetictreeisthe complexityofthenumberoftreestoevaluate,andthismethodresolvesthat, atleastpartially. Forfurtherresearch,itwouldbeinterestingtodiscusswhatother propertiesthegeometryonthespaceoftreesholdsforthisgeneralmethod. Thegeometryofthespaceoftreesholdsarichstructure,andtheminimal useofthisstructurehasalreadyledtoagreatimprovementintheuse ofinvariantstatistics.Newexperimentscouldinvolvetheperturbationof parameters,changingbranchlengthsandratematricesinSeq-Gentoprovide newdata,aswellasactualbiologicaldata.Thereshouldandbytheauthor's willing,therewillbestudyintohowvertexatteningsfareunderthesetrials. 76

PAGE 83

Alsotoconsideristherecombinationofdierentinvariantmodelsintoone statisticalmeasure.Wealsocouldchooseinvariantsbasedonspeciclabelings, astherearesomealignedbasesinalignmentsthatoccurmorefrequently thanothers,andachoiceofinvariantsbasedonthealignedbaselabelings ofindeterminatescouldmanipulatethis.Thereshouldbethoughtintothe developmentofcomparingtreesbyinvariantstatisticsobtainedonlybythose atteningswhichdierentiatetwoneighboringtrees.Thisideaistrickierthan expected,asatreewillneighbortwotreesthatcannotbecomparedthis way.Still,evenanaivecomparisonlikethisshouldfarebetterthanthe L 1 Thesearetherstthingsthatcometomindwhenexploringthepotential ofthismethod,anditistheauthor'sopinionthattheseandotherfuture studieswillbeasenglighteningasthisone. 77

PAGE 84

References [1]Allman,E.andRhodes,J..PhylogeneticInvariantoftheGeneral MarkovModelofSequenceMutation. MathematicalBiosciences 186 113144. [2]Allman,E.andRhodes,J..PhylogeneticIdealsandVarietiesfor theGeneralMarkovModel. AdvancesinAppliedMathematics 40 127-148. [3]Billera,L.,Holmes,S.andVogtmann,K.001.GeometryoftheSpace ofPhylogeneticTrees. AdvancesinAppliedMathematics 27 733-767. [4]Casanellas,M.andFernandez-Sanchez,J..PerformanceofaNew InvariantsMethodonHomogeneousandNon-HomogeneousQuartetTrees. MolecularBiologyandEvolution 24 288-293. [5]Cavender,J.andFelsenstein,J..InvariantsofPhylogeniesina SimpleCasewithDiscreteStates. JournalofClassicaton 4 57-71. [6]Chor,B.,Hendy,M.,Holland,B.andPenny,D.000.MultipleMaxima ofLikelihoodinPhylogeneticTrees:AnAnalyticApproach. Molecular BiologyandEvolution 17 1529-2541. [7]Chor,B.andTuller,T..FindingaMaximumLikelihoodTreeis Hard. JournaloftheAssociationforComputingMachinery 53 722-744. [8]Cox,D.,Little,J.andO'Shea,D.. Ideals,Varieties,andAlgorithms 3rded.Springer,NewYork. [9]Culik,K.andWood,D..ANoteonSomeTreeSimilarityMeasures. InformationProcessingLetters 15 39-42. 78

PAGE 85

[10]Diaconis,P..GroupRepresentationsinProbabilityandStatistics. IMSLectureNotes-MonographSeries 11 InstituteofMathematicalStatistics, HaywardCa. [11]Eriksson,N..TreeConstructionusingSingularValueDecomposition. In:Pachter,L.andSturmfels,B. AlgebraicStatisticsforComputational Biology ,CambridgeUniversityPress,Cambridge. [12]Eriksson,N..UsingInvariantsforPhylogeneticTreeConstruction. EmergingApplicationsofAlgebraicGeometry 149 1-20. [13]Evans,S.andSpeed,T..InvariantsofSomeProbabilityModels UsedinPhylogeneticInference. TheAnnalsofStatistics 21 355-377. [14]Felsenstein,J.. InferringPhylogenies SinauerAssociates,Inc., Sunderland. [15]Ferretti,V.andSanko,D..PhylogeneticInvariantsforMore GeneralEvolutionaryModels. JournalofTheoreticalBiology 173 147-162. [16]Garcia,L.D.,Stillman,M.andSturmfels,B..AlgebraicGeometry ofBayesianNetworks. JournalofSymbolicComputation 39 331-355. [17]Hagberg,A.,Schult,D.andSwart,P.8.Exploringnetworkstructure, dynamics,andfunctionusingNetworkX.In:Varoquaux,G.,Vaught,T. andMillman,J. Proceedingsofthe7thPythoninScienceConference SciPy2008 Pasadena,11-15. [18]Huelsenbeck,J.,Ane,C.,Larget,B.andRonquist,F..ABayesian PerspectiveonaNon-parsimoniousParsimonyModel. SystemsBiology 57 406-419. 79

PAGE 86

[19]Landsberg,J.M.andManivel,L..OntheIdealsofSecantVarieties ofSegreVarieties. FoundationsofComputationalMathematics 4 397-422. [20]Lake,J.A..ARate-IndependentTechniqueforAnalysisofNucleic AcidSequences:EvolutionaryParsimony. MolecularBiologyandEvolution 4 167-191. [21]Mackey,G..HarmonicAnalysisastheExploitationofSymmetry -aHistoricalSurvey. BulletinoftheAmericanMathematicalSociety 3 543-698. [22]MainPage. WikimediaCommons ,Retrieved18:00,April16,2010 from http : ==commons:wikimedia:org=wiki=File : Petersengraphblue:svg [23]Jensen,F.andNielsen,T.. BayesianNetworksandDecisionGraphs 2nded.Springer,NewYork. [24]Oliphant,T..PythonforScienticComputing. ComputinginScience andEngineering9 10-20. [25]Pachter,L.andSturmfels,B..Statistics.In:Pachter,L.and Sturmfels,B. AlgebraicStatisticsforComputationalBiology ,Cambridge UniversityPress,Cambridge.3-42. [26]Rambaut,A.andGrassly,N.C.997.Seq-Gen:AnApplicationforthe MonteCarloSimulationofDNASequenceEvolutionAlongPhylogenetic Trees. ComputationalApplicationsofBiosciences 13 235-238. [27]Sturmfels,B.andSullivant,S..ToricIdealsofPhylogeneticInvariants. JournalofComputationalBiology 12 204-228. 80

PAGE 87

[28]Sleator,D.,Tarjan,E.andThurston,W..RotationDistance,Triangulations,andHyperbolicGeometry. JournaloftheAmericanMathematical Society 1 647-681. [29]Sleator,D.,Tarjan,E.andThurston,W..ShortEncodingsof EvolvingStructures. SIAMJournalonDiscreteMathematics 5 428-450. [30]Szekely,L.A.,Steel,M.A.andErdos,P.L..FourierCalculuson EvolutionaryTrees. AdvancesinAppliedMathematics 14 200-216. [31]Tavare,S..SomeProbabilisticandStatisticalProblemsinthe AnalysisofDNASequences. LecturesonMathematicsintheLifeSciences 17 57-86. 81


ERROR LOADING HTML FROM SOURCE (http://ncf.sobek.ufl.edu//design/skins/UFDC/html/footer_item.html)