|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Full Text | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PAGE 1 LOCALALGEBRAICINVARIANTSTATISTICSFORAHEURISTICTO COMPAREPHYLOGENETICTREES BY IANTAYLORHAYWOOD AThesis SubmittedtotheDivisionofNaturalSciences NewCollegeofFlorida inpartialfulllmentoftherequirementsforthedegree BachelorofArts UnderthesponsorshipofPatrickMcDonald Sarasota,Florida May,2010 PAGE 2 Acknowledgments I'mverypooratacknowledgingthosemostimportanttomeandmystyleand skillofthewrittenwordhasatrophiedtothepointwhereIcannotgivetheproper addressofloveandrespecttoeachsingleindividualwhohashelpedmethroughmy education.ThefollowingisapoorfacsimileofgratitudetothoseIcanrecallvividly. PatMcDonaldforhispresence,hisguidance,andhisboundlessenergywhichhave driventhisthesis. Everysingleotherprofessorwhohaveshapedmypath,includingtheentireMathematicsfaculty,especiallypreviousadvisorsLeonKaganovskiyandEiriniPoimenidou. AspecialguestthankyoutoChrisHart,whohasquicklyandbrilliantlyshouldered hisshareofmyeducation. KevinHarrodattheLovelaceRespiratoryResearchInstituteforemployingme theprevioussummer,settinginmotionthepositiveenergyIneededtocompletemy thesis. Myparents,whoneverturnedskepticalorcritical,butlovingandsupportingover time. ErinDyles,whoseinnitepatiencethroughoutthisentireprocesshasbeenan invaluablestabilizingforce.Yourpresenceliveswithinthesepages. MyfriendsthroughoutmyNewCollegeexperience,especiallythosewhohave remainedhereaslongasIhave.Weareallintoxicatedbyourmutualfriendshipand itmakesusstronger. Finally,toeveryoneelsewhokeepsmeintheirheartsandminds. PAGE 3 LOCALALGEBRAICINVARIANTSTATISTICSFORAHEURISTICTO COMPAREPHYLOGENETICTREES IanTaylorHaywood NewCollegeofFlorida,2010 ABSTRACT Acentralproblemofevolutionarybiologyinvolvesconstructingtheevolutionary pathwayofobservedspecies.Phylogenetictreesarenaturalstructuresusedtomodel suchpathways.Wedevelopamethodforcomparingphylogenetictreeswithanotion towardsconstructingabestttreeforagivendataset.Ourmethodinvolvesdening alocalgeometryonthespaceoftrees.Thisgeometryinvolvesacomparisonoflocal algebraicinvariants.Weusethisgeometrytoconstructanalgorithmforproducing besttphylogenetictrees.Numericalexperimentssuggestthisasaninteresting avenueofresearch. PatrickMcDonald DivisionofNaturalSciences PAGE 4 Contents ITheory1 1Introduction1 2Genetics4 2.1Preliminaries...............................4 2.2Group-basedModels...........................10 3MaximumLikelihoodEstimation12 4HiddenMarkovModel17 5AlgebraicPropertiesofMarkovModels23 6Group-basedModels29 6.1FourieronGroups.............................29 6.2ToricIdeals................................35 6.3Labelings.................................37 6.4Group-basedFlattenings.........................42 7GeneralStochasticModel49 7.1VertexFlattenings............................51 7.2EdgeFlattenings.............................54 7.3SuciencyofFlattenings.........................57 8SpaceofTrees59 IIMethod,Data,andResults64 iv PAGE 5 9Introduction64 10Hypothesis65 11Method67 12Results69 13Discussion75 14Conclusion76 ListofFigures 1Arootedevolutiontree,changesovertimeowdownwardawayfrom commonancestor,notdrawntoscale...................4 2Anunrootedevolutiontreee,onlytherelationsbetweentaxaareknown.5 3Asetof4alignmentswiththe2ndalignedbasehighlighted.7 4Aratematrixwiththeconditionofbeingsymmetric,theonlynecessaryconditionofGTR..........................9 5SimpleHMM...............................18 6Linegraphwiththreevertices......................24 7Arooted3-leaftree............................38 8Subtrees E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(fora6-leaftree T andedge v 8 ;v 9 ..43 9Asampleconsistentlabelingandedgepropagationon E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(..................................44 10Avertexatteningona6leaftree...................46 11Unrooted4-taxonphylogenetictree..................52 12A5-leafunrootedbinarytree.....................56 13Visualrepresentationofanedgeatteningalongedge v 9 ;v 10 58 v PAGE 6 14Thespaceofall4-taxonunrootedbinarytrees.Alltreesin spacecollapsetheiredgeinto T 0 andextendtobecomeanother tree.....................................60 15Thelinkoftheoriginof T 4 ,representedbythe3bolddots. ThelabelscorrespondtothetreesinFigure........63 16Thelinkoftheoriginof T 5 ,shownhereinmostfamiliarform ofthePetersengraph.[22]......................64 17Resultsforinvariantstatisicheuristiconspaceoftreeswithhomogeneousbranchlengtht=.1.Group-based:+,General-Markov:x...70 18Resultsforinvariantmeasureheuristiconspaceoftreeswithhomogeneousbranchlengtht=.5,Group-based:+,General-Markov:x...71 19Asampleofallinvariantstatisticsforalltreesonaspacewithhomogeneousbranchlengtht=.1;1300group-basedattenings......72 20Asampleoftheinvariantstatisticsforalltreesonasinglespacewith homogeneousbranchlengtht=.5;1300generalMarkovattenings.74 vi PAGE 7 PartI Theory 1Introduction Networktreesaremathematicalconstructsusedtovisuallyrepresenttherelationshipbetweenacollectionofobjects.Thesetreesaredenedusinga completesetofdistancesbetweenanytwoobjects.Ofimportancearebinary trees,directedorundirected,whichjointwopointsthatare"nearestneighbors"togethertoaninteriornode.Binarytreesarecapableofdetermining informationbetweenobjectsforavarietyofapplications.Asimpleexample isthevisualrepresentationofabinarysearchwithinasortedlist.The globaltopology ofatreeisdenedasthesetofallverticesandedgesof thetree.The localtopology ofatreeatagivenvertexoredgeisthesetof verticesandedgesinaconnectedcomponentsubtree.Subtreesobtainedby thedeletionofanedgeornode,producingsubtreesoftheoriginaltreethat areinandofthemselvesconnnectedbyalltheirrespectivenodesandedges. Binarytreeshavepropertiesverysimilartothevisualrepresentations ofthemacro-evolutionprocess.Coarsely,evolutioncanbedescribedfrom theintialpointofacommunityoforganismswithinaspeciesthatare somehowdividedtoallownaturalselectionpressurestocreatedierences betweenthetwocommunitiesandultimately,throughtheirinabilitytoshare geneticinformation,newspecies.Inthebinarytreewerepresenttheinitial communityasasinglenode,andthedivergencethroughnaturalselectionas abifurcationofasinglenodeconnectingtotwoothernodes.Abinarytree 1 PAGE 8 canrepresentanyseriesofdivergencesfromthethecommonancestorasa seriesofbifurcatingedgesfromthesinglenodecorrespondingtoanancestor, knownasthe root ,tothosenodesrepresentingthedescendants,knownas leaves Eachleafinthecontextofphylogenetictreesisreferredtoas taxon taxa thepluralarerepresentativeofanylevelofevolutionarydierence,from entirekingdomstosingleorganisms.Theconnectedhistoryofbifurcations relatingthesetofdescendantsistheevolutiontree,or phylogenetictree Untilrecently,therelationshipsbetweenorganismscouldonlybeattributed tophenotypiccharacteristicsthosecharacteristicswhicharededucedona largescalee.g.,hummingbirdbeaklengths,asopposedtothosecharacteristics atorbelowthescaleofthecell. ThetechnologicalinnovationsinDNAsequencinghaveledtonewmethods ofrelatingtaxabythedierencesintheirDNA.Thiscanbedoneby identifyingthosepartsofthegenome,or loci ,thatare orthologous ,orknown tobeevolvedfromacommonancestor.Incomparinglociwelookatthe DNAsequencethatcomprisesthesesharedsequenceothrologues. Each position inaDNAsequencecorrespondstoasingleletter.Wecreate anarrayofDNAsequencesfromthehomologousportionsofthegenome sharedbyeachtaxa;each i th columnreturningthe i th positionforeach DNAsequencebyorderingthehomologoussequencesinawaythatbestt thebiologicalintuitionorsomemathematicaloptimizationprograme.g.the Needleman-Wunschalgorithm.Thearraysareknownas alignments ,while asinglecolumnfromanarrayofalignmentsiscalledan alignedbase .By measuringthenumberoftimesweobservehowofteneachuniquealigned baseinthesetofalignmentsisobserved,wecandevelopthe"`distances"' 2 PAGE 9 betweenthetaxaandinferthephylogenetictreebestrelatingthem. Amongthemostpopularstatisticalmethodsinproducingaphylogenetic treeistheevaluationofaclassofstatisticalmodelsunder MaximumLikelihoodEstimation MLE.Themethodcorrespondstheexpectationofalignments tofunctionsmappingoveraspaceofparameters.TheendgoalofMLE isthenndingthoseparametersthatbesttthesefunctionstotheobserveddata.ThemostpopularsubclassofmodelsintheMLEmethod areMarkovmodels.IwillshowthatMarkovmodelsallowmanipulation thatproduceuniquenon-parametricobjectsknownas algebraicinvariants ThesecanbeusedtoanalyzecertaintopologicalfeaturesofthephylogenetictreeindependentofMLE.Amajordrawback,however,isthatthe naiveapproachtocomputeandanalyzeinvariantsindeterminingthecorrect phylogenetictreeistocomputeinvariantscorrespondingtoalltrees,acomputationwhosecomplexityisgreaterthanexponentialinthenumberofleaves. Givennaturalassumptionsontreestructure,Ishowthatalgebraicinvariants provideameansofdeterminingthetruetopologyofaphylogenetictree.I thenconstructthespaceofallpossibletreetopologies,wherethedistance metricbetweenuniquetreetopologiesisdenedbythenumberoflocal topological'changes'fromonetreetoanother.Fromthis,Iproposethe methodofcomparingnearestneighbortreesbytheirinvariants.Ishowthat byinitializingonarandomtree,evaluatingthistreeandalltreesclosestto it,choosingthebestinthissubset,thencontinuingtheprocesswiththis chosentreeuntilthebesttopologyisdetermined,Imaydeterminethebest ttopologyofthephylogenetictree.Thisisinfactthegreedyalgorithmon thespaceoftrees.Ishowthatthisapproachinthebestcasereducesthe 3 PAGE 10 Figure1:Arootedevolutiontree,changesovertimeowdownwardawayfromcommonancestor,notdrawntoscale. numberoftreestocomputefromexponentiallymanybythenaiveapproach, tolessthancubic.Iproposethatthoughtheideaofmeasuringstatisticsof treesforuseinagreedysearchonthespaceoftreeshasbeenproposed,the useofinvariantstocomputethestatisticalmeasureonagreedysearchhasnot beenpreviouslyproposed.Resultsofasmallexperimentusingthisapproach areincluded.Thoughthenumberofparametersandsamplesizeissmall,it isfoundthatthisapproachholdspromisetoproblemsinphylogenetics. 2Genetics 2.1Preliminaries DeoxyribonucleicAcidDNAisapolymerchaincomposedoffourmolecules: Adenine,Cytosine,Guanine,andThymine.Thesefourmoleculesare nucleotides ,orinterchangeably, bases .EachnucleotideinachainofDNAis 4 PAGE 11 Figure2:Anunrootedevolutiontreee,onlytherelationsbetweentaxaareknown. matchedagainstitschemicalcounterpartasa basepair toformahydrogen bond.Alongthestringofbases,thelocationofanucleotideintheDNAstring iscalledthe position .Thenumberofnucleotidesassociatedwiththegenomes ofanorganismvaryfromonseveralorders.Forexample,virusgenomesareas smallas3200nucleotides,andthehumangenomeisover3billionnucleotides. Nucleotidesubsequencesofgenomesthatdenebiologicaltraitsarereferred toas genes .Theprocessesrequiredtosupportthelifeofanorganism areinlargepartcontrolledbythetranscriptionandreplicationofgenes. TranscribingandtranslatinggenesintoproteinsthatperformbiologicalfunctionsisperformedbycollectionsofRibonucleicAcidRNA,whichare chemicallysimilartoDNA.Thesegenesareconservedbytheprocessof DNAreplicationinthedivisionofcellstoprovidechromosomesfornewly formedcells.Translationistheprocessofmanufacturingaminoacidsfrom transribedRNAthatinturnareusedtocreateanarrayofproteinsthat 5 PAGE 12 runmostprocessesinthecelland,eectively,theentireorganism.SubsequencesofDNAhavealsobeenfoundtocontainmost,ifnotall,ofthe informationnecessarytoproducemanyofthebiologicaltraitsofanorganism. A taxon isthebasicunitinphylogenetics,usedtodistinguishcollections oforganismsonassmallorlargeascaleasdesired.Ashandadog, forexample,aremembersofthetaxontype Phylum as Chordata ,duetotheir sharedtraitamongothersofhavingacentralskeletalstructure.Adog, ajackal,andthesharedgenesbetweenthemaremuchmoresimilarin comparisonandarecollectedinthetaxontype Genus as Canis .Thetaxon type species isacollectionoforganismsthatarebiologicallyabletobreed andshareDNA.Thecollectionofgenesoforganismsinthesamespecies andlocaledenesthe genepool ,thosegenesthatcanbeinheritedfrom breedingindividualswithinthesamelocale. Biologicaltraitsandtheirgenesareunderconstantpressurebynatural selction,andwhengenepoolsdiverge, speciation canoccur.Speciationisthe processbywhichtwopopulations,originallythesamespeciesbecometwo dierentspeciesandarenolongerabletomateandcombineDNAina commongenepool.Divergenceandnaturalselectionfromacommonancestor intheformationofnewspeciesisthemajorprocessinevolutionrst describedDarwin.Anexampleofthetreesofstudyinthisworkarealso usedtovisuallyinterperetthespeciationanddivergenceoftaxaovertime, Darwininhisbook TheOriginofSpecies ,includedadrawingofabifurcating evolutionary"`tree,"'withbranchesendingintaxacurrentlyexisting,and convergingtoacommonancestor,or root .Theevolutionarytreeovertime wasformallynameda phylogenetictree ,andthestudyofevolutionbythe 6 PAGE 13 Seq 1: Seq 2: Seq 3: Seq 4: A A T T 2 6 6 4 G C C C 3 7 7 5 T T C G C C C C Figure3:Asetof4alignmentswiththe2ndalignedbasehighlighted moleculargeneticsis Phylogenetics Animportantquestionthataroseimmediatelyaftertheintialdevelopment ofphylogeneticswas,"`Howtoproduceaphylogenetictreefromthecollection oftaxacurrentlyobservable?"'Atrstthisquestionwouldberesolvedby observingandcarefullyquantifyingthe phenotype ,orthephysicalcharacteristics ofanorganism.Solvingthisislimitedbytheidenticationofthosephysical characteristicsbetweentaxathatareassociatedbyacommonancestor.This hasseveralimmediateproblems,suchashowtoclassifyverysmallorganisms andphylogeneticallydistantorganismsthatshareacommontraitlikebirds andplatypi. Themodernphylogeneticmodelisbasedonthehypothesisofanevolutionary'clock',thatspeciesdivergefromacommonancestoranddonotconverge backtoashareddescendant,andthatthisdivergenceisoftenmeasuredby theexamingthesequencesdivergingbetweenspeciesatcommongenes.This hypothesisismadeformalonaphylogenetictreebyxingthegraphical structuretoadirected,bifurcatingtree.Allbinarytreeshavethefollowing properties: Allinteriornodesinbifurcatingtreeshaveexactlyoneparentnodeand twochildrennodes,whichrepresentsspeciation.Whenundirected,the interiornodeshaveexactlyvalencethree. 7 PAGE 14 Thesourcenodesrepresentthecommonancestorsinaphylogenetic tree. Thesinknodesleavesrepresentobservabletaxainaphylogenetictree. Ifthesinglecommonancestorisidentiable,itisrepresentedasthe uniquesourcenodeandreferredtoasthe root Amathematicalmodelfortheprocessofevolutionofageneticstring alongeachedgecanberepresentedtobeanindependentlyandidentically distributed i.i.d ,non-degenerativeprocessdenedbyaMarkovmatrix.That is,wesupposeeachtaxasequenceisaDNAstringoflength n ,thatthe rootstringisthegenesofacommonancestorforeachtaxaandthatthe i th positionalonganystringis position i .Bystackingtheobservedtaxa sequencessuchthateachpositionofthesequenceisthe i th columnofthe strings,weconstructalignments.Each i th columnofthealignmentsare referredtoasthe i th alignedbase FromeachpositionintheDNAstringassociatedtoaparentnode,we allowaposteriorprobabilitydistributions,knownas transition probabilites,to thesamepositionofthestringassociatedtoachildnodeinthetree.This istheprocessbywhichachildnode"`evolves"'fromitsparent.Allowing for v tobetheuniqueparentnodeofthechild v ,wedeneforevery edge v ;v aMarkovmatrix P v v = 0 B B B B B B B @ P v = A j v = A P v = A j v = C P v = A j v = G P v = A j v = T P v = C j v = A P v = C j v = C P v = C j v = G P v = C j v = T P v = G j v = A P v = G j v = C P v = G j v = G P v = G j v = T P v = T j v = A P v = T j v = C P v = T j v = G P v = T j v = T 1 C C C C C C C A 8 PAGE 15 Q = 0 B B @ )]TJ/F15 11.9552 Tf 9.299 0 Td [( r a + r b + r c r a r b r c r a )]TJ/F15 11.9552 Tf 9.298 0 Td [( r a + r d + r e r d r e r b r d )]TJ/F15 11.9552 Tf 9.298 0 Td [( r b + r d + r f r f r c r e r f )]TJ/F15 11.9552 Tf 9.299 0 Td [( r c + r e + r f 1 C C A Figure4:Aratematrixwiththeconditionofbeingsymmetric,theonlynecessary conditionofGTR Foreachofthefournucleotidesweidentifyfourprobabilitieswithcolumnsumtotalsofone. ThebiologicalassumptionsdonotallowtheuseofarbitraryMarkov matricesbetweennodes,andsoseveralmodelshavebeenproposedtoallow derivationsofMarkovmatricesasfunctionsofanon-degenerativeprocessover time.OneofthemostpopularmodelsistheGeneralTimeReversibleGTR model.Thismodelattributestheprocessofevolutionbetweentwosequencesto aratematrix[takentoacontinuousexponentialprocessovertime t .]InGTR foreachedgetheratematrixisasymmetricmatrixwithcolumntotals0[31]. InGTRtheMarkovmatrix P inEquationiscomputedbyan exponentiatingtime t overtheinstantaneousratematrix Q : P t = e Qt = 1 X i =0 1 i Q i t i Astime t increases,thereisagreaterprobabilitythatthelabelata positionforataxawilldegenerateintoanothernucleotide.Thegraphical representationoftimeonaMarkovmatrixisthelengthoftheedge correspondingtothephylogenetictree.Additionally,forthegeneralGTR modeltheaboveidentitycanonlybecomputednumerically,butonsome restrictionsoftheGTRthereareclosedalgebraicformson t 9 PAGE 16 2.2Group-basedModels Aspecialrestrictionontheratematricesassociatedtoedgesofthephylogenetic treemustbeconsideredforfuturereference.Ofrstconsiderationisthe Jukes-Cantormodel.Fixarate fortheratematrix Q tocreatethe Jukes-Cantor ratematrix: Q v v = 0 B B B B B B B @ )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.298 0 Td [(3 )]TJ/F15 11.9552 Tf 9.299 0 Td [(3 1 C C C C C C C A Asmentionedbefore,insomecaseswecanobtainanalgebraicformof thetransitionprobabilitymatrixfromcomputationoftheratematrix.This istrueforthisrestriction,andtheassociatedMarkovmatrixis: P v v = 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v a v a v a v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v 1 C C C C C C C A ThedevelopmentoftheJukes-Cantormodelwasmeanttosimplifyalgebraic methods,andisgenerallyexplainedasplausibleasthe a v willberelatively small,andthatvariationonpositionsisfairlyuniform.Thisdoesnot necessarilytthereality,asevenundertheassumptionthatevolutionoccurs astheprocessdescribedabove,theratematricesinwouldnotberestricted tobeuniform. AlessrestrictiveMarkovmatrixisthe Kimura 3-parametermodeldened by: 10 PAGE 17 M v v = 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v a v b v c v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v c v b v b v c v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v a v c v b v a v 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.955 0 Td [(c v 1 C C C C C C C A TheKimura2-parametermodelisexactlylike.2,withtheadded restrction b v = c v .TheKimura-2andKimura-3modelsallowforsomeleeway inbiologicalplausibilitythattheJukes-Cantordidnot,asweareallowed dierentratesbetween transitions and transversions .Transitionsaresubstitutionsbetweenchemicallysimilarnucleotides,andaremuchmorecommon thantransversions,whicharethesetofallothersubstitutions. Byassumingadirectedbifurcatingtreetopology,andassociatingtoeach edgesetsoftransitionprobabilities,wedenea bayesiannetwork ;thatis,we havedenedamodelwiththefollowingcollectionofproperties[23]: Asetofvariablesandasetofdirectededgesbetweenvariables Eachvariablehasanitesetofmutuallyexclusivestates Thevariablesandedgesformadirectedacyclicgraph Thoughwehavedenedamodel,theassociatedtransitionprobabilites occurringinthemodelarenotdirectlyidentiable,andthequestionof howtoconstructthetreetopologycannotbeaddressed.Itremainsto beseenwhatotherformsofdatacanbeanalyzedfromtheobservations ofsequencesandhowtousethisdatatoconstructthecorrectphylogenetictree. 11 PAGE 18 Theaddresstheproblemofidentifyingthetransitionprobabilitiesthat besttthemodel,wemustmeasurestatisticsonalignedstrings.Ourmodel createsarandomwalkonMarkovmatricesfromroottoeachleaf,resulting inprobabilitiesthateachnucleotidewilloccuratagivenleafatasingle position.Thenforasinglepositiononallsequences,thenucleotidesobserved aretheresultsonarandomwalkoveraphylogenetictree.Theordered collectionofthesenucleotidesarrangedoveroneposition,isanalignedbase, andthemostusefulandpopularstatisticsaremeasuredoveroccurrencesof alignedbases.Fornotation,werefertothealignedbaseatposition j as a j andthenumberofoccurrencesobservedinthealignments u a j .Wereferto themeannumberoftimes a j isobservedby^ p a j = u a j n ,andtheindeterminate probabilityofthealignedbaseoccurring, p a j .Theindeterminateprobability isaconstructionofthemodelswecreate,andourpurposeistooptimze ourmodelsinawaysuchthatitallowsaminimaldierencebetweenthe evaluationof p a j andtheobservations ^ p a j 3MaximumLikelihoodEstimation Ofthemanystatisticalapproachesmeasuringalignedbases,themostpopular amongcomputationalbiologistsisMaximumLikelihoodEstimationMLE. MLEisageneralmethodthathasmanyclassesofmodelfunctionsovera parameterspace.Fornow,thisparameterspaceisacollectionofundened variables:= 1 ;:::; d ; i 2 C Let m bethelengthofthealignments, a 0 i thealignmentforposition i Forthearrayof n alignments[ a 0 1 ;a 0 2 ;:::;a 0 m ] ;a 0 i 2f A;C;G;T g n ,wecompute coordinateexpectationovertheparameterspace. 12 PAGE 19 f a 0 i : 1 ;:::; d 7!f 0 ; 1 g Fromthismodelwesupposethateachobservation a 0 i isindepedentfor i =1 ;:::;m ,alongthelengthofthealignments.Thentheprobabilitythatwe observetheseriesofalignedbases a 0 i forallpositions i inthealignmentsis theproductofall f a 0 i L a 0 1 ;:::;a 0 m j := m Y i =1 f a 0 i where L isreferredtoasthe Likelihood functionforallalignedbaseson. Example Fromtherst4alignedbasesinFigure,weidentifythe coordinatefunctionsandtheassociatedlikelihood: f AATT ;f GCCC ;f TTCG ;f CCCC L AATT;GCCC;TTCG;CCCC j = f AATT f GCCC f TTCG f CCCC Deneforthesetofall unique observations f a j g = f A;C;G;T g n the associatedsetofcoordinatefunctions: f a j : 1 ;:::; d 7! [0 ; 1]: a j 2f A;C;G;T g n ForthesetofcoordinatefunctionsinEquationdenethedirect product F ofallcoordinatefunctions: F : R d [0 ; 1] 4 n ; 7! f a 1 ;:::;f a 4 n Example Thedirectproductofcoordinatefunctionsforasetof4alignments 13 PAGE 20 is f AAAA ;f AAAC ;:::;f TTTG ;f TTTT Forthesetofcoordinatefunctions f a j andthesetofalignedbase observationcounts u a j ,thelikelihoodfunctiononthesetofalignmentsis L u a 1 ;:::;u a 4 n j = 4 n Y j =1 f a j u a j : Example Giventhesetofallnonzeroobservationcountsfor2alignmentsis u AA =1 ;u CA =2 ;u GC =4 ;u TG =3,thelikelihoodfunctionis: L u AA ;u AT ;:::;u TT j = f AA f 2 CA f 4 GC f 3 TG ItiseasilycheckedthatEquationsand13areequivalentonanyset ofalignments. TheproblemofMaximum-LikelihoodEstimationistomaximizethelikelihoodfunctionforasetofobservationcounts u a j over.Thisproducesa setofparametersthatbest'explain'thesetofcounts u a j asthemost likelyobservation,andreturnthesetofestimates f a j closestto^ p a j Theformof L looksintimidatingasis,butallowingfor L tobetwice continuouslydierentiable,solvingforthelocalandglobalmaximmumsof L isequivalenttosolvingforthelocalandglobalmaximumsofanystrictly increasing,twice-dierentiabletransformationof L .Thetransformationbest suitedinmostcasesfordeterminingthelocalandglobalmaximumsof L isthelogarithm. 14 PAGE 21 l := log L u a 1 ;:::;u a 4 n j = m X j =1 u a j f a j Thelogarithmtransformofthelikelihoodfunctionisthe log-likelihood ,and limitsthecomputationalcomplexitytoalinearfunctiononthecomplexity ofthecollectionoffunctions f a j Example FromExample,wecalculatethelog-likelihood: log L u AA ;u AT ;:::;u TT j = f AA +2 f CA +4 f GC +3 f TG Tondthebestsetofcoordinateexpectationsistoidentifytheglobal maximumofthelikelihoodfunctionandequivalentlythelog-likelihoodfunction. Anobviousmethodforobtainingthiswouldbetorstobtainacritical point.Acriticalpointofthelog-likelihoodfunctionisdenedbysatisfying for1 i d @log L @ i = 1 L @L @ i =0 Thesecriticalpoints,foranycaselargerthanone-dimension,donot implyglobalorlocalmaximaorminima.Toguaranteelocalmaximaof thelog-likelihoodfunction,allsecondpartialderivativesofthelog-likelihood functionmustbenegative. Example Thesimplestconstructionforacollectionofcoordinatefunctions f a j comprisingthelikelihoodfunctionisalinearcombinationof 1 ;:::; d f a j = d X i =1 b ji i + c j 15 PAGE 22 Thederivativeofthelog-likelihoodfunctionbyachosenindeterminate k withtheaboveasacoordinatefunctionforall j is: @log L @ k = m X j =1 u a j b jk d P i =1 b ji i + c j Forthecaseofasetof2alignments,thecoordinatefunctionswould havetheform f a 1 = f AA = b 11 1 + c 1 + b 12 2 + c 1 + + b 1 d d + c 1 f a 2 = f AC = b 21 1 + c 2 + b 22 2 + c 2 + + b 2 d d + c 2 f a 16 = f TT = b 1 + c 16 + b 2 + c 16 + + b d d + c 16 andthesetofallpartialderivativesofthelog-likelihoodwouldbe @log L @ 1 = m X j =1 u a j b j 1 P i =1 b ji i + c j @log L @ 2 = m X j =1 u a j b j 2 P i =1 b ji i + c j @log L @ d = m X j =1 u a j b jd P i =1 b ji i + c j Acriticalpointof L isobtainedwhenEquation14areallzeroforsome 2 R d Thesolutionforndingthezerosofsinglederivativesforalikelihood functioncanbedoneemployingcalculusorGroebnerbasesiftheloglikelihood 16 PAGE 23 functionisallowedarepresentationasapolynomialin R [ 1 ; 2 ;:::; d ]. TheMLEapproachisveryrichandthereareavarietyofmodelsthat havebeenpreviouslyproposedwithvaryinglevelsofsuccess[25]. 4HiddenMarkovModel The HiddenMarkovModelHMM isaclassofgraphicalmodelsthatinvolve actionsbetweennodesthatmayeitherbeobservedorhidden.Themodels themselvesaretusingMaximum-LikelihoodEstimation,withacollectionof coordinatefunctionsevaluatedtomaximizetheassociatedlikelihoodfunction. ThegraphicalconstructionofanHMMisaBayesiannetworkwithasingle sourcenode,associatedwithtwocomplementarysubsetsofitsvertexset.The rstsetconsistsofthosewhoselabelsmaybeobserved,andtheseconda collectionofnodeswhoselabelsarenotobserved,andaredeemed"hidden". Bythepropertiesofdirectedacyclicgraphs,wedonotallowmodelsto includeoutwardedgesfromobservednodes,astheirstateisobservableand henceare d-separated fromanycandidatenodesThereaderisreferredto [23]forfurtherreadingonthepropertiesofdirectedacylicgraphs.Theedges representanarrayofposteriorprobabilities,withanindependentdistribution associatedtotheuniquesourcenodeofthegraph.Toexplainthismethod further,wecomputeasmallexample. Example Imagineashortseriesofhiddenandobservedcoinips,two observedandtwohidden,suchthattheprobabilitiesforeachareconditional onanypreviouscoinip.Weassociatehiddenvariables v h 0 ;v h 1 and observedvariables f v o 2 ;v o 3 g withdirectededgesandtheirrespectiveprobabilites 17 PAGE 24 Figure5:SimpleHMM E T =[ v h 0 ;v h 1 ; v h 0 ;v o 2 ; v h 1 ;v o 3 ].Eachedgeinturnhasanassociated transition probabilitydistribution,andapriordistributionisassociatedtotheinitial vertex. =[ 0 ; 01 ; 02 ; 13 ]=[ P v h 0 ;P v h 1 j v h 0 ;P v h 2 j v h 0 ;P v h 3 j v h 1 ].Thenthecoordinatefunctionfortheobservation v o 2 = H;v o 3 = T is f HT = X i;j 2 [ H;T ] 0 i 0 ; 1 i;j 0 ; 2 i;H 1 ; 3 j;T Itiseasilycheckedthatthissatisesthenon-biased,independentcasefor binomialprobabilities. ByrestrictingBayesiannetworkstotrees,andattachingtoeachthe MarkovmatricesconstructedinSection,wecanevaluatethePhylogenetic treeproblembyMLE.The root istheuniquesourcevertexwithnodirected inwardedges.Someconstructionsallowasingleoutwarddirectededge,while othersdeletethisedgeandlabeltheresultingvalencetwonodeastheroot. 18 PAGE 25 Thisisamatterofpreference,asthetwoconstructionsareequivalentAllow forabifurcatingrootwithindependentdistributionvector r tohavethe form 0 r P e r forsomevectordistribution 0 r andtheadjacentoutwardedge posteriordistribution P e r .The leaves ofthetreeareallverticeswitha singledirectedinwardedgeandnodirectedoutwardedges. EachedgewillbeassociatedtoaMarkovmatrix,whichinturndescribes anarrayofparametersin. Weconstructourparametersforallpossibledistributionsontherootand edge-set: 8 i;j 2f A;C;G;T g ;v 2 V T = f r g ; v v i;j := P v = j j v = i r i := P r = i Allowingthevertexordering V T = f v 1 ;:::;v m g : = r A ;:::; r T ; v 1 v 1 A;A ; v 1 v 1 A;C ;::: :::; v 1 v 1 T;T ; v 2 v 2 A;A ;:::; v m v m T;T Withtheproperprobabilisticconditionsthat 19 PAGE 26 8 i;j 2f A;C;G;T g ; v v i;j 0 ; X k 2f A;C;G;T g v v i;k =1 8 i 2f A;C;G;T g ; r i 0; X k 2f A;C;G;T g r k =1 Forallpracticalcases,wesubstitutepositivityfornonnegativityfordistributions,asitassumedthatthereisachanceofpositionevolutiontoall bases,nomatterhowsmall. ThenwemayassociateforeachMarkovmatrix P v v asconstructedin Section2: P v v = 0 B B B B B B B @ v v A;A v v C;A v v G;A v v T;A v v A;C v v C;C v v G;C v v T;C v v A;G v v C;G v v G;G v v T;G v v A;T v v C;T v v G;T v v T;T 1 C C C C C C C A Remark Whetherthetransitionprobabilitiesaredependentorindependent ofachosenedgeisachoiceofwhethertheedgesareofthesamelength. Ifwerecall,theMarkovmatrixconstructedabovewasdependentontime. Fixinganedgelengthwouldbeequivalenttoxingatime t andthusxing theMarkovmatrixforeachedge. Eachuniqueobservation a j inthephylogeneticHMMisanalignedbase, withthelabelsinthealignedbasetheassociatedobservationsontheleaves ofthetree.Thenthecoordinatefunctions f a j aresumsovertheproducts oftheparametersassociatedtoeachedgeandtheindependentdistribution 20 PAGE 27 oftheroot. Example WeperturbtheHMMinFigureintoaphylogenetictreeby replacingthebinarylabelswithnucleotidelabels,andremovethenode v h 1 by thefollowingprocess: 8 i;j 2f A;C;G;T g ,let v h 0 v o 3 i;j = X k 2f A;C;G;T g v h i v h 0 v h 1 i;k v h 1 v o 3 k;j : Allowthetreetoonlycontainnodes v h 0 ;v o 2 ;v o 3 ,edges v h 0 ;v o 2 ; v h 0 ;v o 3 andprobabilities n v h 0 ; v h 0 v o 2 i;j ; v h 0 v o 3 i;j : 8 i;j 2f A;C;G;T g o .Theresult isarootedphylogenetictree,andtheproblemofdeterminingthevaluesof coordinatefunctionsfortwoalignmentsisequivalenttoevaluatingthismodel. Weordertheelementsinthesetofleaves L T astheorderofthe alignedbase, l 1 ;l 2 ;:::;l n .Lettheinteriorverticesbeallthosevertices withvalence3,hence IV T = V T )]TJ/F15 11.9552 Tf 12.005 0 Td [( L T [f r g .Let f v i 0 ;v i 1 ;:::;v i g bea partialorderingoftheinteriorverticesbydistanceawayfromroot,leastto greatest,suchthat v i )]TJ/F31 7.9701 Tf 6.586 0 Td [( n )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 ;v i )]TJ/F31 7.9701 Tf 6.587 0 Td [( n )]TJ/F31 7.9701 Tf 6.586 0 Td [(2 ;:::;v i aretheparentnodesof l 1 ;l 2 ;:::;l n respectively. Thenforanyalignedbase a j ,weassociatetheindividuallabelingonthe leavesby a j ;a j ;:::;a j n ,andtheexpectationofthealignedbaseon theleavesbythecoordinatefunction f a j = X k r X k v i o X k v i 1 X k v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 X k v i r k r v i o k r ;k v i o v i 1 k v i o ;k v i 1 v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 k v i )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 ;a j n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 v i k v i ;a j n 21 PAGE 28 whichisjustthesumoftheparameterproductsoverallrootandinterior labelings.Theresultinglikelihoodfunctionisthesameasabove,takinga productofeachalignedbaseexpectationforeachposition.Aclearerform ofthecoordinatefunctionistobuildasetof histories oftheinteriorlabels. Let IV T = V T )]TJ/F15 11.9552 Tf 12.79 0 Td [( L T [f r g bethesetofallinteriorverticesand H = k v v 2 IV T [f r g thesetofalllabelingsontheinteriorverticesplusthe root,thenthecoordinatefunction f l 1 l 2 l n k l 1 ;k l 2 ;:::;k l n = X k v 2 H r k r Y e M v v k v ;k v ItistrivialtoshowthatEquationsandareequivalent. TheindeterminatesforthegeneralMarkovmodeloneachedge-wise Markovmatrixis12werequirethecolumnsumstobe1,thenumberof indeterminatesacrossthewholemodelis3+12 j E T j =3+12 n )]TJ/F15 11.9552 Tf 9.495 0 Td [(3=24 n )]TJ/F15 11.9552 Tf 9.495 0 Td [(33, andthetotaldegreeofpolynomialtoevaluateforeachcoordinatefunction is2 n )]TJ/F15 11.9552 Tf 13.047 0 Td [(2.Whilethelogarithmtransformationrelievesusofevaluating indeterminateproductsofexponentiallygreatertotaldegree,itisapparent thatndingaglobalmaximumisindeeddicultforanymethod.Thisis usuallywhyafteracertainnumberoftaxa,theHMMonphylogenetictrees isperturbedtotradeecacyforeciency,ornotusedatall. Ingeneral,MLEmethodshavebeenfoundtohavecertainunsavory properties: [6]foundmultiplemaximumlikelihoodoptimafortheHMM,andfor certaintopologies,continuumofoptimaoccured. In[7],itisfoundthatphylogenetictreereconstructionbyMLEis NP-hard. 22 PAGE 29 Dependingonthemodel,suchasin[18],MLEonHMMhasalsobeen foundtobeinconsistent. ThoughtheseproblemsareespeciallyperniciousforthemostgeneralMarkov model,wecandrasticallyrestrictthedimensionoftheparameterspaceby choosingauniformMarkovmatrixforeachedge,andthenfurtherby restrictingthedegreesoffreedomoftheparametersinthematrix.These restrictionsandtheirimplicationswillcomeintouselateron. Thegeneralconsensusamongstcomputationalbiologistsisthat,ifnot intractable,maximum-likelihoodestimationispreferred.Thoughnoneofthe modelsaddressrealissuesinthephysicalprocessesinDNAreplicationand recombination,fortheapplicationsofcurrentinterest,MLEisthepreferred method. 5AlgebraicPropertiesofMarkovModels Maximum-LikelihoodEstimationisapowerfulapproachthatdirectlyaddresses thecommonlyassumedphylogenetictheory.However,parameterestimation alongthesemodelsbecomescomputationallyexpensiveasalignmentsizeand thenumberofleavesincreases.Itiswisetoconsiderotherpropertiesofthe HiddenMarkovModel. Apossibleavenueinrelievingourselvesfromtheburdenofexpensive parameterestimationistodiscussthealgebraicpropertiesofMarkovmodels. WeconsideragainourcoordinatefunctionformallyconstructedinEquation ,andperturbeachdeterminatecoordinatefunction f a j byrestricting theirdomain f 1 ; 2 ;:::; d g andcodomains p a j : a j 2f A;C;G;T g n tobe 23 PAGE 30 Figure6:Linegraphwiththreevertices indeterminate.Thatis,weallowthemtoremainunknownvariablesuntil substitutedforrealvalues.Denotetheinducedmap. ~ f a j : 1 ; 2 ;:::; d 7! p a j : Thenthedistributionoftheindeterminateexpectationsforallleaf-labels isadirectproductofallcoordinatefunctionsinEquationis ~ F : 7! p AA AA ;p AA AC ;:::;p TT TG ;p TT TT ; X a j 2f A;C;G;T g n p a j =1 : Weconstructfrom im ~ F thepolynomialring R [ p AA AA ;p AA AC ;:::;p TT TG ;p TT TT ],andwediscussthosefunctions E : im ~ F C andtheirvariousproperties.Oneofthemostnaturalproperty todiscussisthekernelofthemap E 24 PAGE 31 Example Westartwithanexamplefromanundirectedthreenodelinegraph G asdepictedinFigure,denedbythevertexset V G = f A;B;C g and undirectededgeset E G = f A;C ; C;B g .Eachnodeisassociatedanevent whosestatespaceis f 0 ; 1 g .Foreachedgeweattachthelogicalstatement of conditionaldependence Inclassicalprobability,wesaytwoevents A and B are independent if andonlyiftheysatisfy P A B = P A P B : Givenathirdevent C ,thisindependenceimplies P A B j C = P A j C P B j C : If A and B onlysatisfyFigureforsomeevent C ,wedescribe A and B as" conditionallyindependent given C ." Conversely,iftwoevents A and B donotsatisfyFigure,theyare depedendent ofeachother,andifthreeevents A B C donotsatisfyFigure A and B are conditionallydependent given C ." Forourthreenodelinegraph,leteachpairofeventsbeindpendent giventhesetallothersifandonlyiftheredoesnotexistanedgebetween theirassociatednodes.Thenin G ," A indpendentof C given B "isthe onlyindependencestatement. Let k 1 ;k 2 ;k 3 ;k 4 ;k 5 benotnecessarilyuniquestatesinthespace f 0 ; 1 g 5 25 PAGE 32 ofanyoftheevents A;B;C ,andlet p ijk := P A = i;B = j;C = k for allstates i;j;k .Thebelowfollowsfromthesingleindependencestatement associatedto G p k 1 k 5 k 2 p k 3 k 5 k 4 = P A = k 1 ;B = k 5 ;C = k 2 P A = k 3 ;B = k 5 ;C = k 4 = P A = k 1 ;C = k 2 j B = k 5 P B = k 5 P A = k 3 ;C = k 4 j B = k 5 P B = k 5 = P A = k 1 j B = k 5 P C = k 2 j B = k 5 P A = k 3 j B = k 5 P C = k 4 j B = k 5 P B = k 5 2 = P A = k 1 j B = k 5 P C = k 4 j B = k 5 P A = k 3 j B = k 5 P C = k 2 j B = k 5 P B = k 5 2 = P A = k 1 ;C = k 4 j B = k 5 P B = k 5 P A = k 3 ;C = k 2 j B = k 5 P B = k 5 = P A = k 1 ;B = k 5 ;C = k 4 P A = k 3 ;B = k 5 ;C = k 2 = p k 1 k 5 k 4 p k 3 k 5 k 2 Whichinturnimplies: p k 1 k 5 k 2 p k 3 k 5 k 4 )]TJ/F17 11.9552 Tf 11.956 0 Td [(p k 1 k 5 k 4 p k 3 k 5 k 2 =09 TheresultinEquationisinterestinginthatittellsusalogical propertyofthemodelwe'vedevelopedwithoutconstructingthemodelitself. SinceourmodelmustsatisfyEquationforanychoiceoflabelings k 1 ;k 2 ;k 3 ;k 4 ;k 5 2f 0 ; 1 g 5 ,thecontrapositiveisalsotrue:Ifourgraphical modeldoesnotsatisfyEquationforanychoiceoflabelings, A and C arenotconditionallyindependentgiven B .Hence,wecanmakeassertions 26 PAGE 33 ofthetopologyofaMarkovmodelwithlogicalindependenceconditions attachedtoedgesbyevaluatingasetofpolynomials.Thosepolynomialsin theringofindeterminates R [ f p ijk : 8 i;j;k g ]thatarecongruenttozeroare called algebraicinvariants .Thesetofalgebraicinvariantsofamodelform anidealintheringofindeterminates.Lettheidealofallinvariantsinduced by" AindependentofCgivenB "bedenoted I A;C j B : Supposewearegivenamodelwithevents A 1 ;A 2 ;:::A n ;B 1 ;B 2 ;:::;B n ;C 1 ;C 2 ;:::C n suchthatforall i A i isconditionallyindependentof C i given B i .Let M betheunionofallsuchindependentstatementsforthemodel.thenthe followingisanidealofalgebraicinvariantsforthewholemodel: I M = I A 1 ;C 1 j B 1 [ I A 2 ;C 2 j B 2 [[ I A n ;C n j B n Iftheindeterminatesoftheringaregivenalexicographicorder,thenthe setofallpossiblealgebraicinvariantsconstructedasabovearea Groebner Basis fortheideal[16].Groebnerbasesareobjectsdenedinpolynomial rings,andreadersshouldbereferredtotheexcellent Ideals,Varieties,and Algorithms [8]forfurtherdiscussion. Analgebraicinvariantnotyetdiscussedbuteasilydemonstrablebythe propertiesofasetofindeterminateprobabilitiesis: 27 PAGE 34 X k 1 ;k 2 ;:::;k n p k 1 k 2 k n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1=0 Theaboveexamplesaremeanttoshowhowalgebraicinvariantscan determinepropertiesofastatisticalmodel,includingtopologicalfeaturesona graphicalmodel. Remark In1987,CavenderandFelsenstein[5]andLake[20]independently derivedalgebraicinvariantsfromthephylogenetichiddenMarkovmodelofa phylogenetictree.Tospecifythatthealgebraicinvariantswerederivedfrom phylogeneticmodels,theyareoftenreferredas phylogeneticinvariants .These wereshownin[12]toberelativelyunpredictive.Earlyresearchinproducing phylogeneticinvariantsbydierentmethodsandonvariousmodelsweremostly analgebraicexercise.Acatalogofpapersdevelopinginvariantsintherst sevenyearsofresearchisfoundin[15]. Beforewecontinueonwardwithconstructinginvariants,weneedtodiscuss afewpropertiesoninvariantsthatwouldheadtoconfusion.Recallthatfor thegeneralproblemofproducingaphylogenetictreestrictlyfromthedata onthetaxawearenotallowedpriorinformationonanyoftheinterior vertices,andalsotheroot.Allofourconstructionsuptothispoint,and theconstructionsafterwards,supposetheexistenceofaroot,andinsome cases,useobjectsthatrepresentthedistributionacrosstheroot.However,the invariantswedescribedonotrepresentanyinformationacrosstherootor interiorvertices.Theproblemforconstructingphylogenetictreesisthennot toconstructdirected,rootedphylogenetictrees,butundirectedphylogenetic trees.Itshouldbenotedthatfor n taxa,thenumberofdirectedphylogenetic treesis n )]TJ/F15 11.9552 Tf 10.917 0 Td [(3!!= n )]TJ/F31 7.9701 Tf 6.586 0 Td [(2! 2 n )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 n )]TJ/F31 7.9701 Tf 6.586 0 Td [(1! ,thenbytheperturbationoflabelingtherootas 28 PAGE 35 anotherleaf.Thetopologiesofthe n -leafrootedtreeshaveanaturalbijection intothesetof n +1-leafunrootedtreetopologies.Thenthenumberof n -leafundirectedphylogenetictreesis n )]TJ/F15 11.9552 Tf 11.243 0 Td [(5!!= n )]TJ/F31 7.9701 Tf 6.587 0 Td [(4! 2 n )]TJ/F32 5.9776 Tf 5.756 0 Td [(2 n )]TJ/F31 7.9701 Tf 6.587 0 Td [(2! .[Sinceourgoalis onlytodiscovertheundirectedphylogenetictopologies,howcanwedetermine fromconstructionsofdirectedtrees?]Itturnsoutthatwhendetermining invariantsfromarootedmodel,thechoiceoftherootpositionisarbitrary. Corollary5.1 Let T bea n -taxontreeandlet a beoneofthetaxalabeling theleaves.Thenthephylogeneticinvariantideal I T forthegeneralMarkov modelon T rootedat a isidenticaltothephylogeneticinvariantideal I T forthegeneralMarkovmodelon T rootedat r ,where r isanyotherleaf, internalnode,ornewnodeinstertedonanedgeof T Proof See[1]. Thentheconstructionofinvariantsisindependentofthelocationof therootofthetree,henceallphylogeneticinvariantsdescribeundirected phylogenetictreemodels,andanyconstructioninvolvingrootdistributionsare independentoftherootdistribution.Nowwecandiscussinvariantsandthe constructionofundirectedphylogenetictreetopologies. 6Group-basedModels 6.1FourieronGroups RecallSection2,specicallythelabelraterestrictionsinEquations6and .2.Theserestrictionswerecomputationalshortcutsthatwerebegrudgingly upheldtoproducephylogenies.However,startingwith[13],analgebraically attractiveapproachwasderivedbythewayofharmonicanalysis. 29 PAGE 36 WereferspecicallytotheKimura-3parametermodeldevelopedinEquation .2.Thismodelissymmetricwith3degreesoffreedom,withanadded groupstructure.Moreprecisely,weconstructagroupdenedbytheelements f A;C;G;T g ,andwhosepermutationsaredeterminedbythemap: f 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a v )]TJ/F17 11.9552 Tf 11.955 0 Td [(b v )]TJ/F17 11.9552 Tf 11.956 0 Td [(c v ;a v ;b v ;c v g$f A;C;G;T g ThenthesetofoperationsinducedontothearrayEquation.2is: 0 B B B B B B B @ ACGT CATG GTAC TGCA 1 C C C C C C C A Todrawclosertothepoint,weattributethefollowingreversiblemapto Equation: A $ ; 0 ;C $ ; 1 ;G $ ; 0 ;T $ ; 1 ThenEquationisisomorphictothegrouptablefor Z 2 Z 2 ; popularly knownasthe Kleinfour-group .Furthermore,indeterminingtheratein Equation.2frombases k 1 ;k 2 wendthegroupelementinEquation thatisequalto k 2 )]TJ/F17 11.9552 Tf 11.955 0 Td [(k 1 ,andreversingthemapinEquation. Remark Thetwo-stateJukes-Cantormodelhasasimilarmapinto Z 2 ,and thesetofallaminoacidwords,composedofall3-letterwordsintheset ofnucleotides,areallowedasimilarmapinto 6 L n =1 Z 2 : Thateachofthese modelshaverepresentationsasanitedirectproductof Z 2 isimportant. Wewillneedtodevelopnewnotationandobjects.Let G; +beanite 30 PAGE 37 abeliangroup.Allowing T := f z 2 C : j z j =1 g ,wedenethe characters ofa group G asallmaps : G T thatsatisfy g 1 + g 2 = g 1 g 2 ; 8 g 1 ;g 2 2 G Hencethecharactersaregrouphomomorphismson G ,andthesetofall characterson G aredenedasthe dualgroup of G ,denoted ^ G ; .When ^ G = Z 2 Z 2 ,denotetheelementsin ^ G as[1 ;;; ]. ^ G ; isisomorphic to G ; +bythefollowingmap. G ^ G ; [ ; 0 ; ; 1 ; ; 0 ; ; 1] 7! [1 ;;; ] Wealsousethenotation h g; i torepresentthenaturalpairingof ^ G and G : G ^ G T ; g; 7!h g; i = g 7 Thevalueforalloperationsarelisted: h g; i := ; 0 ; 1 ; 0 ; 1 11111 1 )]TJ/F15 11.9552 Tf 9.299 0 Td [(11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 11 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 )]TJ/F15 11.9552 Tf 9.298 0 Td [(1 1 )]TJ/F15 11.9552 Tf 9.299 0 Td [(1 )]TJ/F15 11.9552 Tf 9.298 0 Td [(11 Theconceptworksinhigherordersof G .Thedualofthedirectsum G m = L m i =1 G isisomorphicto ^ G m underthenaturalpairing: h g 1 ;g 2 ;:::;g m ; 1 ; 2 ;:::; m i = m Y i =1 h g i ; i i Giventheobjectsdenedaboveandafunction f : G C ; ,wedenethe Fouriertransform of f by: 31 PAGE 38 ^ f : ^ G C ; 7! X g 2 G h g; i f g Next,denethebinaryactionof convolution foranytwofunctions f 1 ;f 2 on G by f 1 f 2 g = X h 2 G f 1 g )]TJ/F17 11.9552 Tf 11.956 0 Td [(h f 2 h ;g 2 G Thefouriertransformhasafewinherentproperties: ^ 1 = 8 > > < > > : j G j ;if =1 0 ;otherwise and f 1 f 2 = ^ f 1 ^ f 2 Recovering f from ^ f isalsofoundbytheprocessof Fourierinversion : f g = j G j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 X 2 ^ G h g; i ^ f ; 8 g 2 G Remark Attributingasetofgroupbasedactionstorepresenttherandom walkonaMarkovmodelintroducesawealthofknowledgeconcerningharmonic analysisofgroupsthatunfortunatelywillonlybeutilizedsuperciallyto developourdesiredinvariants.Introductoryreferencestoapplicationsof harmonicanalysisofgroupswithapplicationstostatisticsinclude[10]and [21].Furthermore,whileharmonicanalysisonphylogeneticMarkovmodelsis introduced[13],andinturnproducesanumberofphylogeneticinvariants utilizingaformofgaussianelimination,amoreinterestingapproachin harmonicanalysiswillbeconsideredinstead. 32 PAGE 39 Weproposeavectordistribution v overtheelementsin G bythe following: P v g 1 ;g 2 = v g 2 )]TJ/F17 11.9552 Tf 11.956 0 Td [(g 1 ; 8 g 1 ;g 2 2 G ;v 2 V T ; Givenaleaf l ,deneallverticesinthepathfromtheroot r to l by l .Thenasetof G -valuedrandomvariables Z v v 2 V T overthevertices V T inducesasetof G -valuedrandomvariablesovertheleaves, Y l l 2 L by thefollowingsum: Y l = X v 2 l Z v Remark The"trick"totheharmonicanalysisin[13]istodevelopfourier analysisoncoordinateexpectationsbyndingarepresentationofalignedbase distributionsassummationsoverdistributionsofthe'ancestor'vertexlabels. Thefollowingdescribesthefouriertransformoncoordinateexpectations. Recalltheconstructionoftheindeterminatecoordinateexpectations p k 1 k 2 k n denedinEquation,andthepolynomialringassociatedtothem.As p k 1 k 2 k n onanyarrayofgroupelements k 1 ;k 2 ;:::;k n isnecessarilyafunction from G n intothereals,itmusthaveafouriertransform.EvansandSpeed [13]providethefollowingconstructionofitsfouriertransform.First,we deneforagivendirectedphylogenetictree,theset v ofallleavesthat are"below" v .Thatis,thereisadirectedpathfrom v to l inthetreeif andonlyif l 2 v Theorem6.1 Let p k 1 k 2 k n betheindeterminatejointdistributionofagroup basedmodelonaphylogenetictreeT.Thefouriertransformofthedistribution is 33 PAGE 40 q 1 n =^ 1 n Y v 2V T =r ^ f v Y l 2 v l withtherepresentation ^ r ^ r = X g 2 G n h 1 2 n ;g i g Theproofofwhichisreferredto[30]. Fromhere,[13]developsamethodforobtainingphylogeneticinvariants, thoughageneratingsetforallsuchinvariantsoftheKimura-3modelisnot dened. Itwasthearrivalof[1]thatsparkedtheideafortheconstructionof invariantsthatcoulddescribealocaltoplogicalfeatureofaphylogenetictree. Although[1]describedanunrestrictedMarkovmodelanddidnotproduce allinvariants,itdidhoweverinspire[Sturmfels,Sullivant;2005],whichdoes producealltheinvariantsformodelswithedgeratesrestrictedtoasingle groupstructure.Although[1]doespredate[27],itisillustrativetoshow exactlyhowthosein[27]developedageneratingsetforallgroup-based invariantsbeforeproceedingtothestill-openproblemofndingagenerating setofinvariantsforthegeneralmodel. Unfortunately,thenotationdevelopedtofullyexpressEquationisnot adequatetoconstructtheinvariantsin[27].Wewillrequirethedevelopment of ToricIdeals 34 PAGE 41 6.2ToricIdeals Therecentdevelopmentsofpolynomialringtheory,especiallyintherealmof Groebnerbases,hasledanexpansivesearchforspecialclassesofpolynmial ringideals.Onesuchclassthatisintriguingforusisthe ToricIdeal Werequiretheuseofaeld k formostinstances,weassumethat theeldis R .Fixanarrayofindeterminates x = x 1 ;x 2 ;:::;x n ,and denotethemonomials x u = x u 1 1 x u 2 2 x u n n .Considerthepolynomialring k [ x ]:= k [ x 1 ;:::;x n ].Theringisgeneratedbyallsumsandproductsover theindeterminateset f x 1 ;x 2 ;:::;x n g ,withcoecientsin k Let beasemigrouphomomorphismdenedbyanychoiceof nd dimensionalintegervectors A = a 1 ; a 2 ;:::; a n : : N n Z d ; u = u 1 ;:::;u n 7! u 1 a 1 + + u n a n Giventhissemigrouphomomorphism,wedenethehomomorphismfrom thepolynomialringin f x 1 ;x 2 ;:::;x n g tothepolynomialeld t 1 ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 1 ;t 2 ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 2 ;:::;t n ;t )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 n foranychoiceofpowervectors f u i g on f x i g bya naturalliftingof f a 1 ;a 2 ;:::;a n g intothetermpowersof t j : ^ : k [ x ] k [ t ] ;x u i i 7! t a i u i Example Thecollectionofintegervectors A = f ; 1 ; 0 ; ; 3 ; 2 ; ; 1 ; 1 ; ; 0 ; 2 g inducesthefollowingmap^ on f x 1 ;x 2 ;x 3 ;x 4 g : ^ : x 1 7! t 2 1 t 2 ;x 2 7! t 3 2 t 2 3 ;x 3 7! t 1 t 2 t 3 ;x 4 7! t 2 1 t 2 3 : Thismapgeneralizestoallpolynomialsintheeld,e.g.: 35 PAGE 42 ^ x 2 4 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 2 1 x 2 x 3 3 = t 2 1 t 2 3 2 )]TJ/F15 11.9552 Tf 11.955 0 Td [( t 2 1 t 2 2 t 3 2 t 2 3 t 1 t 2 t 3 3 = t 4 1 t 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(t 5 1 t 8 2 t 5 3 Thekernelof^ areallthosepolynomialsin k [ x 1 ;x 2 ;:::;x n ]thatare mappedto0.Thesetofallpolynomialsinthekernelofamap^ associated toachoiceofintegervectors A isanidealinthepolynomialring,denoted I A andreferredasa ToricIdeal Everyvector u 2 Z n canbewrittenuniquelyas u = u + )]TJ/F28 11.9552 Tf 12.152 0 Td [(u )]TJ/F15 11.9552 Tf 15.466 -4.338 Td [(suchthat u + and u )]TJ/F15 11.9552 Tf 16.327 -4.338 Td [(arenon-negativewithdisjointsupport,i.e.theyrepresentthe positiveandnegativedomainsof u Foreverytermorder in k [ x 1 ;x 2 ;:::;x n ] ; weconstructanitesetof vectors G ker suchthatthereducedGroebnerbasisof I A withrespect to isequalto n x u + )]TJ/F28 11.9552 Tf 11.955 0 Td [(x u )]TJ/F15 11.9552 Tf 13.975 -7.151 Td [(: u 2G o .Thereisanalgorithm: Groebnerbasisoftoricideal I A 1.Introduce n + d +1indeterminates t 0 ;t 1 ;:::;t d ;x 1 ;x 2 ;:::;x n .Let beany termorderwith f t i gf x j g 2.ComputethereducedGroebnerbasis G oftheideal D t 0 t 1 t d )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 ;x 1 t a )]TJ/F32 5.9776 Tf 0 -6.117 Td [(1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(t a + 1 ;:::;x n t a )]TJ/F35 5.9776 Tf 0 -4.838 Td [(n )]TJ/F17 11.9552 Tf 11.955 0 Td [(t a + n E 3.Return:Theeliminationideal G k [ x 1 ;x 2 ;:::;x n ] 36 PAGE 43 Remark If A isentirelynonnegative, t 0 isnotnecessaryandwecancompute theGroebnerBasisfromtheideal h x i )]TJ/F28 11.9552 Tf 11.955 0 Td [(t a i : i =1 ;:::;n i .[25] Examplecontinued ComputingtheGroebnerbasisof I A withgraded lexicographicorderyields f x 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 1 x 2 x 4 g .Thisiseasilychecked: ^ x 4 3 )]TJ/F17 11.9552 Tf 11.955 0 Td [(x 1 x 2 x 4 = t 1 t 2 t 3 4 )]TJ/F15 11.9552 Tf 11.955 0 Td [( t 2 1 t 2 t 3 2 t 2 3 t 2 1 t 2 3 =0 TheunionofallGroebnerbases G withrespecttoalexicographicorder foralllexicographicordersisthe universalGroebnerbasis ,andisa generatingsetof ker ^ Nowthatwehaveasolidideaonwhattoricidealsare,wecanconstruct invariantsthatmanipulateEquationtocomputetoricidealsofthe fouriertransformsofexpectationcoordinates.In[27],thismethodconstructs invariantsbysplittingatreetopologyintoasetoflocaltopologies. 6.3Labelings Weproduceinvariantsofthegroup-basedmodelbyassigninglabelstothe edgesandproducingamapfromthesetoffouriercoordinates q 11 1 ;q 11 ;:::;q toasetofindeterminatesrepresentingthe labelingsofanedgeforeachedge.Thepolynomialsinthetoricideal constructedthiswayarethenreturnedtotheringofexpectationcoordinates R [ p AA A ;p AA C ;:::;p TT T ],whicharethesetofinvariantsinthegroup-rate model.Thiswillbemadeexplicitsoon. 37 PAGE 44 Figure7:Arooted3-leaftree Foreachcoordinate q 1 ; 2 ;:::; n in q 11 1 ;q 11 ;:::;q ,weassociate thefourierleaflabelings 1 ; 2 ;:::; n toleaves l 1 ;l 2 ;:::;l n Weattributetoeachedgeonaphylogenetictreewithafourierleaf labelings,afourierlabelingdeterminedbytheleavesproceedingtheedge. Recallthedenitionof e asthesetofallleavesdirectedawayfromthe edge e .Foreachedge e 2 E T ,let g e betheassociatededgelabeldened by: g e = Y l 2 e l Example ConsiderFigure.3,andletthefourierlabelingontheleaves l 1 ;l 2 ;l 3 be ;; ,thentheinducedlabelingsoftheedgesare g e 1 = l 1 =1 ;g e 2 = l 2 = ;g e 3 = l 3 = 38 PAGE 45 g e 4 = l 2 l 3 = = Foreach leaf-wisecoordinate q 1 2 n ,weproduceaninjectivemapinto asetofindeterminateswithcoordinatesassociatedtothefourieredgelabels inducedbyEquation.Letting m = j E T j ,wedenotethese edge-wise coordinates by q g e 1 g e 2 g e m ,suchthatfor e i = f l i g ; 8 i =1 ;:::;n .Thatis, welabelthoseedgesadjacenttoleaveswiththeirassociatedleaf-labels. Recallingthetransformintofouriercoordinates,Equation,thismap cannowbeconstructedasaproductofthefouriertransitionfunction ^ f actingonthefourieredgelabelsconstructedabove. q 1 n $ q g e 1 g e 2 g e m = Y e 2 E T ^ f e g e Bythisconstruction,theinjectivityofthemapfromleaf-coordinateindeterminatestoedge-coordinateindeterminatesismadeexplicit.Itshouldalso benotedthatthisproductdoesnotrelyontherootrepresentation^ ,and alledges e havedirectionawayfromtheinteriornodeclosesttotheroot node.Thisdistinctionwillsoonbedoneawaywith,howeverforcaseson subtrees,andedgeswillbedirectedawayfromachosennode. Alongwiththosegrouplabelsandcharacterlabels,wedeneathirdset oflabels L and labelingfunction L : ^ G !L Recallthatwerequireidenticalgroup-basedratematricesattachedto theMarkovmodelonthephylogenetictree,orelsetheisomorphismfrom transitionprobabilitiestogroupactionsdoesnotexist.Weidentifyunique labelsforeachuniqueelementinthearray.ForJukes-Cantor,thisarray 39 PAGE 46 [1 )]TJ/F15 11.9552 Tf 12.606 0 Td [(3 ;;; ]hasonlytwouniquelabels,whileinKimura-3,thearray [1 )]TJ/F15 11.9552 Tf 12.272 0 Td [( + + ;;; ]hasauniquelabeltoeachelement.ForKimura-3, whichisthemodelweareinterestedin,thisonlycreatesanotherbijection betweenlabelings,andistrivial.However,itshouldbenotedthatwerewe tochooseamodel,we'dbeforcedtoconsider linearinvariants ,whichare coordinateequalitiesthatareattributedtoequivalentlabelings.Thosemodels withlinearinvariantsmustthencreateinvariantsmodulototheidealof linearinvariants.Thepurposeofcreating L -labelsisthentodeterminethose linearinvariants. Fromthelabelfunction L inEquation,weinduceamap L T from theedge-coordinatelabelingsabovetoanarrayofedge L -labels,dened: L T : ^ G m !L m ; : g e 1 ;:::;g e m 7! L g e e 2 E T = L g e 1 ;L g e 2 ;:::;L g e m Weconsiderforalledges e 2 E T andlabels l 2L ,weproposethe indeterminatevariables a e l ,termed edge-labelindeterminates .Thisinducesa maprelatingeachfourieredge-coordinatetothepolynomialringofedge-label indeterminates. C [ q g e 1 g e m : 1 ;:::; n 2 ^ G n ] C [ a e l : e 2 E T ;l 2L ] ; q g e 1 g e m 7! Y e 2 E T a e l L g e : Thenthemapfromleaf-coordinateindeterminatestoedge-labelindetermi40 PAGE 47 nantsisthecompositionofmapsEquationsand: C [ q 1 n ] [ q : 2 im L T ] C [ a e l : e 2 E T ;l 2L ] : Theedge-labelindeterminatesforagivenedge-wisefouriercoordinate isthefourierequivalenttoasinglehistoryasinEquation.We areonlyinterestedinthealgebraicmanipulationtheyallow.Themapof fourierleaf-coordinatestotheedge-labelindeterminatesinEquationis exactlythemapwhoseassociatedtoricidealwewishtocompute.More precisely,thosebinomialsin[ q : 2 im L T ]thatreturn0whenmapped into C [ a e l : e 2 E T ;l 2L ].Theunionofallidealsacrossthechoiceofleaf labelsisthesetofallphylogeneticinvariantsforgroup-basedmodels. Sogiventhetransformationfromgroup-coordinatestofouriercoordinates, andfouriercoordinatestoedge-labelindeterminates,wewantthekernel.The naiveapproachwouldbetouseBuchberger'salgorithmoranotheralgorithm tocomputeGroebnerbases,whichwouldwork,butisadistastefulformany reason. Groebnerbasiscomputationis slow ,prohibitivelyslow.Tobeavoided asidefromthesimplestcases. Providesusnorealintuitionastothetopologicalinformationthe phylogeneticinvariantsmaybecapableofcommunicating. [27]hasalreadysolvedthersttwoproblemswiththeirapproachof attenings 41 PAGE 48 6.4Group-basedFlattenings Withtheconceptofatteningscomesthedirtytruthaboutourconstruction overdirectedphylogenetictrees.Theconstructiondoesnot,infact,relyupon adirection.Thisisadvantageous,however,asourpurposesinthispaperis todeducethetopologyoftheundirectedphylogenetictree,sincepractically wehavenorealintutitionastowherethepositionofthecommonancestor isinrelationtothetaxa.Forthepurposesofpropagatinglabelsfromleaves, wedirectpropagationawayfromtheedgesandrootswechoosetoseparate oursubtreesalong. Let T beaphylogenetictree,weconstructforanedge e 2 E T the uniquepairofconnectedcomponentsubtreeswhoseunionofedgesis E T andwhoseintersectionisstrictly e ,andchooseanarbitrarydirectionaway from e thatwilldene e .Notethatforourpurposes,wewillonlychoose interioredges.Denethesesubtrees T e; + and T e; )]TJ/F15 11.9552 Tf 15.529 1.793 Td [(bytheirrespectiveedge sets: E T e; + := e [f e g E T e; )]TJ/F15 11.9552 Tf 7.085 1.793 Td [(:= E T )]TJ/F15 11.9552 Tf 11.955 0 Td [( e [f e g Thevertexsets V T e; + ;V T e; )]TJ/F15 11.9552 Tf 7.084 1.794 Td [(areobtainedfromtheunionsofthevertices fromtheedgesets.Itisclearthatthechoiceofdirectionthatdenes e isarbitrary. Example ObserveFigure.4,whichistheresultofsplittinga6-leafunrooted treealongtheedge v 8 ;v 9 ,producing T e; + and T e; )]TJ/F15 11.9552 Tf 7.084 1.793 Td [(.Labelingeachsubtree 42 PAGE 49 Figure8:Subtrees E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.794 Td [(fora6-leaftree T andedge v 8 ;v 9 as T e; + or T e; )]TJ/F15 11.9552 Tf 16.608 1.793 Td [(isdependentonanarbirtrarychoiceofdirectioninthe undirectedtree T Theactionofobtainingapairofsubtreesinthismannerwillbereferred toasa edgeattening along e .Themeaningofwhichwillbecomemore clear.Next,wedenetheimagesofthelabelingfunctiononthesesubtrees, andrequirepairsoflabelingstobe consistent : Denition Wesayachoiceoftwolabelings )]TJ/F36 11.9552 Tf 11.385 -4.338 Td [(2 im L T e; )]TJ/F15 11.9552 Tf 6.752 -3.264 Td [( ; + 2 im L T e; + are consistent if )]TJ/F15 11.9552 Tf 7.084 -4.338 Td [([ e ]= + [ e ]. Thatis,werequirethoselabelingsontheleafsetsofoursubtreesto propagateontoourchosenedge e withthesamelabel.Theintuitionis sound,astherandomwalkalongthecorrectphylogenetictreedescribedas groupactionsmust,overtime,produceconsistentactionsalongtheinterior edgestoobtainlabelingsontheleaf-setovertime. 43 PAGE 50 Figure9:Asampleconsistentlabelingandedgepropagationon E T e; + E T e; )]TJ/F15 11.9552 Tf 7.085 1.794 Td [( Althoughitisnotrequired,wewillusethetableaunotationof[27]to developourinvariants.Let f l i :1 ; 2 ;:::;d g forthefollowing tableau notation representasetof L -labelsontheedge-setforthetree T .Foranygiven d 2 N ,theedge-coordinatemonomial M = q l 1 q l 2 q l d hasanassociatedtableau: M = 2 6 6 6 6 6 6 6 4 l 1 l 2 l d 3 7 7 7 7 7 7 7 5 Let T 1 ;T 2 ;T 3 beconnectedcomponentsubgraphs,andtheirrespective consistentedgelabelings l;m;n .Let f l i ;m i ;n i ;i =1 ;:::;d g beacollection ofconsistent,notnecessarilyuniqueedgelabelings.Wedenotetheproduct of L -labeledge-coordinates q l 1 ;m 1 ;n 1 q l 2 ;m 2 ;n 2 q l d ;m d ;n d asthemonomial M with tableaunotation: 44 PAGE 51 M = 2 6 6 6 6 6 6 6 4 l 1 m 1 n 1 l 2 m 2 n 2 . . l d m d n d 3 7 7 7 7 7 7 7 5 Usingthisnotation,wehavethefollowing. Lemma6.2 Supposeforthesubtrees T 1 ;T 2 ;T 3 havetheedgesets e [ f e g ; f e g ;E T )]TJ/F15 11.9552 Tf 12.715 0 Td [( e [f e g ,andlet l 1 ;m;n 1 and l 2 ;m;n 2 betwonot necessarilydistinctconsistentlabelingsof T 1 ;T 2 ;T 3 .Then g = 2 6 4 l 1 mn 1 l 2 mn 2 3 7 5 )]TJ/F39 11.9552 Tf 11.955 27.616 Td [(2 6 4 l 1 mn 2 l 2 mn 1 3 7 5 liesintheideal I T;L Proof Thelabelings l 1 ;m and l 2 ;m areconsistentlabelingsforthesubtree T e; )]TJ/F15 11.9552 Tf 7.084 1.793 Td [(,andlabelings m;n 1 and m;n 2 areconsistentlabelingsforthesubtree T e; + .[27]. Thesetofallpossiblebinomialsconstructedbyconsistentlabelingson edgeatteningswithrespecttoagiven e aredenoted Quad e;T Next,weconstructthesetofconnectedcomponentsubtrees T 1 ;T 2 ;T 3 for eachinternalvertex v 2 V T bytheunorderedsetoftreesthatsatisfythe condition: V T 1 V T 2 = V T 1 V T 3 = V T 2 V T 3 = f v g ,asinFigure .4: Choosingsubtreesinthismanneriscalled vertexattening .Muchas weobtainedinvariantsbyachoiceofedgeattenings,wedothesamefor 45 PAGE 52 Figure10:Avertexatteningona6leaftree vertices. Denition Chooseavertex v withvalence c .Thenthesubtreecontaining onlythoseedgesadjacentto v istheclawtree K 1 ;c : Notethatforourpurposes,weonlychooseinteriorverticesonbifurcating trees,hence c =3. Lemma6.3 [27]Let T 1 ;T 2 ;:::;T c betheconnectedcomponentsubtreesof T denedbythepropertythat 8 i 6 = j;V T i V T j = v .Thatis,we constructoursetofsubtreesbyseparatingthemattheirjointvertex.Let f l 1 1 ; ;l c 1 ;:::; l 1 d ; ;l c d ; m 1 1 ; ;m c 1 ;:::; m 1 d ; ;m c d g beacollectionof consistentlabelingsontothesesubtrees.Thenthefollowingisabinomialin tableaunotation 46 PAGE 53 g = 2 6 6 6 6 4 l 1 1 l c 1 . . l 1 d l c d 3 7 7 7 7 5 )]TJ/F39 11.9552 Tf 11.955 38.377 Td [(2 6 6 6 6 4 m 1 1 m c 1 . . m 1 d m c d 3 7 7 7 7 5 Thesetgeneratedbytheconstructionofallsuchbinomialswithconsistent labelingsonthesubtreesassociatedtotheatteningon v isdenoted Ext B v T ,andisasubsetof I T;L .Furthermore,foreachrow i ,column j ,consider labelings L i j ;M i j 2 Im L T v;e j ;l j i suchthattheset L j i d j =1 isequaltothe multiset M j i d j =1 .Thenthebinomial g = 2 6 6 6 6 4 L 1 1 L c 1 . . L 1 d L c d 3 7 7 7 7 5 )]TJ/F39 11.9552 Tf 11.955 38.376 Td [(2 6 6 6 6 4 M 1 1 M c 1 . . M 1 d M c d 3 7 7 7 7 5 liesin Ext B v T ,andfromallchoicesofconsistentlabelingsand arrangementofmultisets,alltheassociatedbinomialsgenerate Ext B v T Thisisreallypretty.Asweseehere,byatteningthetreebyeach vertex,we'reabletoproduceinvariantsbasedontheindependenceofthe edgelabelingsfromeachother.Wedenoteby B v thegeneratingsetfor thesevertexatteningsforeach v .Thenextlemmasettlesoursearchfor groupbasedinvariants. Lemma6.4 Let T beatreewithconsistentlabelingand e anyinterior edge.Thengiventheconstructionoftheobjectsdescribedabove,theideal I T;L isexactly [ v Ext B v T [ [ e Quad e;T : andgivenatermorder,thebinomialconstructionsabovearealsoauniversal 47 PAGE 54 Groebnerbasis.[27] Thislemmaprovesthatthepreviousconstructionsofinvariantscreatea generatingsetforinvariantsofthefamilyofKimuramodels,andwhats more,determinesallinvariantsbytheirlocalstructuresalone.Thisisincrediblyuseful.Weshallseemoreinvariantsthataredeterminedbytheirlocal topologies,andthatwhenconsideringlocaltopologies,wecandiscusstrees thatareverymuchlikeeachother. Awatchfuleyemayhavenoticedthatwhilewecancreatebinomialsas aboveuptoadegree d ,wehavenotmentionedwhat d maybe.Thefollowing theoremputsthistorest: Theorem6.5 [27]Let T beanarbitrarybinaryrootedtree.Modulothe trivialinvariant q 00 0 )]TJ/F15 11.9552 Tf 11.955 0 Td [(1 1.thephylogeneticinvariantidealoftheJukes-Cantorbinaryisgenerated bypolynomialsofdegree2 2.thephylogeneticinvariantidealoftheJukes-CantorDNAmodelis generatedbypolynomialsofdegree1,2,and3 3.thephylogeneticinvariantidealoftheKimura2-parametermodelis generatedbypolynomialsofdegree1,2,3and4 4.thephylogeneticinvariantidealoftheKimura3-parametermodelis generatedbypolynomialsofdegree2,3and4 Themethoddescribedincomputingpolynomialsofdegree d forthevertex atteningsareallowedtobeboundedby d =4toobtainwithaterm orderingaHilbertbasisforthissubsetofpolynomials. 48 PAGE 55 Remark Allthiswasstartedbythepropositionin[1]thatinvariantscould bededucedfromexploitingthelocaltopologyofaphylogenetictree.Thatall thegroup-basedinvariantsweredeterminedbythisnotionisratherincidental, asthegroup-basedmodelisveryrestrictiveonthepredictivecapabilities. Rememember,wewereforcedbythemodeltoxalledgestothesame length. 7GeneralStochasticModel ThissectionreturnstothelanguageofMarkovmodelsandawayfromspectral analysis.Forreference,weconsiderthestructureforageneralstochastic Markovmatrix. 0 B B B B B B B @ 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(d )]TJ/F17 11.9552 Tf 11.955 0 Td [(g )]TJ/F17 11.9552 Tf 11.955 0 Td [(jabc d 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(a )]TJ/F17 11.9552 Tf 11.955 0 Td [(h )]TJ/F17 11.9552 Tf 11.955 0 Td [(kef gh 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(b )]TJ/F17 11.9552 Tf 11.955 0 Td [(e )]TJ/F17 11.9552 Tf 11.955 0 Td [(li jkl 1 )]TJ/F17 11.9552 Tf 11.955 0 Td [(c )]TJ/F17 11.9552 Tf 11.955 0 Td [(f )]TJ/F17 11.9552 Tf 11.955 0 Td [(i 1 C C C C C C C A UnliketheKimuramodels,therearenoidentiablealgebraicstructuresto attachmachineryto;theconstructionofalgebraicinvariantsforthegeneral MarkovmodelcarriedoutviamanipulationofthelinearalgebraonMarkov matrices. Remark Therstconstructionofinvariantsarethosefoundby[5]and[20], werebuiltforthegeneralstochasticmodel.Thesewereimportantintroductory discussions,theinvariantswhenevaluatedoverdatahavebeenfoundtomake weakstatisticalmetrics.Itisgenerallyassumedthattheseandotherlinear invariants.Thoseinvariantswithtotalpolynomialdegree=1,areoflittle predictivevalue[12]. 49 PAGE 56 Consideragainthecoordinatefunctionevaluatedoverthecollectionof histories: f l 1 l 2 l n k l 1 ;k l 2 ;:::;k l n = X k v 2 H r k r Y v 2 V T = f r g M v v k v ;k v Thisisarealfunctionoverthe n dimensionalspaceofthevaluesateach leaf.RecallthedistributionfunctionEquation,whoseimageisthearray p AA AA ;p AA AC ;:::;p TT TT .Wedeneabijectivefunctionfrom im ~ F to thespacetothe n -tensorproductof n rank-4arrays[1]. E l 1 l 2 l n l 1 ;l 2 ;:::;l n : p AA AA ;p AA AC ;:::;p TT TT 7! 0 B B B B B B B @ p AA AA p AA AC p AA AG p AA AT 1 C C C C C C C A 0 B B B B B B B @ p AA CA p AA CC p AA CG p AA CT 1 C C C C C C C A 0 B B B B B B B @ p TT TA p TT TC p TT TG p TT TT 1 C C C C C C C A Ifwelabelasingleleaf l j by k j 2f A;C;G;T g E mapstothe n )]TJ/F15 11.9552 Tf 12.073 0 Td [(1 tensor: E l 1 l 2 l n l 1 ;l 2 ;:::;l j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ;k j ;l j +1 ;:::;l n : p AA AA ;p AA AC ;:::;p TT TT 7! 0 B B B B B B B @ p AA k j AA p AA k j AC p AA k j AG p AA k j AT 1 C C C C C C C A 0 B B B B B B B @ p AA k j CA p AA k j CC p AA k j CG p AA k j CT 1 C C C C C C C A 0 B B B B B B B @ p TT k j TA p TT k j TC p TT k j TG p TT k j TT 1 C C C C C C C A Achoiceoflabelings k j 1 ;k j 2 ;:::;k j h 2f A;C;G;T g h foranychoiceof h 50 PAGE 57 leavesreturnsan n )]TJ/F17 11.9552 Tf 12.322 0 Td [(h -tensor.Whenallleavesareattributedalabeling k 1 ;k 2 ;:::;k n ,thetensorreturnstheassociatedindeterminateexpectation E l 1 l 2 l n k 1 ;k 2 ;:::;k n = p k 1 k 2 k n : Lettheleaf-labelat l j inducethemarginaldistributionover l 1 ;:::;l j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ;l j +1 ;:::;l n be: E l 1 l 2 l n k 1 ;k 2 ;:::;k j )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 ; ;k j +1 ;:::;k n = X k j 2f A;C;G;T g p k 1 k 2 k j )]TJ/F32 5.9776 Tf 5.756 0 Td [(1 k j k j +1 k n Thenforalabeling k j 1 ;:::;k j n )]TJ/F32 5.9776 Tf 5.756 0 Td [(2 : k j 2 [ A;C;G;T; ] onachoiceof n )]TJ/F15 11.9552 Tf 12.442 0 Td [(2leavesreturnsa4 4matrixwithentriesin R [ p A A ;p A C ; ;p T T ]. Bymanipulatingthesematricesweproducecertainequalitiesandinturn, phylogeneticinvariants.Fornotationalpurposes,wedenethedistribution X by X k 1 k 1 k n := E l 1 l 2 l n k 1 ;k 2 ;:::;k n ;k i 2f l i ;A;C;G;T; g 7.1VertexFlattenings Werefertoasampleconstructionofinvariantsona4-taxaphylogenetic tree,thengeneralizetothosetreeswith 4taxa.Inthiscase,werefer totheevaluationofthetensoronthelabels k 1 ;k 2 ;k 3 ;k 4 2f A;C;G;T g 4 WerefertothetreeinFigure.1denedbythedirectededges E T = [ l 1 ;v 5 ; v 5 ;l 2 ; v 5 ;v 6 ; v 6 ;l 3 ; v 6 ;l 4 ].Wechoose l 1 asthe"root"',thatis,we directallMarkovmatricesasactingawayfrom l 1 ,althoughwewillevaluate l 1 asataxon. 51 PAGE 58 Figure11:Unrooted4-taxonphylogenetictree. X k 1 k 2 k 3 k 4 = l 1 k 1 X k 5 2f A;C;G;T g M l 1 v 5 k 1 ;k 5 M v 5 l 2 k 5 ;k 2 X k 6 2f A;C;G;T g M v 5 v 6 k 5 ;k 6 M v 6 l 3 k 6 k 3 M v 6 l 4 k 6 ;k 4 D l 1 = 0 B B B B B B B @ l 1 A 000 0 l 1 C 00 00 l 1 G 0 000 l 1 T 1 C C C C C C C A Denethediagonalmatrix P v 5 ;l 3 l 4 ; ij whichassociatesgivenleaves l 3 ;l 4 to values i;j ,respectivelybytheentryvalues 52 PAGE 59 Fork 2f A;C;G;T g ; P v 5 ;l 3 l 4 ; ij k;k = 4 X m =1 M v 5 v 6 k;m M v 6 l 3 m;i M v 6 l 4 m;j Fork 6 = l; P v 5 ;l 3 l 4 ; ij k;l =0 Denefor i;j;k;l 2f A;C;G;T g X l 1 l 2 ij = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; ij M v 5 l 2 X l 1 l 2 kl = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 X l 1 l 2 = D l 1 M l 1 v 5 M v 5 l 2 Thefollowingisaresultoflinearalgebgra: X l 1 l 2 ij X )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 l 2 X l 1 l 2 kl = = D l 1 M l 1 e P e;l 2 l 3 ; ij M v 5 l 2 M )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 v 5 l 2 M )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 v 5 D )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; ij P v 5 ;l 3 l 4 ; kl M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl P v 5 ;l 3 l 4 ; ij M v 5 l 2 = D l 1 M l 1 v 5 P v 5 ;l 3 l 4 ; kl M v 5 l 2 M )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 v 5 l 2 M )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 v 5 D )]TJ/F31 7.9701 Tf 6.587 0 Td [(1 l 1 D l 1 M l 1 e P e;l 2 l 3 ; ij M v 5 l 2 = X l 1 l 2 kl X )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 l 2 X l 1 l 2 ij Wemakethesubstitution X )]TJ/F31 7.9701 Tf 6.586 0 Td [(1 l 1 l 2 = 1 det X l 1 l 2 Cof X l 1 l 2 T 53 PAGE 60 X l 1 l 2 ij Cof X l 1 l 2 T X l 1 l 2 kl = X l 1 l 2 kl Cof X l 1 l 2 T X l 1 l 2 ij X l 1 l 2 ij Cof X l 1 l 2 T X l 1 l 2 kl )]TJ/F17 11.9552 Tf 19.261 0 Td [(X l 1 l 2 kl Cof X l 1 l 2 T X l 1 l 2 ij =0 ThelefthandsideofEquationhas16entriesin R [ p AA A ;p AA C ; ;p TT T ],allofwhichmustbecongruenttozero.Hence wehave16algebraicinvariantsofdegree5. Next,wegeneralizethisconstructiontotreeswith 4taxa,withan emphasisastohowtheseareattenings.Chooseaninternalvertexand removethisvertexfromthetree.Thevalencyofinternalverticesareexactly 3,hencetheremovalreturns3disjointconnectedsubtrees.Choosetwo subtrees,andfromeachtreechooseexactlyoneleaf,denoted a 1 and a 2 : Then fromtheremainingtree,ordertheleaves a 3 ;:::;a m .Thenforanylabelings i 3 ;:::;i m ; j 3 ;:::;j m 2f A;C;G;T g m )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 X l 1 l 2 i 3 :::i m Cof X l 1 l 2 T X l 1 l 2 j 3 :::j m ::: = X l 1 l 2 j 3 :::j m ::: Cof X l 1 l 2 T X l 1 l 2 i 3 :::i m Thisisoneofahandfulofconstructionsin[1].Notethattheconstruction isdeterminedbythelocaltopologyaroundachoseninteriorvertex. 7.2EdgeFlattenings Denition Denean edgeattening ofanedge e onthetensor X l 1 l 2 l n asdenedabovebyremoving e fromtheedgesetof E T andreturning 54 PAGE 61 thetwoconnected,componentsubtrees T 1 and T 2 .Letthe split inducedby theatteningoforderedleavesbethepairofleafsets L T V T 1 and L T V T 2 .Reordertheleavessuchthat: f l 1 ;:::;l m g = L T V T 1 f l m +1 ;:::;l n g = L T V T 2 The atteningof X on e isthe4 m 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m matrix F = Flat e X denedforallmatrixpositions i 2f 1 ;:::; 4 m g ;j 2f 1 ;:::; 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m g ,asfollows: Ordertheelementsin J 1 = f A;C;G;T g m ;J 2 = f A;C;G;T g n )]TJ/F34 7.9701 Tf 6.586 0 Td [(m suchthat u 1 u 2 u m = J 1 [ i ] ;v 1 v 2 v n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m = J 2 [ j ]andset: F i;j = X u 1 ;:::;u m ;v 1 ;:::;v n )]TJ/F35 5.9776 Tf 5.756 0 Td [(m = p u 1 ;:::;u m ;v 1 ;:::;v n )]TJ/F35 5.9776 Tf 5.756 0 Td [(m : Thisconstructionreturnsanarrayofallindeterminateexpectations. Althoughwehavemadeitexplicitforlabelingsinthesetofnucleotides, theconstructionofatteningsareeasilygeneralizedtoanynumberofnite states.Weprovideanexampleinthe2-statecase. Example Considerthe5-taxonunrootedtreeasinFigurewithleaflabels in f 0 ; 1 g .Fromlookingatthepicture,weseetwonaturaledgeattenings onthe2 2 2 2 2produced.Thesplitsinducedbytheinternaledge atteningare ff l 1 ;l 2 g ; f l 3 ;l 4 ;l 5 gg and ff l 1 ;l 2 ;l 3 g ; f l 4 ;l 5 gg ,withmatrices 55 PAGE 62 Figure12:A5-leafunrootedbinarytree 0 B B B B B B B @ p 00000 p 00001 p 00010 p 00011 p 00100 p 00101 p 00110 p 00111 p 01000 p 01001 p 01010 p 01011 p 01100 p 01101 p 01110 p 01111 p 10000 p 10001 p 10010 p 10011 p 10100 p 10101 p 10110 p 10111 p 11000 p 11001 p 11010 p 11011 p 11100 p 11101 p 11110 p 11111 1 C C C C C C C A 0 B B B B B B B B B B B B B B B B B B B B @ p 00000 p 00001 p 00010 p 00011 p 00100 p 00101 p 00110 p 00111 p 01000 p 01001 p 01010 p 01011 p 01100 p 01101 p 01110 p 01111 p 10000 p 10001 p 10010 p 10011 p 10100 p 10101 p 10110 p 10111 p 11000 p 11001 p 11010 p 11011 p 11100 p 11101 p 11110 p 11111 1 C C C C C C C C C C C C C C C C C C C C A DevelopinganedgeatteningbyperturbingtheindeterminatetensornaturallyleadsustoquestionexactlywhattheMarkovmodelforthisconstruction 56 PAGE 63 is.Letthegraphicalmodelinitiatefromanode r andbifurcateimmediately intotwoedgeswhoseterminalnodesrepresentstatesin f A;C;G;T g n and f A;C;G;T g n )]TJ/F34 7.9701 Tf 6.586 0 Td [(m .ThentheMarkovmatricesassociatedtotheseedgesare M 1 and M 2 withsizes4 4 m and4 4 n )]TJ/F34 7.9701 Tf 6.587 0 Td [(m ,andtheMarkovmatrixproduct thatdenes Flat e X is Flat e X = M T 1 D r M 2 Where D r isthediagonalizationoftherootdistributionvector.Since theranksof M 1 and M 2 areatmost4,matrix Flat e X musthaveat mostrank4.Thisleadsustothenubofthisconstruction.Since Flat e X hasrank 4,thenevery5 5minorof Flat e X musthavedeterminant zero.Thedeterminantisadegree-5polynomialintheringofindeterminants. Hence,bytakingthedeterminantofany5 5minorof Flat e X we computephylogeneticinvariants,implyingthattheseinvariantswhenevaluatedunderdatawilltellusinformationaboutwhetherthesplittingofthe leavesinthiswayisvalidforthebestttopologyofthephylogenetictree[2]. Remark PleasenoteagainthatthisconstructionofMarkovmatricescanbe generalizedtoanyMarkovtreemodelwithanynitenumberofstatesat theleaves.Thenforleaveswith numberofstates,wemustndthe determinantsofthe +1 +1minorsofanedgeatteningthatwill returna +1-degreealgebraicinvariant. 7.3SuciencyofFlattenings Thelastthingtodiscussisexactlyhowdenseconstructionsofinvariants byvertexandedgeatteningsareintheidealofinvariantsforthegeneral 57 PAGE 64 Figure13:Visualrepresentationofanedgeatteningalongedge v 9 ;v 10 Markovmodel.Withoutgoingtoodeepintothealgebra,wementionafew keyfacts. Conjecture7.1 [2]Foranynumberofstates andandnumberoftaxa n thephylogeneticideal I T forthegeneralMarkovmodelonabinary n -taxon tree T isthesumoftheidealsassociatedtotheatteningsof X l 1 l 2 l n at theverticesof T Thatedgeatteningsareactuallyaspecialcaseofvertexatteningsis madeclearerin[2],andanimportantcorollarythebinarystatecaseis deduced. Corollary7.2 Theideal I T ofallphylogeneticinvariantsforthephylogenetic treemodelwithbinarystatesisgeneratedbythe 3 3 minorsforalledge atteningsofthemodel'sassociated n -tensor. 58 PAGE 65 Remark Thenotionofvertexatteningsismoreformal,andrequiresthe useofthesecantvarietiesin[19].In[2]itismaderelativelyclearthat theidealofinvariantsonthegeneralMarkovphylogenetictreemodelis foundbyconstructingallinvariantsofvertexandedgeattenings,muchas in[27].However,aconstructionofageneratingsetofvertexatteningsis notdeduced,andwhileagreatmanyofinvariantscanbeproduced,nding ageneratingsetfor I T isstillanopenproblem.Pleasereferto[19]and [2]foracleardescriptionofhowweknowthatvertexandedgeattenings generatetheideal I T Whilethesuciencyofatteningsisnotcompletelyunderstood,wedo havethemethodstoproduceagreatnumberofthem,wehaverelative condenceintheirstatisticalvalue,andinpracticalcasesitshardtobelieve thatwe'llrequirethegeneratingsetof I T todevelopusefulstatistics.Recall thattheemphasisofourinvariantshasbeentoobtain local information. Thisnotionwillbecomecleareraswediscussthespaceoftrees. 8SpaceofTrees Eachconstructionofinvariantsthatwe'vediscussedarederivedfromasubset ofthetopologicalpropertiesofthephylogenetictreeassociatedtoit.These invariantswhenevaluatedundertheobserveddatacanbeusedtoconstruct statisticsofeachtree,andthebestttreeshouldhavetheoptimumstatistic, thisisusuallytheminimumstatisticwhenthestatisticisametriclike L 1 overallinvariants.Thenthenextproblemincomparingtreesishowto mostecientlycomparethem,soastoreducethetotalnumberofinvariants computedforanysetofobserveddata.Forsomenaivetrials,suchasthose abovein[4],invariantsarecomputedforalltrees,theinvariantsevaluated 59 PAGE 66 Figure14:Thespaceofall4-taxonunrootedbinarytrees.Alltreesinspace collapsetheiredgeinto T 0 andextendtobecomeanothertree overtheobserveddata,thenastatisticiscomputedfortheseinvariantsto comparealltreestoeachother.Wemayalsocomparesubsetsoftrees basedontheirlocalfeaturesandchoosetheoptimaltree.Thatis,there isasystematicwaytoorganizetreesbyadistanceawayfromthesolutiontree. Werefertothe spaceoftrees T n asthesetofallundirectedbinary n leaftrees.Itshouldbeagainnotedthatthereexistsanaturualbijection fromundirectedbinary n leaftreestothedirectedbinary n )]TJ/F15 11.9552 Tf 11.955 0 Td [(1leaftrees. Denition Denetheactionof collapsing anedgebydeletinganinternal edgeanddeningitstwoadjacentverticesasonevertex.Denetheaction of extending anedgeonavertexofvalence4bypairingtheadjacentfour edgesintotwoconnectedsubgraphs,thenjoiningthesubgraphswithanedge thatisadjacenttoallfourinitialedges.Denea rotation bytheactionof successivelycollapsingandextendinganinternaledge.Denethe rotational 60 PAGE 67 distance betweenanytwotreesastheminimalnumberofrotationsnecessary totransformone n -leafunrootedbinarytreetoanother. Properties Thereareexactly )]TJ/F31 7.9701 Tf 9.968 -4.977 Td [(4! 2!2! = 2=3uniquerotationsforanygiven internaledgeinabinarytree,thenumberofunorderedchoicesofasetoftwo inasetoffour.Foranundirectedphylogenetictreewith n 3taxa,there areexactly n )]TJ/F15 11.9552 Tf 12.118 0 Td [(3internaledges,hence3 n )]TJ/F15 11.9552 Tf 12.118 0 Td [(9numberofrotationspossible. Since1outof3rotationsisanidentitymap,thereare2rotationsforeach edgethatreturnanewtree,hencethereare2 n )]TJ/F15 11.9552 Tf 11.93 0 Td [(6numberofuniquetrees neighboringanytree. Remark Wenowhavearelationbetweentrees.Wesaythattwophylogenetic treesare neighbors ifthereexistsasinglerotationthatmapsonetreeto another.Itisclearthatthismapisinvertiblebyasinglerotation.Here,the ideaof localtopology becomesclear.Thelocaltopologyisasubsetofvertices andedgesthatdenehowconnectedcomponentsubtreesofaphylogenetic treearerelatedtoeachother,andwhatrotationsmaptopologiestoeach other. Givensetsofmapsthatrelatetreestoeachother,wecandenea graphicalspacewithuniquetreetopologiesasnodesconnectedbyedgesifthe associatedtreesarerelatedtoeachotherbyarotation.Denotethisspace by T n .Thisspacehasanumberofniceproperties. Wedeneamapfromtheinternaledgesofasingleundirectedtree IE T totherealspace R n )]TJ/F31 7.9701 Tf 6.587 0 Td [(3 bythefollowingmap. D : IE T R n )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 ; v i 1 ;v t 1 ;ie 2 ;:::;ie n )]TJ/F31 7.9701 Tf 6.587 0 Td [(3 7! d ie 1 ;d ie 1 ;:::;d ie n )]TJ/F31 7.9701 Tf 6.586 0 Td [(3 ; 61 PAGE 68 where d ie = d v ;v isarealmetricwiththeproperty: d v 1 ;v 3 = d v 1 ;v 2 + d v 2 ;v 3 Theimageofthismapisanonnegativeconvexpolyhedraofdimension n )]TJ/F15 11.9552 Tf 12.092 0 Td [(3,andwedescribetheimageforatree'sinternaledgesisthetree's orthant .Weconstructspacesbythismapforeachtreein T n .Each n )]TJ/F15 11.9552 Tf 9.946 0 Td [(4-dim. faceofthetreeorthantcorrespondstoabinarytreewithasinglecollapsed edge.Theedgewhichismeasuredalongtheaxisat0.Thereareexactly threetreesthatcollapseaninterioredgetothisnon-binarytree,hencethe orthantsofthethreetreessharethisface.Wethenallowthefacesofthe orthantsforallpossible n )]TJ/F15 11.9552 Tf 11.429 0 Td [(5!!undirectedbinarytreestobeequivalentin T n iftheirrespectivefacesarecorrespondingtothesametree,deningan equivalencerelationbetweenthefacesoftreeorthants. Denition Inthespaceoftreesasaconvexpolyhedra,theunionofall pointswherethecoordinatesummetric L 1 isexactly1isdenedasthe linkoftheorigin .Thelinkoftheorigininanyorthantinturnformsa simplex,andthelinkoftheoriginisasimplicialcomplex. Example n=4:Thereexist3binaryrootedtreeswith1interioredge.Eachtree determineseach1-dimensionalorthantconvergingtoa0-dimensionalface attheorigin.Thelinkoftheoriginisexactlythethreepointsalong eachraywithlength1fromfromtheorigin. n=5:Thereare15binaryrootedtrees,eachwitha2-dimensionalorthant sharingone-dimensionalfaces.Thelinkoftheoriginhastheinteresting 62 PAGE 69 Figure15:Thelinkoftheoriginof T 4 ,representedbythe3bolddots.The labelscorrespondtothetreesinFigure propertyofbeingequivalenttothePetersengraph,showninFigure Thefollowingdescribesthesizeof T n Theorem8.1 Themaximumrotationaldistancebetweenanytwotreesinthe spaceofundirected, n -leaftreesis n log n ,ortheupperandlowerbounds areconstantsmultipliedby n log n Proof In[9]it'sshownthattheupperboundforrotationaldistance O n log n In[29]itisshownfor n log n Denition Anotnecessarilyuniquepathofminimumrotationaldistance in T n betweenanytwotreesisreferredtoasa geodesic Withanunderstandingofthegeometryonthespaceofphylogenetictrees, andinparticularthemaximumgeodesiclengthin T n ,wecancompletethis 63 PAGE 70 Figure16:Thelinkoftheoriginof T 5 ,shownhereinmostfamiliarformof thePetersengraph.[22] briefintroductionandmoveforwardwiththeinterestingexperiment.However, in[9],[3],[28],and[29],thereisextensivediscussionastothecombinatorial natureofbinarytrees,andthereisfurtherpropertiesofinterest. PartII Method,Data,andResults 9Introduction Thenextsectiontakesintoaccountthepreviousthreesections.Wehave obtainedwaystocomputeinvariantsthatareproducednaturallyfromlocal topologiesonthephylogenetictreewhichtheydescribe.Wehavediscussed thespaceofphylogenetictrees,andtheenlighteningfactthatthemaximum 64 PAGE 71 pathbetweenanytwotreesinthespaceoftreesislog-lineartothenumber ofleavesinthetree.Aninterestingquestionposediswhetherornotthe "greedyalgorithm",asrespecttoachosenmetric,couldnotreturntous thebestttreetopologystartingfromarandomintialpointinthetree. Wewillattempttoanswerthisbyproposingthatametricderivedfrom invariantsofatreeisagoodcandidateforourgreedyalgorithmonthe spaceoftrees. 10Hypothesis Bynow,threethingsshouldbeclear. 1.Thatthecomputationandevaluationofinvariantscantellusinformation aboutthetopologyofaphylogenetictree. 2.Wecancomparephylogenetictreesthataretopologicallysimilarby asubsetoflocaltopologicalproperties,denedhereasadierenceof rotationsbetweenanytwotrees. 3.Ifweareallowedawayofdeterminingwhichtreefromanysubsetof topologicallysimilartreesis"closer"tothebestttopologyinthe spaceoftreesconstructedabove,thenthepathfromanytreetothe bestttopologybychoosingateachsteptheclosesttreeismaximally log-linear O nlogn Thenifwecanconstructanalgorithmthatchoosesapathbyfollowing thegeodesicbetweenanytwotreesbasedonsequencealignments,thiswould improvethecomputationaleciencyforcomputingthebesttphylogenetic tree.Thenaturalalgorithmtorsttestisagreedyalgorithm.Wewouldlike toseeifagreedyalgorithmonthespaceoftreesscoredusingachosen 65 PAGE 72 statisticparameterizedovertheinvariantswhenevaluatedunderobserveddata terminatesatthebesttphylogenetictree.Furthermore,ifthestatistic strictlydecreaseswhenmovingbetweennearestneighbortreestowardsthe besttphylogenetictree,thiswouldmakethegreedyalgorithmverystrong. Howwellthestatisticusedintheexperimentperformsdependsonits validityoverthenumberofinvariantscomputedandthevalidityofthe statisticovertheobservedalignmentdata.Weuse validity asaweakerform ofconsistency,inthataparameterchangemayonlyimprovethecondence ofastatistic,butdoesnotimplytheprobabilitythatthestatisticwill produceanincorrectresultconvergestozero. Thevalidityandcomputationaleciencyofthethegreedyalgorithmdependonboththevalidityofthestatisticoveritsrespectiveparametersand thegeometryofthespaceoftrees.Weperformanumberofexperiments eachwith105trialsofthegreedyalgorithmforeachtreeinthespace, measuringthenumberoftrialsthatpredictthebestttreeoverthenumberofinvariantsevaluatedforthestatisticoneachtree.Thisistotest thestatisticalvalidityofourgreedyalgorithmoverthenumberofinvariants tocomputeforthestatisticofeachtreeandoverthepositionoftheinitialtree. IfthelengthoftheDNAsequenceisallowedtogrowlarge,andthe sequenceisgeneratedbytheprocessofpropagatingMarkovmatricesalonga treenetwork,theneachinvariantwillconvergetozero,sothateachinvariant ofthebestttreeisstatisticallyconsistenthenceanymetriconthem.I donottestagainstthis,asmanyapplicationsofphylogeneticsrestrictthe lengthofthealignment,henceitisinterestingtotestthealgorithmona 66 PAGE 73 relativelysmallalignmentlength. 11Method IusedthesoftwareSeq-Gen[26]forrandomlygeneratingsequencesbasedon thegeneraltimereversiblemodel.Ichoseasinglelabeledtreetopologyx asourbestttopology,thenIinputtedintoSeq-Genthebestttopology andalistofparametersstatedbelow: NumberofSequencesTaxa:6 Lengthofsequences:10000 Branchlengths:[Homogeneoust=.1,Homogeneoust=.5] Ratematrix:GTR.1,.1,.1,.1,.1,.1 Thus,wehaveexactlytwosetsofsequencedatadierentiatedbythe lengthofthetreebranchestheydescribe.Sincethebranchesareofhomogeneouslength,andtheratematrixisoftheJukes-Cantorclass,boththe group-basedandgeneralMarkovinvariantsshouldbestatisticallyconsistent onthedata. Thisprojectrequiredthenimblemovementbetweenstrings,algebra,and graphs.EachofwhicharehandledadmirablybyPython2.6andthemodules numpy [24]and networkx [17]. Weproduceatotalof8experimentsforeachchoiceofparameter,which areasfollows.First,thechoiceofbranchlengths,whichwereproducedfrom theSeq-Gendatat=.1,t=.5.Second,themethodbywhichtheinvariants 67 PAGE 74 areconstructed;wehavechosenheretocompareexactlytwomethods:general Markovmodeledge-atteninginvariantsandgroup-basedmodeledge-attening invariants.Third,wehavechosentoperformtrialsacrossthenumberof invariantstoproduce:100,300,500,700,900,1100,1300. Letthenumberofinterioredgesbe j IE T j ,andthetotalnumberof experimentsbe H .Weperformthefollowingalgorithmforeachtrialonthe spaceoftrees. 1.Chooseatrandomexactly H= j IE T j invariantsfromanedgeattening foreachinterioredge.Thisweighstheinvariantsobtainedforalledges equally. 2.Evaluateeachinvariantbysubstitutingtheobservedalignedbasedata intoeachindeterminate. 3.Sumtheabsolutevaluesofallevaluatedinvariantsobtainedthisway. Labelthisnumberforeachtreeits"invariantstatistic". 4.Dothisforalltreeswithinthespaceoftrees. 5.Foreachtree,startbylabelingthetreechosenthe'initial'and'current' tree.Evaluatetheinvariantstatisticofthetreeandallneighboring trees. 6.Choosethetreewiththelowerstscoringinvariantstatisticthe'current' tree. 7.Performthisuntilthesamecurrenttreeischosentwice,thisisthe 'terminal'tree. 8.Ifthe'terminal'treeisthebestttree,addtheinitialtreetothe "success"binoftreesforthespaceoftrees. 68 PAGE 75 9.Forthespaceoftreesineachexperiment,ndthemeannumberof "successes",whichisthesizeofthe"success"bindividedby n )]TJ/F15 11.9552 Tf 11.374 0 Td [(5!!, thenumberoftreesinthespace. Ifwehavebeenpayingattention,we'llrecallthatageodesicfromany treetothebesttphylogenetictreeisatmost O nlogn ,fornlarge, andthatthenumberofneighboringundirectedtreestoanygiventreeis 2 n )]TJ/F15 11.9552 Tf 12.221 0 Td [(3.Henceifthealgorithmaboveisoptimalandwechoosethetree closesttothebestttreeateachturn,thenumberoftreestocalculate theirinvariantstatisticis O n )]TJ/F15 11.9552 Tf 11.58 0 Td [(3 nlogn = O n 2 logn ,fornlargeenough. Thisissignicantlybetterthanevaluating n )]TJ/F15 11.9552 Tf 11.955 0 Td [(5!!trees. Remark Themetricdescribedinisthe L 1 metric.Thisisnotthe optimalchoice,butitshouldprovideuswitharoughestimateastothe accuracyofthetreeitdescribes,andisutilizedeectivelyin[4].Itshould alsobenotedthatatreemusthaveaninvariantstatisticlocalminimumin thespaceoftreesforittobeaterminatingtree.Thus,itisessentialthat thebestttreemustbealocalminimumforthedatanottoskewwildly. Wewillseethatthisdoeshappen,butdoesnotsignicantlyaectoverall results. 12Results Above,theresultsinFigureareresultsforallgroup-basedandgeneral Markovtrialsgiventhechoiceofhomogeneousbranchlengthswitht=.1. Thosetrialsmarkedbetweenanytwonumbers n and n +200arechosen tohaveevaluated n statistics.Thatis,thosetrialsbetween100and300 onthebottomaxiswerechosentohave100invariants.Inredarethose experimentsperformedforthegroup-basedattenings,anditisclearthat 69 PAGE 76 Figure17:Resultsforinvariantstatisicheuristiconspaceoftreeswithhomogeneous branchlengtht=.1.Group-based:+,General-Markov:x 70 PAGE 77 Figure18:Resultsforinvariantmeasureheuristiconspaceoftreeswithhomogeneous branchlengtht=.5,Group-based:+,General-Markov:x forthissetofdata,clearlywinsoutagainstthegeneralMarkovmodel attenings,whichareinbluealthoughforboththeimprovementsfromthe choiceof100invariantsinanexperimentto1300invariantsisnoticeable. Thatthegroup-basedmodelsachievesuchaccuracyonouralgorithmwith suchalownumberofinvariantsistheremarkablefacetofourcomputational results. Figureisthecollectionofresultsforallgroup-basedandgeneral Markovtrialsgiventhechoiceofhomogeneousbranchlengthswitht=.5.The treebranchesforthesetrialsareconsiderablelargerthantherstexperiment, andthegraphshowsaconsiderablechangeinthedata.Nowitisthe generalMarkovmodelinvariantsthatperformhighly,althoughtherearesome experiments,showninthebluex'sonthebottom,wherethebestttree 71 PAGE 78 Figure19:Asampleofallinvariantstatisticsforalltreesonaspacewithhomogeneous branchlengtht=.1;1300group-basedattenings wasnotalocalminimuminthetreespace,leadingtoacompletefailure inthetrial.Still,itishighlynotablehowthegroup-basedmodelshave deterioratedinaccuracydepsitewhatshouldbegreaterstatisticalevidenceas totherelationshipbetweenthetrees. Withtheconsistencyofthemethodofusinginvariantsasastatistic alongthespaceoftreesupheldbythesefewexperiments,Inowshowthe invariantstatisticsforalltreesontwosingletrialsfromthebestcasesabove. Figureshowsall105invariantstatisticsfromasingletrialof1300 group-basededgeatteningsforatreewithhomogeneousbranchlength t = : 1, whichisthebestcasefromthetwosetsofdataforgroup-basededge 72 PAGE 79 attenings,showninFigure.Theaxesareabitfacetious,andreferto thenumberofthetreesinthepathinsteadofrotations.Notethatthe invariantstatisticscoreisnotanerror,oratleastnotagraphingerror.If thetrialswereperformedagain,itwouldbeprudenttoaveragethisoutto somethingthesameorderasthegeneralMarkovinvariantstatisticsbelow. Thenumberofrotationsinthepathissimplyonelessthanthenumberof trees.Noticehowthereisonesinglepointforapathlengthof1,whichis thestatisticforthebestttreetopology.Thestatisticsforalltheother treesfollowaniceprogression,anditiseasytoseehowagreedyalgorithm onthisdatacould,atleastmostofthetime,followthegeodesic.There areafewdatapointsthatpointtoapathof6trees,or5rotations. Thisisactuallyamisdirectionfromthegeodesic,asthemaximumgeodesic is5rotations.Thereisalsoasubsetofpointsforthe4-treepathtrials thatareaslowasmost3-treepathtrials.Thisbehaviorisnotoptimal, butotherwiseinteresting.Whatisencouragingisthesteepdropobetween statisticsfortrialsalonga3-treepath,2-treepath,andthebestttree. Thistellsusthattheinvariantstatistics,atleastforthiscase,aregoodat determiningwhenthetreesareclosetothebestttree. Figureshowsall105invariantstatisticsfromanexperimentof1300 generalMarkovedgeatteningsforatreewithhomogeneousbranchlength t = : 5,whichisthebestcasefromthetwoexperimentsforgeneralMarkov edgeattenings,showninFigure2.Onceagainweobservealovely progression.ComparedtoFigure,thistrialdidnotndextralongpaths, sowemaynaivelyassumethatmosttreesfollowedthegeodesicorapath closetoone.Also,thedataforthesetsoftrialsforeachpathlengthare well-grouped,especiallycomparedtoFigure,althoughthedropofrom the3-path,2-path,andbestttreetopologiesareaboutthesame.That 73 PAGE 80 Figure20:Asampleoftheinvariantstatisticsforalltreesonasinglespacewith homogeneousbranchlengtht=.5;1300generalMarkovattenings 74 PAGE 81 twodierentmodelsshouldshowthispatternofprogressionimpliesastrong relationshipbetweenusinginvariantstatisticsandgreedysearchesonthespace oftrees. 13Discussion Ratherthancomparingallpossibletreetopologiesatriskofcomputational ineciency,wecanorganizethetreesbyarelationintoagraphicalspacewith advantageousgeometry.Thenifwecanproduceametricoverthealgebraic invariantsofthebesttphylogenetictreethatisminimizedconsistently, andstrictlyincreasesasthedistanceawayfromthebestphylogeneticas thedistanceawayfromthebesttphylogenetictreeincreases,wehavean ecientandgeometricallyinterestingmethodtoproducethebestphylogenetic tree. Fromthesmallamountoftrialsperformed,itisclearthatthepotential forthisapproachisnotunfounded,andwhollyinteresting.Ifweareallowed optimalresultsforeachtrial,thenwewillndstatisticsonnomorethan O n 2 logn trees,afarsmallernumberthan n )]TJ/F15 11.9552 Tf 12.988 0 Td [(5!!.Thereareother methodsthathavebeenproposedtoreducethenumberoftreestocompute usinginvariantstatistics.In[11],notably,isanapproachthatoptimallynds aninvariantstatisticfor O n 2 .Thisapproachndsthesingularvalue decompositionofedge-attenedmatrices,whichwhileanaturalmetriconall edgeattenings,doesnotallowanaturalextensiontomeasuringoverother attenings.Whilethisiscertainlynotprohibitivetoprediction,theapproach aboveallowsforachoiceofanytypeofinvariant,includingvertexattenings, althoughsadlytherewasnotenoughtimetoperformtrialsusingvertex attenings.UntilageneratingsetforthegeneralMarkovinvariantsarefound, 75 PAGE 82 weareonlyallowedsubsetsofgeneratingsets,andhowwechoosethese subsetsisapreferenceofwhichtopologicalpropertiesareofinteresti.e.in [1]theydescribeinvariantsthatpredictthe4-pointcondition. 14Conclusion Invariantshavebeenfoundtodiscusslocaltopologicalfeaturesofphylogenetic trees,andphylogenetictreeswhenrelatedtoeachotherbylocaltopological featuresconstructacombinatoriallypleasingspacetostudy.Thoughthedata sofarisminimal,thegeneralmethodofusingthespaceoftreestoreduce thecomputationofthenumberoftreeshassofarbeensupported,andfurther simulationsshouldbeconductedtorevealitspotenetial.Theadvantageof algebraicinvariantsisthattheyareofxedpolynomialdegree,andthatthey maybecomputedbeforeanydataiscollected,thenusedasmanytimes asdesired,reducingthetotalcomputationalcomplexity.Bythismethod, thecomplexityoftheconstructionofthebesttphylogenetictreeisthe complexityofthenumberoftreestoevaluate,andthismethodresolvesthat, atleastpartially. Forfurtherresearch,itwouldbeinterestingtodiscusswhatother propertiesthegeometryonthespaceoftreesholdsforthisgeneralmethod. Thegeometryofthespaceoftreesholdsarichstructure,andtheminimal useofthisstructurehasalreadyledtoagreatimprovementintheuse ofinvariantstatistics.Newexperimentscouldinvolvetheperturbationof parameters,changingbranchlengthsandratematricesinSeq-Gentoprovide newdata,aswellasactualbiologicaldata.Thereshouldandbytheauthor's willing,therewillbestudyintohowvertexatteningsfareunderthesetrials. 76 PAGE 83 Alsotoconsideristherecombinationofdierentinvariantmodelsintoone statisticalmeasure.Wealsocouldchooseinvariantsbasedonspeciclabelings, astherearesomealignedbasesinalignmentsthatoccurmorefrequently thanothers,andachoiceofinvariantsbasedonthealignedbaselabelings ofindeterminatescouldmanipulatethis.Thereshouldbethoughtintothe developmentofcomparingtreesbyinvariantstatisticsobtainedonlybythose atteningswhichdierentiatetwoneighboringtrees.Thisideaistrickierthan expected,asatreewillneighbortwotreesthatcannotbecomparedthis way.Still,evenanaivecomparisonlikethisshouldfarebetterthanthe L 1 Thesearetherstthingsthatcometomindwhenexploringthepotential ofthismethod,anditistheauthor'sopinionthattheseandotherfuture studieswillbeasenglighteningasthisone. 77 PAGE 84 References [1]Allman,E.andRhodes,J..PhylogeneticInvariantoftheGeneral MarkovModelofSequenceMutation. MathematicalBiosciences 186 113144. [2]Allman,E.andRhodes,J..PhylogeneticIdealsandVarietiesfor theGeneralMarkovModel. AdvancesinAppliedMathematics 40 127-148. [3]Billera,L.,Holmes,S.andVogtmann,K.001.GeometryoftheSpace ofPhylogeneticTrees. AdvancesinAppliedMathematics 27 733-767. [4]Casanellas,M.andFernandez-Sanchez,J..PerformanceofaNew InvariantsMethodonHomogeneousandNon-HomogeneousQuartetTrees. MolecularBiologyandEvolution 24 288-293. [5]Cavender,J.andFelsenstein,J..InvariantsofPhylogeniesina SimpleCasewithDiscreteStates. JournalofClassicaton 4 57-71. [6]Chor,B.,Hendy,M.,Holland,B.andPenny,D.000.MultipleMaxima ofLikelihoodinPhylogeneticTrees:AnAnalyticApproach. Molecular BiologyandEvolution 17 1529-2541. [7]Chor,B.andTuller,T..FindingaMaximumLikelihoodTreeis Hard. JournaloftheAssociationforComputingMachinery 53 722-744. [8]Cox,D.,Little,J.andO'Shea,D.. Ideals,Varieties,andAlgorithms 3rded.Springer,NewYork. [9]Culik,K.andWood,D..ANoteonSomeTreeSimilarityMeasures. InformationProcessingLetters 15 39-42. 78 PAGE 85 [10]Diaconis,P..GroupRepresentationsinProbabilityandStatistics. IMSLectureNotes-MonographSeries 11 InstituteofMathematicalStatistics, HaywardCa. [11]Eriksson,N..TreeConstructionusingSingularValueDecomposition. In:Pachter,L.andSturmfels,B. AlgebraicStatisticsforComputational Biology ,CambridgeUniversityPress,Cambridge. [12]Eriksson,N..UsingInvariantsforPhylogeneticTreeConstruction. EmergingApplicationsofAlgebraicGeometry 149 1-20. [13]Evans,S.andSpeed,T..InvariantsofSomeProbabilityModels UsedinPhylogeneticInference. TheAnnalsofStatistics 21 355-377. [14]Felsenstein,J.. InferringPhylogenies SinauerAssociates,Inc., Sunderland. [15]Ferretti,V.andSanko,D..PhylogeneticInvariantsforMore GeneralEvolutionaryModels. JournalofTheoreticalBiology 173 147-162. [16]Garcia,L.D.,Stillman,M.andSturmfels,B..AlgebraicGeometry ofBayesianNetworks. JournalofSymbolicComputation 39 331-355. [17]Hagberg,A.,Schult,D.andSwart,P.8.Exploringnetworkstructure, dynamics,andfunctionusingNetworkX.In:Varoquaux,G.,Vaught,T. andMillman,J. Proceedingsofthe7thPythoninScienceConference SciPy2008 Pasadena,11-15. [18]Huelsenbeck,J.,Ane,C.,Larget,B.andRonquist,F..ABayesian PerspectiveonaNon-parsimoniousParsimonyModel. SystemsBiology 57 406-419. 79 PAGE 86 [19]Landsberg,J.M.andManivel,L..OntheIdealsofSecantVarieties ofSegreVarieties. FoundationsofComputationalMathematics 4 397-422. [20]Lake,J.A..ARate-IndependentTechniqueforAnalysisofNucleic AcidSequences:EvolutionaryParsimony. MolecularBiologyandEvolution 4 167-191. [21]Mackey,G..HarmonicAnalysisastheExploitationofSymmetry -aHistoricalSurvey. BulletinoftheAmericanMathematicalSociety 3 543-698. [22]MainPage. WikimediaCommons ,Retrieved18:00,April16,2010 from http : ==commons:wikimedia:org=wiki=File : Petersengraphblue:svg [23]Jensen,F.andNielsen,T.. BayesianNetworksandDecisionGraphs 2nded.Springer,NewYork. [24]Oliphant,T..PythonforScienticComputing. ComputinginScience andEngineering9 10-20. [25]Pachter,L.andSturmfels,B..Statistics.In:Pachter,L.and Sturmfels,B. AlgebraicStatisticsforComputationalBiology ,Cambridge UniversityPress,Cambridge.3-42. [26]Rambaut,A.andGrassly,N.C.997.Seq-Gen:AnApplicationforthe MonteCarloSimulationofDNASequenceEvolutionAlongPhylogenetic Trees. ComputationalApplicationsofBiosciences 13 235-238. [27]Sturmfels,B.andSullivant,S..ToricIdealsofPhylogeneticInvariants. JournalofComputationalBiology 12 204-228. 80 PAGE 87 [28]Sleator,D.,Tarjan,E.andThurston,W..RotationDistance,Triangulations,andHyperbolicGeometry. JournaloftheAmericanMathematical Society 1 647-681. [29]Sleator,D.,Tarjan,E.andThurston,W..ShortEncodingsof EvolvingStructures. SIAMJournalonDiscreteMathematics 5 428-450. [30]Szekely,L.A.,Steel,M.A.andErdos,P.L..FourierCalculuson EvolutionaryTrees. AdvancesinAppliedMathematics 14 200-216. [31]Tavare,S..SomeProbabilisticandStatisticalProblemsinthe AnalysisofDNASequences. LecturesonMathematicsintheLifeSciences 17 57-86. 81 |