... of commonsubexpressioneliminationGM82],whichappearsparticularlyusefulwhenflatteningoccurs. A simpletechniqueusing a hill—climbingmethodiseasytosuperimpose on the proposedstrategy,butmoreambitioustechniqueprovide a topicforfutureresearch.Further,anextrapolation of commonsubexpressioninlogicqueriescanbeseenin the followingexample:letbothgoalsP (a, b,X)andP (a, Y,c)occurin a query.ThenitisconceivablethatcomputingP (a, Y,X)onceandrestricting the resultforeach of the casesmaybemoreefficient.Acknowledgments:WearegratefultoShamimNaqviforinspiringdiscussionsduring the development of anearlierversion of thispaper.References:AU79]Aho, A. andJ.Uliman,Universality of DataRetrievalLanguages,Proc.POPLCon!.,SanAntonio,TX,1979.B40]Birkhoff,G.,“LatticeTheory”,AmericanMathematical Society, 1940.BMSU8S]Bancilhon,F.,D,Maier,Y.SagivandUliman,MagicSetsandotherStrangeWaystoImplementsLogicPrograms,Proc.5—thACMSIGMOD—SIGACTSymposium on Principles of DatabaseSystems,pp.1—16,1986.BR86]Bancilhon,F.,andR.Ramakrishan,AnAmateur’sIntroductiontoRecursiveQueryProcessingStrategies,Proc.1986ACM—SIGMQDIntl.Conf. on Mgt. of Data,pp.16—52,1986.D82]Daniels,D.,et.al.,“AnIntroductiontoDistributedQueryCompilationin~Proc. of SecondInternationalConf, on DistriutedDatabases,Berlin,Sept.1982.GM82]Grant,J.andMinkerJ., On Optimizing the Evaluation of a Set of Expressions,mt.Journal of Computer andInformationScience,11,3(1982),179—189.1W87]loannidis,Y.E,Wong,E,QueryOptimizationbySimulatedAnnealing,SIGMOD87,SanFrancisco.KBZ86]Krishnamurthy,R.,Boral,H.,Zaniolo,C.Optimization of NonrecursiveQueries,Proc. of 12thVLDB,Kyoto,Japan,1986.KRS87]Krishnamurthy,R,Ramakrishnan,R,Shmueli,0.,“TestingforSafetyandEffectiveComputability”,ManuscriptinPreparation.KT811Kellog,C.,andTravis,L.Reasoningwithdatain a deductivelyaugmented database system,inAdvancesin Database Theory:Vol1,H.Gallaire,J.Minker,andJ.Nicholaseds.,PlenumPress,NewYork,1981,pp261—298.Lb84]Lloyd,J.W.,Foundations of LogicProgramming,SpringerVerlag,1984.M84]Maier,D., The Theory of RelationalDatabases,(pp.542—553),Comp.SciencePress,1984.Na86]Naish,L.,NegationandControlinPrologJournal of LogicProgramming,toappear.Sel79]Sellinger,P.G.et.al.AccessPathSelectionin a Relational Database ManagementSystem.,Proc.1979ACM—SIGMODIntl.Conf. on Mgt. of Data,pp.23—34,1979.5Z86]Sacca’,D.andC.Zaniolo, The GeneralizedCountingMethodforRecursiveLogicQueries,Proc.ICDT‘86——mt.Conf. on Database Theory,Rome,Italy,1986.TZ86]Tsur,S.andC.Zaniobo,LDL: A Logic—BasedDataLanguage,Proc. of 12thVLDB,Kyoto,Japan,1986.U85]Ullman,J.D.,Implementation of logicalquerylanguagesfordatabases,TODS,10,3,(1985),289—321.UV85]Ullman,J.D.and A. VanGelder,TestingApplicability of Top—DownCaptureRules,StanfordUniv.ReportSTAN—CS—85—146,1985.V86]Viflarreal,M.,“Evaluation of anO(N**2)MethodforQueryOptimization”,MSThesis,Dept. of Computer Science,Univ. of TexasatAustin,Austin,TX.Z85]Zaniolo,C. The representationanddeductiveretrieval of complexobjects,Proc. of 11thVLDB,pp.458—469,1985.Z86]Zaniolo,C.,SafetyandCompilation of Non—RecursiveHornClauses,Proc.Firstmt.Con!. on Expert Database Systems,Charleston,S.C.,1986.3OPTIMIZATION OF COMPLEX DATABASE QUERIESUSINGJOININDICESPatrickValduriezMicroelectronicsand Computer TechnologyCorporation3500WestBalconesCenterDriveAustin,Texas78759ABSTRACTNewapplicationareas of database systemsrequireefficientsupport of complexqueries.Suchqueriestypicallyinvolve a largenumber of relationsandmayberecursive.Therefore,theytendtouse the joinoperatormoreextensively. A joinindexis a simpledatastructurethatcanimprovesignificantly the performance of joinswhenincorporatedin the database systemstoragemodel.Thus,asanyotheraccessmethod,itshouldbeconsideredasanalternativejoinmethodby the queryoptimizer.Inthispaper,weelaborate on the use of joinindicesfor the optimization of bothnon—recursiveandrecursivequeries.Inparticular,weshowthat the incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizerandthusoffersadditionalopportunitiesforincreasingperformance.1.IntroductionRelational database technologycanwellbeextendedtosupportnewapplicationareas,suchasdeductive database systemsGallaire84].Comparedto the traditionalapplications of relationaldatabasesystems,theseapplicationsrequire the support of morecomplexqueries.Thosequeriesgenerallyinvolve a largenumber of relationsandmayberecursive.Therefore, the quality of the queryoptimizationmodule(queryoptimizer)becomes a keyissueto the success of database systems. The idealgoal of a queryoptimizeristoselect the optimalaccessplanto the relevantdataforaninputquery.Most of the work on traditionalqueryoptimizationJarke84]hasconcentrated on select—project—join(SPJ)queries,fortheyare the mostfrequentonesintraditionaldataprocessing(business)applications.Furthermore,emphasishasbeengivento the optimization of joinsIbaraki84]becausejoinremains the mostcostlyoperator.Whencomplexqueriesareconsidered, the joinoperatorisusedevenmoreextensivelyforbothnon—recursivequeriesKrishnamurthy86]andrecursivequeriesValduriez8 6a] .InValduriez87],weproposed a simpledatastructure,called a joinindex,thatimprovessignificantly the performance of joins.Inthispaper,weelaborate on the use of joinindicesin the context of non—recursiveandrecursivequeries.Weview a joinindexasanalternativejoinmethodthatshouldbeconsideredby the queryoptimizerasanyotheraccessmethod.Ingeneral, a queryoptimizermaps a queryexpressed on conceptualrelationsintoanaccessplan,i.e., a low—levelprogramexpressed on the physicalschema. The physicalschemaitselfisbased on the storagemodel, the set of datastructuresavailablein the database system. The incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizer,andthusoffersadditionalopportunitiesforincreasingperformance.10Joinindicescouldbeusedinmanydifferentstoragemodels.However,inordertosimplifyourdiscussionregardingqueryoptimization,wepresent the integration of joinindicesin a simplestoragemodelwithsingleattributeclusteringandselectionindices.Thenweillustrate the impact of the storagemodelwithjoinindices on the optimization of non—recursivequeries,assumedtobeSPJqueries.Inparticular,efficientaccessplans,where the mostcomplex(andcostly)part of the querycanbeperformedthroughindices,canbegeneratedby the queryoptimizer.Finally,weillustrate the use of joinindicesin the optimization of recursivequeries,where a recursivequeryismappedinto a program of relationalalgebraenrichedwith a transitiveclosureoperator.2.StorageModelwithJoinIndices The storagemodelprescribes the storagestructuresandrelatedalgorithmsthataresupportedby the database systemtomap the conceptualschemainto the physicalschema.In a relationalsystemimplemented on a disk—basedarchitecture,conceptualrelationscanbemappedintobaserelations on the basis of twofunctions,partitioningandreplicating.All the tuples of a baserelationareclusteredbased on the value of oneattribute.Weassumethateachconceptualtupleisassigned a surrogatefortupleidentity,called a TID(tupleidentifier). A TIDis a valueuniqueforalltuples of a relation.Itiscreatedby the systemwhen a tupleisinstantiated.TID’spermitefficientupdatesandreorganizations of baserelations,sincereferencesdonotinvolvephysicalpointers. The partitioningfunctionmaps a relationintooneormorebaserelations,where a baserelationcorrespondsto a TIDtogetherwithanattribute,severalattributes,orall the conceptualrelation’sattributes. The rationalefor a partitioningfunctionis the optimization of projection,bystoringtogetherattributeswithhighaffinity,i.e.,frequentlyaccessedtogether. The replicatingfunctionreplicatesoneormoreattributesassociatedwith the TID of the relationintooneormorebaserelations. The primaryuse of replicatedattributesisforoptimizingselectionsbased on thoseattributes.Anotheruseisforincreasedreliabilityprovidedbythoseadditionaldatacopies.inthispaper,weassume a simplestoragemodel ... )clustered on TID.Clusteringisbased on a hashedortreestructuredorganization. A selectionindex on attribute A of relationRis a baserelationF (A, TID)clustered on A. LetR1andR2betworelations,notnecessarilydistinct,andletTID1andTID2beidentifiers of tuples of R1and A2 ,respectively. A joinindex on relationsR1and A2 is a relation of couples(TID1,TID2),whereeachcoupleindicatestwotuplesmatching a joinpredicate.Intuitively, a joinindexisanabstraction of the join of tworelations. A joinindexcanbeimplementedbytwobaserelationsF(TID1,TID2),oneclustered on TID1and the other on TID2.Joinindicesareuniquelydesignedtooptimizejoins. The joinpredicateassociatedwith a joinindexmaybequitegeneralandincludeseveralattributes of bothrelations.Furthermore,morethanonejoinindexcanbedefinedbetweenanytworelations. The identification of variousjoinindicesbetweentworelationsisbased on the associatedjoinpredicate.Thus, the join of relations A1 andR2 on the predicate(R1 .A =R2 .A andR1.B=R2.B)canbecapturedaseither a singlejoinindex, on the multi—attributejoinpredicate,ortwojoinindices,one on (R1 .A =R2 .A) and the other on (R1.BR2.B). The choicebetween the alternativesis a database designdecisionbased on joinfrequencies,updateoverhead,etc.Letusconsider the followingrelational database schema(keyattributesarebold):11CUSTOMER(cname,city,age,job)ORDER(cname,pname,qty,date)PART(pname,weight,price,spname) A (partial)physicalschemaforthis database, based on the storagemodeldescribedabove,is(clusteredattributesarebold)C_PC(CID,cname,city,age,job)City_IND(city,CID)Age_IND(age,CID)0_PC(OlD,cname,pname,qty,date)CnamelND(cname,OlD)CIDJI(CID,OlD)OID_Jl(OlD,CID)C_PCand0_PCareprimarycopies of CUSTOMERandORDERrelations.City_INDandAge_INDareselectionindices on CUSTOMER.CnamelNDis a selectionindex on ORDER.CIDJIandOlDJIarejoinindicesbetweenCUSTOMERandORDERfor the joinpredicate(CUSTOMER.Cname=ORDER.Cname).3.Optimization of Non—RecursiveQueries- The objective of queryoptimizationistoselectanaccessplanforaninputquerythatoptimizes a givencostfunction.Thiscostfunctiontypicallyreferstomachineresourcessuchasdiskaccesses,CPUtime,andpossiblycommunicationtime(for a distributed database system). The queryoptimizerisincharge of decisionsregarding the ordering of database operations,and the choice of the accesspathsto the data, the algorithmsforperforming database operations,and the intermediaterelationstobematerialized.Thesedecisionsareundertakenbased on the physical database schemaandrelatedstatistics. A set of decisionsthatleadtoanexecutionplancanbecapturedby a processingtreeKrishnamurthy86]. A processingtree(PT)is a treeinwhich a leafis a baserelationand a non—leafnodeisanintermediaterelationmaterializedbyapplyinganinternal database operation.Internaldatabaseoperationsimplementefficientlyrelationalalgebraoperationsusingspecificaccesspathsandalgorithms.Examples of internal database operationsareexact—matchselect,sort—mergejoin,n—arypipelinedjoin,semi—join,etc. The application of algebraictransformationrulesJarke84]permitsgeneration of manycandidatePT’sfor a singlequery. The optimizationproblemcanbeformulatedasfinding the PT of minimalcostamongallequivalentPT’s.TraditionalqueryoptimizationalgorithmsSelinger79]performanexhaustivesearch of the solutionspace,definedas the set of allequivalentPT’s,for a givenquery. The estimation of the cost of a PTisobtainedbycomputing the sum of the costs of the individualinternal database operationsin the PT. The cost of aninternaloperationisitself a monotonicfunction of the operandcardinalities.If the operandrelationsareintermediaterelationsthentheircardinalitiesmustalsobeestimated.Therefore,foreachoperationin the PT,twonumbersmustbepredicted:(1) the individualcost of the operationand(2) the cardinality of itsresultbased on the selectivity of the conditionsSelinger79,Piatetsky84]. The possiblePT’sforexecutinganSPJqueryareessentiallygeneratedbypermutation of the joinordering.Withnrelations,therearen!possiblepermutations. The complexity of exhaustivesearchisthereforeprohibitivewhennislarge(e.g.,n>10). The use of dynamicprogrammingandheuristics,asinSelinger79],reducesthiscomplexityto2~,whichisstillsignificant.Tohandle the case of complexqueriesinvolving a largenumber of relations, the optimizationalgorithmmustbemoreefficient. The complexity of the optimizationalgorithmcanbefurtherreducedbyimposingrestrictions on the class of 12PT’sIbaraki84),limiting the generality of the costfunctionKrishnamurthy86),orusing a probabilistichill—climbingalgorithmloannidis87].Assumingthat the solutionspaceissearchedbyanefficientalgorithm,wenowillustrate the possiblePT’sthatcanbeproducedbased on the storagemodelwithjoinindices. The addition of joinindicesin the storagemodelenlarges the solutionspaceforoptimization.Joinindicesshouldbeconsideredby the queryoptimizerasanyotherjoinmethod,andusedonlywhentheyleadto the optimalPT.InValduriez87],wegive a precisespecification of the joinalgorithmusingjoinindex,denotedbyJOINJI,anditscost.ThisalgorithmtakesasinputtwobaserelationsR1(TID1, A1 ,B1, ... )clustered on TID.Clusteringisbased on a hashedortreestructuredorganization. A selectionindex on attribute A of relationRis a baserelationF (A, TID)clustered on A. LetR1andR2betworelations,notnecessarilydistinct,andletTID1andTID2beidentifiers of tuples of R1and A2 ,respectively. A joinindex on relationsR1and A2 is a relation of couples(TID1,TID2),whereeachcoupleindicatestwotuplesmatching a joinpredicate.Intuitively, a joinindexisanabstraction of the join of tworelations. A joinindexcanbeimplementedbytwobaserelationsF(TID1,TID2),oneclustered on TID1and the other on TID2.Joinindicesareuniquelydesignedtooptimizejoins. The joinpredicateassociatedwith a joinindexmaybequitegeneralandincludeseveralattributes of bothrelations.Furthermore,morethanonejoinindexcanbedefinedbetweenanytworelations. The identification of variousjoinindicesbetweentworelationsisbased on the associatedjoinpredicate.Thus, the join of relations A1 andR2 on the predicate(R1 .A =R2 .A andR1.B=R2.B)canbecapturedaseither a singlejoinindex, on the multi—attributejoinpredicate,ortwojoinindices,one on (R1 .A =R2 .A) and the other on (R1.BR2.B). The choicebetween the alternativesis a database designdecisionbased on joinfrequencies,updateoverhead,etc.Letusconsider the followingrelational database schema(keyattributesarebold):11CUSTOMER(cname,city,age,job)ORDER(cname,pname,qty,date)PART(pname,weight,price,spname) A (partial)physicalschemaforthis database, based on the storagemodeldescribedabove,is(clusteredattributesarebold)C_PC(CID,cname,city,age,job)City_IND(city,CID)Age_IND(age,CID)0_PC(OlD,cname,pname,qty,date)CnamelND(cname,OlD)CIDJI(CID,OlD)OID_Jl(OlD,CID)C_PCand0_PCareprimarycopies of CUSTOMERandORDERrelations.City_INDandAge_INDareselectionindices on CUSTOMER.CnamelNDis a selectionindex on ORDER.CIDJIandOlDJIarejoinindicesbetweenCUSTOMERandORDERfor the joinpredicate(CUSTOMER.Cname=ORDER.Cname).3.Optimization of Non—RecursiveQueries- The objective of queryoptimizationistoselectanaccessplanforaninputquerythatoptimizes a givencostfunction.Thiscostfunctiontypicallyreferstomachineresourcessuchasdiskaccesses,CPUtime,andpossiblycommunicationtime(for a distributed database system). The queryoptimizerisincharge of decisionsregarding the ordering of database operations,and the choice of the accesspathsto the data, the algorithmsforperforming database operations,and the intermediaterelationstobematerialized.Thesedecisionsareundertakenbased on the physical database schemaandrelatedstatistics. A set of decisionsthatleadtoanexecutionplancanbecapturedby a processingtreeKrishnamurthy86]. A processingtree(PT)is a treeinwhich a leafis a baserelationand a non—leafnodeisanintermediaterelationmaterializedbyapplyinganinternal database operation.Internaldatabaseoperationsimplementefficientlyrelationalalgebraoperationsusingspecificaccesspathsandalgorithms.Examples of internal database operationsareexact—matchselect,sort—mergejoin,n—arypipelinedjoin,semi—join,etc. The application of algebraictransformationrulesJarke84]permitsgeneration of manycandidatePT’sfor a singlequery. The optimizationproblemcanbeformulatedasfinding the PT of minimalcostamongallequivalentPT’s.TraditionalqueryoptimizationalgorithmsSelinger79]performanexhaustivesearch of the solutionspace,definedas the set of allequivalentPT’s,for a givenquery. The estimation of the cost of a PTisobtainedbycomputing the sum of the costs of the individualinternal database operationsin the PT. The cost of aninternaloperationisitself a monotonicfunction of the operandcardinalities.If the operandrelationsareintermediaterelationsthentheircardinalitiesmustalsobeestimated.Therefore,foreachoperationin the PT,twonumbersmustbepredicted:(1) the individualcost of the operationand(2) the cardinality of itsresultbased on the selectivity of the conditionsSelinger79,Piatetsky84]. The possiblePT’sforexecutinganSPJqueryareessentiallygeneratedbypermutation of the joinordering.Withnrelations,therearen!possiblepermutations. The complexity of exhaustivesearchisthereforeprohibitivewhennislarge(e.g.,n>10). The use of dynamicprogrammingandheuristics,asinSelinger79],reducesthiscomplexityto2~,whichisstillsignificant.Tohandle the case of complexqueriesinvolving a largenumber of relations, the optimizationalgorithmmustbemoreefficient. The complexity of the optimizationalgorithmcanbefurtherreducedbyimposingrestrictions on the class of 12PT’sIbaraki84),limiting the generality of the costfunctionKrishnamurthy86),orusing a probabilistichill—climbingalgorithmloannidis87].Assumingthat the solutionspaceissearchedbyanefficientalgorithm,wenowillustrate the possiblePT’sthatcanbeproducedbased on the storagemodelwithjoinindices. The addition of joinindicesin the storagemodelenlarges the solutionspaceforoptimization.Joinindicesshouldbeconsideredby the queryoptimizerasanyotherjoinmethod,andusedonlywhentheyleadto the optimalPT.InValduriez87],wegive a precisespecification of the joinalgorithmusingjoinindex,denotedbyJOINJI,anditscost.ThisalgorithmtakesasinputtwobaserelationsR1(TID1, A1 ,B1,...