... of commonsubexpressioneliminationGM82],whichappearsparticularlyusefulwhenflatteningoccurs. A simpletechniqueusing a hill—climbingmethodiseasytosuperimposeon the proposedstrategy,butmoreambitioustechniqueprovide a topic for futureresearch.Further,anextrapolation of commonsubexpressioninlogicqueriescanbeseenin the followingexample:letbothgoalsP (a, b,X)andP (a, Y,c)occurin a query.ThenitisconceivablethatcomputingP (a, Y,X)onceandrestricting the result for each of the casesmaybemoreefficient.Acknowledgments:WearegratefultoShamimNaqvi for inspiringdiscussionsduring the development of anearlierversion of thispaper.References:AU79]Aho, A. andJ.Uliman,Universality of Data RetrievalLanguages,Proc.POPLCon!.,SanAntonio,TX,1979.B40]Birkhoff,G.,“LatticeTheory”,AmericanMathematicalSociety,1940.BMSU8S]Bancilhon,F.,D,Maier,Y.SagivandUliman,MagicSetsandotherStrangeWaystoImplementsLogicPrograms,Proc.5—thACMSIGMOD—SIGACTSymposiumonPrinciples of DatabaseSystems,pp.1—16,1986.BR86]Bancilhon,F.,andR.Ramakrishan,AnAmateur’sIntroductiontoRecursiveQueryProcessingStrategies,Proc.1986ACM—SIGMQDIntl.Conf.onMgt. of Data, pp.16—52,1986.D82]Daniels,D.,et.al.,“AnIntroductiontoDistributedQueryCompilationin~Proc. of SecondInternationalConf,onDistriutedDatabases,Berlin,Sept.1982.GM82]Grant,J.andMinkerJ.,OnOptimizing the Evaluation of a Set of Expressions,mt.Journal of ComputerandInformationScience,11,3(1982),179—189.1W87]loannidis,Y.E,Wong,E,QueryOptimizationbySimulatedAnnealing,SIGMOD87,SanFrancisco.KBZ86]Krishnamurthy,R.,Boral,H.,Zaniolo,C.Optimization of NonrecursiveQueries,Proc. of 12thVLDB,Kyoto,Japan,1986.KRS87]Krishnamurthy,R,Ramakrishnan,R,Shmueli,0.,“Testing for SafetyandEffectiveComputability”,ManuscriptinPreparation.KT811Kellog,C.,andTravis,L.Reasoningwith data in a deductivelyaugmenteddatabasesystem,inAdvancesinDatabaseTheory:Vol1,H.Gallaire,J.Minker,andJ.Nicholaseds.,PlenumPress,NewYork,1981,pp261—298.Lb84]Lloyd,J.W.,Foundations of LogicProgramming,SpringerVerlag,1984.M84]Maier,D., The Theory of RelationalDatabases,(pp.542—553),Comp.SciencePress,1984.Na86]Naish,L.,NegationandControlinPrologJournal of LogicProgramming,toappear.Sel79]Sellinger,P.G.et.al.AccessPathSelectionin a RelationalDatabaseManagementSystem.,Proc.1979ACM—SIGMODIntl.Conf.onMgt. of Data, pp.23—34,1979.5Z86]Sacca’,D.andC.Zaniolo, The GeneralizedCountingMethod for RecursiveLogicQueries,Proc.ICDT‘86——mt.Conf.onDatabaseTheory,Rome,Italy,1986.TZ86]Tsur,S.andC.Zaniobo,LDL: A Logic—Based Data Language,Proc. of 12thVLDB,Kyoto,Japan,1986.U85]Ullman,J.D.,Implementation of logicalquerylanguages for databases,TODS,10,3,(1985),289—321.UV85]Ullman,J.D.and A. VanGelder,TestingApplicability of Top—DownCaptureRules,StanfordUniv.ReportSTAN—CS—85—146,1985.V86]Viflarreal,M.,“Evaluation of anO(N**2)Method for QueryOptimization”,MSThesis,Dept. of ComputerScience,Univ. of TexasatAustin,Austin,TX.Z85]Zaniolo,C. The representationanddeductiveretrieval of complexobjects,Proc. of 11thVLDB,pp.458—469,1985.Z86]Zaniolo,C.,SafetyandCompilation of Non—RecursiveHornClauses,Proc.Firstmt.Con!.onExpertDatabaseSystems,Charleston,S.C.,1986.3OPTIMIZATION OF COMPLEXDATABASEQUERIESUSINGJOININDICESPatrickValduriezMicroelectronicsandComputerTechnologyCorporation3500WestBalconesCenterDriveAustin,Texas78759ABSTRACTNewapplicationareas of databasesystemsrequireefficientsupport of complexqueries.Suchqueriestypicallyinvolve a largenumber of relationsandmayberecursive.Therefore,theytendto use the joinoperatormoreextensively. A joinindexis a simple data structurethatcanimprovesignificantly the performance of joinswhenincorporatedin the databasesystemstoragemodel.Thus,asanyotheraccessmethod,itshouldbeconsideredasanalternativejoinmethodby the queryoptimizer.Inthispaper,weelaborateon the use of joinindices for the optimization of bothnon—recursiveandrecursivequeries.Inparticular,weshowthat the incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizerandthusoffersadditionalopportunities for increasingperformance.1.IntroductionRelationaldatabasetechnologycanwellbeextendedtosupportnewapplicationareas,suchasdeductivedatabasesystemsGallaire84].Comparedto the traditionalapplications of relational data basesystems,theseapplicationsrequire the support of morecomplexqueries.Thosequeriesgenerallyinvolve a largenumber of relationsandmayberecursive.Therefore, the quality of the queryoptimizationmodule(queryoptimizer)becomes a keyissueto the success of databasesystems. The idealgoal of a queryoptimizeristoselect the optimalaccessplanto the relevant data for aninputquery.Most of the workontraditionalqueryoptimizationJarke84]hasconcentratedonselect—project—join(SPJ)queries, for theyare the mostfrequentonesintraditional data processing(business)applications.Furthermore,emphasishasbeengivento the optimization of joinsIbaraki84]becausejoinremains the mostcostlyoperator.Whencomplexqueriesareconsidered, the joinoperatorisusedevenmoreextensively for bothnon—recursivequeriesKrishnamurthy86]andrecursivequeriesValduriez8 6a] .InValduriez87],weproposed a simple data structure,called a joinindex,thatimprovessignificantly the performance of joins.Inthispaper,weelaborateon the use of joinindicesin the context of non—recursiveandrecursivequeries.Weview a joinindexasanalternativejoinmethodthatshouldbeconsideredby the queryoptimizerasanyotheraccessmethod.Ingeneral, a queryoptimizermaps a queryexpressedonconceptualrelationsintoanaccessplan,i.e., a low—levelprogramexpressedon the physicalschema. The physicalschemaitselfisbasedon the storagemodel, the set of data structuresavailablein the databasesystem. The incorporation of joinindicesin the storagemodelenlarges the solutionspacesearchedby the queryoptimizer,andthusoffersadditionalopportunities for increasingperformance.10Joinindicescouldbeusedinmanydifferentstoragemodels.However,inordertosimplifyourdiscussionregardingqueryoptimization,wepresent the integration of joinindicesin a simplestoragemodelwithsingleattributeclusteringandselectionindices.Thenweillustrate the impact of the storagemodelwithjoinindiceson the optimization of non—recursivequeries,assumedtobeSPJqueries.Inparticular,efficientaccessplans,where the mostcomplex(andcostly)part of the querycanbeperformedthroughindices,canbegeneratedby the queryoptimizer.Finally,weillustrate the use of joinindicesin the optimization of recursivequeries,where a recursivequeryismappedinto a program of relationalalgebraenrichedwith a transitiveclosureoperator.2.StorageModelwithJoinIndices The storagemodelprescribes the storagestructuresandrelatedalgorithmsthataresupportedby the databasesystemtomap the conceptualschemainto the physicalschema.In a relationalsystemimplementedon a disk—basedarchitecture,conceptualrelationscanbemappedintobaserelationson the basis of twofunctions,partitioningandreplicating.All the tuples of a baserelationareclusteredbasedon the value of oneattribute.Weassumethateachconceptualtupleisassigned a surrogate for tupleidentity,called a TID(tupleidentifier). A TIDis a valueunique for alltuples of a relation.Itiscreatedby the systemwhen a tupleisinstantiated.TID’spermitefficientupdatesandreorganizations of baserelations,sincereferencesdonotinvolvephysicalpointers. The partitioningfunctionmaps a relationintooneormorebaserelations,where a baserelationcorrespondsto a TIDtogetherwithanattribute,severalattributes,orall the conceptualrelation’sattributes. The rationale for a partitioningfunctionis the optimization of projection,bystoringtogetherattributeswithhighaffinity,i.e.,frequentlyaccessedtogether. The replicatingfunctionreplicatesoneormoreattributesassociatedwith the TID of the relationintooneormorebaserelations. The primary use of replicatedattributesis for optimizingselectionsbasedonthoseattributes.Another use is for increasedreliabilityprovidedbythoseadditional data copies.inthispaper,weassume a simplestoragemodel ... )clusteredonTID.Clusteringisbasedon a hashedortreestructuredorganization. A selectionindexonattribute A of relationRis a baserelationF (A, TID)clusteredon A. LetR1andR2betworelations,notnecessarilydistinct,andletTID1andTID2beidentifiers of tuples of R1and A2 ,respectively. A joinindexonrelationsR1and A2 is a relation of couples(TID1,TID2),whereeachcoupleindicatestwotuplesmatching a joinpredicate.Intuitively, a joinindexisanabstraction of the join of tworelations. A joinindexcanbeimplementedbytwobaserelationsF(TID1,TID2),oneclusteredonTID1and the otheronTID2.Joinindicesareuniquelydesignedtooptimizejoins. The joinpredicateassociatedwith a joinindexmaybequitegeneralandincludeseveralattributes of bothrelations.Furthermore,morethanonejoinindexcanbedefinedbetweenanytworelations. The identification of variousjoinindicesbetweentworelationsisbasedon the associatedjoinpredicate.Thus, the join of relations A1 andR2on the predicate(R1 .A =R2 .A andR1.B=R2.B)canbecapturedaseither a singlejoinindex,on the multi—attributejoinpredicate,ortwojoinindices,oneon(R1 .A =R2 .A) and the otheron(R1.BR2.B). The choicebetween the alternativesis a databasedesigndecisionbasedonjoinfrequencies,updateoverhead,etc.Letusconsider the followingrelationaldatabaseschema(keyattributesarebold):11CUSTOMER(cname,city,age,job)ORDER(cname,pname,qty,date)PART(pname,weight,price,spname) A (partial)physicalschema for thisdatabase,basedon the storagemodeldescribedabove,is(clusteredattributesarebold)C_PC(CID,cname,city,age,job)City_IND(city,CID)Age_IND(age,CID)0_PC(OlD,cname,pname,qty,date)CnamelND(cname,OlD)CIDJI(CID,OlD)OID_Jl(OlD,CID)C_PCand0_PCareprimarycopies of CUSTOMERandORDERrelations.City_INDandAge_INDareselectionindicesonCUSTOMER.CnamelNDis a selectionindexonORDER.CIDJIandOlDJIarejoinindicesbetweenCUSTOMERandORDER for the joinpredicate(CUSTOMER.Cname=ORDER.Cname).3.Optimization of Non—RecursiveQueries - The objective of queryoptimizationistoselectanaccessplan for aninputquerythatoptimizes a givencostfunction.Thiscostfunctiontypicallyreferstomachineresourcessuchasdiskaccesses,CPUtime,andpossiblycommunicationtime (for a distributeddatabasesystem). The queryoptimizerisincharge of decisionsregarding the ordering of databaseoperations,and the choice of the accesspathsto the data, the algorithms for performingdatabaseoperations,and the intermediaterelationstobematerialized.Thesedecisionsareundertakenbasedon the physicaldatabaseschemaandrelatedstatistics. A set of decisionsthatleadtoanexecutionplancanbecapturedby a processingtreeKrishnamurthy86]. A processingtree(PT)is a treeinwhich a leafis a baserelationand a non—leafnodeisanintermediaterelationmaterializedbyapplyinganinternaldatabaseoperation.Internal data baseoperationsimplementefficientlyrelationalalgebraoperationsusingspecificaccesspathsandalgorithms.Examples of internaldatabaseoperationsareexact—matchselect,sort—mergejoin,n—arypipelinedjoin,semi—join,etc. The application of algebraictransformationrulesJarke84]permitsgeneration of manycandidatePT’s for a singlequery. The optimizationproblemcanbeformulatedasfinding the PT of minimalcostamongallequivalentPT’s.TraditionalqueryoptimizationalgorithmsSelinger79]performanexhaustivesearch of the solutionspace,definedas the set of allequivalentPT’s, for a givenquery. The estimation of the cost of a PTisobtainedbycomputing the sum of the costs of the individualinternaldatabaseoperationsin the PT. The cost of aninternaloperationisitself a monotonicfunction of the operandcardinalities.If the operandrelationsareintermediaterelationsthentheircardinalitiesmustalsobeestimated.Therefore, for eachoperationin the PT,twonumbersmustbepredicted:(1) the individualcost of the operationand(2) the cardinality of itsresultbasedon the selectivity of the conditionsSelinger79,Piatetsky84]. The possiblePT’s for executinganSPJqueryareessentiallygeneratedbypermutation of the joinordering.Withnrelations,therearen!possiblepermutations. The complexity of exhaustivesearchisthereforeprohibitivewhennislarge(e.g.,n>10). The use of dynamicprogrammingandheuristics,asinSelinger79],reducesthiscomplexityto2~,whichisstillsignificant.Tohandle the case of complexqueriesinvolving a largenumber of relations, the optimizationalgorithmmustbemoreefficient. The complexity of the optimizationalgorithmcanbefurtherreducedbyimposingrestrictionson the class of 12PT’sIbaraki84),limiting the generality of the costfunctionKrishnamurthy86),orusing a probabilistichill—climbingalgorithmloannidis87].Assumingthat the solutionspaceissearchedbyanefficientalgorithm,wenowillustrate the possiblePT’sthatcanbeproducedbasedon the storagemodelwithjoinindices. The addition of joinindicesin the storagemodelenlarges the solutionspace for optimization.Joinindicesshouldbeconsideredby the queryoptimizerasanyotherjoinmethod,andusedonlywhentheyleadto the optimalPT.InValduriez87],wegive a precisespecification of the joinalgorithmusingjoinindex,denotedbyJOINJI,anditscost.ThisalgorithmtakesasinputtwobaserelationsR1(TID1, A1 ,B1, ... usedtodeflect the readingbeamveryfast.As a result,itismuchfastertoretrieveinformationfromtracksthatarelocatednear the currentlocation of the readinghead.Wecallthis a spanaccesscapability. The spanaccesscapability of opticaldiskshasimplications for schedulingalgorithmsand data structuresthatareappropriate for opticaldisks,aswellassignificantimpactonretrievalperformanceChristodoulakis8 7a] .InChristodoulakis87]wealsoderiveexactanalyticcostestimatesaswellasapproximationsthatarecheapertoevaluate, for the retrieval of recordsandlongerobjectssuchastext,images,voice,anddocuments(possiblycrossingblockboundaries)fromCAVopticaldisks.Theseestimatesmaybeusedbyqueryoptimizers of traditionalormultimedia data bases.RetrievalPerformance of CLVOpticalDisksConstantLinearVelocity(CLV)opticaldiskshavedifferentcharacteristicsthan the CAVopticaldisks.CLVopticaldisksvary the rotationalspeedsothat the unitlength of the trackwhichisreadpassesunder the readingmechanisminconstanttime,whichisindependent of the location of the track.Thishasimplicationson the rotationaldelaycostwhich,inCLVdisks,dependson the tracklocation.Thisalsoimpliesthat,inCLVdisks, the number of sectorspertrackvaries(outsidetrackshavemoresectors). The latter(variablecapacity of a track)hasmanyfundamentalimplicationsonselection of data structuresthataredesirable for CLVopticaldisksand the parameters of theirimplementation, for the selection of accesspathstobesupported for data basesstoredonCLVdisks,aswellas for the retrievalperformanceand the optimalqueryprocessingstrategytobechosen.(TheseimplicationsarestudiedindetailinChristodoulakis87b],inwhichisshownthatthesedecisionsdependon the location of data placementon the disk.)Analyticcostestimates for the performance of retrieval of recordsandobjectsfromCLVdisksarealsoderivedinChristodoulakis87b]).Theseestimatesmaybeusedbytraditionalormultimediaqueryoptimizers.Itisshownthat the optimalqueryprocessingstrategydependson the location of fileson the CLVdisk.Thisimpliesthatqueryoptimizersmayhavetomaintaininformationabout the location of fileson the disk.Estimation of SelectivitiesinTextInmultimediainformationsystemsmuch of the contentspecificationwillbedonebyspecifying a pattern of textwords.Queriesbasedon the content of imagesaredifficulttospecify,andimageaccessmethodsareveryexpensive.Voicecontentistransformedtotextcontentif a goodvoicerecognition18deviceisavailable.Thusaccurateestimation of textselectivitiesisimportantinqueryoptimizationinmultimediaobjects.Thereisanotherimportantreasonwhyaccurateestimation of textselectivitiesisimportant.Frequently the userwantstohave a fastfeedback of howmanyobjectsqualifyinhisquery.Iftoomanyobjectsquality, the usermaywanttorestrict the set of qualifyingobjectsbyaddingmoreconjunctiveterms.Iftoofewobjectsqualify, the usermaywanttoincrease the number of objectsthathereceivesbyaddingmoredisjunctiveterms.(Tradeoffs of precisionversusrecallareextensivelydescribedin the informationretrievalbibliography.)Althoughsuchstatisticsmaybefoundbytraversinganindexontext(possiblyseveraltimes for complicatedqueries)indexesmaynotbe the desirabletextaccessmethodsinseveralenvironmentsHaskin81].Given a set of stopwords(wordsthatappeartoofrequentlyinEnglishtobe of a practical valueincontentaddressibility),itiseasytogiveananalyticformulathatcalculates the averagenumber of wordsthatqualifyin a textqueryChristodoulakisandNg87].Thisanalyticformulauses the factthat the distribution of wordsin a longpiece of textisZipfwithknownparameters.However, the averagenumber of documentsmaynotbe a goodenoughestimate(insomecases) for queryoptimizationor for givinganestimate of the size of the responseto the userChristodoulakis84].Moredetailedestimateswillhavetoconsiderselectivities of individualwordsandqueries.Thiscanbedoneusingsampling. A samplingstrategylooksatsomeblocks of text,counts the number of occurrences of a particularwordortextpattern,andbasedonthisextrapolates the probabilitydistribution of the number of patternoccurrencesto the whole data base. A potentialproblemwiththisapproachisthatinordertobeconfidentabout the statistics a largeportion of the filemayhavetobescanned.Instead of blocks of the actualtextfile,blocks of the textsignaturescouldbeusedwhensignaturesareusedastextaccessmethods.Sincemoreinformationexistsinblocks of signaturesthaninblocks of the...