Intelligent Bioinformatics


Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Theapplicationofartificialintelligence
techniquestobioinformaticsproblems
EdwardKeedwell
AjitNarayanan
SchoolofEngineering,ComputerScienceandMathematics
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Bioinformatics
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Theapplicationofartificialintelligence
techniquestobioinformaticsproblems
EdwardKeedwell
AjitNarayanan
SchoolofEngineering,ComputerScienceandMathematics
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
2005JohnWiley&SonsLtd,TheAtrium,SouthernGate,Chichester,
WestSussexPO198SQ,England
Telephone(+44)1243779777
Email(forordersandcustomerserviceenquiries):[email protected]
VisitourHomePageonwww.wileyeurope.comorwww.wiley.com
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Prefaceix
Acknowledgementxi
PART1INTRODUCTION1
1IntroductiontotheBasicsofMolecularBiology3
1.1Basiccellarchitecture
1.2Thestructure,contentandscaleofdeoxyribonucleicacid(DNA)
1.3Historyofthehumangenome
1.4Genesandproteins
1.5Currentknowledgeandthecentraldogma
1.6Whyproteinsareimportant
1.7Geneandcellregulation
1.8Whencellregulationgoeswrong
1.9So,whatisbioinformatics?
1.10Summaryofchapter
1.11Furtherreading
2IntroductiontoProblemsandChallenges
inBioinformatics31
2.1Introduction
2.2Genome
2.3Transcriptome
2.4Proteome
2.5Interferencetechnology,virusesandtheimmunesystem
2.6Summaryofchapter
2.7Furtherreading
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
3IntroductiontoArti“cialIntelligenceand
ComputerScience65
3.1Introductiontosearch
3.2Searchalgorithms
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
6.7Summaryofchapter
6.8References
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Itiswidelyrecognizedthatthe“eldofbiologyisinthemidstofadata
explosion.Aseriesoftechnicaladvancesinrecentyearshasincreased
theamountofdatathatbiologistscanrecordaboutdifferentaspectsof
anorganismatthegenomic,transcriptomicandproteomiclevels.This
datais,ofcourse,vitaltoadvancingourknowledge.Inrecentyears,the
disciplineof
hasallowedbiologiststomakefulluseofthe
advancesincomputerscienceandcomputationalstatisticsinanalysing
thisdata.However,asthevolumeofdatagrows,thetechniquesusedmust
becomemoresophisticatedtocaterforlarge-scaledataandnoise.Also,
giventhegrowthinbiologicaldata,thereisaneedtoextractinformation
thatwasnotpreviouslyknownfromthesedatabasestosupplementcur-
rentknowledge.Largedatabasesmaycontaininterestingpatternsthat,if
identi“edandauthenticatedbyfurtherlaboratoryandclinicalwork,can
leadtonoveltheoriesaboutthecausesofvariousdiseasesandalsopossi-
blytonewdrugsfortheirtreatment.Thedisciplineofbioinformaticshas
reachedtheendofits“rstphase,andthemotivationbehindthisbook
istocharacterizetheprinciplesthatmayunderliesecondphasebioin-
formatics.Thatis,secondphasebioinformaticsiswhenthediscipline,
insteadofbeinginformedbyjustcomputerscienceandcomputational
statistics,isalsoinformedbyarti“cialintelligencetechniques.
Asweshowinthisbook,thereareproblemsinbioinformaticsand
manyothersciencesthatcannotbesolvedsatisfactorilyevenwiththe
fastestcomputers.Clearly,amoreintelligentapproachisrequiredto
solvetheseincreasinglydif“cultbioinformaticsproblems,suchasgene
expressionanalysisandproteinstructureprediction.Thisbookattempts
toaddressthisbylookingatthelatestadvancesinarti“cialintelligence
technologyasappliedtocomputationalproblemsinbiology.Arti“cial
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
PREFACE
searchandoptimizationproblems,orhownaturehassolveditsown
problems,forexamplebyusingtheprinciplesofsurvivalofthe“ttest
inevolutionarycomputation.
Thisbookisdividedintothreeparts,eachcontaininganumberof
chapters.Thesepartsaredesignedtoallowreaderstoaccessthemate-
rialmostrelevanttothem.The“rstpart,
introducesthe
materialnecessarytounderstandthetechnologyandbiologyincluded
inthelaterchapters.Werecognizethatbioinformaticsishighlycross-
disciplinaryandthereforesome,allornoneofthesechaptersmaybe
relevanttothereader,dependingontheirbackground.Thenextpart,
CurrentTechniques
,describestheestablishedarti“cialintelligencetech-
niquesinbioinformaticsincludingprobabilistic,nearestneighbourand
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
Theauthorswouldliketothankeveryoneinvolvedwithproducingthis
bookincludingstaffattheDepartmentofComputerScienceandCentre
JWBK023-FMJWBK023-KeedwellApril5,200523:30CharCount=0
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
Part1
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
IntroductiontotheBasics
ofMolecularBiology
1.1Basiccellarchitecture
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
(b) Translation
(a) Transcription
Figure1.1
Anoverviewofatypicalhumancell
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
THESTRUCTURE,CONTENTANDSCALEOFDEOXYRIBONUCLEICACID
oroffwithincells,resultingindifferenttypesofcellproducingdifferent
proteinsfornormalgrowthandfunctioningoftheorganismasawhole.
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
transmissionelectronmicroscopy
(TEM),whereelectronsarebeamed
throughthesampleandanimageproducedresultingfromtheinteraction
oftheelectronswiththesample.TEMcanresolveorganellesandother
subcellularstructuresbutnotthecontentofchromosomes.Inother
words,itislikelythatchromosomecontentwillnotbeobserveddirectly
atthelevelofbases,whichmeansthatDNAsequenceswillneverbe
observeddirectly.Instead,
5

3

3

5


(b)
5

5

Double helix
3

Figure1.2
Thedouble-helixstructure
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
THESTRUCTURE,CONTENTANDSCALEOFDEOXYRIBONUCLEICACID
right-to-leftforthebottomstrand).Eachnucleotideisamoleculecon-
sistingofa“ve-carbonsugar(deoxyriboseforDNA),aphosphategroup,
andanitrogenousbase(aringcompoundcontainingnitrogen),witheach
carbonbeinggivenanumber1
to5
.Nucleotidesformachainwhen
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
HISTORYOFTHEHUMANGENOME
genewhichcontributestotheproductionofinsulinforbreakingdown
glucoseinthebloodhasadifferentformtothenormalformorversion
ofthegene.Otherdifferencesinallelicvaluesprovidenormalvariation
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
computationalpowerthroughthe1990smadeCelerasapproachpossi-
ble.CelerausedjustoneanonymouspersonsDNA,whereastheHGP
requiredcross-checkingwithseveralpeoplesDNA.Also,Celerarepeated
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
GENESANDPROTEINS
forstartingvariouschemicalreactionsinthecell.Proteinstherefore
contributetobiologicalstructureandfunction.Proteinsalsohavethree
otherfunctions:theycancarrysignals,theycantransportmoleculessuch
asoxygenandtheycanregulatecellprocesses,suchasdefencemecha-
nisms.Theprocessbywhichgenesaremadeintoproteinsisstartedby
RNApolymerase
comingintocontactwithachromosomeandidentify-
ingthestartpointofagene.Thesemoleculesopenupthedoublehelix
structuretoexposetheDNAstrandmakingupthegene,andacomple-
mentarycopyofthegeneismadeinthedirectioninwhichthegeneis
meanttoberead.TheprocessofcopyinggenesintomRNAiscalled
,andtheprocessofconvertingthemRNAintoproteiniscalled
Transcriptionstartswiththedouble-helixunwindingFigure1.3(a)and
exposingbasesthatrepresentthestartofagene.mRNAisthenformed,
wherebyacomplementarycopyofthegeneismade.Sincetranscription
proceedsinthe3
to5
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
5'3'
5'3'
To ribosomes
Splice pathway 1Splice pathway 2
Right exonLeft
Figure1.3
Theprocessoftranscription
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
CAAT box
Start pointRest of
TATA box
Figure1.4
Theinitiationstage
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
Theprocessoftranscriptionthereforeresultsina
copyofthegene,butthereisonecomplication.
(cytosine)inDNA
istranscribedas
(guanine),and
.However,while
(adenine)is
transcribedto
isnottranscribedto
.Instead,fortran-
scription,a“fthbasecalled
),whichisfunctionally
identicaltoadenine(
),isused.Faithfulcomplementarybasecopyingis
usedinsteadforanotherprocess,
Coding strand/
Template/DNA
Transcription
Figure1.5
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
GENESANDPROTEINS
anduse
ratherthan
.Therearethereforethreewaysthatagenecan
bedescribed:throughthetemplateorantisensestrand,throughthecod-
ingorsensestrand,andthroughthemRNAthatistranscribedfromthe
templateorantisensestrand.
Spliceosomeandtranscriptome
JustbecauseagenehasbeentranscribedintomRNAdoesnotmean
thatthetaskofmakingacopyofagenehas“nished.Genescontain
codingandnon-codingregions.Theseregionsaredifferentfromthe
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
intron.ItappearsthatinternalsplicingmechanismsrecognizethemRNA
mRNA grouped
into three bases
Free-floating
tRNA
Codons broken
up for reuse of
mRNA bases
Ribosome
Growing polypeptide
Protein folds into
complex structure
tRNA match
against codons
Amino
acid
Figure1.6
Theprocessoftranslationataribosome
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
GENESANDPROTEINS
tomatchthecodonagainsttheirmatchingelement.Ifamatchismade,
theattachedaminoacidisreleasedfromthetRNAandaddedtothe
aminoacidsequence(polypeptidechain)thatisbeingformedfromearlier
matchesagainstthemRNAthathasenteredtheribosome.Forinstance,
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
Table1.1
Secondbase
FirstbaseUCAGThirdbase
UUUUPheUCUSerUAUTyrUGUCysU
UUCPheUCCSerUACTyrUGCCysC
UUALeuUCASerUAAStopUGAStopA
UUGLeuUCGSerUAGStopUGGTrpG
CCUULeuCCUProCAUHisCGUArgU
CUCLeuCCCProCACHisCGCArgC
CUALeuCCAProCAAGlnCGAArgA
CUGLeuCCGProCAGGlnCGGArgG
AAUUIleACUThrAAUAsnAGUSerU
AUCIleACCThrAACAsnAGCSerC
AUAIleACAThrAAALysAGAArgA
apost-translationalprocesswherebyit“rstfoldsintoacomplexthree-
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
GENESANDPROTEINS
Table1.2
The20aminoacidsandtheircommonlyusedabbreviations;eachamino
GlycineGGLYGly
AlanineAALAAla
ValineVVALVal
LeucineLLEULeu
IsoleucineIILEIle
PhenylalanineFPHEPhe
ProlinePPROPro
SerineSSERSer
ThreonineTTHRThr
CysteineCCYSCys
TheRNApolymerasethenunravelstheappropriatepartoftheDNA
doublehelix.Free-”oatingbasesinthenucleusattachthemselvestothe
revealedDNAbasesonthetemplatestrand,formingacomplementary
sequencewhichbecomesthemessengerRNA.Thedoublehelixisre-
formedastranscriptioncontinuesalongtheunravelledDNAmolecule.
WhenaterminatingsequenceofbasesisfoundintheDNA,theresulting
messengerRNA,aftereditingtoremoveintronsandtoformalternative
splicedforms,isdispatchedtotheribosomes,wherecombinationsof
threebasesatatimeinthemessengerRNAareusedbytRNAtopro-
duceoneof20differentaminoacids.Sequencesoftheseaminoacids
(varyinginlengthfromafewhundredtoafewthousand)arecalled
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
Table1.3
Anexampleofageneanditsaminoacidtranslation,takenfrom
http://www.ncbi.nlm.nih.gov/entrez.ThemRNAsequenceisgivenin(a)andcon-
sistsof729bases.ThenameofthegeneisgivenbytheLocus“eldin(c),withits
de“nitionbelow.Auniqueaccessionnumberforthegeneisalsoprovided(Z22865).
Thesourceofthegeneisprovidedintermsofdiscoverersandauthors.TheCDS
valuesindicatethatonlybases13to618resultintranslationafterediting.Hence,
translationstartswiththeaugsequenceatpositions13,14and15in(a).This
1gaauucgggagcauggaccucagucuucucuggguacuuaugccccuagucaccauggcc
61uggggccaguauggcgauuauggauacccauaccagcaguaucaugacuacagcgaugau
121gggugggugaauuugaaucggcaaggcuucagcuaccaguguccccaggggcaggugaua
181guggccgugaggagcaucuucaguaagaaggaagguucugacagacaauggaacuacgcc
241ugcaugcccacgccacagagccucggggaacccacggagugcuggugggaggagaucaac
301agggcuggcauggaaugguaccagacgugcuccaacaaugggcugguggcaggauuccag
361agccgcuacuucgagucagugcuggaucgggaguggcaguuuuacuguugucgcuacagc
421aagaggugcccauauuccugcuggcuaacaacagaauauccaggucacuauggugaggaa
481auggacaugauuuccuacaauuaugauuacuauauccgaggagcaacaaccacuuucucu
541gcaguggaaagggaucgccaguggaaguucauaaugugccggaugacugaauacgacugu
601gaauuugcaaauguuuagauuugccacauaccaaaucugggugaaaggaaaggggcccuc
661cagcuuuccacugcagagaaagugguuguugcuccucgguauauguaaucauaauuguag
721aucgaauuc
MDLSLLWVLMPLVTMAWGQYGDYGYPYQQYHDYSDDGWVNLNRQGFSY
QCPQGQVIVAVRSIFSKKEGSDRQWNYACMPTPQSLGEPTECWWEEINRAGM
EWYQTCSNNGLVAGFQSRYFESVLDREWQFYCCRYSKRCPYSCWLTTEYPGH
YGEEMDMISYNYDYYIRGATTTFSAVERDRQWKFIMCRMTEYDCEFANV
LOCUSHSDERMATA729bp
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
CURRENTKNOWLEDGEANDTHECENTRALDOGMA
1.5Currentknowledgeandthe
'centraldogma'
Whathasbeendescribedsofaristhecentraldogmainbiology:thatone
genebyandlargeproducesonemRNAthatbyandlargeproducesone
protein.Morespeci“cally,thecentraldogmaconsistsofsixaxioms.
1DNAreplicatesitsinformationinaprocessthatinvolvesmanyrepli-
cationenzymes.
2DNAcodesfortheproductionofmessengerRNA(mRNA)during
3Ineukaryoticcells,themRNAisprocessed(essentiallybysplicing)
andmigratesfromthenucleustothecytoplasm.
4mRNAcarriescodedinformationtoribosomes,whereproteinissyn-
thesizedusingthemRNAduringtranslation.
5Proteinsdonotcodefortheproductionofproteins,RNAorDNA.
6Proteinsareinvolvedinalmostallbiologicalactivities,structuralor
Infact,transcriptionandtranslationareevenmorecomplicatedthan
previouslydescribed.Somegeneproductsarenottranslatedatallbut
functionintheirRNAformaftertranscription.Forinstance,genesthat
codefortRNAandribosomescannotbetranslated.Iftheycould,they
woulddependontRNAandribosomesforthetranslation!Instead,af-
tertranscriptiontheseRNAmoleculesexitthenucleusandperformtheir
rolesinthetranslationofnormallytranscribedandtranslatedgenes.This
hasledtotheidenti“cationofseveraldifferenttypesofRNA,themost
commonofwhicharemessengerRNA(mRNA),transferRNA(tRNA),
ribosomalRNA(rRNA)forthebuildingoftheribosomes,andsmall
nuclearRNA(snRNA)thathelpeditthemRNA.mRNAistheprimary
messengersynthesizedfromagenesegmentofDNAandcarryingthe
codeintothecytoplasmwhereproteinsynthesisoccurs.rRNA(ribo-
somal)inthecytoplasmandproteincombinetoformanucleoprotein
(ribosome)thatservesasthetranslationsiteandcarriestheenzymes
necessaryforproteinsynthesis.Severalribosomesmaybeattachedtoa
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
singlemRNAatanytime.tRNA(transfer)containsabout75nucleotides,
threeofwhichformatRNAanticodon,andoneaminoacid.ThetRNA
readsthecodeandcarriestheaminoacidtobeincorporatedintothe
developingprotein.
Also,thereisgrowingevidencethat,formanygenes,therearemany
moresplicevariantsthanpreviouslybelieved,wheredifferentcombi-
nationsofintronsandexonsexist.Itisnotasimplematterofintrons
beingremovedandtheremainingexonsbeingsentsequentiallytothe
ribosomes.Onegenecanproducemanydifferenttypesofpolypeptide
chaindependingonhowmanyintronsareremovedandhowtheex-
onsareshuf”edordifferentlyspliced.Thismeansthat,whilethehuman
genomemayconsistofonly30000genes,thegenesmayproducemany
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
WHYPROTEINSAREIMPORTANT
geneforthecolourofhair,itwillhavemanyforms,eachcodingfora
certainhaircolour.Therewillbetwosuchgenes…oneinheritedfrom
thefatherandtheotherfromthemother.Onemaybedominantand
theotherrecessive,inwhichcasehairwillbeproducedofsimilarcolour
tooneoftheparents,orbothcouldbedominantorrecessive,inwhich
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
GENEANDCELLREGULATION
onrandomcollisionsthousands,ifnotmillions,oftimespersdeepwithin
thenucleus(transcription)andatribosomes(translation),therewould
benocelldifferentiation,sinceeachcellwouldtranscribeandtranslate
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
asgenes.Thequestionofwherethese
elementscomefromwould
inturnrequirestillothergenes,whichrequiretheirown
adin“nitum.Thisstillleavesthequestionofwherethe“rst
comesfrom.Oneapproachistofocusontheabilityofagene,becauseof
itsbiomolecularstructureandcontent,to
withouttheneed
fortranscriptionmechanismssuchaspolymeraseandpromoters.An-
otherapproachistohypothesizethat,atthemomentoffertilization,the
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
SO,WHATISBIOINFORMATICS?
thesemutationsaffectgenesthatarecriticalforthetimingofcelldivision
),whichthenbecomeoncogenesthatinstructacellto
dividerepeatedlywithoutcontrol.Usually,thecellhasothergenesfor
counteringsuchmutations,butiftheseothergenesarealsoaffectedthen
thecellformsa
(amassofcells)thatcontinuestogrow.Somehow
cancercellsachieveimmortalityinthattelomerereductionwitheach
replicationdoesnotappeartoaffecttheabilityofthetumourtogrow.
Manytumoursare
ornon-malignantinthattheydonotposea
danger,aslongasthereisroomforgrowthofthetumour;butifthe
tumourblocksthenormalfunctioningofothercells,orifthetumour
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
INTRODUCTIONTOTHEBASICSOFMOLECULARBIOLOGY
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
FURTHERREADING
2Onegeneononestrandofthedouble-helix(thetemplate)isusedto
makethetranscript.Genesaretranscribedfromthe3
to5
end,and
sothemRNAissynthesizedfromthe5
to3
3mRNAiscomplementarytothesourceortemplatestrand,except
inDNAisreplacedby
inthemRNA.WhenDNAreplicates
JWBK023-01JWBK023-KeedwellMarch23,200510:21CharCount=0
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
IntroductiontoProblemsand
ChallengesinBioinformatics
2.1Introduction
Chapter1providedanoverviewofthebasicsofmolecularbiologyof
relevancetobioinformaticiansandalsointroducedsomeoftheinitial
problemsfacedbyresearchersinthearea.Thischapterexaminescurrent
andfuturechallengesinbioinformatics.Theproblemareasandchal-
lengesarepresentedaccordingtothe“eldofmolecularbiologyinwhich
theyoccur:thegenome,thetranscriptomeandtheproteome.Also,the
recentlyexpandingareaofgenesilencingandinterferencetechnology
willbecovered.
2.2Genome
Sequenceanalysis
Someoftheearliestproblemsingenomicsconcernedhowtomeasure
similarityofDNAandproteinsequences,eitherwithinagenome,or
acrossthegenomesofdifferentindividuals,oracrossthegenomesofdif-
ferentspecies.DNAandproteinscanbesimilarintermsoftheir
ortheirlinearsequenceofnucleotidesoraminoacids.The
fundamentalassumptionforDNAisthattwoDNAsequencesthatare
similarprobablysharethesamefunction,eveniftheyoccurindifferent
partsofthegenomeoracrosstwoormoregenomes.Thefundamental
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
aspossibleshouldbemadetotheoriginalstringssothatoptimalsimi-
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
Table2.1
Atableofinformationindicatingsharedgeneval-
uesacrossfourorganisms.Thegenevaluesareassumedtobe
binaryphenotypicvaluesforthesakeofexpositionalthough
inreallifegenevaluescanbeexpectedtobemuchmorecom-
plex,suchaslongstringsofDNA,aminoacidsormultivalued
phenotypes.0standsforgroundstateand1foradvanced
Gene1Gene2Gene3Gene4
OrganismA0000
OrganismB1000
OrganismC1101
OrganismD1011
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
ABCD
ABCD
ABCD
3
2
ABCD
Figure2.1
HennigArgumentationconsiderstheinformationprovidedbyeachgene
oneatatime
speci“cvalueforGene4(commonancestorforCandD),andthatC
andDsplitfromeachotherthroughtheacquisitionofspeci“cvaluesfor
Genes2and3(commonancestor).
HennigArgumentationissimplebutcanleadtocomplextreelabelling
wheninformationfromgenesinsubsequentcolumnscon”ictswithin-
formationalreadyincludedfromearliercolumns.Thiscaninturnleadto
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
ABCD
ABCD
ABCD
3
2
ABCD
Figure2.2
AnalternativeHennigArgumentation
(rowbyrow)ratherthangenebygene(columnbycolumn),withthe
purposeofminimizingthenumberofstatechangesrequired.The“rst
stepinWagnerTreeconstructionisto“ndtheorganismthathasfewest
advancedstates,where1standsforadvanced.Ahas0valuesacross
allgenesandthereforenoadvancedstates.
Gene1Gene2Gene3Gene4
OrganismA0000
OrganismB1000
OrganismC1100
OrganismD1110
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
B
A
1
BCA
BA
CC
DD
3
2
3
2
2
BA
CD
1100
A
CD
3
2
2
A
Figure2.3
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
1000
BA
�Š
c
a
�Š
c
a
�Š
g
11000
BA
00110
00110
a
�Š
c
a
�Š
c
g
�Š
a
�Š
t
BCDA
BCDA
�Š
c
a
�Š
c
g
�Š
t
A
accg
cccct
aaa
Figure2.4
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
momentthattheprobesareshortfragmentsofeachgenethatcanbe
foundinthegenomeofanorganism,andthatthemRNAorcDNAsam-
plesaretakenfromcellsortissuesofthatsameorganismundersome
condition.IfthesamplesareappliedtotheDNAarrayandstickto
someprobesbutnotothersthroughcomplementarybasepairing,that
tellsuswhichgenesareexpressedinthesampleandwhichgenesarenot
expressedinthesample(Figure2.5).
ThetotalmRNAfromanindividual(cellortissue)isextractedand
puri“ed.SincemRNAdoesnotremainstableforlong,cDNAversions
ofthemRNAarereversetranscribedsothatthemRNAandcDNAform
astablestructure.Thestrandsarethenfurtherampli“edortranscribed
togeneratefurthercDNAormRNA(calledcRNA)strandsbeforebeing
labelled.Typically,samplesfromonecellorindividualarelabelledgreen
andsamplesfromanothercellorindividualredtoallowfordifferential
Total mRNA from
Wash and
Figure2.5
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
theamountofmRNAinasamplecanbereaddirectlyfromtheprobe
cells.DNAprobesandmRNAfragmentsarefartoosmalltobereadin
thismanner.Instead,thearrayorgenechiphastobeconvertedintoa
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
d1
=
2
d2
= 1
d3
= Š2
d4
= 0.5
d5
= 0.5
|d1|
=
2
|d2|
= 1
|d3|
= 2
|d4|
= 0.5
|d5|
= 0.5
a4
=
0.5
a5
= 0.5
a2
= 1
a1
= 2
a3
= 2
Average
= Š4.5
signed ranks
2 1.5
1.5 1.5
1.5 3
1.5 3 4.5 4.5 12
27 1.5 1.5 3 4.5
28 1.5 1.5 3
4.5 4.5 10.5
29 1.5 1.5
3 4.5 4.5 12
30 1.5
1.5 3 4.5 4.5 13.5
1.5 1.5 3 4.5 4.5 13.5
32 1.5 1.5 3 4.5 4.5 15
10.5 are given the weight
signed ranks equal to
10.5 (4) are given the
p
value:
(10.5)
= (1

5 + 0.5

4)/32
= 0.21875
(j) Since the
p
value is not
significant (not below
Figure2.6
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
allowforchecksonnon-speci“ccross-hybridizationinthesample.That
is,outsideofthehumanbodymRNAnucleotidesarenotalwaysguar-
anteedtobindtotheircomplementarybasepairs,duetoheatdifferences
anddegradation,forinstance.Thesemismatchprobesarealsousedto
generateabsolutecallvaluesinthatthefewermismatchesthereare,the
morecon“denceonehasintheaccuracyoftheperfectmatched“gures.
InFigure2.6(2)ageneisprobedacrossseveralprobepairs(typically
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
berememberedthatmRNAlevelsdonotalwayscorrelatewithprotein
levels.ItisnotcurrentlyknownhowmuchmRNAactuallymakesitto
AlternativesplicevariantsofgenesthatarenotmeasuredonaDNA
chipmeanthatagenemaynotbeaccuratelymeasured.Also,DNAchips
cannotidentifypost-translationalmodi“cationsofaprotein.However,
perhapsthebiggestproblemwithDNAchipsconcernscurrentgeneex-
pressionanalysistechniques.Thesheervolumeofdata(geneexpression
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
computerstoanalyse.Ifageneatonetime-stepisrestrictedtoaffecting
only“veothergenesatthenexttimestep,orageneatasubsequent
time-steptobeaffectedbyonly“veothergenesattheprevioustime-
step,thequestionishowtoidentifyjustthesesmallnumbersofaffected
oraffectinggenesfromthehugenumbermeasured.Reverseengineering
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
Totipotent cells
NeuronsGlial cells
Figure2.7
Embryonicstemcells
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
2.4Proteome
Secondaryandtertiarystructureprediction
ProteinsaretheendresultoftranslationofmRNAbyribosomes.Once
proteinsequencesofaminoacidsleavetheribosomestheyfoldincom-
plexwaystoachieveanativestateorconformationinthecell.Thena-
tivestateofaproteinisahighlystablethree-dimensionalstructurethat
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
sequencesthatmakeupthatproteinalsoneedtobestored.Newprotein
sequencesarebeingaddedtoproteindatabasesasaresultofanalysing
mRNAsequences,whereredundantlytranscribedDNA(introns)have
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
Turn
Turn
-helix

Tertiary structure
Figure2.8
Proteinstructure
Forinstance,humanhaemoglobinconsistsoffourseparatepolypeptides
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
AAAAAAAAAA
Turn
Figure2.9
Computervisualizationofasecondarystructure
Therearecurrentlythreeapproachestoproteinfoldingprediction.
Comparativemodelling
(alsoknownasmodellingbyhomologyor
knowledge-basedmodelling)usesstructuraldatafromexperimentally
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
therearealsoexamplesoftwoproteinswithknownstructurewithsimilar
functionwheretheprimarysequenceinformationissigni“cantlydiffer-
entineachsequence.Typicallyathresholdvalueof30percentsequence
identityisrequiredtobeexceededbeforetwosequencesareconsidered
homologousformodelling.Whilethis“guremayappeartobelow,the
argumentisthatthethree-dimensionalstructureofproteinsisconserved
toagreaterextentthantheprimarysequence.Thatis,ahighdegreeof
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
escapedtheattentionofsomeresearchersthatlargeproteinsnaturally
foldwithinsecondsoftranslation,whereascomputermodelstakehours
orevendaystopredictthestructureoflesscomplexproteins.
Proteinfoldingisperhapsthebiggestprobleminbioinformaticscur-
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTERFERENCETECHNOLOGY
chancesof“ndingunique“ngerprintsbegintoworsen!Ideally,itwould
behelpfulifaproteincouldbesequencedinthesamewaythatagenecan
besequenced(throughcomplementarybasepairingtechniques).Amino
acidsdonothavecomplements,however.Peptidesequencingattemptsto
identifytheaminoacidsofaproteinorproteinfragmenteitherbywork-
ingfromoneendofthefragment(
sequencing),oneresidueat
atime,bycuttingtheresiduefromthesequenceandthenusingcomplex
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
affectothergenesthroughproteins,ofhowproteinsaffectotherproteins.
Whilethegenomeisstatic,inthatonceitischaracterizeditcanbeas-
sumedtobeconstant,theproteomeisdynamicandre”ectsthestateof
thecellandtheconditionsunderwhichitsurvives.Someproteinsare
producedonlywhenthecellsenvironmentisstressed(e.g.byheating).
Itispossiblethatthereisaspeci“cstressgeneforthatconditionthatonly
comesonwhenthestressconditionisapparent,butitisalsopossible
thatthecelldealswiththestresseitherbyproducingmorequantityof
aproteinorbymodifyingaproductofanalreadyexpressedgene.One
waytostudytheeffectsofproteinsisthroughknock-outtechnology
thateffectivelysilencesgenes.Ifgenescanbesilencedundercontrolled
conditions,theeffectsoftheabsenceofthegeneontheproteomecan
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTERFERENCETECHNOLOGY
Virusesandtheimmunesystem
Avirusisnotalivingentityorcell,sinceitlacksmanyoftheessential
componentsofacell,suchastranslationmachineryandcellulartrans-
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
Two copies of single stranded negative RNA
Viral
Lymphocyte
Viral DNA
(e) Viral mRNA
(f) Viral polyprotein
Protease
HIV virion with
Figure2.10
ThelifecycleofHIV
onthesurfaceofthecell(Figure2.10(b)).Theviralcontent(RNAand
proteins)isinjectedintothecell,andthereversetranscriptasemakesa
positivecopyofoneofthenegativestrandviralRNAtoformadouble
strand(Figure2.10(c)).Theviralintegrasetakesthedoublestrandinto
thenucleusandsplicesitintothecellsDNA(Figure2.10(d)).Normal
cellularmachinerythentranscribes(Figure2.10(e))andtranslatesthevi-
ralmRNAtoformonelongviralpolyproteinsequence(Figure2.10(f)).
Thethirdviralprotein,protease(Figure2.10(g))hasthetaskofcleaving
theviralpolyproteinintoconstituentparts(newcopiesofviralprotein,
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTERFERENCETECHNOLOGY
(either“xedtospeci“clocationsinthebodyorcirculat-
ingwiththeblood)thatswallowwholepathogensorclearupdebris.
Suchcellsaredirectedtopathogensthroughthestimulationof
(immunoglobulins)inresponseto
andothersubstances
producedbythepathogen.Alsopartoftheinnateresponsearethe
uralkiller
cellsthatdestroycellsinthebodythathavebeeninfected
topreventtheinfectionfromspreading.Iftheinnatesystemcannotdeal
withthepathogen,theadaptivesystemtakesover.Oneimportantpartof
theadaptivesystemconsistsof
(whitebloodcells)binding
approximatelytopathogens.Thiscanresultin
(cellspro-
ducedinbonemarrow)producingantibodiestobringthepathogento
theattentionofmacrophagesandphagocytesfordestruction,orcloning
themselvesinlargenumberswithevenmorespecializedbindingmech-
anismssothattheycaninactivatethepathogensdirectly.Approximate
bindingandcloningbyB-cellsprovidesuswiththeabilitytoidentify
anddealwithanynewpathogen.However,sinceapproximatebinding
andcloningcanleadtotheproductionofB-lymphocytesthatinadver-
tentlyattachthemselvestohealthy
(cellsthatarepartofthe
bodyandnotforeigntothebody),theimmunesystemrequires
T-cells
(cellsproducedinthethymus)toco-stimulateB-cellsonlyifthe
B-cellisnotattachedtoahealthy(non-antigenpresenting)self-cell.This
isparticularlyimportantinthecaseofvirusesthathaveinfectedself-
cells.Suchinfectedcellsproducefragmentsofthevirusontheirsurface
throughtheuseof
majorhistocompatability
.Ifhelper
T-cellsrecognizetheseviralfragmentsonthesurfaceofself-cells,itpro-
ducesaco-stimulustotheB-cellwhichthendestroystheinfectedcell.
OneofthecriticalpropertiesofHIVisthatitattacksthesehelperT-cells
(Figure2.10).Iftheseimmunesystemcellsbecomeinfected,theycan
nolongerprovidetheco-stimulationrequiredforB-cellstowork.The
immunesystemthenbecomessuf“cientlyweakened(AcquiredImmun-
ode“ciencySyndrome…AIDS)thatanypathogenthatwouldnormally
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
pathogensontheirownandwithoutthehelpofothercells.Bothpositive-
senseandnegative-senseRNAareproducedbydifferenttypesofvirus,
andthecellhadto“ndamechanismtopreventtheirexpression.Also,
double-strandedRNAcanbeproducedbyvirusesusingreversetran-
scriptase.Sinceallthreetypesofsequencewerefoundtosilencegenes
inmulticellularorganisms,thecurrenthypothesisisthattheunderly-
inggenesilencingmechanismsre”ectthemannerinwhichsinglecells
preventedinfection.
Thecurrentmodelofinterferenceisthatanenzymecalled
ure2.11(a))takestheintroduceddouble-strandedRNAandcutsitinto
small(20…25bp)sequencescalled
smallinterferingRNA
(siRNA)(Fig-
ure2.11(b)),whichinturn…afterseparatingintosinglestrands…bind
toanRNA-inducingsilencingcomplex(
)(Figure2.11(c)).These
dsRNA
Dicer
siRNA
RISC
(d)Activated
Cut
RISC prevents
translation
at ribosome
Figure2.11
Interferencetechnology
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
SUMMARYOFCHAPTER
becomeactivatedwhenthesiRNAunfolds(Figure2.11(d))andtheacti-
JWBK023-02JWBK023-KeedwellMarch28,200512:56CharCount=0
INTRODUCTIONTOPROBLEMSANDCHALLENGES
3Transcriptomicsisarelativelynewproblemareaarisingfromrecent
technologicaladvancesinDNAarrays(microarraysandgenechips).
Themajorproblemshere,apartfromobtainingthedata,istheanal-
ysisofthedatagiventhelargenumberofgenesmeasuredforacom-
parativelysmallnumberofsamples.Noveltechniquesmayneedto
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
IntroductiontoArti“cial
IntelligenceandComputer
3.1Introductiontosearch
Oneofthemostfundamentaltasksincomputerscienceis
.Many
problemscanbeconvertedintosearchproblems,includingthesimple
problemofaddingtwonumbers,suchas2
2.Thesearchrepresen-
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
representstepsintheproblem-solvingprocess.Oneofthenodesis
uniquelydistinguishedasthestartorinitialstate,andtheremaybeone
ormorenodesthatrepresentthegoalstateorstates.Thetaskofasearch
algorithmisto“ndasolutionpaththroughtheproblemspace,keeping
trackofthestepsfollowedandstatesvisited.Representingproblemsin
computerscienceandbioinformaticsassearchproblemsallowsthefull
C
E
G
F
B
S
7
9
1
1
2
18
72
2
2
2
6
8
8
8
Figure3.1
Agraphrepresentingamap
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
SEARCHALGORITHMS
isshorter.Aftertryinganumberofpossibilities,therouteSBACEFDG
maybearrivedat,whichgivesatotaldistanceof12eventhoughallcities
havebeenvisited.Toverifythatthisisindeedtheshortestroute,other
routeshavetobetried,butnowthereisabenchmarkof12whichcan
beusedtostopfurtherexaminationofparticularroutesifthedistance
is12ormoreandGhasnotbeenreached.
Thesolutiontothistaskisreachedquickly,butnowimaginethatthere
isareal-lifelabelledmapcontaining100ormorecitiesanddistances,
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
Table3.1
AmatrixrepresentationofthegraphinFigure3.1.Eachcityisnumbered
1,A
SABCDEFG
12345678
Ø726ØØØØ
7Ø219ØØØ
22Ø18Ø7ØØ
6118Ø818Ø
Ø9Ø8ØØ22
ØØ71ØØ2Ø
ØØØ822Ø8
ØØØØ2Ø8Ø
threearefollowed.ThisisrepresentedinFigure3.2(a).Thestartnodeis
calledthe
ofthetreeandisatthetoplevel.Atthenextlevelbelow
areallthenodes(A,B,C)thatcanbereachedfromthestartnode.The
treeatthisstageisoneleveldeep,andthetreeatthispointrepresentsthe
routesSA,SBandSC.ThenodesA,BandCare
nodesofS,andS
isthe
nodeofA,BandC,whichare
nodestoeachother.
Describingatalevelbelowallthechildnodesthatcanbereachedfrom
aparentnodeatthelevelaboveiscalled
theparentnode.At
thenextlevelofthetreeallthenodesthatcanbereachedfromA,Band
Carethendescribed(Figure3.2(b)).AandBhavethreechildrennodes
each,whereasChas“ve.Thetreeisnowtwolevelsdeepandrepresents
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
SEARCHALGORITHMS
123
4567891011121314
151617181920212223
...
goal node has been reached
Figure3.2
Orderedexpansionofasearchintoatreesearch
allthenodesthatcanbereachedfromSintwosteps.Forinstance,
theleft-mostpathinFigure3.2(b)describestherouteSAB,whereasthe
right-mostpaththerouteSCF.Itisssumedthat,whengeneratingchild
nodes,thereisnoneedtogobacktothenodewhichisitsparent.The
distancetravelledeachtimethesearchisextendedbyalevelcanalso
becalculated,butanotherpossibilityistogenerateallpossibleroutes
“rstwithoutwastingtimelookingupdistancesinthematrixandthen
calculatethetotaldistancesofallpathsattheend.
ThenextstepintheexpansionisgiveninFigure3.2(c).Onlyapartial
expansionofallnodesatthethirdlevelaredescribed,butalreadyitcan
beseenhowcomplexthesearchtreeisbecoming.Thesearchiscontinued
untilallpathsreachG(allroutesare
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
12
21
15
Figure3.3
Afragmentofthefullbreadth-“rstsearchtreeisshownhere(withdots
incirclesindicatingthattherearemoresiblingaswellaschildnodes
thatarenotrepresentedinthe“gure)
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
SEARCHALGORITHMS
(1followedby25zeros)paths.Evenifonebillionpathspersecond
couldbeprocessedonatrulyfastcomputer,thiswouldstilltakeover
7.5billionyearstocalculate.Generatingallpossibleroutesisclearly
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
346789111213
Figure3.4
Depth-“rstsearchandexpansionorder
thepathcanbeprunedwitharecordmadeofthedistanceforthatroute.
However,depth-“rstsearchesmustbesupportedbyotherchecksthat
ensurethatloopsarenotformed.Forinstance,ifalargedepthboundis
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
SABCDEFG
12345678
Ø726ØØØØ
7Ø219ØØØ
22Ø18Ø7ØØ
6118Ø818Ø
Ø9Ø8ØØ22
ØØ71ØØ2Ø
ØØØ822Ø8
302025155185Ø
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
Whenconsideringsearchproblemssuchasthis,thetotalnumberof
solutionsandtheircostcanbethoughtofasalandscape,withpeaksand
troughsrepresentingcollectionsofgoodandbadsolutions.Mostsearch
problemshavegraduatedpeaks,whichmeansthatifthealgorithmisat
2515
node does not take us
Actual cost of reaching G
=
18
Figure3.5
Simplehill-climbing
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
30
20
25
15
20
20
5
0
5
18
25
5
Figure3.6
Steepest-ascenthill-climbingwithestimatesofdistanceremaining
estimateddistanceawayfromGasD,sothatcanbeignoredalso.Then
Gisfound,whichisthegoalnode,sothatpathistaken.
However,thesearchisnotover.OneroutetoGwithcost18hasbeen
found.Thetaskisto“ndtheshortestdistanceroute,andotherroutes
mustbeexplored,especiallythoserejectedearlierinthesearchdueto
estimateddistancesremainingthatwerelargerthantheparentnode.
However,atleastthereisnowonepathwithanactualcostthatcanhelp
guidetheremainderofthesearch.
Steepest-ascenthill-climbing
isavariationofhill-climbingthatselects
thebestpossiblemoveateachpointandrequiresallchildnodesofa
parenttobegenerated“rstbeforeadecisionistakenastowhichchild
nodetoexpandfurther(Figure3.6).Again,evenaftertherouteSCDG
isfound,thesearchwillneedtoexplorethetreefurthertoseeifthere
areroutesofshorterdistancethan16.
Bothdepth-“rstandbreadth-“rstsearchesexpandonlyonenodeata
time.Anextensiontothisis
beamsearch
,wheretwoormorenodesare
expandedinparallelwiththeotherpathsbeingkeptinthebackground
forsubsequentchecking,ifrequired.Thenumberofnodesexploredin
parallelisgivenbya
beamwidth
.Anexampleofaheuristicbeamsearch
isgiveninFigure3.7.
ThesearchisstartedwithSanditsthreechildnodes.Fromtheesti-
mateddistancesremaining,twonodesAandCarechosenforfurther
expansionatthenextlevel.AhasthreechildnodesandChas“ve.
Threenodesallhaveestimateddistance“veremaining.Sincetwoof
thesenodesarethesame(D),oneischosenarbitrarilytoexpandfurther
withF.Atthethirdlevel,Gisfoundtwice,withdifferentactualdis-
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
25518
18
5
D
18Distance
Figure3.7
Anexampleofabeamsearchwithbeamwidthtwo(thatis,twonodes
areexpandedateachlevel)
tonodesnotfullyexpandedtoexpandthesefurthertoseeifashorter
routeexists.
3.4Optimalsearchstrategies
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
OPTIMALSEARCHSTRATEGIES
costofanotherrouteelsewhereinthesearchtree,atwhichpointajump
ismadetowhicheverpartialroutehasleastactualcostsofar.
Branch-and-boundandA*
Thisprocessofexpanding(branching)nodesandthenjumpingtowhich
everroutehasleastactualcost(bindingtothatroute)iscalled
)search.Anexampleofbranch-and-boundwork-
ingontheexamplegraphisgiveninFigures3.8and3.9.
1313
13169149
1313
91412
13169149
2615916
161412
Figure3.8
Anexampleofbranch-and-boundonthegraphinFigure3.1
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
271626
1016
1113
1616
420
16914
1412
Figure3.9
Conclusionofthebranch-and-boundsearch
Actualcostsofreachingaparticularnodeonaroutearegivennextto
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
OPTIMALSEARCHSTRATEGIES
actualcost,sothesearchisnowboundtothatpath(Figure3.8(c)).SAis
expandedtoSAB,SACandSAD[8],andSAChasthejointlowestcost
withSBACEF.AssumingthatSACisexpandedfurther[9],thelowest
costpathanywhereinthetreeisSBACEF(middleoftree).
TheconclusionofthesearchisgiveninFigure3.9.AfterSAChasbeen
expanded(Figure3.8(c)),thelowestactualcostpartialrouteisSBACEF
(withcost8).ThisisexpandedtoDandG[1],andalthoughGhas
beenfoundwithactualcost15,thesearchmustcontinuesincethereare
manyopennodesthatcouldstillreachGwithlesscost.SinceSABnow
hasthelowestroutecostof9[2],thisisexpanded,asaresultofwhich
SACE[3]isexpanded.ThisresultsinSBEhavingthelowestcostwith9,
andsothesearchisboundthere[4].Assumingthatthechildrennodes
ofSBEresultinrouteswithahighercostthan9,branchingcontinues
withSCAB[5],whichhasthelowestcostanywhereinthetree.Again
assumingthatthechildnodeshavegreatercost,thesearchisthenbound
toSCEF[6],whichalsohascost9.Finally,assumingthatthechildnodes
ofSCEFresultincostshigherthan10atsomepoint,thesearchisbound
backtoSBACEFD[7],andthegraphstruelowestcostpathof12units
ofdistanceisfound.However,allopennodeselsewhereinthetreeon
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
intherightdirection,somedistance-remainingestimatescanbeused
toguidethesearch.TheresultistheA
algorithm,whichisabranch-
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
OPTIMALSEARCHSTRATEGIES
27‡749‡2419‡1425‡719‡14
BDEF
[2]
(a)
27‡7
19‡14
43‡2321‡1616‡16
25‡719‡14
27‡2
21‡6
with cost 16
27‡7
25‡7
19‡14
27‡2
21‡5
34‡1621‡1622‡22
(c)
16‡16
27‡7
25‡7
27‡2
21‡6
39‡14
14‡9
[5]
[6]
[7]
(d)
16‡11
17‡17
40‡2013‡13
16‡16
with cost 13
Figure3.10
The“rsthalfofA
searchingtheexamplegraph
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
BCD
34/923/821/16
G
D
[15]
[13]
[12]
[11]
[10]
[16]
S
BC
E
F
D
(b)
A
35/2027/9
18/13
21/6
25/7
14/9
16/11
13/13
less cost
Figure3.11
The“nalpartofA*search
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
PROBLEMSWITHSEARCHTECHNIQUES
wayofreachingG,and“nallyGisfound[15].SinceSBACEGDGisless
actualcostthanthepreviouslyfoundroutetoG(SCEFDG),thelatter
ispruned[16].Sincetherearenomoreopennodestoexamine,A*has
foundtheshortest(leastactualcost)routefromStoG.
3.5Problemswithsearchtechniques
Therearethreemajorproblemswithanysearchtechniquethatusessome
S
BC
D
G
S
B
C
G
D
E
E
S
S
H
A
A
B
B
C
C
G
G
F
E
E
D
D
BCDE
Plateau
G
A
B
C
D
Actual direction
of search
Ridge
(c)
(b)
(a)
Figure3.12
Problemswithsearchtechniquesusingthedistanceremaining;Sisthe
startnodeandGthegoalnode
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
identicalornearlyidentical(Figure3.12(b)).Noobviousroutecanbe
foundtothegoal.Ridgesoccurwhenthesearchappearstobeheading
intherightdirection,butthefurtheronegoesthemoredistantthegoal
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
COMPLEXITYOFSEARCH
(assumingsomeformofbacktrackingsothatnodesgeneratedearlier
inthesearchtreearerevisitedtobeexpandedlaterinthesearchtree).
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
allthepathsifadepth-“rstsearchorA
isused.Unfortunately,mostof
thereallyinterestingproblems,includingseveralinbioinformatics,have
searchspacesandalgorithmswhichareexponentialinnature.Also,no
attempthasbeenmadetoassessthecostofactuallyrunningthealgorithm
intheabovecases,andthiscostmustalsobeincludedintheestimates
ofhowlonganalgorithmwilltaketo“ndtherightsolution.
3.7Useofgraphsinbioinformatics
ThegraphinFigure3.1providedanexampleofmapsthatcanbe
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
USEOFGRAPHSINBIOINFORMATICS
123567
ACAATG
ACAAATC
AGAATC
ACCGATC
Figure3.13
Agraphasautomaton
madetothenextcharacterofthesequence(C).Instate2,therearetwo
choices,andthelinktostate3isfollowedsincethecurrentcharacteris
C.ThenextcharacterisA,andsoon.Thenumbersbythefoursequences
atthebottomleftofthediagramprovideahistory(trace)ofthestates
entered.Ifacharacterisencounteredthatdoesnotpermittheexitfrom
astatethatisnota“nalstate,theautomatonstopsandarejectanswer
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
Table3.3
TwowaysofrepresentingtheautomatoninFigure3.13.Inthetoptable,
12345678
startA
2CG
3CA
5AT
6GC
goalnode
goalnode
234
352
556
(iv)oneT,followedby(v)eitheraGoraC.Thestatenumbersinthe
automatonareusefulforconvertingthiscomplexexpressionintosix
transitionrules,asfollows:
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
USEOFGRAPHSINBIOINFORMATICS
Thatis,asequencestartswithanAfollowedbystate2(
),according
torule(1).State2iseitheraCfollowedby
oraGfollowedby
,accordingtorule(2),andsoon.Boldfaceisusedtodistinguishthe
statesoftheautomatonfromthecharactersofthestring.Thespecial
startsymbol
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
3.8Grammars,languagesandautomata
Grammarsandautomatacanbeusedtostudythetypesofproblem
thatcanbesolvedwithacomputer.Agrammarisformallyde“nedto
consistofan
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
GRAMMARS,LANGUAGESANDAUTOMATA
non-terminalsymbolontheright-handsideoftherule(i.e.therulesare
rightlinear).Noticealsothattheonlywaytoexittherulesisatsome
pointtorewrite
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
AU
AU
AU
AU
AU
AU
A
NNNNN
N
NNNNNAAAAAAAAAAUUUUUUUUUUNNNNN
Secondary
structure
U
CG
CG
CG
CG
CG
NNNNN
N
NNNNNCCCCCCCCCCGGGGGGGGGGNNNNN
Primary structure
Secondary
structure
G
Figure3.14
LoopedRNAstructurewithequalnumbersofcomplementarybase
pairswithitscomplementbases,andsimilarlyfortheright-hand“g-
ureinvolvingCsandGs.Suchloopingcanformthebasisforremov-
ingintronsfrommRNA,wheretheloopiscutoutandthesequence
NNNNNNNNNNisleftastheexon.Also,non-translatedtranscripts
fortransferRNA(tRNA)canformsuchloops,calledhairpins,which
arerequiredforthecorrectfunctioningoftRNA.
Tocopewitharbitrarylongexpressionswherethereisaspeci“crela-
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
GRAMMARS,LANGUAGESANDAUTOMATA
2
3
U
G
C
G
U
A
5
4
2
3
U
Count
=
count
Š
1
Count
=
0
Count
=
count
Š
1
Count
=
count
+
1
Count
=
count
+
1
Count
=
count
Š
1
Count
=
count
Š
1
G
C
G
U
A
5
4
(a) Finite state automaton
(b) Push-down automaton
Figure3.15
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
is,a“nitestateautomatoncanacceptorgeneratestringsoftheform
(anynumberofXsfollowedbyanynumberofYs),someofwhich
willsatisfytheconstraintthat
bychance(forinstance,threeXs
followedbythreeYs).However,onceitisdecidedtoacceptonlyasub-
ACU
Figure3.16
Alinear-boundedautomaton(LBA)
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
GRAMMARS,LANGUAGESANDAUTOMATA
Twocountervariablesarenowrequiredtoacceptonlythosestringsthat
conformtoL
,thatis,onlystringscontainingone
ormoreAsfollowedbythesamenumberofCsfollowedbythesame
numberofUs.The“rstcountercountsthenumberofAsandthesecond
thenumberofCs.BothcountersaredecrementedwitheveryUfound.
Anystringthatdoesnot“nishinstate3withcount1andcount2equal
tozeroisrejectedbythisLBA.
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
thelanguage,theTuringMachinewilleventuallystopinanacceptstate.
If,however,thestringisnotastringofthelanguage,itispossiblethat
theTuringMachinewillnotstopinarejectstate.Itmayloopfor-
evertryingto“ndawayofacceptingthestring.Itwillnotbeknown
whentoswitchofftheautomaton,sinceweitwillnotbeknownifthe
TuringMachinewillloopforeverorwouldhavefoundtheanswer(which
maybeanaccept)themomentaftertheautomatonisswitchedoff.For
moststringsthatarenotpartofthelanguagetheTuringMachinewill
eventuallyhaltinarejectstate,butoneofthemostimportant“ndingsin
computerscienceisthatitisnotpossibletopredictforwhichnon-strings
theTuringMachinewillhaltinarejectstateandforwhichstringsitwill
loopforever(thehaltingproblem).Thatis,itisnotpossibleto“ndor
runanalgorithmthat,priortorunningtheTuringMachinewiththat
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
CLASSESOFPROBLEMS
anexponentialproblem,butwhenarouteisfounditcanbeveri“edby
simplyfollowingtherouteandseeingiftheroutestartsatSand“nishes
atG(i.e.lineartimedependingonthenumberofcitiesvisited).However,
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
INTRODUCTIONTOARTIFICIALINTELLIGENCE
3.10Summaryofchapter
1Graphsconsistofnodes(vertices)connectedbylinks(arcs).Many
problemsincomputerscienceandbioinformaticscanbeconverted
intoagraphformforsearchpurposes.
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
FURTHERREADING
JWBK023-03JWBK023-KeedwellMarch31,20052:55CharCount=0
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
Part2
CurrentTechniques
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
ProbabilisticApproaches
4.1Introductiontoprobability
Probabilityisthebranchofmathematicswhichisconcernedwiththe
likelihoodthateventswilloccur.Itisaninvaluabletoolforscientists
sinceaneventisoftennotguaranteedtooccur,butcanonlybecharac-
terizedbythefactthatitwilloccuranaveragenumberoftimesgivena
numberoftrials.Thisisalsoofparticularimportancetobioinformati-
ciansastheinteractionsandreactionsthatoccurinbiologicalprocesses
canoftenonlybecharacterizedbyprobabilitytheory.Probabilitytheory
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
insurancecompanyhastodecidewhattheinsurancepremiumshouldbe
tocoverapharmaceuticalcompanythatisplanningtointroduceanew
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
BAYESTHEOREM
coins.Theoutcomeofthesecondeventis
ontheoutcomeof
the“rst.
.Theprobability
sequencethereforeis
,whichis0.0315.If
representthefoureventsoftakingacoinfromthebag,thissequence
canberepresentedas
,where
standsforgiven
ordependingon.Thatis,theprobabilityof
A,B,C
,writtenas
)isnow
),where
...)meansthattheprobabilityof
dependsontheprobability
and...occurringinsequenceorasachainofevents.
4.2Bayes'Theorem
Examiningwhat
)isintheaboveexample,whiletheresultis
anotherwaytoexpressthisisasfollows:
isthejointprobabilityofAandBoccurring.Inotherwords,
forthisexampleabove,
(
A
)
=
9
.
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
Somesimplealgebraicmanipulationthenfollows:
(
A
)

P
(
AB
)
=
P
(
A
)
×
P
(
B
|
A
)

P
(
AB
)
=
P
(
B
)
×
P
(
A
|
B
)

P
(
A
|
B
)
=
P
(
A
)
×
P
(
B
|
A
)
(
B
)

P
(
A
|
B
)
=
P
(
B
|
A
)
×
P
(
A
)
Inotherwords,theprobabilityofanevent
occurringgiventhat
occurredhasbeentransformedintoaprobabilityofanevent
hasoccurred.ThisisBayesTheorem.Moreformally,Bayes
Theoremstates:
isthe
posteriorprobability
ofahypothesis
aftercon-
sideringtheevidence
)isthe
andgivestheproba-
bilityoftheevidence
(theconditionalprobability),
isthepriorprobabilityof
alone,and
)isanormalizingorscaling
constanttoensurethattheposteriorprobabilityaddsupto1.
BayesTheoremisveryusefulwhenreasoningunderconditionsof
uncertainty,sinceitallowsustoreasonaboutthepriorevent
ifa
subsequentevent
FCA
FCA
FCA
Tocalculatethisprobabilitya
probabilitytree
isdrawn,asinFigure4.1,
whichidenti“esallpossibleoutcomes.Itisassumedthatcoinsarenot
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
BAYESTHEOREM
)Second coin American (
Yes
Yes
0.715/14
YesNo
Figure4.1
Aprobabilitytreethatdescribesallpossiblesituationswhenthereare
15coinsinabagofwhich“veareAmericanand10areBritish
replacedinthebagafterbeingextracted.Hence,atthe“rstlevelofthe
treethereisa0.33probabilitythatthe“rstcoinisAmericanand0.66
probabilitythatthe“rstcoinisBritish.Oncethe“rstcoinistakenfrom
thebag,thesecondlevelofthetreedescribestheprobabilitiesofthe
secondcoinbeingAmerican,takingintoaccountthetwopossibilitiesat
thelevelabove.Forexample,ifthe“rstcoinhadbeenBritish(10/15),
thechancesofthesecondcoinalsobeingBritishare9/14.Bayes
Theoremallowsreasoningoverwhatthechancesofthe“rstcoinbeing
AmericanwereifonlyanAmericancoinwasobservedbeingtaken
fromthebagatthesecondstage(asasecondevent).Thisisreasoning
underuncertainty.Theleavesofthetree(bottom-mostlayer)canbe
referredtoas
yes,
yes,
noand
no,readingfromleft-to-right
respectively.
FCA
Thatis
),whichis0.29(thebottomleft-handyesbranch)in
Figure4.1,ismultipliedby
)alone(thetopleft-handyesbranch).
Inotherwords,
meanstheprobabilitytobefoundonthebranch
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
undernode
whichinturnisundernode
intheprobabilitytree.To
)…thedenominator…ismorecomplex.Therearetwo
possibilities:thebottomleftyesbranch,andthethirdfromleftbottom
yesbranch.Thatis,thesecondcoinbeingAmericanhastotakeinto
accountboththatthe“rstcoinwasAmericanandthatthe“rstcoinwas
notAmerican.Inotherwords,
)asdenominatorhastotakeinto
yes)and
yes)isthejointprobability
yes),while
no)isthejointprobabilityof
no).Puttingallthisintotheformulagives:
FCA
FCA
FCA
FCA
FCA
FCA
FCA
Inotherwords,ifthesecondcointhatistakenfromthebagisseentobe
American,thenthereisa31percentchancethatthe“rstcoinwasalso
American(hencea69percentchancethatthe“rstcoinwasBritish).
AnotheradvantageofBayesTheoremisthatitcantakeintoac-
countnewevidence.Considerthefollowingexample.Adrugsmanu-
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
BAYESTHEOREM
Took drugs (
Tests positive (
Yes
YesNo
Tests positive (
0.90.1
0.010.25
0.150.85
YesNo
Figure4.2
BayesTheoremcanalsotakeintoaccountnewinformation(newinfor-
mationinitalics)
havingtakendrugsgiventhattherandomroadsidetestshowedpositive
iscalculatedasfollows:
(
E
)

P
(
=
P
(
×
P
(
(

P
(
=
0
.
9
×
0
.
2
Tocalculate
,takeintoaccountthetwopossibilitiesoftestingpositive
havingtakendrugsandnothavingtakendrugs:
(

P
(
=
0
.
9
×
0
.
2
(
yes)
+
P
(
no)

P
(
=
0
.
9
×
0
.
2
.
9
×
0
.
2)
+
(0
.
15
×
0
.
8)

P
(
=
0
.
18
.
18
+
0
.
12
=
0
.
18
Inotherwords,thereisa60percentchancethatyourfrienddidindeed
takedrugs72hpriortotheroadsidetestandonlya40percentchance
thatsheistellingthetruth.Nowimaginethatnewinformationarrives
thattheroadsidedrugtestwillnowshowpositivefordriverswhohave
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
takendrugs99.9percentofthetimebutthatthenumberofdrug-free
driversshowingpositivehasgoneupto25percent(Figure4.2,new
probabilitiesin
attheleavesofthetree).Itisasimple
mattertoenterthenewvaluesintotheaboveequations:
.
99
×
0
.
2)
+
(0
.
25
×
0
.
8)
=
0
.
198
.
198
+
0
.
2
=
0
.
198
Thatis,thenewinformationincreasesthechancesofyourfriendtelling
thetruthto50…50.
Ifevenmorerecentinformationarrivesthatindicatesthattheorigi-
nalsurveyofthenumberofdriverswhohavetakendrugswaswrong
(20percent)andthattherevised“gureshouldbe5percent(Figure4.2,
italic“guresunder
),thiscanalsobeincorporated:
.
99
×
0
.
05)
+
(0
.
25
×
0
.
95)
=
0
.
05
.
05
+
0
.
24
=
0
.
05
Thatis,thenewinformationnowincreasestheprobabilitythatyour
friendistellingthetruthto83percent.Thisdramaticimprovementin
theprobabilitythatsheistellingthetruthresultsfromthehighfalse
positiverateaswellasreducedchancesthatshedroveaftertakingdrugs.
Finally,BayesTheoremcanalsobeusedtocalculatewhatsortoffalse
positiverateisrequiredtoensurethatdriverswhotestpositiveaftera
randomroadsidedrugtestareatleast80percentcertaintohavetaken
drugs,takingallthenewinformationintoaccount.Tryafalsepositive
rateof6percent:
.
99
×
0
.
05)
+
(
0.06
×
0
.
95)
=
0
.
05
.
05
+
0
.
06
=
0
.
05
Thefalsepositiverateisnotlowenough,sotry2percent:
.
99
×
0
.
05)
+
(
0.02
×
0
.
95)
=
0
.
05
.
05
+
0
.
02
=
0
.
05
.
07
=
0
.
71
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
.
99
×
0
.
05)
+
(
0.01
×
0
.
95)
=
0
.
05
.
05
+
0
.
01
=
0
.
05
Therequirementforsuchalowfalsepositiverateisaconsequenceof
only5percentofdriverstakingdrugs,accordingtothesurvey.
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
released
cured
dies
against
company
dollar
payout
Figure4.3
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
patientdies
drugreleased
drugreleased
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
morescarce,butagoodreferencesourceishttp://zlab.bu.edu/kasif/
0.3
b = 0.15
c = 0.8
b = 0.05
b = 0.8
c = 0.1
Figure4.4
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
S1S2S3
S1[0.9]00.50.5
S2[0.05]0.30.00.7
S3[0.05]00.80.2
parenthesesinTable4.1).Asequenceofstates,suchasS1,S2,S1,S3,
S3,S2,S1cannowbegivenaprobability,asfollows:
(S1,S2,S1,S3,S3,
S2,S1)
0.00324.Thatis,giventhat
thestartofthesequencehasprobability0.9,theprobabilityofthispar-
ticularsequenceofstatesistheproductofthetransitionprobabilitiesof
movingfromstatetostate,asgiveninTable4.1.
Ifonesymbolisattachedtoeachstate(thatis,theMarkovmodel
producesonespeci“edsymboleverytimeitentersastate),suchasa
forS1,bforS2andcforS3,thenthesequenceS1,S2,S1,S3,S3,S2,
S1producesthesymbolsequenceabaccba,againwithprobability
0.00324,usingjustthetransitionprobabilitiesinTable4.1.Whenonly
onesymbolisusedinsteadofastate,theprobabilitiesattachedtothese
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
S2,S1),whichhasprobabilityof0.00324ifonlyonesymbolisproduced
ineachstate,hasactualprobability0.9
0.00009675,where
“guresinboldgivetheprobabilityofaparticularsymbolbeingproduced
inaspeci“cstate.
Inmanycasestheinterestisonlyinthesymbolsequenceandtheprob-
abilityofthatsymbolsequencebeinggenerated,withouttherebeinga
needtoknowthestatespassedthrough.Whenthestatesarehidden,
inthatthesymbolsequenceanditsprobabilityareallthatisrequired,a
simpleformofahiddenMarkovmodel(HMM)results.Moreformally,
anHMMisdescribedasM
),where
isatableof
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
-AGTC
CAG-C
T-G-C
-AG-C
threeprimitivescanonlybeconnectedtoeachotherinthreeways.A
matchstateM
isconnectedtoinsertstateI
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
(b)
(c)
D2D3
D2D3
D2D3
M
A
M
T
C
T
G
C
A
M
G
C
Figure4.5
The“rstpartoftheconstructionofanHMMforthefoursequences
AGTC,CAGC,TGCandAGC
followedtocalculatetransitionprobabilitiesattheendoftheprocess
(Figure4.7).
ThethirdsequenceTGCisentered(Figure4.6(a)).SinceTdoesnot
matchM1,atransitiontoI0isrequiredtoinsertT.TnowjoinsC(from
thesecondsequence)inI0.SincethenextsymbolisGandatransitionto
I1fromI0isnotpermitted,atransitiontoD1tosignifythattheconsensus
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
Begin
D
(b)
A
M
Begin
D
G
C
G
C
T
TC
TC
Figure4.6
The“nalstagesofHMMconstruction
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
Begin
Match
profiles:
Insertion profiles:
Note that 

C 0.01
G 0.01
T 0.01
C 0.01
G 0.97
T 0.01
C 0.97
G 0.01
T 0.01
Trace:
Transition probabilities:
… M
C A G … C
T … G … C
… A G … C
A G C
Figure4.7
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
log0
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
Forinstance,thelogs-oddscoreforCAGTCis:
CAGTC
CAGTC
log0
log
023103…(5
log0
7677928…(5
16367901(usingnaturallogs)
Thatis,theHMMprobabilityofCAGTCis0.023103,andfromthisis
subtracted(withlogs,dividing
becomessubtracting
)“ve
(thelengthofthesequence)timestherandomchanceofthesequence
beingrandom(withlogs,raisingtothepowerbecomesmultiplication).
The“nalresultcanberoundedto3.164.Theabovecalculationwas
undertakenaftertheprobabilityofCAGTCwascalculated,butthelog-
oddsformulacanbeappliedfromthebeginningofasequenceasit
enterstheHMMsothatarunningnaturallogscorecanbemaintained
asthesymbolsareprocessedonebyoneandinsert,matchandtransition
probabilitiesareapplied.
ThreemajoradvantageswithHMMsarethatconservedregionsof
sequences(subsequencesthatoccurinallstrings,suchasthesamegene
acrossanumberofdifferentorganisms,thatcanthereforebeassumed
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
SUMMARYOFCHAPTER
cse.ucsc.edu/research/compbio/ismb99.tutorial.html).Thispagealso
containslinkstootherHMM-relatedpages.
4.5Summaryofchapter
1Probabilityisplayinganincreasinglyimportantroleinarti“cialin-
telligenceandtheabilityofsystemstoreasonunderconditionsof
uncertainty.OfparticularimportanceandinterestareBayesianand
HiddenMarkovModelapproaches,whichhavetheabilitytocalcu-
latetheprobabilityoffutureeventsandsequencesonthebasisofpast
eventsandsequences.Bothapproachesusespecialtypesofgraphfor
representinginformationaboutthedomain.
2AttheheartoftheBayesianapproachistheconceptofconditional
probability,i.e.theprobabilityofaprioreventAhavingoccurred
JWBK023-04JWBK023-KeedwellMarch24,200512:57CharCount=0
PROBABILISTICAPPROACHES
5HMMsareparticularyusefulfordealingwithmultiplealignments,
pro“lesandvariousprobabilisticmodelsofbiologicalsequences.A
numberofalgorithmsnowexistforconstructingHMMsforoptimal
multiplealignment.Therearetypicallytwophasestoconstructing
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NearestNeighbourand
ClusteringApproaches
5.1Introduction
Considereightprostatecancerpatientswhohavehadbiopsiesinaclinic,
withmeasurementoftwospeci“cgenesthroughgeneexpressionanaly-
sis:hepatomamRNAfortheserineproteasehepsin(accessionnumber
X07732)andthec-myconcogene(V00568).Doctorsattheclinicbelieve
theyhaveanovelwayofidentifyingmoreaccuratetherapeuticstrategies
forindividualpatientsbasedonthesegeneexpressionmeasurements.
Eachofthepatientsthenundergoesanindividualizedtherapyregime
consistingofvaryingcombinationsofandrogensuppressionandradi-
ation.Thesetherapeuticstrategiesprovesuccessful,therebyvindicating
therapeuticdiagnosisonthebasisofthemeasurementofthesetwogenes.
Anewpatiententerstheclinicandalsohasabiopsy.Thequestionnow
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
(a)
p
5
p
7
p
6
p
8
p
4
p
2
p
1
p
3
New patient
5
4
3
2
1
X07732
6
5
4
3
2
1
0
?
(b)
p
5
p
7
p
6
p
8
p
4
p
2
p
1
p
3
Mid-point = 3.5
Neutral zone
First
division
4
3
2
1
X07732
6
5
4
3
2
1
0
Figure5.1
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
(c)
p
5
p
7
p
6
p
8
p
4
p
2
p
1
p
3
Mid-point (a) = 3.5
Mid-point (b) = 3.0
Second division (b)
Second
division (a)
4
3
2
X07732
6
5
4
3
2
1
0
(d)
p
5
p
7
p
6
p
8
p
4
p
2
p
1
p
3
Third
division (a)
(Mid-point = 5.5)
Third
division (b)
(Mid-point = 1.5)
4
3
2
1
X07732
6
5
4
3
2
1
0
Yes
Yes
Yes
Yes
3.0?X07732
NoYes
NoYes
NoYes
Figure5.1
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
placeallfourmid-pointtestsatthelevelbelowsothat,byfollowingthe
treeandapplyingthetests,eachoftheeightcasesuniquelyfallsinaleaf
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURAPPROACH
Forexample,theaminoacidsequenceATSLVFWhassecondarystructure
,where
standsforhelix,
112
1002
0000000000000002
110202
1010
102
10000
1012
1000
10002
GPDEANQSTKRHVIM
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
Table5.2
Theconformationpredictiontableforthethreehomologuesincompar-
isonwiththesequencewithunknownstructure
hsc
Residue16
Residue26
159
Residue36
159
Residue46
159
Residue569
Residue669
Residue76
thesequenceswithknownsecondarystructure)is:1
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
residuerowofTable5.2,themaximum(andonly)scorefallsunderthe
column.The“rstresidueofthenewstrandisthereforepredictedto
partakeinahelixconformation.Forthesecondresidue,therearetwo
possibilities,with15for
and9for
Sincethemaximumvalueis15,the
secondresidueispredictedtopartakeinahelixconformation.Applying
thismaximumfunctiontoalltheotherresiduesresultsinaprediction
thatthenewstrandSTNGIYWhasconformation
,i.e.helix,
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
Table5.3
Atableoffourpatientswiththeirgeneexpressionmea-
surementsacross“vegenes.Thegenevaluesareabsentandpresent,
whicharecodedinthetableas0forabsentand1forpresent.
Patients1to4arereferredtoas
inthetext
Gene1Gene2Gene3Gene4Gene5
Patient1
Patient2
Patient3
Patient4
tree.Thisthenmeansthattherewillbedifferentdecisiontreesdepending
onwhichattributesareused.Also,thereisthepossibilitythatbyarbi-
trarilychoosingsomeattributesratherthanotherscertainattributesare
missedthatleadtoclearandseparateregionsofspaceforeachsample.
Forinstance,noisyattributesmaybeusedthatdonotdistinguishthe
samplewellandinsteadprojectthesamplesintoaverytightregionof
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
0.6
Similarity
(
p
1/
p
2)/(
p
3/
p
4)
p
1/
p
2
p
1
p
2
p
3
p
4
p
3/
p
4
Third iteration:
(
p
1/
p
2)/(
p
3/
p
4)=0.4
Second iteration:
(
p
1/
p
2)/
p
3=0.5
(
p
1/
p
2)/
p
4=0.3
p
3/
p
4=0.8
p
3/
p
4 chosen
to form second
cluster
p
1/
p
2=0.8
p
1/
p
3=0.4
p
1/
p
4=0.2
p
2/
p
3=0.6
p
2/
p
4=0.4
p
3/
p
4=0.8
p
1/
p
2 chosen to
form first cluster
Figure5.2
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
Thenextstageistochoosethematchthathasthehighestmatchingco-
ef“cient.Inthisexample,therearetwovaluesof0.8(
Chooseoneatrandom,say
,andformthe“rst
.Thecalcula-
tionofallpairwisematchingcoef“cientsisrepeated,butthistimeusing
asonepatientandtakingpartialmatchesintoaccount.So,the
averagematchingcoef“cientfor
5.Thatis,thereisnomatch(0score)forGene1,sinceboth
havevalue1forthisgenewhereas
has0;thereisa
par-
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
ADVANCEDCLUSTERINGTECHNIQUES
Table5.4
Ageneexpressionsampletable,withgenes
occupyingrowsandsamplescolumns
Patient1Patient2Patient3Patient4
Gene1
Gene2
Gene3
Gene4
Gene5
(0,1,2,3).First,thetableistransposedsothatgenesappearintherows
andsamplesinthecolumns(Table5.4).
Thegenepro“lescanbeplottedonagraph,asgiveninFigure5.3.
Thatis,eachlineonthegraphconnectsagenesvaluesacrossallpatients.
Thenextstageistoidentifyhowsimilareachgeneistoothergenes,given
thepro“lesacrossallsamples.
Euclideandistance
approachtothiscalculationisasfollows:
e
e
2
1234
Gene 5
Gene 1
Figure5.3
Geneexpressionpro“lesplottedonagraph
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
acrossallsamples.Thedifferenceissquaredtopreventnegativevalues.
So,forinstance,forthe“vegenesandadoptingapairwisecomparison
inthe“rstinstance:
(
e
e

Š
2)
(2
Š
1)
(1
Š
0)
(0
Š
1)

Thatis,Gene1hasvalues3,2,1and0acrossthefoursamples(patients)
andGene2hasvalues2,1,0and1forthesamefoursamples.Gene1
valuesconstitutethe“rstitemofeachpairofvaluesintheformula,
andgene2valuesconstitutetheseconditemofeachpair.Adoptingthe
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
ADVANCEDCLUSTERINGTECHNIQUES
Š
2
.
5)
(2
Š
1)
(1
Š
0)
(1
Š
1
.
5)

+
1
+
1
+
0
.
25
=
1.58
d
(
g
4
,
g
,
3
=

Š
2
.
5)
(0
Š
1)
(1
Š
0)
(0
Š
1
.
5)

.
25
+
1
+
1
+
2
.
25
=
2
.
55
d
(
g
5
,
g
,
3
=

Š
2
.
5)
(0
Š
1)
(3
Š
0)
(3
Š
1
.
5)

Themostsimilarpairattheendoftheseconditerationis
1with
(withvalue1.58).Thisbecomesanewprotogene,withnewaverages
calculatedforthisprotogene
byaddingthevaluesofallthree
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
Š
2
.
75)
(0
Š
1
.
5)
(1
Š
0
.
5)
(0
Š
0
.
75)

.
06
+
2
.
25
+
0
.
25
+
0
.
56
=
2
.
47
d
(
g
5
,
g
,
2
,
3
=

Š
2
.
75)
(0
Š
1
.
5)
(3
Š
0
.
5)
(3
Š
0
.
75)

Themostsimilarpairnowis
4with
.Thisgivesanewprotogene
withvalues0.5,0,2and1.5.The“nalstepistocomparethetwo
.
5
Š
0)
(0
Š
0)
(2
Š
3)
(1
.
5
Š
3)

.
25
+
0
+
1
+
2
.
25
=
1
.
87
.
4.Butpatientsamplescanalsobeclustered.ApplyingtheEuclidean
Table5.5
ThetransposedversionofthedatainTable5.4,inprepara-
tionforpatientclustering
Gene1Gene2Gene3Gene4Gene5
Patient13
Patient22
Patient31
Patient40
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
ADVANCEDCLUSTERINGTECHNIQUES
Therearetwoclustersof
1with
2,and
3with
4withvalue
2.65.Formsuperpatientsfromeachoftheseandcalculatethenew
0)and
0).These
twosuperpatientshavesimilarity:
The“nalclusteringsforbothgenesandpatientsareprovidedinFigure
belongtooneparticularclass,suchasbeingprostatecan-
cersufferers,whereas
belongtoanotherclass,suchasnormal,
theclusteringsinFigure5.4(b)wouldindicatethatGene5separates
asapairfrom
,andthatGene1andGene4further
(perhapsseverityofthecancer).
Gene 3
Gene 1
Gene 4
Gene 5
Gene clustering
Gene clustering
Gene expression values
Gene expression
4
p
3
p
2
p
1
0123
Figure5.4
(a)Representstheresultsofclusteringjustthegenes,whereas(b)rep-
resentstheresultsofclusteringthesamplesaswell.Also,eachgene
expressionvaluehasbeengivenadifferentshadesothatavisualrepre-
sentationofgeneclusteringisobtained
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
SUMMARYOFCHAPTER
techniquesintheshorttimethatithasexisted.Clusteringtechniques
inparticularareprobablyoneofthemostwellusedtechniquesin
problemareaswherethedatahashigh-dimensionality(forinstance
ingeneexpressionanalysis).Theycanbeusedintheirownrightto
JWBK023-05JWBK023-KeedwellMarch31,20052:56CharCount=0
NEARESTNEIGHBOURANDCLUSTERINGAPPROACHES
(thatis,datathatisnotmeasuredovertime)or,iftemporalgene
expressiondataisavailable,coregulatedgenes(thatis,genesthatare
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
Identi“cation(Decision)Trees
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
Thetaskofclassi“cationisonethatisprominentinalargenumberofap-
plicationareas.Itisessentiallythetaskofcreatingrulesorstructuresthat
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
Table6.1
WeatherLightGroundconditionUmpiresdecision
SunnyGoodDryPlay
OvercastGoodDryPlay
RainingGoodDryNoplay
OvercastPoorDryNoplay
OvercastPoorDampNoplay
RainingPoorDampNoplay
OvercastGoodDampPlay
SunnyPoorDryPlay
Fromthisdata,anidenti“cationtreecanbeconstructedthatcanshow
whicharetheimportantfactorsinmakingthedecision.Itisobvious
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
Noplay
Poor:yieldsfourexamples,oneofclassNoplayandthreeof
Noplay
Noplay
Noplay
NoticethatnoattentionispaidtotheothertwoattributesWeather
andGroundconditionwhentestingtheeffectivenessofLight.The
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
6.2Gaincriterion
Thegaincriterionisbasedontheamountofinformationthatatestonthe
dataconveys.Thisinformation-theorybasedapproachhasbeenshown
tobemoreeffectivethanasimpletallyofthenumberofindividuals
Theinformationconveyedbythisisthencomputedas…log
oftheprob-
ability.Thisgives:
Thisequationthereforecomputestheinformationconveyedfromeach
T
|

log
freq
(
C
T
)
T
|

.
(6.3)
T
|

log
freq
(
C
T
)
Thegaingivenbyaparticulartestcanbegivenbysubtractingtheresult
ofEquation6.4fromEquation6.3:
Thissoftwareanddocumentationisavailablefromhttp://www.rulequest.com.
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
GAINCRITERION
Theidenti“cationtreealgorithmproceedsthrougheachfeature,com-
putingthegaincriterionforeachfeature,selectsthebestofthese,and
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
WeatherLightGroundconditionUmpiresdecision
OvercastGoodDryPlay
OvercastPoorDryNoplay
OvercastPoorDampNoplay
OvercastGoodDampPlay
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
GAINCRITERION
SunnyRaining
GoodPoor
Weather
Figure6.1
The“naltreegeneratedbyexecutingtheidenti“cationtreealgorithmon
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
Gainratio
ThegainratioseeninQuinlan(1993)isamoresophisticatedversionof
itsforerunner,thegaincriterion.Thedif“cultywiththegaincriterion
T
|

log
|
T
Thisgivesa“nalgainratioequation:
(
X
)
.
(6.7)
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
OVERFITTINGANDPRUNING
circumstance,thegaincriterionvaluesremainunchangedforattributes
lightandgroundcondition,andtheattributeweatherisstillchosen
asthe“rstsplitinthetree.However,itsin”uencehasbeenreduced(from
0.5usinggaincriterionto0.33usinggainratio)sotheeffectonthegain
criterionexertedbythefactthattheattributesplitsthedataintomore
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
forreductiontoaleafwheretheleafclassi“cationisthemostfrequent
2theleafthatisreplacingthecurrentsubtree.
Ifrestrictedtothetrainingdata,thecurrentsubtreewillhavethefewest
errorseverytime,sosomemeasuremustbemadeoftheexpectederror
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
OVERFITTINGANDPRUNING
theresultsthatareobtained.Inadditiontothis,itisimportanttonote
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
APPLICATIONGUIDELINES
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES

Figure6.2
AnexampleofN-foldcross-validation
theapproachthatcanbeexpectedonnon-trainingdataandisespecially
usefulwheretheamountofdataisrestricted,whichisoftenthecasein
problemsinbiology.
Dataminingisbigbusinessandidenti“cationtreesoftwareformsarea-
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
BIOINFORMATICSAPPLICATIONS
describedinthisbook.Ifamorecut-downidenti“cationtreesoftware
packageisrequired,thenSee5(Windows)orC5(UNIX)
,developedby
RossQuinlan,representsaneatandef“cientimplementationoftheal-
gorithmsdiscussedhere.Inadditiontothis,See5iskeptup-to-datewith
thelatestadvancesinthe“eldsothatitincorporatesnewfeaturessuch
asboosting,cross-validationandfuzzythresholds.Ifasimpleand
quickalgorithmimplementationisrequiredwithrudimentaryvisualiza-
tionofresults,thenSee5ishighlyrecommended.Analternativetothis
packageisCART
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
P4P3P2P1P1'P2'P3'P4'
S4S3S2S1S1'S2'S3'S4'
aa2aa3aa4aa5aa6aa7aa8
Figure6.3
The“nalmaturationphaseofHIV
oftheviralgagandgag-polpolyproteinstoyieldthestructuralproteins
andenzymesofthevirusforfurtherinfection.
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
BIOINFORMATICSAPPLICATIONS
frombindingtoitssubstrate.Inhibitorsmustbecarefullyandspeci“cally
designedsothattheydonotaffectthenaturallyoccurringproteasesin
thehumanbody.
Asigni“cantamountofpotentialcleavagesitedataforHIVandHCV
hasbeenproducedthroughlaboratory
invitro
experiments,wherethe
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
BIOINFORMATICSAPPLICATIONS
Classificationofcancerbyusingdiagnosisdata
Agooddealoftheclassi“cationproblemsinbioinformaticsdataarere-
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
IDENTIFICATION(DECISION)TREES
includedthebasicstructurethatisemployedinC4.5.Theoriginalidea
QuinlancreditstoHovelandandHuntand
conceptlearningsystems
which,asmanyofthenotionsinthisbookseemtohavebeen,were
createdinthe1950s.
SinceQuinlandesignedC4.5andSee5,therehavenotbeenanysig-
ni“cantparadigm-shiftsinthewaythatdecisiontreesoftwareworks.
Therehasbeenanexplosioninthenumberofsoftwarepackageswhich
useidenti“cationordecisiontrees,butthetheorybehindthemremains
muchthesame.Advanceshavecomeintheshapeofimprovementsto
thewaythatsplitsareevaluated,howtheresultsarevisualizedandad-
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
Quinlan,J.R.(1993)
C4.5:ProgramsforMachineLearning
,MorganKauffman,San
Mateo,California.
Narayanan,A.,Wu,X.andZhang,R.Y.(2002)Miningviralproteasedatatoextract
cleavageknowledge.
,Suppl1,5…13.
Selbig,J.,Mevissen,T.andLengauer,T.(1999)Decisiontreebasedformationofcon-
sensusproteinsecondarystructureprediction.
,1039…1046.
JWBK023-06JWBK023-KeedwellMarch31,20053:1CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
Sigmoid function
(b)
�If (activation threshold)
output = 1 Else output = 0
1
1 + e

(activation)
Figure7.1
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
Figure7.2
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
Figure7.3
Thearchitectureofaself-organizingfeaturemap:themapitselfforms
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
t
)
OR
t
)
If output is 1, required = 0
Š

i
Š

i
t
)
If output is 0, required = 1
+

i
+

i
Figure7.4
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
t
)
w
x
+ 
+ 
w
+ 
+ 
Figure7.5
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
Fully connected
output layer
Full connection
18 training
00110000
...
12345678
9
8
7
5
4
3
2
12345678
9
8
7
6
5
4
2
(00000000)(00000000)(00110000)(00110000)(...)(...)...
Figure7.6
AnidealizedKSOM
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
APPLICATIONGUIDELINES
Unsupervisedlearningisespeciallyusefulwheredatahasbeencol-
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
BIOINFORMATICSAPPLICATIONS
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
BIOINFORMATICSAPPLICATIONS
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
BIOINFORMATICSAPPLICATIONS
oftenaccurateonthetrainingdata,buthavedisappointingaccuracyon
testdata.Thenotionisthatthenumberofgenesisreducedtothepoint
thatitcanbefruitfullyanalysed(incombinationwiththeweightswhich
helptoindicatetheimportanceofageneintheclassi“cation)butnot
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
SUMMARYOFCHAPTER
7.5Summaryofchapter
JWBK023-07JWBK023-KeedwellMarch31,20053:2CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Allele
Figure8.1
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Figure8.2
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Cross-over point
Parent 1
Parent 2
Offspring 1
Offspring 2
Cross-over
operation
Figure8.3
Singlepointcross-over
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Figure8.4
Uniformcross-over
uniformcrossoverwheretwochromosomespassthrougha“lter,where
thegreysquaresindicatethataswapofallelestakesplaceandthe
whitesquaresindicatethatthevaluespassthroughuntouched.Uniform
crossoverensuresthateachgeneonthechromosomehasanequalchance
ofbeingcrossedoverandrepresentsacrossoverwithoutpositionalbias.
Crossoveroperatorsthereforeensurethatmaterialtakenfromtwo
selectedparentsismergedinsuchawaythattheinformationthatcon-
tributedtotheparents“tnesswillbekeptintheoffspring.Thecrossover
operatorisstochasticasthecrossoverpointisselectedatrandom.One
possibledif“cultyisthat,forcertaintypesofproblem,crossovermay
resultinachromosomethatisnotpermittedforthetaskathand.For
instance,ifsomeallelicvaluestowardstheendofachromosomedepend
onallelicvaluesearlierinthechromosome(suchas,forinstance,aroute
“ndingproblemwhereonlyoneoccurrenceofacityornodeinamap
canoccurinachromosome),crossoverwillneedtobesupplemented
bysomecheckprocedurethatensuresthattheresultsofcrossoverstill
makesense.Somepostcrossoverproceduremayberequiredtomendthe
resultsofcrossoversothatsolutionsstillfallinthespaceofacceptable
solutions.Onceselectionandcrossoverhavetakenplace,thesolutions
aremutated.
Selectionandcrossoverensurethatthebestindividualshavethegreatest
chanceofprogressingintothenextpopulationandcansharetheirin-
formationtogivethebestpossiblesolutions.Bothprocessesincludean
elementofrandombehaviourbutafurtherrandomelementisrequiredto
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Generation 1
N
Generation 2
N
cross-over and
Figure8.5
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Cross-over and
Figure8.6
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
DV1DV2DV3DV4SumFitness
Chrom1
21441
Chrom2
416
749
Chrom32
1012
525
Chrom4
24576
Chrom50
758636
Chrom6
238
900
Chrom7624315225
Chrom8
749
Chrom94
Chrom101
19361
Average176.3
Ifbychancechromosomes4and10wereselectedandcrossedoverat
point2,theresultingchildrenwouldbecreated
Child1
30900
Child21
13169
Amutationcanthenoccurchangingthevariable1inchild2toa3which
Child2
11121
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Allele1Allele2Allele3Allele4SumFitness
Chrom1
21441
Chrom2
416
749
Chrom32
1012
525
Chrom4
24576
Chrom50
758636
Child1
Chrom7624315225
Chrom8
749
Child23
Chrom101
19361
Average278.3
ByusingjustonegenerationoftheGA,theaverage“tnessofthepopu-
lationhasrisenfrom176to278.Inadditiontothis,anindividualwith
a“tnessof900hasbeencreated,byfarthehighestfromthetwopop-
ulationsandalsomuchhigherthanthatofitsparents.Thisexample
illustratestwokeyconceptsoftheGA.
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
SolutionWeightStrength
1452.2
2301.5
3251.0
Adecreaseinweightisaccompaniedbyadecreaseinstrengthand
thereforeeachofthesesolutionsdoesnotdominateanyother.However,
ifafourthsolutionisincludedwhichhasaweightof22andastrengthof
2.5,thissolutiondominateseachofthe“rstthreesolutions.Thisfourth
solutiondominatestheotherthreebecauseitisatleastasgoodinevery
1020
Weight
304050
New solutions
Figure8.7
Aircraftdesignexample
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
APPLICATIONGUIDELINES
1020
Weight
304050
Rank 1
Figure8.8
Extendedaircraftdesignexample
meansthatinthecurrentpopulation,itisthebest(least-dominated)
solution.Thereisoftenmorethanonesolutionwithrankzeroandthis
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
APPLICATIONGUIDELINES
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
oftheseinteractions,takingasingleexpressionmeasurementatapartic-
ularmomentintimeasthebasisforastudywillnotnecessarilyyieldthe
requiredresults,sincethegeneexpressionofindividualsisadynamic,
notstaticphenomenon.Themeasurementofanindividualsampleover
timeprovidesuswiththerawinformationrequiredtodecipherwhich
genesaresubsequentlyaffectingtheexpressionlevelsofothergenesor
eventhemselves.Thisprocessisnotanexactscience;genesexpressedat
onetimestephaveanindirectaffectonothersthroughproteinproduc-
Time0isONTHENgene
Time1isOFF
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Soallthecombinationsofselecting“ve(
)elementsfromatotalof10(N)
is252.Thatis,ifthedataconsistsofmeasurementsof10genesandweas-
sumethatonly“vegenesatonetimestepaffectageneatthenexttimestep,
thereare252possiblecombinations.Select“vefrom100,though,and
thenumberofpossiblecombinationsrisesto75287520;select“vefrom
10000(atypicalgeneexpressionexperiment)andthenumbersbecome
unmanageable.Thetotalnumberofsolutionscannotbesearchedexhaus-
tivelytoguaranteeoptimality.Thisproblemisthereforeverydif“cultto
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Figure8.9
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
Figure8.10
Geneandweightpairschromosomerepresentationofageneregulatory
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
REFERENCESANDFURTHERREADING
8.6Summaryofchapter
JWBK023-08JWBK023-KeedwellMarch28,200514:21CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
Part3
FutureTechniques
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
Figure9.1
Asmallparsetreeusingtheintegersonetotwoandthestandard
mathematicaloperators;theresultofthistreecanbecomputedas
24((5
Figure9.1showsanexampletreewhichcouldbecreatedfromsimple
mathematicaloperatorsandtheintegersoneto“ve.Theinitialization


Figure9.2
Thegrowinitializationoperator…notethatateachiterationanyoper-
atororterminalcanbeaddedtothetreeandthattheresultingtreedoes
notnecessarilyhaveanevennumberofleaves;theresultofthistreecan
becomputedas15(3
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
Figure9.3
Thefullinitializationoperatorwithmaximumdepth2…ateachitera-
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
Yes
Arity 2Arity 2
Arity 0Arity 0Arity 0
5341
Figure9.4
MutationinGPisalsodifferentfromthatinGAs,againduetothe
maintenanceoftreestructure.InGAs,mutationcanoccurbysimply
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
5 5 / 2 4
Figure9.5
ConvertingparsetreestoPolishnotation
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
APPLICATIONGUIDELINES
variables(terminals)relatingtotheproblemaspossible.Generallyspeak-
ing,randomconstantsarealsoaddedasterminalstoallowGPagreater
freedominderivingequations.Thenumberandrangeoftheseconstants
willdependontheproblemitself.
Operatorselection
Operatorselectioncanbeatrickybusinessasthemorepowerfulopera-
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
BIOINFORMATICSAPPLICATIONS
andanimals.Whilstthelaboratorytestingstageofadrugisrelatively
cheap,itcanalsobefrustratingastherewillexistmanycompoundswith
similarpropertiesthathavetobetested.Pharmaceuticalcompaniesare
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
BIOINFORMATICSAPPLICATIONS
HTH-containing…whichwerelearntfromexistingfunctionalknowl-
edge.ThetaskforGPwastoassigngenestothecorrectclassesusing
theirgeneexpressionpro“lesoverthe79experiments.
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-09JWBK023-KeedwellApril1,20050:32CharCount=0
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CellularAutomata
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
elements,sinceeachcellcanbeoccupiedbyoneoftheelementsbyamod-
i“cationofitsstate(atitsmostsimple,occupiedornot-occupied).Inthis
fashiontwoadjacentoccupiedelementscanbethoughtofasreactingor
nei
Moore
nei
Figure10.1
TwopossibleneighbourhoodsforuseinaCA
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
eachofthecellswhichmakeupthegridinwhichthegliderisseen.The
Time
Time
Time
Time
Time
Figure10.2
AgraphicaldepictionofConwayslifeapplicationofCA
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
ofsystemsthatdependnotonlyonthecurrentstateofthesystem,but
thestateofthesysteminsomepreviousgeneration.Itisalsopossible
toincreasethedimensionsofthegriditself,yieldingthree-dimensional
systemsorevenfour-dimensionalwhentimeisconsidered.
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
APPLICATIONGUIDELINES
2Eachcellmustcontainonlyafewbitsofdataandconsistofa“nite
numberofstates.
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
BIOINFORMATICSAPPLICATIONS
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
Figure10.3
TheextendedvonNeumannneighbourhood
valuesforresultsoftheruns.Initialvelocitieswerefoundtovarywith
respecttothesubstrateconcentrationinaccordancewithMichaelis…
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
BIOINFORMATICSAPPLICATIONS
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
Register4
.Distributionoflocalmomentum,basedonthehardsphere
collisionmodel.
Register5
.Potentialenergystatusofthemoleculesonthecurrent
site.Thiswascomputedasafunctionoftheattraction/repulsionof
moleculesbothonthecurrentsiteandthoseintheneighbourhood.
Register6
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
transitionrulesbutcrucially,duetotheirlimitedlocalin”uenceanddis-
Insummary,theseexperimentscon“rmedwhatwasknownexperi-
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
CELLULARAUTOMATA
NeumannonthesuggestionofStanUlamtoprovideamorerealistic
modelforcomplexsystems.InfactvonNeumann,inadditiontobeing
aphysicistandmathematician,wasactuallymoreinterestedinthere-
ductionistbiologicalapplicationsseeninthisbook.Sincetheirinception,
CAhavebeenusedinmanyofthehardscience“eldssuchasphysics
and”uiddynamics,aswellasfascinatingmathematicians.Sowhilethey
maintainapedigreeofbeinghighlyintertwinedwithbiologyfromthe
verybeginning,theyhaveperhapsnotbeenusedasmuchaswouldhave
beenexpectedintherecentexplosioninbioinformaticsapplications.
10.5Summaryofchapter
1Cellularautomataconsistofagridofcellswhichcanadoptanumber
ofstates.
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
REFERENCESANDFURTHERREADING
Siehs,C.,Oberbauer,R.,Mayer,G.
JWBK023-10JWBK023-KeedwellMarch31,20053:5CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
IntelligentBioinformatics
EdwardKeedwellandAjitNarayanan
2005JohnWiley&Sons,Ltd
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
Figure11.1
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
Classi“cationEnumerationConversio
Acutelymphoblasticleukaemia11,0,0
Acutemyeloidleukaemia20,1,0
Unknown30,0,1
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
AC=P-2
AC=P-3
AC=P2
AC=A3
AC=A-3
�-normal
TestError:1:9
Rule:2796
AC=P-2
AC=P3
AC=A1
AC=P-1
AC=A2
�-myeloma
TestError:1:31
whereA
absentandP
present.Thereforeapositiveweightindicates
thatthisattributeinitsindicatedstate(presentorabsent)isrequired
toclassifythedata.Largerweightsindicatemorein”uenceontheclas-
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
REFERENCESANDFURTHERREADING
isoftentemperedbythefactthatthenatureofhybridalgorithms,and
particularlytheirhighdegreeofspecialization,meanthattheproblem-
independencewhichthesingletechniquespossessdoesnottransferto
thehybridtechnology.Nevertheless,hybridapproachesarebecoming
moreandmorepopularasthenumberofresearchersusingthestandard
techniquesincreases.
11.7Summaryofchapter
JWBK023-11JWBK023-KeedwellMarch31,20053:54CharCount=0
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
Note:FiguresandTablesareindicatedby
italicpagenumbers
,footnotesby
suf“xn
algorithm80…3
timecomplexity84
abinitio
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
bioinformaticsapplications
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
complementaryDNA(cDNA)41…2
complexityofsearch84…6
conceptlearningsystems170
conditionalprobability
comparedwithfrequentist
probability103…4
comparedwithjointprobability
confocalarrayscanners44
conformationmatrix133
conformationpredictiontable
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
falsepositives,andBayesTheorem
“nite-stateautomaton(FSA)89,91,
comparedwithpush-down
,94
“tnessevaluation
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
selectiontechniques225
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
identi“cationtreealgorithm150…1
exampleofuse153…4,
identi“cationtrees147…71
advantages159…60
applicationguidelines160…3
background169…70
bioinformaticsapplications
classi“cationofcancer167…8
HIVandHCVproteasecleavage
prediction163…6
secondaryproteinstructure
prediction168…9
cross-validationofdata161…2
disadvantages157…9
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
mRNA(messengerRNA)11,21…2
multi-layerperceptrons174
“rstdiscussed192
learningrule180…1
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
parsimonyprinciple(inphylogeny)
34…5,39…40
pathogens60
cancer-causing27
peptidesequencing57
peptide-mass“ngerprinting56…7
perceptrons174
“rstdiscussed192
learningrule179…80
perfect-match/mismatchstrategy44
phagocytes(scavengercells)60
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0
ROC(receiveroperating
characteristics)convexhull
approach233
JWBK023-INDJWBK023-KeedwellApril1,200520:53CharCount=0

Приложенные файлы

  • pdf 24056195
    Размер файла: 1 MB Загрузок: 0

Добавить комментарий