2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 11
An optimal algorithm for
identifying a maximum-density
segment
呂學一 ( 中央研究院 資訊科學所 )
http://www.iis.sinica.edu.tw/~hil/
Microsoft Office
XP is needed to
see all the
animation
What do
algorithm people
do?
Inventing
efficient recipes
to solve
combinatorial problems
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 33
A famous combinatorial
A famous combinatorial
problem
problem
The Factorization Problem
–
Input:
a number
N
–
Output:
“yes”
if
N
is a prime number;
A factorization of N
if
N
is not a prime number.
– For example,
N
= 323264989793317.
OPEN QUESTION
Is there an efficient recipe for the
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 55
Why Factorization?
The security of many encryption schemes
is based upon the assumption that the
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 77
RSA factorization
RSA factorization
challenges
challenges
Challenge
Number
Prize ($US)
Challenge
Number
Prize ($US)
RSA-576
$10,000
RSA-896
$75,000
RSA-640
$20,000
RSA-1024
$100,000
RSA-704
$30,000
RSA-1536
$150,000
US$10,000 –– RSA-576
US$10,000 –– RSA-576
1881988129206079638386972394616504
3980716356337941382700763356422988
8597152346654853190606065047430453
1738801130339671619969232120573403
1879550656996213051687593076502570
59
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 99
RSA-576 factored in
RSA-576 factored in
December 3, 2003
December 3, 2003
3980750864240649373971255005503864
9119906436234252670840638518957594
6388957261768583317
4727721461074353025362230719730482
2463291469530209711645985217113052
0711256363590397527
At the same time, Adi Shamir gave two
US$20,000 –– RSA-640
US$20,000 –– RSA-640
3107418240490043721350750035888567
9300373460228427275457201619488232
0644051808150455634682967172328678
2437916272838033415471073108501919
5485290073377248227835257423864540
14691736602477652346609
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 1111
US$200,000 –– RSA-2048
US$200,000 –– RSA-2048
25195908475657893494027183240048398571429282126204
03202777713783604366202070759555626401852588078440
69182906412495150821892985591491761845028084891200
72844992687392807287776735971418347270261896375014
97182469116507761337985909570009733045974880842840
17974291006424586918171951187461215151726546322822
16869987549182422433637259085141865462043576798423
38718477444792073993423658482382428119816381501067
48104516603773060562016196762561338441436038339044
14952634432190114657544454178424020924616515723350
77870774981712577246796292638635637328991215483143
81678998850404453640235273819513786365643912120103
97122822120720357
Short of cash?
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 1313
RSA 2003 (April ’03)
2002 Turing Award
2002 Turing Award
(June’03)
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 1515
The awarded paper
The awarded paper
Only 7 pages.
– “A Method for Obtaining Digital Signatures
and Public Key Cryptosystems”,
Communications of the ACM 21, 120-126,
“PRIMES is in P”
Agarwal, Kayal, and Saxena
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 1717
PRIMES is in P
PRIMES is in P
The PRIMES problem:
– Input: a number N.
– Output:
“yes”if N is a prime number.
“no” if N is not a prime number.
Only 9 pages!
Running time is O(n
12
), where n is the
NEW YORK TIMES
NEW YORK TIMES
, Aug.
, Aug.
8, 2002
8, 2002
Previous algorithmic results that caught
the attention of the New York Times
– 1984, Karmarkar’s algorithm for solving
linear programs.
– 1979, Khachian’s algorithm for solving linear
programs.
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 1919
The latest version (v.3) of
The latest version (v.3) of
AKS’s paper
AKS’s paper
The running time is now improved from
What do algorithm people
What do algorithm people
do?
do?
Looking for important/interesting
combinatorial problems
Coming up with efficient recipes to solve
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 2121
Bioinformatics
Bioinformatics
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 2323
Finding a DNA segment with Max
GC-density in linear time
WABI J. Comput. Sys. Sci.
ESA SIAM J. Computing
DNA Sequences
DNA Sequences
[Chargaff and Vischer, 1949]
– DNA consisting of A, G, T, C
Adenine ( 腺嘌呤 )
Guanine ( 鳥糞嘌呤 )
Cytosine ( 胞嘧啶 )
Thymine ( 胸腺嘧啶 )
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 2525
[Vischer, Zamenhof, Chargaff, 1949]
– Negative evidences for the widely believed
%A = %G = %T = %C.
Edwin Chargaff,
Edwin Chargaff,
1905-
Observing
– %A ~ %T
– %G ~ %C
“A comparison of the
molar proportions
reveals certain
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 2727
Double Helix
Double Helix
[Watson and Crick, Nature,
April 25, 1953]
– Biologist (age 23, fresh Ph.D.) +
Physicist (age 35, still a Ph.D.
student)
1962 Nobel Prize in
1962 Nobel Prize in
Physiology or Medicine
Physiology or Medicine
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 2929
DNA’s picture
DNA’s picture
[Alexander Rich, 1973]
– Structure biologist at MIT.
Celebrating
Celebrating
50 years
50 years
of Double
of Double
Helix (April 25, 1953 – 2003)
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 3131
Francis Crick 1916-2004
Francis Crick 1916-2004
Passed away on July 28, 2004
Maurice Wilkins 1916-2004
Maurice Wilkins 1916-2004
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 3333
GC-content
GC-content
Non-uniformity of nucleotide composition
– 25% - 75% in genomes of all of organisms
– 40% - 50% in typical mammalian genomes
– 30% - 60% in human chromosomes
GC content
GC content
GC-content is positively correlated with
– gene length,
– gene density,
– patterns of coden usage,
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 3535
The Problem
The Problem
Input:
– an n-bit string S,
– an integer L.
Output:
– a substring S[i, j] of S with maximum density
over all substrings of S with at least L bits.
Example
Example
S = 1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1
L = 1, 1
0 0 1 1 0 0 1 0 1 1 0 1 0 0 1
L = 2, 1 0 0 1 1
0 0 1 0 1 1 0 1 0 0 1
L = 3, 1 0 0 1 1 0 0 1 0 1 1
0 1 0 0 1
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 3737
density of each segment
density of each segment
in O(1) time
in O(1) time
prefix-sum(i) = S[1]+S[2]+…+S[i],
– all n prefix sums are computable in O(n) time.
sum(i, j) = prefix-sum(j) – prefix-sum(i-1)
density(i, j) = sum(i, j) / (j-i+1)
Good partners
Good partners
Finding the
best ending position
g(i)
for
each i=1,2,…,n.
L
g(i)
maximing avg[i,
g(i)
]
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 3939
Previous Work
Previous Work
[Huang, CABIOS ’94]
–
O(nL) time
.
Key observation: no need to examine substrings
longer than
2L.
L
i+L
g(i)
L
Recent Progress
Recent Progress
[Lin, Jiang, Chao, J. Computer Systems
and Science (JCSS), 2002]
–
O(n log L) time
.
– Techniques:
Right-skew decomposition
.
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 4141
Our results
Our results
Reviewing Lin, Jiang, and
Chao’s Algorithm
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 4343
Right-Skew Substring
Right-Skew Substring
S[i, j] is
right-skew
if for each k = i,…, j-1
– density[i, k] ≤ density[k+1, j].
S =
1 0 0 1 1 0 0 1 0 1 1 0 1 0 0 1
1
0 0 1
1 0 0 1 0 1 1 0 1 0 0 1
1 0
0 1 1
0 0 1 0 1 1 0 1 0 0 1
1 0 0 1 1 0
0 1 0 1 1
0 1 0 0 1
Right-Skew
Right-Skew
Decomposition
Decomposition
Partition S into substrings S
1
,S
2
,…,S
k
such
that
– each S
i
is a right-skew substring of S
– density(S
1
)
>
density(S
2
)
>
…
>
density(S
k
)
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 4545
An example
An example
1
1
0
1
1
0
1
0
1
1
0
0
1
1
>
2/3
>
3/5
>
1/3
Why RS-decomposition?
Why RS-decomposition?
1.
It suffices to search for g(i) among the
boundaries of RS-decomposition of
S[i,
n]
.
2.
The boundaries’s “potential” of being a
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 4747
Illustration
Illustration
L
i+L
g(i)
i+L
L
Preprocessing steps
Preprocessing steps
1.
RS-decomposition of S[i, n] for each i.
2.
Jumping table that enables binary search
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 4949
First preprocessing:
First preprocessing:
All RS-decompositions
All RS-decompositions
The RS-decomposition of each S[i, n]
– Linear time for each i = 1, …, n.
All n RS-decompositions
– [Lin et al.] O(n
2
) time O(n) time.
L
Key: nested structures
Key: nested structures
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 5151
Second Preprocessing:
Second Preprocessing:
Jumping Table
Jumping Table
Why?
– Need a table that enables jumping over 2
k
right-skew components in O(1) time for each
k.
[Lin et al.] O(n
2
) time O(n log L) time.
L
LJC’s Algorithm
LJC’s Algorithm
Three main steps:
1.
All RS-decompositions in O(n) time.
2.
Jumping table in O(n log L) time.
3.
For each i=1, 2,…, n
Binary searching g(i) in O(log L) time.
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 5353
A new important
A new important
observation
observation
i < j < g(j) < g(i) implies
density(i, g(i)) is no more than
density(j, g(j))
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 5555
Searching for all g(i) in
Searching for all g(i) in
linear time
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 5757
We still need RS-decomp.
We still need RS-decomp.
L
A generalized version
A generalized version
Input:
– an n-bit string S
– an integer L
–
an integer U
Output: a substring S[i, j] of S with maximum
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 5959
Difficulty
Difficulty
Idea: Partition the input
Idea: Partition the input
into blocks of length U-L
into blocks of length U-L
For each index i, g(i) can only be in two
consecutive blocks.
i
U-L
U-L
U-L
U-L
U-L
U-L
U-L
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 6161
Consider two cases
Consider two cases
separately
separately
L
i
L
i
L+d
i
Case 1
Case 1
L
i
Taking care of all indices i with the same
left block together.
Just like no U is specified.
L
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 6363
Case 2
Case 2
Taking care of all indices i with the same
left block together.
Need RS-decompositions for
Need RS-decompositions for
all prefixes of each block
all prefixes of each block
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 6565
ESA/SICOMP result
ESA/SICOMP result
An Optimal Algorithm for the
Maximum-Density Segment Problem, with Kai-min
Chung, in Proceedings of the 11th
Annual European Symposium on
Algorithms, Budapest, Hungary,
Kai-min’s idea
Kai-min’s idea
For a segment
– i k j
For a feasible segment
lowest density
lowest density
prefix of
prefix of
lowest density
lowest density
prefix of
prefix of
lowest density
lowest density
prefix of
prefix of
max-density
max-density
segment got
segment got
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 6767
Our algorithm
Our algorithm
For each possible j, we find a “good”
candidate S(i
j
,j) and look for the max-density
segment over all S(i
j
,j)
– i
j
-1 j
removable prefix
L
The features of our new
The features of our new
algorithm
algorithm
No need of the clever but somewhat
complicated right-skew decomposition.
As a result, our algorithm can process the
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 6969
Some thoughts
Some thoughts
Almost all algorithmic results start with
very simple observations.
Experience and skills (in analysis) are
certainly crucial, but being creative and
diligent is even more important in doing
good (algorithmic) research.
Would you like to join the algorithmic
adventure?
2004/12/13
2004/12/13 Maximum-Density Segment @ EE.NTUMaximum-Density Segment @ EE.NTU 7171