1-1 Central Dogma of Molecular Biology
DNA, RNA, and proteins are the three crucial marcomolecules in living organisms.
The central dogma was introduced by Crick to describe the process of producing
proteins from DNA through RNA in 1958 (Figure 1-1).1 DNA is a biopolymer that
carries genetic information. DNA is duplicated before a cell undergoes self-replication.
DNA is also used to produce pre-RNA through transcription. Both duplication and
transcription of DNA occur in the cell nucleus. RNA is processed through RNA splicing
to remove the non-coding regions before translation. The mature RNA is transported to
the cytoplasm. Proteins are built based on the corresponding genetic code on the mature
RNA through translation.
Figure 1-1. The central dogma of molecular biology is the genetic information flowing from DNA through RNA to proteins.1The solid arrows indicate the information flow that occurs in all eukaryotic cells. The dashed arrow indicates the information flow that occasionally occurs in viruses through reverse transcriptases.
2
1-2 Proteins
Proteins are the end products of the central dogma. Based on the unique genetic
code carried by the RNA, each protein is composed of different types and number of
amino acid. Most amino acids are L-α-amino acids. Proteins are linear biopolymers with
peptide bonds linking an α-carboxyl group of one amino acid and an α-amino group of
another. The peptide bond is planar with six atoms in the same plane. The length of a
peptide bond is 1.32 Å, which is between a C-N single bond (1.49 Å) and a double bond
(1.27 Å), suggesting partial double bond character.2 Each amino acid contains a
different side chain functional group, allowing proteins to perform various bioactivities.
Proteins are essential elements that control nearly all cellular functions. There are
several types of proteins differing in utility including structural components,3 signal
transduction,4 catalysis,5 and immune response.6 Proteins are responsible for almost all
bioactivities in the cell, and thus studies to enhance the fundamental knowledge on
proteins should improve our understanding of nature, along with potential technological
advancement.
1-3 Protein Folding and Function
In order to perform various biological functions, proteins must fold into
three-dimensional structures with high accuracy. Different protein structures give rise to
3
various protein functions.3, 7For example, at least 15 distinct enzyme families require a
specific protein fold named αβ barrel to construct the appropriate active site geometry.8
If proteins are denatured or mutated and cannot fold correctly into the corresponding
three dimensional shape, proteins lose their functions or even lead to protein misfolding
diseases such as Alzheimer’s,9 Parkinson’s,10 Huntington’s,11 and Crutzfeldt-Jacob
(prion) diseases.12 Alzheimer’s disease (AD) is a clinical syndrome caused by
neurodegeneration and was estimated that 24.3 million people suffered from it in
2001.13AD is related to the abnormal formation and accumulation of amyloid E peptide
(Aβ) and tau protein.14 Parkinson’s disease (PD) is a common nerval syndrome caused
by the abnormal aggregation of a stable tetrameric protein, α-synuclein (α-SYN), to
form insoluble fibrils.15 Prion disease is also caused by the aggregation of a
helical-containing protein called prion protein (PrP).16 These three diseases are all
involved in peculiar protein stacking of once structurally diverse proteins into β-sheet
structured amyloid fibrils. Importantly, the exact conformation of a protein plays an
important role in its function. Thus, a thorough study of protein function at the
molecular level requires detailed structural analysis.
1-4 Hierarchy of Protein Structure
In 1952, Linderstrøm-Lang proposed the hierarchy of protein structure with four
4
levels: primary, secondary, tertiary, and quaternary.17 In Linderstrøm-Lang’s model,
each level was constructed by the elements of the previous level and was characterized
by specific patterns of interactions.17 The primary structure reveals the direct
composition of a protein in the unit of various types of amino acids, starting from the
amino-terminal end (N) to the carboxyl-terminal end (C’). The main-chain atoms are an
NH group of one residue bound to Cα, a central carbon atom (Cα) to which the side
chain (R) is attached, and a carbonyl group C’=O linked to the NHof another residue.
The backbone atoms are basically composed of a repeating unit (NH- Cα+C’=O)n,
which serve as the common framework of an amino acid (Figure 1-2). In order to
describe the structural properties of a protein, another method is introduced to
characterize the main chain. The original repeating unit can be viewed as one central
carbon (Cαn+1) extending to its prior (Cαn) and subsequent central carbons (Cαn+2). As
discussed earlier, the peptide C-N bond has partial double bond character.18 This
character allows the peptide bond to arrange six main chain atoms
(Cαn-C’O-NH-Cαnand Cαn+1-C’O-NH-Cαn) in a rigid planar structure.2 Two
neighboring rigid planar structures are linked by the covalent bonds with the Cαatom,
rotating through N-Cα and Cα-C’ bonds. The two conventional dihedral angles for these
two bonds are named phi (I) and psi (ψ), respectively (Figure 1-2).
5
Figure 1-2. The peptide bond and the dihedral anglesIandψ in the backbond.
The combinations of the dihedral anglesare used to describe the structural
properties of the main chain. Most of the combinations of φ and ψ angles are not
allowed due to steric clashes between the peptide backbone and the side chains.19G. N.
Ramachandran calculated and plotted the sterically allowed regions as Ramachandran
plots with the dihedral angles ranging from -180° to 180° (Figure 1-3).19 The allowed
regions depend on the permitted van der Waals contact distance and the combination of
dihedral angles.19
Figure 1-3. Ramachandran plot.19The X axis is φ and the Y axis is ψangles, and the angle regions are from -180° to 180°.
Secondary structure is defined by patterns of hydrogen bonds between the
backbone amide and carboxyl groups. The basic secondary structures are α-helix
6
andβ-sheet.20 The α-helix was first described by Pauling in 1951.21 The α-helix is a
right-handed coil with dihedral angles I = -57° and ψ = -47°.22, 23The coil-like structure
has 3.6 residues per turn and is characterized by consecutive, main-chain, i←i+4
hydrogen bonds between each carbonyl oxygen (i) and an amide hydrogen (i+4) on the
adjacent helical turn (Figure 1-4).24 One third of all protein residues adopt an α-helix
conformation, showing that helical proteins play important roles in living organism.25
Figure 1-4. The structure of an α-helix (an α-helix from a four-α-helix bundle, PDB
2I7U).
β-Sheet is another common secondary structure. It is a flat plate configuration
containing multiple β-strands with inter-strand hydrogen bonds between backbone
C’=O and N-H on neighboring strands. β-Sheets can be further categorized into two
types: parallel and anti-parallel, distinguished by the arrangement of the hydrogen bond
orientation.26 A parallel β-sheet is characterized by a series of twelve-membered
hydrogen-bonded rings, while an anti-parallel β-sheet is characterized by an alternating
series of ten-and fourteen-membered hydrogen-bonded rings. The dihedral angles of
parallel and anti-parallel β-sheets are (I = -119°, ψ = +113°) and (I = -139°, ψ = +135°),
respectively. β-Hairpins are one of the simplest super-secondary structures, consisting of
7
two anti-parallel β-strands connected through a short loop region (Figure 1-5).27-29
Figure 1-5. The structure of a β-hairpin (the C-termini β-hairpin from GB1 protein,
PDB 2PLP).
Tertiary structure refers to the stable three-dimensional structure formed by a
polypeptide chain.30 Various recurring secondary structures assemble to form the
tertiary structure, which is required to perform different and precise protein functions.
X-ray analysis has revealed significant relationship between function and structure.
Domains are the fundamental units of tertiary structure, which are also closely related to
protein function. The concept of a domain was first introduced by Wetlaufer after X-ray
studies of hen lysozyme and papain,31, 32and proteolysis studies of immunoglobulins.33,
34 Protein tertiary structures can be divided into four major classes based on their
secondary structure content of the domain: all-D domains, all-E domains, α+β domains,
and α/β domains.35 According to an algorithm named “Structural Classification of
Proteins (SCOP) Database”, which investigates sequences and structures, these common
folds account for 16.2%, 22.6%, 25.4%, and 23.4% of the total 87681 structural hits,
respectively.36 Pyruvate kinase is a phosphate group-transferring enzyme that plays an
crucial role in glycolysis. It contains three major domains: an all-β regulatory domain,
8
an α/β substrate binding domain, and an α/β nucleotide binding domain. Each
structurally different domain serves a different purpose in phosphate group transfer. A
typical tertiary structure has its nonpolar residues buried in the interior, forming a
hydrophobic core.37 Polar and charged residues are more frequently found on the
surface, where proteins can interact with the aqueous environment through the
hydrophilic side chains.37
Quaternary structure is the spatial assemble of multiple polypeptide chains.38
Examples of proteins with quaternary structure include hemoglobin, DNA polymerase,
and ion channels. Conformational change or re-orientation of individual polypeptides
can induce changes in quaternary structure or connection between polypeptides.
Through such structural changes, protein function can be regulated and exert their
physiological function.
Each level of protein structure is held together by characteristic interactions and
forces. Higher levels of proteins structure are assembled through the structural units of
the lower level (Figure 1-6). Among the protein structure hierarchy, the secondary
structural level plays a key role in protein folding. Therefore, research on the factors
that affect the formation of secondary structure is important for understanding protein
structure formation and prediction.
9 Primary
Structure
Secondary Structure
Tertiary Structure
Quaternary Structure
Figure 1-6. Four hierarchical levels of protein structure (triosephosphate isomerase, PDB 8TIM).
1-5 Driving Force of Protein Folding
Proteins must fold into the native structure to carry out its function. There are four
dominant forces for protein folding and all these four forces are non-covalent in
nature.39 These four forces are hydrophobics, electrostatics interaction, hydrogen
bonding, and van der Waals.37, 39-46
Protein residues can be divided into two groups, polar and non-polar, depending on
their side chains. When a protein folds, most of the non-polar residues are buried inside
and form a hydrophobic core, while polar residues are mostly exposed to solvent. This
phenomenon is entropically favored and therefore leads to the increased stability of
10
proteins.37, 47, 48The hydrophobic effect was first described by Kauzmman in 1959.
Polar residues are mostly charged and free to interact with their environment,
including solvent molecules and other polar functional groups. Electrostatic interactions
can be divided into three types: ion-ion, ion-dipole, and dipole-dipole.41, 49 A charged
side chain can interact with an oppositely charged functional group located on another
residue or the protein terminus. Dipoles are formed by the asymmetric distribution of
electrons due to the differences in electronegativity of the two atoms in a covalent bond.
Electrostatic interactions through ionic charges or dipoles contribute to protein stability
and the formation of protein structures.50, 51
A hydrogen bond is an interaction between a hydrogen atom in an X-H group and a
highly electronegative atom Y such as nitrogen, oxygen, or fluorine.40, 52 The partial
positive charge on the H atom interacts with the partial negative charge on the Y atom.40,
52Such an interaction is important for stabilizing secondary and tertiary structures.44, 53,
54 The backbone hydrogen bond C=O···H-N is the most prevalent (68.1%), with
C=O···side chain (10.9%), N-H···side chain (10.4%), and side chain···side chain
hydrogen bond (10.6%) account for the remainder of the hydrogen bonds in protein
structures.44
Another intermolecular interaction is van der Waals. Van der Waals force is a
dispersion force caused by the fluctuating polarization of the nearby entities.55 In a
11
symmetrical molecule, there is no charge distribution on average. In reality, electrons
are mobile and might more towards one end of the molecule, forming a slight negatively
charged end (δ-) and a slightly positively charged end (δ+).55 Individual van der Waals
interactions are very weak, yet a massive number of such weak forces can still
significantly influence protein structure and stability.56
1-6 RNA Recognition
RNA-protein interactions are important in various fundamental biological processes,
including transcription, translation,57 RNA processing and modification.58 Both double
helical RNA and DNA are constructed by multiple complementary base pair such as
A-U, C-G and A-T, C-G.59 There are three factors that control the binding affinity
between RNA and protein: electrostatic interaction between the protein positively
charged region and the negatively charged phosphate groups on the RNA backbone,
hydrogen bonding, and the interactions between the RNA groove and the protein side
chains. Specific proteins bind to specific sites on specific RNAs. The appropriate
binding of such proteins acts as a switch for RNA activation or repression. Therefore,
studies on RNA-protein recognition are important for understanding many diseases
related to RNA.
12
Human immunodeficiency virus (HIV) is a type of RNA retrovirus that causes the
acquired immune deficiency syndrome (AIDS).60 A retrovirus is a single-stranded RNA
virus that targets a host cell as an obligate parasite.61 In most viruses, DNA is
transcribed into RNA, and RNA is translated into viral protein. In retroviruses, however,
RNA is reverse-transcribed into DNA by a virally encoded reverse transcriptase, and
then integrated into the genome of the host cell by a virally encoded integrase.62 Most
retroviruses contain three common genes in RNA genomes: gag, pol, and env. These
genes contain the information necessary for building the structural proteins and
important enzymes for new virus particles. The gag and env genes code for the core
nucleocapsid polypeptides and surface-coat proteins of the virus, respectively.63The pol
gene code for the viral reverse transcriptase and other enzymes.64 In the HIV-1 viral
RNA genome, there are six additional regulatory genes (tat, rev, nef, vif, vpr, and vpu)
that code for proteins that control the infection by HIV and the production of new viral
particles.64 The tat gene encodes for the Tat protein, which serves as a transcriptional
trans-activator by binding TAR RNA. The Tat protein is important for HIV-1
replication.
Trans-activator of transcription (Tat) protein contains a basic region that can
recognize RNA: RKKRRQRRR (residue 49 to 57). The Tat protein targets the
trans-activating responsive element (TAR) RNA located at the 5’end of nascent HIV-1
13
transcripts.65The TAR RNA contains a stem-loop structure composed of 59 nucleotides.
Two essential regions are the pentanucleotide loop (+29CUGGG+33) and the three-base
bulge (+22UCU+24) at the sites from +17 to +45. By interacting with this loop and bulge
region, Tat proteins alters the properties of the transcriptional complex and recruits
crucial enzymes, including the positive transcription elongation complex and RNA
polymerase II, for efficient production of full-length viral RNA.66The Tat-TAR binding
provides a positive feedback cycle and allows HIV to have an explosive response once
the threshold amount of Tat protein is reached.67Blocking this protein-RNA interaction
may repress the transcription of HIV-1 and serve as a potential treatment towards
AIDS.68
1-7 Post-Translational Modifications (PTMs)
Proteins are synthesized through the following biological steps: translation,
polymerization, termination, and processing.69 There are only 20 amino acids encoded
by the triple nucleotide codons in mRNA. However, there are about 140 amino acids
derivatives that have been identified in different proteins.70 These 20 encoded amino
acids must undergo various modifications to increase or even alter their functionalities.
Any modification that occurs after the completion of translation is considered a
14 post-translational modification (PTM).70
PTMs are a series of covalent processing events including peptide bond cleavage
and functional group attachment onto individual amino acids. Some common PTMs are
phosphorylation,71 acetylation,72 glycosylation,73 acylation,74 and methylation.75 PTMs
are responsible for protein function regulation and structural change.76
Protein methylation is a common post-translational modification that affects
thermal stability,77 cellular stress response,78 protein aging,79 ,gene regulation,80-82 and
transcriptional regulation.83 Protein methylation typically takes place on arginine (Arg)
or lysine (Lys) residues in the protein sequence.75Lysine can be methylated once, twice,
or three times by lysine methyltransferases into monomethyllysine (Mmk),
dimethyllysine (Dmk), and trimethyllysine (Tmk), respectively.84 Lysine methylation
leads to the increase of the positive charge effective radius and hydrophobicity. Such
methylated lysines play an important role in protein-protein and protein-nucleic acid
regulation.85, 86 For the Tat protein, several post-translational modifications have been
identified that modulate the interactions of Tat with TAR and other essential enzyme
complexes.87 These modifcations include lysine methylation at the residue adjacent to
the basic region.87 Accordingly, in this thesis, we investigate the effect of various types
of lysine methylation on TAR RNA recognition by Tat47-57derivatives.
15
1-8 Thesis Overview
Post-translational modifications are responsible for many protein behaviors. Lysine
methylation alters the physiological properties of the residue and may impact both
protein function and structure. There are three variations of methylated lysines that are
identified in proteins. It is logical to assume that the different numbers of methyl groups
attached on the side chain amino group should have different effects on proteins. In this
study, various types of methylated lysines are placed into two basic secondary structures:
α-helix and the simplest β-sheet model, “β-hairpin”, to investigate the effect of lysine