University of the Witwatersrand, Johannesburg
Wits Bioinformatics

Protein Three Dimensional Structure
and Function using PDB

Summer University 2011

Abdelkrim Rachedi

In this tutorial you will learn how to use the Protein Data Bank, which is the international repository for processing and distributing 3-D macromolecular structure data determined by X-ray crystallography, Nuclear Magnetic Resonance (NMR) and Electron Microscopy. Explore primary, seconday, tertiary structure, ligand examples and ligand environment.

The primary goals of this tutorial are:

1. Learn to access the PDB and explore data in the entries and what they mean.

2. Find and explore entries bound to ligands (inhibitors).

3. Explore ligands binding 3D-environment.

  • You need to double-click on the thumbnails to enlarge and single-click to restore to thumbnail size.
  • Tutorial Part 1: PDB access & structural data

    - The PDB is a flat-files type of database, it is in fact an archive of files containing macromolcules (nucliec and protein) structural data determined by X-ray crystallography, Nuclear Magnetic Resonance (NMR) and Electron Microscopy.

    - Data for each entry in the database is stored in a seprate text file with a defined format, see pdb file format.

    - Each entry has a four characters PDB id such as 3DFR (can contain numbers and letters). The full name of each file takes the following format pdbxxxx.ent (xxxx=the pdb id) such as pdb3dfr.ent

    A.
    Go to the PDB WWW page., see Fig.1

    Fig.1: Main pdb page. Pay attention to total number of entries 71794 in the PDB to date.

    B.
    The top search bar support a variety of search ways such as: PDB id code, keywords, Authors, titles ..etc. To get help on what is offered click on , as is seen red-highlighted in ,

    see Fig.2

    Fig.2: Top Bar Search Help. Gives help about the types of queries that can be done

    C.
    Since most likely you do not have a particular pdbid, then type the keywords “dihydrofolate reductase” in the Seach box, see Fig.3, and then click Search button.



    Fig.3: Top main pdb page. Search box with the query text dihydrofolate Reducate

    D.
    The Query Result Browser will display a results panel in the left side of the page displaying a hit number of 253 structures, see Fig.4



    Fig.4:Typical Query Results pages. Take notice of the Query Refinements informations, Fig.5. See for example Fig.6 that shows related hits in the SCOP database.



    Fig.5: Query Refinements information & options.



    Fig.6: Dihydrofolate Reductase hits in the SCOP database.

    E.
    Explore a PDB entry:

    View the content of one of the entries by clicking on the icon highlighted in red in Fig.7. This will open a page with the content of the entry.



    Fig.7: Red highlighted icon which when clicked would display the associated pdb entry.
    In this case the pdb entry id is 3FQ0

    The first two data type lines should be:

    HEADER    OXIDOREDUCTASE                          06-JAN-09   3FQ0              
    TITLE     STAPHYLOCOCCUS AUREUS DIHYDROFOLATE REDUCTASE COMPLEXED               
    TITLE    2 WITH NADPH AND 2,4-DIAMINO-5-(3-(2,5-DIMETHOXYPHENYL)PROP-           
    TITLE    3 1-YNYL)-6-ETHYLPYRIMIDINE (UCP120B)                                  
    

    Note that the second line, "TITLE", is further divided because of the length of the title.

    Further down you should find the Uniprot cross reference id and the structure sequence:

    DBREF  3FQ0 A    1   157  UNP    Q2YY41   Q2YY41_STAAB     2    158             
    SEQRES   1 A  157  THR LEU SER ILE LEU VAL ALA HIS ASP LEU GLN ARG VAL          
    SEQRES   2 A  157  ILE GLY PHE GLU ASN GLN LEU PRO TRP HIS LEU PRO ASN          
    SEQRES   3 A  157  ASP LEU LYS HIS VAL LYS LYS LEU SER THR GLY HIS THR          
    SEQRES   4 A  157  LEU VAL MET GLY ARG LYS THR PHE GLU SER ILE GLY LYS          
    SEQRES   5 A  157  PRO LEU PRO ASN ARG ARG ASN VAL VAL LEU THR SER ASP          
    SEQRES   6 A  157  THR SER PHE ASN VAL GLU GLY VAL ASP VAL ILE HIS SER          
    SEQRES   7 A  157  ILE GLU ASP ILE TYR GLN LEU PRO GLY HIS VAL PHE ILE          
    SEQRES   8 A  157  PHE GLY GLY GLN THR LEU PHE GLU GLU MET ILE ASP LYS          
    SEQRES   9 A  157  VAL ASP ASP MET TYR ILE THR VAL ILE GLU GLY LYS PHE          
    SEQRES  10 A  157  ARG GLY ASP THR PHE PHE PRO PRO TYR THR PHE GLU ASP          
    SEQRES  11 A  157  TRP GLU VAL ALA SER SER VAL GLU GLY LYS LEU ASP GLU          
    SEQRES  12 A  157  LYS ASN THR ILE PRO HIS THR PHE LEU HIS LEU ILE ARG          
    SEQRES  13 A  157  LYS                                                          
    

    - Can you think about why the sequence data is called "SEQRES"?

    Tip: check the Uniprot entry http://www.uniprot.org/uniprot/Q2YY41 against the SEQRES.

    The Uniprot sequence is:

            10         20         30         40         50         60 
    MTLSILVAHD LQRVIGFENQ LPWHLPNDLK HVKKLSTGHT LVMGRKTFES IGKPLPNRRN 
    
            70         80         90        100        110        120 
    VVLTSDTSFN VEGVDVIHSI EDIYQLPGHV FIFGGQTLFE EMIDKVDDMY ITVIEGKFRG 
    
           130        140        150 
    DTFFPPYTFE DWEVASSVEG KLDEKNTIPH TFLHLIRKK 
    

    - What do you see?

    Further down you should see: (ignore the two lines in the green area)

             1         2         3         4         5         6         7         8
    12345678901234567890123456789012345678901234567890123456789012345678901234567890
    
    ATOM     76  N   GLN A  11     -16.377  -5.868  50.479  1.00 13.87           N  
    ATOM     77  CA  GLN A  11     -15.458  -6.990  50.659  1.00 14.10           C  
    ATOM     78  C   GLN A  11     -15.736  -8.147  49.692  1.00 13.37           C  
    ATOM     79  O   GLN A  11     -15.240  -9.270  49.878  1.00 13.26           O  
    ATOM     80  CB  GLN A  11     -15.480  -7.438  52.121  1.00 14.28           C  
    ATOM     81  CG  GLN A  11     -15.032  -6.290  53.031  1.00 17.76           C  
    ATOM     82  CD  GLN A  11     -14.888  -6.667  54.485  1.00 21.55           C  
    ATOM     83  OE1 GLN A  11     -15.815  -7.196  55.113  1.00 24.16           O  
    ATOM     84  NE2 GLN A  11     -13.726  -6.354  55.047  1.00 23.67           N  
    ATOM     85  N   ARG A  12     -16.499  -7.838  48.638  1.00 12.04           N  
    ATOM     86  CA  ARG A  12     -16.901  -8.789  47.595  1.00 11.79           C  
    ATOM     87  C   ARG A  12     -17.879  -9.881  48.043  1.00 11.15           C  
    ATOM     88  O   ARG A  12     -18.064 -10.863  47.338  1.00 10.81           O  
    ATOM     89  CB  ARG A  12     -15.698  -9.385  46.857  1.00 11.69           C  
    ATOM     90  CG  ARG A  12     -15.090  -8.455  45.797  1.00 12.28           C  
    ATOM     91  CD  ARG A  12     -13.804  -9.032  45.203  1.00 12.87           C  
    ATOM     92  NE  ARG A  12     -12.757  -9.049  46.224  1.00 15.08           N  
    ATOM     93  CZ  ARG A  12     -12.419 -10.108  46.962  1.00 16.96           C  
    ATOM     94  NH1 ARG A  12     -13.012 -11.291  46.789  1.00 15.70           N  
    ATOM     95  NH2 ARG A  12     -11.468  -9.971  47.884  1.00 18.43           N  
    

    -What do the columns above mean?

    These lines give the atom number, atom name, residue name, polypeptide chain identifier, residue number, x coordinate, y coordinate, z coordinate, occupancy, thermal factor (b-factor) of every atom in the protein and atom type.

    The occupency gives information how much does each atom occupy the 3D position; value of 1.0 represents full occupency.

    The b-factor tells you about relative mobility of the atoms.

    CA is an alpha carbon. Every amino acid has an alpha carbon (with the exception of Glycine). The atoms N, CA, Ca andO are part of the polypeptide backbone called also Main-Chain. Atom beloging to the amino acids' radical group (R-group) also called Side-Chain start from CB, CG ..etc. Hydrogen atoms are missing, because hydrogen atoms are not observed by x-ray diffraction of large molecules like proteins. NMR PDB entries contain hydrogen atoms because they are detected with the NMR technique.

    For details see below the ATOM record format:
    
    COLUMNS        DATA TYPE       FIELD         DEFINITION
    ---------------------------------------------------------------------------------
     1 -  6        Record name     "ATOM  "
     7 - 11        Integer         serial        Atom serial number.
    13 - 16        Atom            name          Atom name.
    17             Character       altLoc        Alternate location indicator.
    18 - 20        Residue name    resName       Residue name.
    22             Character       chainID       Chain identifier.
    23 - 26        Integer         resSeq        Residue sequence number.
    27             AChar           iCode         Code for insertion of residues.
    31 - 38        Real(8.3)       x             Orthogonal coordinates for X in Angstroms.
    39 - 46        Real(8.3)       y             Orthogonal coordinates for Y in Angstroms.
    47 - 54        Real(8.3)       z             Orthogonal coordinates for Z in Angstroms.
    55 - 60        Real(6.2)       occupancy     Occupancy.
    61 - 66        Real(6.2)       tempFactor    Temperature factor.
    73 - 76        LString(4)      segID         Segment identifier, left-justified.
    77 - 78        LString(2)      element       Element symbol, right-justified.
    79 - 80        LString(2)      charge        Charge on the atom.
    

    - How many amino acids you have in the list above? give their names

    - In both cases, highlight the main and side chain atoms

    - What type of secondary structure they blong to. (check this Secondary Structure Format)

    F.
    NMR entries, hydrogen atoms and models:

    The PDB entry 2hm9 is an example of an NMR strcuture. Click pdb2hm9.pdb and explore its content. In particular note the existence of hydrogen atoms and find out how NMR models are described. See also below the graphycal display of the 2hm9 entry:


    Note in particular the multiple conformations of the ligand TRR (2,4-DIAMINO-5-(3,4,5-TRIMETHOXY-BENZYL)-PYRIMIDIN-1- IUM). See also below figures:

    Tutorial Part 2: Protein Entries & Ligands

    A.
    Click on the tab entitled " 144 Ligands Hits" and explore the list of ligands, see Fig.9.



    Fig.9: Ligands result page showing DHFR entries (complexes) with bound ligands.

    Note that each hit has some PDB entries associated with.

    B.

    Click on of the entries assocaited link, highlighted in Fig.10.a, to see which entries binds this particular ligand.



    Fig.10.a: An example of a ligand binding with 2 DHFR complex structures. Highlighted is the link to information about DHFR entries binding the ligand.

    Tutorial Part 3: Ligand's 3D-environment & Chemistry

    To explore the ligands' 3D-environment & Chemistry, select a pdb ID from the list above, as shown below:

  • suppose you selected the pdb id 3FQ0

  • load the web tool: Ligands Sites Explorer
  • , see Fig.11
  • type in 3FQ0 in "PDB id: box" press ENTER or click "Go"


  • Fig.11 Ligand Site Explorer page showing data for the pdb enty 3FQ0.

    Go to the Ligands table;


    Glycine smiles string: [C@2H2]([C](=[O])[O-])[NH3+]

    Resource:

  • Visual Biochemistry
  • Jmol Wiki