Predicting Protein Structure

Understanding proteins, predicting their structures with AlphaFold2 and visualizing them in ChimeraX.
AI
Project
Deep Learning
Biology
Author

Mahmut Osmanovic

Published

October 24, 2024

1.0 | What are Proteins?

The word protein was coined by Swedish chemist Jöns Jacob Berzelius in 1838. It is derived from Greek, meaning “of the first rank”. He named them as such because of how fundamental he thought that they were for living organisms. Proteins are fundamentally assembled atoms. A specific assembly of these atoms form what is referred to as a basic structure, in this case, of amino acids. The figure below highlights the basic structure of amino acids.


On the left is an amine, a nitrogen atom with two hydrogen atoms. On the right is a carboxyl, a carbon atom with two oxygen atoms and a hydrogen. In the center is the “R” group. The “R” group is different for every amino acid. There are 20 standard amino acids. A long and interconnected chain of these amino acids is what makes up a protein. There are several transitional states between a small and basic amino acid chain and protein. They are visualized and described in the figure below.


The first state describes amino acids linked (by peptide bonds) in a linear sequence (polypeptide chain). The sequence is determined by the genetic code in the DNA.

Subsequently, the polypeptide chain locally folds into specific shapes due to the hydrogen bonds between the backbone NH and CO groups of the amino acids. Two of the most common shapes are the alpha helix and beta pleated sheet. These structures equip proteins with stability. They are often repeated in various regions of the protein. The state is called “Secondary Structure”.

The tertiary structure refers to the global 3D shape of a single polypeptide chain, including many secondary structures. It is stabilized by interactions between the side chains (R-groups) of amino acid.

Lastly, the Quaternary Structure represents the assembly of multiple tertiary structures. For example, collagen (a protein that strengthens joints and ligaments) is made up of three intertwined polypeptide chains.

2.0 | The Protein Folding Problem

The three global dimensional structure plays a crucial role in determining the functionality of the protein. The protein folding problem focuses on predicting a protein’s 3D structure based on its amino acid sequence. Each amino acid is represented by a letter in the alphabet.


Their linkage is represented by bonds. Determining the angles of bonds and their lengths is fundamental to understand the spatial structure of any protein. Note that proteins structures obviously are translation and rotation invariant.


3.0 | Prediction and Visualization

In order to experiment with protein structure predictions, I used AlphaFold2, the open source model. I found the alphabetical amino acid representation of the protein that I was interested in modeling on uniprot. Since I am a passionate weight lifter, I thought, why not predict the 3D structure of Myosin. Myosin is a protein that is responsible for muscle contraction. Note that there are various types of myosin proteins. If I choose a protein with an AA-chain (amino acid chain) that includes thousands of AAs, it becomes very likely that I will run out of RAM before the prediction process has completed. Typical RAM requirements are 32-64 GB, but I have 16 GB. Hence, I ensure that the chosen protein for prediction did not contain a long amino acid sequence.

The chosen one was Myosin I, X8JJ96_9AGAM. Subsequently I downloaded its amino acid sequence from uniprot. The amino acid sequence was provided as input to AlphaFold2, and after some slight configurations, it generated several protein structure predictions. The output files were in pdb format. It is a binary file that contains type and symbolic debugging information. Thereafter I downloaded its spatial representation from uniprot, it will serve as a reference point. I uploaded and visualized both individually in ChimeraX and subsequently aligned my prediction with the spatial structure on the uniprot website using the matchmaker tool.

Predicted vs. Real before alignment:

Predicted vs. Real after alignment:

The result is not a perfect match but aligns closely with the real structure. The AlphaFold2 code is specified in the appendix. The figure below highlights the architectural design that DeepMind used to construct AlphaFold. The code is open-sourced on their github, AlphaFold. Note their extensive usage of the python module JAX. JAX arrays perform phenomenally when similar types of executions occur iteratively, certainly as is the case in AlphaFold. In addition, parallelizing computation and taking gradients of any function is greatly simplified.


4.0 | Final Thoughts

Protein folding is a complex yet crucial problem in biology because a protein’s structure directly dictates its function. Using AlphaFold2 to predict protein structures has demonstrated just how much progress has been made in this area. Even with limited computational resources, I was able to predict the 3D structure of Myosin I with reasonable accuracy, which highlights the power of current models.

Although the prediction wasn’t an exact match to the real structure, the alignment was close enough to show the potential of these tools. As computational power grows and models become more advanced, this accuracy will continue to improve. It’s clear that AI is reshaping our approach to understanding biological structures.

This is just the start. With advancements in AI, computation, and biological research, we are likely heading towards a future where predicting the structure of any protein is not only feasible but fast and reliable. This will have profound implications across drug development, disease understanding, and biological engineering. We are standing at the brink of what could be a revolution in how we approach biological problems in particular, and in science in general.

Appendix

ALPHAFOLD2 - PYTHON CODE
"""
Original file is located at
    https://colab.research.google.com/drive/1m8C9DPbekGtjwEiUcxRTVnq2iagDmFuO

<img src="https://raw.githubusercontent.com/sokrypton/ColabFold/main/.github/ColabFold_Marv_Logo_Small.png" height="200" align="right" style="height:240px">

##ColabFold v1.5.5: AlphaFold2 using MMseqs2

Easy to use protein structure and complex prediction using [AlphaFold2](https://www.nature.com/articles/s41586-021-03819-2) and [Alphafold2-multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1). Sequence alignments/templates are generated through [MMseqs2](mmseqs.com) and [HHsearch](https://github.com/soedinglab/hh-suite). For more details, see <a href="#Instructions">bottom</a> of the notebook, checkout the [ColabFold GitHub](https://github.com/sokrypton/ColabFold) and [Nature Protocols](https://www.nature.com/articles/s41596-024-01060-5).

Old versions: [v1.4](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.4.0/AlphaFold2.ipynb), [v1.5.1](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.1/AlphaFold2.ipynb), [v1.5.2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.2/AlphaFold2.ipynb), [v1.5.3-patch](https://colab.research.google.com/github/sokrypton/ColabFold/blob/56c72044c7d51a311ca99b953a71e552fdc042e1/AlphaFold2.ipynb)

[Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all.
*Nature Methods*, 2022](https://www.nature.com/articles/s41592-022-01488-1)

Forked by Mahmut Osmanovic, Oct, 2024.
"""

#@title Input protein sequence(s), then hit `Runtime` -> `Run all`
from google.colab import files
import os
import re
import hashlib
import random

from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

query_sequence = 'MTDIKLKGHKGLQTYSTHVSGELVGVVSEFEAKKHKVKVWQPALQPNTYPTVESVSYRLEDLKSGTIQVYRVAEGV' 
jobname = 'Myosin_I(X8JJ96)' 
num_relax = 0 
template_mode = "none"

use_amber = num_relax > 0

# remove whitespaces
query_sequence = "".join(query_sequence.split())

basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)

# check if directory with jobname exists
def check(folder):
  if os.path.exists(folder):
    return False
  else:
    return True
if not check(jobname):
  n = 0
  while not check(f"{jobname}_{n}"): n += 1
  jobname = f"{jobname}_{n}"

# make directory to save results
os.makedirs(jobname, exist_ok=True)

# save queries
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
  text_file.write(f"id,sequence\n{jobname},{query_sequence}")

if template_mode == "pdb100":
  use_templates = True
  custom_template_path = None
elif template_mode == "custom":
  custom_template_path = os.path.join(jobname,f"template")
  os.makedirs(custom_template_path, exist_ok=True)
  uploaded = files.upload()
  use_templates = True
  for fn in uploaded.keys():
    os.rename(fn,os.path.join(custom_template_path,fn))
else:
  custom_template_path = None
  use_templates = False

print("jobname",jobname)
print("sequence",query_sequence)
print("length",len(query_sequence.replace(":","")))

msa_mode = "mmseqs2_uniref_env" 
pair_mode = "unpaired_paired"

# decide which a3m to use
if "mmseqs2" in msa_mode:
  a3m_file = os.path.join(jobname,f"{jobname}.a3m")

elif msa_mode == "custom":
  a3m_file = os.path.join(jobname,f"{jobname}.custom.a3m")
  if not os.path.isfile(a3m_file):
    custom_msa_dict = files.upload()
    custom_msa = list(custom_msa_dict.keys())[0]
    header = 0
    import fileinput
    for line in fileinput.FileInput(custom_msa,inplace=1):
      if line.startswith(">"):
         header = header + 1
      if not line.rstrip():
        continue
      if line.startswith(">") == False and header == 1:
         query_sequence = line.rstrip()
      print(line, end='')

    os.rename(custom_msa, a3m_file)
    queries_path=a3m_file
    print(f"moving {custom_msa} to {a3m_file}")

else:
  a3m_file = os.path.join(jobname,f"{jobname}.single_sequence.a3m")
  with open(a3m_file, "w") as text_file:
    text_file.write(">1\n%s" % query_sequence)

model_type = "auto" 
num_recycles = "3"
recycle_early_stop_tolerance = "auto" 
relax_max_iterations = 200 
pairing_strategy = "greedy" 

max_msa = "auto" 
num_seeds = 1 
use_dropout = False 

num_recycles = None if num_recycles == "auto" else int(num_recycles)
recycle_early_stop_tolerance = None if recycle_early_stop_tolerance == "auto" else float(recycle_early_stop_tolerance)
if max_msa == "auto": max_msa = None

save_all = False
save_recycles = False 
save_to_google_drive = True 
dpi = 200 


if save_to_google_drive:
  from pydrive2.drive import GoogleDrive
  from pydrive2.auth import GoogleAuth
  from google.colab import auth
  from oauth2client.client import GoogleCredentials
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("You are logged into Google Drive and are good to go!")


display_images = True

import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from Bio import BiopythonDeprecationWarning
warnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)
from pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2

import os
import numpy as np
try:
  K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
except:
  K80_chk = "0"
  pass
if "1" in K80_chk:
  print("WARNING: found GPU Tesla K80: limited to total length < 1000")
  if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
    del os.environ["TF_FORCE_UNIFIED_MEMORY"]
  if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
    del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]

from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt

# For some reason we need that to get pdbfixer to import
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

def input_features_callback(input_features):
  if display_images:
    plot_msa_v2(input_features)
    plt.show()
    plt.close()

def prediction_callback(protein_obj, length,
                        prediction_result, input_features, mode):
  model_name, relaxed = mode
  if not relaxed:
    if display_images:
      fig = plot_protein(protein_obj, Ls=length, dpi=150)
      plt.show()
      plt.close()

result_dir = jobname
log_filename = os.path.join(jobname,"log.txt")
setup_logging(Path(log_filename))

queries, is_complex = get_queries(queries_path)
model_type = set_model_type(is_complex, model_type)

if "multimer" in model_type and max_msa is not None:
  use_cluster_profile = False
else:
  use_cluster_profile = True

download_alphafold_params(model_type, Path("."))
results = run(
    queries=queries,
    result_dir=result_dir,
    use_templates=use_templates,
    custom_template_path=custom_template_path,
    num_relax=num_relax,
    msa_mode=msa_mode,
    model_type=model_type,
    num_models=5,
    num_recycles=num_recycles,
    relax_max_iterations=relax_max_iterations,
    recycle_early_stop_tolerance=recycle_early_stop_tolerance,
    num_seeds=num_seeds,
    use_dropout=use_dropout,
    model_order=[1,2,3,4,5],
    is_complex=is_complex,
    data_dir=Path("."),
    keep_existing_results=False,
    rank_by="auto",
    pair_mode=pair_mode,
    pairing_strategy=pairing_strategy,
    stop_at_score=float(100),
    prediction_callback=prediction_callback,
    dpi=dpi,
    zip_results=False,
    save_all=save_all,
    max_msa=max_msa,
    use_cluster_profile=use_cluster_profile,
    input_features_callback=input_features_callback,
    save_recycles=save_recycles,
    user_agent="colabfold/google-colab-main",
)
results_zip = f"{jobname}.result.zip"
os.system(f"zip -r {results_zip} {jobname}")

import py3Dmol
import glob
import matplotlib.pyplot as plt
from colabfold.colabfold import plot_plddt_legend
from colabfold.colabfold import pymol_color_list, alphabet_list
rank_num = 1 
color = "lDDT" 
show_sidechains = False 
show_mainchains = False 

tag = results["rank"][0][rank_num - 1]
jobname_prefix = ".custom" if msa_mode == "custom" else ""
pdb_filename = f"{jobname}/{jobname}{jobname_prefix}_unrelaxed_{tag}.pdb"
pdb_file = glob.glob(pdb_filename)

def show_pdb(rank_num=1, show_sidechains=False, show_mainchains=False, color="lDDT"):
  model_name = f"rank_{rank_num}"
  view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
  view.addModel(open(pdb_file[0],'r').read(),'pdb')

  if color == "lDDT":
    view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
  elif color == "rainbow":
    view.setStyle({'cartoon': {'color':'spectrum'}})
  elif color == "chain":
    chains = len(queries[0][1]) + 1 if is_complex else 1
    for n,chain,color in zip(range(chains),alphabet_list,pymol_color_list):
       view.setStyle({'chain':chain},{'cartoon': {'color':color}})

  if show_sidechains:
    BB = ['C','O','N']
    view.addStyle({'and':[{'resn':["GLY","PRO"],'invert':True},{'atom':BB,'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"GLY"},{'atom':'CA'}]},
                        {'sphere':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"PRO"},{'atom':['C','O'],'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
  if show_mainchains:
    BB = ['C','O','N','CA']
    view.addStyle({'atom':BB},{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})

  view.zoomTo()
  return view

show_pdb(rank_num, show_sidechains, show_mainchains, color).show()
if color == "lDDT":
  plot_plddt_legend().show()

#@title Plots {run: "auto"}
from IPython.display import display, HTML
import base64
from html import escape

# see: https://stackoverflow.com/a/53688522
def image_to_data_url(filename):
  ext = filename.split('.')[-1]
  prefix = f'data:image/{ext};base64,'
  with open(filename, 'rb') as f:
    img = f.read()
  return prefix + base64.b64encode(img).decode('utf-8')

pae = ""
pae_file = os.path.join(jobname,f"{jobname}{jobname_prefix}_pae.png")
if os.path.isfile(pae_file):
    pae = image_to_data_url(pae_file)
cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_coverage.png"))
plddt = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_plddt.png"))
display(HTML(f"""
<style>
  img {{
    float:left;
  }}
  .full {{
    max-width:100%;
  }}
  .half {{
    max-width:50%;
  }}
  @media (max-width:640px) {{
    .half {{
      max-width:100%;
    }}
  }}
</style>
<div style="max-width:90%; padding:2em;">
  <h1>Plots for {escape(jobname)}</h1>
  { '<!--' if pae == '' else '' }<img src="{pae}" class="full" />{ '-->' if pae == '' else '' }
  <img src="{cov}" class="half" />
  <img src="{plddt}" class="half" />
</div>
"""))

if msa_mode == "custom":
  print("Don't forget to cite your custom MSA generation method.")

files.download(f"{jobname}.result.zip")

if save_to_google_drive == True and drive:
  uploaded = drive.CreateFile({'title': f"{jobname}.result.zip"})
  uploaded.SetContentFile(f"{jobname}.result.zip")
  uploaded.Upload()
  print(f"Uploaded {jobname}.result.zip to Google Drive with ID {uploaded.get('id')}")

References

  1. Wikipedia. (Retrieved 24th of Oct, 2024). https://en.wikipedia.org/wiki/J%C3%B6ns_Jacob_Berzelius.
  2. Deepmind. (Retrieved 24th of Oct, 2024). https://educlick.uk/course/26
  3. Basic Structure of an aminacid. (Retrieved 24th of Oct, 2024). https://www.snexplores.org/article/scientists-say-amino-acid
  4. Myosin-I. (Retrieved 24th of Oct, 2024). https://www.uniprot.org/uniprotkb/X8JJ96/entry