1.0 | What are Proteins?
The word protein was coined by Swedish chemist Jöns Jacob Berzelius in 1838. It is derived from Greek, meaning “of the first rank”. He named them as such because of how fundamental he thought that they were for living organisms. Proteins are fundamentally assembled atoms. A specific assembly of these atoms form what is referred to as a basic structure, in this case, of amino acids. The figure below highlights the basic structure of amino acids.
On the left is an amine, a nitrogen atom with two hydrogen atoms. On the right is a carboxyl, a carbon atom with two oxygen atoms and a hydrogen. In the center is the “R” group. The “R” group is different for every amino acid. There are 20 standard amino acids. A long and interconnected chain of these amino acids is what makes up a protein. There are several transitional states between a small and basic amino acid chain and protein. They are visualized and described in the figure below.
The first state describes amino acids linked (by peptide bonds) in a linear sequence (polypeptide chain). The sequence is determined by the genetic code in the DNA.
Subsequently, the polypeptide chain locally folds into specific shapes due to the hydrogen bonds between the backbone NH and CO groups of the amino acids. Two of the most common shapes are the alpha helix
and beta pleated sheet
. These structures equip proteins with stability. They are often repeated in various regions of the protein. The state is called “Secondary Structure”.
The tertiary structure refers to the global 3D shape of a single polypeptide chain, including many secondary structures. It is stabilized by interactions between the side chains (R-groups) of amino acid.
Lastly, the Quaternary Structure represents the assembly of multiple tertiary structures. For example, collagen (a protein that strengthens joints and ligaments) is made up of three intertwined polypeptide chains.
2.0 | The Protein Folding Problem
The three global dimensional structure
plays a crucial role in determining the functionality of the protein. The protein folding problem focuses on predicting a protein’s 3D structure based on its amino acid sequence. Each amino acid is represented by a letter in the alphabet.
Their linkage is represented by bonds. Determining the angles of bonds and their lengths is fundamental to understand the spatial structure of any protein. Note that proteins structures obviously are translation and rotation invariant.
3.0 | Prediction and Visualization
In order to experiment with protein structure predictions, I used AlphaFold2, the open source model. I found the alphabetical amino acid representation of the protein that I was interested in modeling on uniprot. Since I am a passionate weight lifter, I thought, why not predict the 3D structure of Myosin
. Myosin is a protein that is responsible for muscle contraction. Note that there are various types of myosin proteins. If I choose a protein with an AA-chain (amino acid chain) that includes thousands of AAs, it becomes very likely that I will run out of RAM before the prediction process has completed. Typical RAM requirements are 32-64 GB, but I have 16 GB. Hence, I ensure that the chosen protein for prediction did not contain a long amino acid sequence.
The chosen one was Myosin I
, X8JJ96_9AGAM. Subsequently I downloaded its amino acid sequence from uniprot. The amino acid sequence was provided as input to AlphaFold2, and after some slight configurations, it generated several protein structure predictions. The output files were in pdb
format. It is a binary file that contains type and symbolic debugging information. Thereafter I downloaded its spatial representation from uniprot, it will serve as a reference point. I uploaded and visualized both individually in ChimeraX and subsequently aligned my prediction with the spatial structure on the uniprot website using the matchmaker tool.
Predicted vs. Real before alignment:
Predicted vs. Real after alignment:
The result is not a perfect match but aligns closely with the real structure. The AlphaFold2 code is specified in the appendix. The figure below highlights the architectural design that DeepMind used to construct AlphaFold. The code is open-sourced on their github, AlphaFold. Note their extensive usage of the python module JAX. JAX arrays perform phenomenally when similar types of executions occur iteratively, certainly as is the case in AlphaFold. In addition, parallelizing computation and taking gradients of any function is greatly simplified.
4.0 | Final Thoughts
Protein folding is a complex yet crucial problem in biology because a protein’s structure directly dictates its function. Using AlphaFold2 to predict protein structures has demonstrated just how much progress has been made in this area. Even with limited computational resources, I was able to predict the 3D structure of Myosin I with reasonable accuracy, which highlights the power of current models.
Although the prediction wasn’t an exact match to the real structure, the alignment was close enough to show the potential of these tools. As computational power grows and models become more advanced, this accuracy will continue to improve. It’s clear that AI is reshaping our approach to understanding biological structures.
This is just the start. With advancements in AI, computation, and biological research, we are likely heading towards a future where predicting the structure of any protein is not only feasible but fast and reliable. This will have profound implications across drug development, disease understanding, and biological engineering. We are standing at the brink of what could be a revolution in how we approach biological problems in particular, and in science in general.
Appendix
ALPHAFOLD2 - PYTHON CODE
"""
Original file is located at
https://colab.research.google.com/drive/1m8C9DPbekGtjwEiUcxRTVnq2iagDmFuO
<img src="https://raw.githubusercontent.com/sokrypton/ColabFold/main/.github/ColabFold_Marv_Logo_Small.png" height="200" align="right" style="height:240px">
##ColabFold v1.5.5: AlphaFold2 using MMseqs2
Easy to use protein structure and complex prediction using [AlphaFold2](https://www.nature.com/articles/s41586-021-03819-2) and [Alphafold2-multimer](https://www.biorxiv.org/content/10.1101/2021.10.04.463034v1). Sequence alignments/templates are generated through [MMseqs2](mmseqs.com) and [HHsearch](https://github.com/soedinglab/hh-suite). For more details, see <a href="#Instructions">bottom</a> of the notebook, checkout the [ColabFold GitHub](https://github.com/sokrypton/ColabFold) and [Nature Protocols](https://www.nature.com/articles/s41596-024-01060-5).
Old versions: [v1.4](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.4.0/AlphaFold2.ipynb), [v1.5.1](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.1/AlphaFold2.ipynb), [v1.5.2](https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.5.2/AlphaFold2.ipynb), [v1.5.3-patch](https://colab.research.google.com/github/sokrypton/ColabFold/blob/56c72044c7d51a311ca99b953a71e552fdc042e1/AlphaFold2.ipynb)
[Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all.
*Nature Methods*, 2022](https://www.nature.com/articles/s41592-022-01488-1)
Forked by Mahmut Osmanovic, Oct, 2024.
"""
#@title Input protein sequence(s), then hit `Runtime` -> `Run all`
from google.colab import files
import os
import re
import hashlib
import random
from sys import version_info
= f"{version_info.major}.{version_info.minor}"
python_version
def add_hash(x,y):
return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]
= 'MTDIKLKGHKGLQTYSTHVSGELVGVVSEFEAKKHKVKVWQPALQPNTYPTVESVSYRLEDLKSGTIQVYRVAEGV'
query_sequence = 'Myosin_I(X8JJ96)'
jobname = 0
num_relax = "none"
template_mode
= num_relax > 0
use_amber
# remove whitespaces
= "".join(query_sequence.split())
query_sequence
= "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
basejobname = add_hash(basejobname, query_sequence)
jobname
# check if directory with jobname exists
def check(folder):
if os.path.exists(folder):
return False
else:
return True
if not check(jobname):
= 0
n while not check(f"{jobname}_{n}"): n += 1
= f"{jobname}_{n}"
jobname
# make directory to save results
=True)
os.makedirs(jobname, exist_ok
# save queries
= os.path.join(jobname, f"{jobname}.csv")
queries_path with open(queries_path, "w") as text_file:
f"id,sequence\n{jobname},{query_sequence}")
text_file.write(
if template_mode == "pdb100":
= True
use_templates = None
custom_template_path elif template_mode == "custom":
= os.path.join(jobname,f"template")
custom_template_path =True)
os.makedirs(custom_template_path, exist_ok= files.upload()
uploaded = True
use_templates for fn in uploaded.keys():
os.rename(fn,os.path.join(custom_template_path,fn))else:
= None
custom_template_path = False
use_templates
print("jobname",jobname)
print("sequence",query_sequence)
print("length",len(query_sequence.replace(":","")))
= "mmseqs2_uniref_env"
msa_mode = "unpaired_paired"
pair_mode
# decide which a3m to use
if "mmseqs2" in msa_mode:
= os.path.join(jobname,f"{jobname}.a3m")
a3m_file
elif msa_mode == "custom":
= os.path.join(jobname,f"{jobname}.custom.a3m")
a3m_file if not os.path.isfile(a3m_file):
= files.upload()
custom_msa_dict = list(custom_msa_dict.keys())[0]
custom_msa = 0
header import fileinput
for line in fileinput.FileInput(custom_msa,inplace=1):
if line.startswith(">"):
= header + 1
header if not line.rstrip():
continue
if line.startswith(">") == False and header == 1:
= line.rstrip()
query_sequence print(line, end='')
os.rename(custom_msa, a3m_file)=a3m_file
queries_pathprint(f"moving {custom_msa} to {a3m_file}")
else:
= os.path.join(jobname,f"{jobname}.single_sequence.a3m")
a3m_file with open(a3m_file, "w") as text_file:
">1\n%s" % query_sequence)
text_file.write(
= "auto"
model_type = "3"
num_recycles = "auto"
recycle_early_stop_tolerance = 200
relax_max_iterations = "greedy"
pairing_strategy
= "auto"
max_msa = 1
num_seeds = False
use_dropout
= None if num_recycles == "auto" else int(num_recycles)
num_recycles = None if recycle_early_stop_tolerance == "auto" else float(recycle_early_stop_tolerance)
recycle_early_stop_tolerance if max_msa == "auto": max_msa = None
= False
save_all = False
save_recycles = True
save_to_google_drive = 200
dpi
if save_to_google_drive:
from pydrive2.drive import GoogleDrive
from pydrive2.auth import GoogleAuth
from google.colab import auth
from oauth2client.client import GoogleCredentials
auth.authenticate_user()= GoogleAuth()
gauth = GoogleCredentials.get_application_default()
gauth.credentials = GoogleDrive(gauth)
drive print("You are logged into Google Drive and are good to go!")
= True
display_images
import sys
import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(actionfrom Bio import BiopythonDeprecationWarning
='ignore', category=BiopythonDeprecationWarning)
warnings.simplefilter(actionfrom pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2
import os
import numpy as np
try:
= os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
K80_chk except:
= "0"
K80_chk pass
if "1" in K80_chk:
print("WARNING: found GPU Tesla K80: limited to total length < 1000")
if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
del os.environ["TF_FORCE_UNIFIED_MEMORY"]
if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]
from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt
# For some reason we need that to get pdbfixer to import
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
0, f"/usr/local/lib/python{python_version}/site-packages/")
sys.path.insert(
def input_features_callback(input_features):
if display_images:
plot_msa_v2(input_features)
plt.show()
plt.close()
def prediction_callback(protein_obj, length,
prediction_result, input_features, mode):= mode
model_name, relaxed if not relaxed:
if display_images:
= plot_protein(protein_obj, Ls=length, dpi=150)
fig
plt.show()
plt.close()
= jobname
result_dir = os.path.join(jobname,"log.txt")
log_filename
setup_logging(Path(log_filename))
= get_queries(queries_path)
queries, is_complex = set_model_type(is_complex, model_type)
model_type
if "multimer" in model_type and max_msa is not None:
= False
use_cluster_profile else:
= True
use_cluster_profile
"."))
download_alphafold_params(model_type, Path(= run(
results =queries,
queries=result_dir,
result_dir=use_templates,
use_templates=custom_template_path,
custom_template_path=num_relax,
num_relax=msa_mode,
msa_mode=model_type,
model_type=5,
num_models=num_recycles,
num_recycles=relax_max_iterations,
relax_max_iterations=recycle_early_stop_tolerance,
recycle_early_stop_tolerance=num_seeds,
num_seeds=use_dropout,
use_dropout=[1,2,3,4,5],
model_order=is_complex,
is_complex=Path("."),
data_dir=False,
keep_existing_results="auto",
rank_by=pair_mode,
pair_mode=pairing_strategy,
pairing_strategy=float(100),
stop_at_score=prediction_callback,
prediction_callback=dpi,
dpi=False,
zip_results=save_all,
save_all=max_msa,
max_msa=use_cluster_profile,
use_cluster_profile=input_features_callback,
input_features_callback=save_recycles,
save_recycles="colabfold/google-colab-main",
user_agent
)= f"{jobname}.result.zip"
results_zip f"zip -r {results_zip} {jobname}")
os.system(
import py3Dmol
import glob
import matplotlib.pyplot as plt
from colabfold.colabfold import plot_plddt_legend
from colabfold.colabfold import pymol_color_list, alphabet_list
= 1
rank_num = "lDDT"
color = False
show_sidechains = False
show_mainchains
= results["rank"][0][rank_num - 1]
tag = ".custom" if msa_mode == "custom" else ""
jobname_prefix = f"{jobname}/{jobname}{jobname_prefix}_unrelaxed_{tag}.pdb"
pdb_filename = glob.glob(pdb_filename)
pdb_file
def show_pdb(rank_num=1, show_sidechains=False, show_mainchains=False, color="lDDT"):
= f"rank_{rank_num}"
model_name = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
view open(pdb_file[0],'r').read(),'pdb')
view.addModel(
if color == "lDDT":
'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
view.setStyle({elif color == "rainbow":
'cartoon': {'color':'spectrum'}})
view.setStyle({elif color == "chain":
= len(queries[0][1]) + 1 if is_complex else 1
chains for n,chain,color in zip(range(chains),alphabet_list,pymol_color_list):
'chain':chain},{'cartoon': {'color':color}})
view.setStyle({
if show_sidechains:
= ['C','O','N']
BB 'and':[{'resn':["GLY","PRO"],'invert':True},{'atom':BB,'invert':True}]},
view.addStyle({'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
{'and':[{'resn':"GLY"},{'atom':'CA'}]},
view.addStyle({'sphere':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
{'and':[{'resn':"PRO"},{'atom':['C','O'],'invert':True}]},
view.addStyle({'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
{if show_mainchains:
= ['C','O','N','CA']
BB 'atom':BB},{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
view.addStyle({
view.zoomTo()return view
show_pdb(rank_num, show_sidechains, show_mainchains, color).show()if color == "lDDT":
plot_plddt_legend().show()
#@title Plots {run: "auto"}
from IPython.display import display, HTML
import base64
from html import escape
# see: https://stackoverflow.com/a/53688522
def image_to_data_url(filename):
= filename.split('.')[-1]
ext = f'data:image/{ext};base64,'
prefix with open(filename, 'rb') as f:
= f.read()
img return prefix + base64.b64encode(img).decode('utf-8')
= ""
pae = os.path.join(jobname,f"{jobname}{jobname_prefix}_pae.png")
pae_file if os.path.isfile(pae_file):
= image_to_data_url(pae_file)
pae = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_coverage.png"))
cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_plddt.png"))
plddt f"""
display(HTML(<style>
img {{
float:left;
}}
.full {{
max-width:100%;
}}
.half {{
max-width:50%;
}}
@media (max-width:640px) {{
.half {{
max-width:100%;
}}
}}
</style>
<div style="max-width:90%; padding:2em;">
<h1>Plots for {escape(jobname)}</h1>
{ '<!--' if pae == '' else '' }<img src="{pae}" class="full" />{ '-->' if pae == '' else '' }
<img src="{cov}" class="half" />
<img src="{plddt}" class="half" />
</div>
"""))
if msa_mode == "custom":
print("Don't forget to cite your custom MSA generation method.")
f"{jobname}.result.zip")
files.download(
if save_to_google_drive == True and drive:
= drive.CreateFile({'title': f"{jobname}.result.zip"})
uploaded f"{jobname}.result.zip")
uploaded.SetContentFile(
uploaded.Upload()print(f"Uploaded {jobname}.result.zip to Google Drive with ID {uploaded.get('id')}")
References
- Wikipedia. (Retrieved 24th of Oct, 2024). https://en.wikipedia.org/wiki/J%C3%B6ns_Jacob_Berzelius.
- Deepmind. (Retrieved 24th of Oct, 2024). https://educlick.uk/course/26
- Basic Structure of an aminacid. (Retrieved 24th of Oct, 2024). https://www.snexplores.org/article/scientists-say-amino-acid
- Myosin-I. (Retrieved 24th of Oct, 2024). https://www.uniprot.org/uniprotkb/X8JJ96/entry