Protein Sequences

  • OpenProteinSet: OpenProteinSet is an open-source corpus of more than 16 million Multiple Sequence Alignments (MSAs), associated structural homologs from the Protein Data Bank, and AlphaFold2 protein structure predictions. It's designed to be broadly useful as training and validation data for tasks focused on protein structure, function, and design, as well as large-scale multimodal machine learning research.
    Publication
    Link