C2H2ZF Repository

1. Software Requirements

We require the following software:

DSSP (version CMBI 2006) provides protein structural features;
X3DNA (version 2.0) is used to analyze and generate DNA structures
Matcher and Needle, from the EMBOSS package (version 6.5.0), produces local and global alignments, respectively;
BLAST (version 2.2.22) is employed to search homologs of a target protein;
MODELLER (version 9.9) is used to create structural models with all the templates similar to our target;
FIMO and TOMTOM from the MEME suite are used to scan a DNA sequence with a PositionWeight Matrix and to compare two PWMs, respectively.

2. Databases

Structural information is retrieved from the PDB repository and protein codes and sequences are extracted from UniProt (January 2019 release).

We select all transcription factors of the C2H2-ZF family as defined in CIS-BP database (version 1.62) with known structures in PDB to generate the internal database of structures (set PDB_DNA).

We rearrange the set of structures by separating them in chains and constructing a set of structures formed by single protein-chains interacting with a double-strand helix (single-chain PDBDNA).

After removing structures with methylated forms of DNA in the binding site, the set is obtained with the PDB codes:

'5t00', '5kkq', '1a1g', '6jnm', '4r2d', '1tf6', '4m9v', '4r2a', '1tf3', '4r2c', '4x9j', '2i13', '5und', '5ke7', '5ke8', '2drp', '5ke6', '1llm', '4m9e', '5vmv', '1p47', '5vmw', '5ei9', '5eh2', '2wbu', '6e93', '6e94', '1jk1', '1jk2', '3uk3', '5wjq', '2kmk', '6jnn', '5v3m', '6jnl', '4gzn', '5t0u', '5kea', '5keb', '1ubd', '5k5h', '5k5l', '5k5j', '5k5i', '1a1f', '1a1i', '1a1h', '1a1k', '1a1j', '1a1l', '5ke9', '4f6m', '4f6n', '2gli', '5yeg', '1un6', '5yef', '1g2f', '5yeh', '1g2d', '1zaa', '5yel', '5egb', '1aay', '4is1', '5vmu', '1f2i', '5v3j', '5yj3', '2hgh', '5vmz', '2lt7', '5vmx', '5vmy'

Binding information of Zinc-finger family C2H2-ZF is retrieved from bacteria one-hybrid (B1H) experiments.
The experiment distinguishes between Zinc-finger individual domains at the C-tail (F3 domain) and inner domain (F2 domain).

The experiment performs the screening of all 64 possible 3bp targets for interactions with C2H2-ZF domains from multiple large protein libraries based on Zif268 structure with six variable amino acid positions on each individual domains F2 and F3.

Here, we summarize the key points of the B1H experiment of Persikov et al. that we use in our work.
The experiment surveys systematically the DNA-binding landscape of C2H2-ZF domains. It relies on a repertoire of protein libraries of engineered modifications of the sequence of Zif268 allowing for each of the 20 amino acids at positions -1,1,2,3,5 and 6 of the alpha-helix (i.e. the hexamer core sequence) of domains F2 and F3. In the 3D space, these residues interact specifically with three nucleotides.

Attached is an example for the F2 domain, with the amino acid positions indicated by numbers and the three nucleotides represented with Ns and highlighted in red:

The modified Zif268 proteins are expressed as fusions to the omega subunit of RNA polymerase, which act as activation domain of the hybrid assay. Then, each of the 64 possible 3bp DNA target sequences, located 10bp upstream of a weak promoter driving the reporter genes HIS3 and URA3, is tested; when cells are grown on minimal media requiring the activation of HIS3, only a functional protein-DNA interaction leads survival of the bacteria.

The authors considered that the affinity of protein-DNA interaction was related to the growth rate in the B1H system. They suggested a computational approach to infer the affinity of a specific hexamer core sequence based on the frequency with which it was found in each of the 64 possible 3bp targets. The method for predicting the DNA- binding specificity for a core sequence was defined by the authors as the “lookup” procedure, which is based on finding the core sequence across all the protein selections.

They normalized the frequencies so that they summed up to 1 across the 64 possible 3bp targets, and obtained a binding profile with the probability distribution of the preference of a core sequence for each 3bp target. We retrieved from the repository provided by the authors the files with the logged and normalized frequencies of hexamer core sequences for each of the 64 possible 3bp targets.

Finally, the affinity of the protein-DNA interaction is calculated as the affinity percentile:

Where, f(seq) is the frequency of the particular hexamer core sequence and f is any other frequency on the set of hexamer core sequences binding the same 3bp.

3. Interface and triads of protein-DNA structures

The interface between a transcription factor and DNA is defined by the residues (aminoacids and nucleotides) in contact.

A general approach for protein-protein interactions is to consider that two residues are in contact if the distance between a pair of atoms from each residue is shorter than 5Å.

We define triads as a type of contacts between the protein and the double-strand DNA helix.

Triads are formed by three residues: one amino-acid and two contiguous nucleotides of the same strand.

The distance associated with a triad is defined by the distance between the Cb atom of the amino acid residue and the average position of the atoms of the nitrogen-base of the two nucleotides plus their complementary pairs in the opposite strand of the helix.

The triad also has an associated amino-acid residue number in the protein and a dinucleotide position in the DNA, defined by the sequence position of the first nucleotide of the dinucleotide.

We define the interface between a transcription factor and DNA as:

The set of triads with associated distances shorter than 15 Å, and their associated amino-acid residue number and dinucleotide position (e.g. a triad with amino-acid residue number p, dinucleotide in position q and associated distance d is represented as (triad, d, p, q)).

Specific features can be added on a triad, defining an extended-triad:

Hydrophobicity of the amino-acid. Amino-acid residues are split in Polar (P): {Arg, His, Lys, Asp, Glu, Ser, Thr, Asn, Gln, Cys, Gly} and Non-polar (N): {Ala, Ile, Leu, Met, Val, Phe, Trp, Tyr, Pro}.
Surface accessibility of the amino-acid. We use DSSP to calculate the percentage of accessibility of the residue in the unbound structure of the protein. If the percentage is smaller than 50% the amino-acid is buried (B), otherwise it is exposed (E).
Secondary structure of the amino-acid. We use DSSP to calculate the secondary structure of the protein. The amino-acid of the triad is either in regular secondary structure (H if in α-helix, E if in b-strand), or in a non-regular secondary structure (C)
Nitrogenous bases: We classify nucleotides by their nitrogenous bases in two types, purines (U): {A, G} and pyrimidines (Y): {C, T}.
Closest strand. We use X3DNA to define the strands forward and reverse of the DNA. Next, we calculate the distance of all atoms of the two nucleotides to the Cb of the amino-acid.
We define the strand closest to the amino-acid (i.e. with the atom at minimum distance) as either the strand of the two nucleotides of the triad or the strand of their complementary pair in the opposite strand, which can be either forward (F) or reverse (R).
Closest Groove. We calculate the distances between the Cb of the amino-acid and the closest phosphates of the dinucleotides in both strands (i.e. the strand of the two nucleotides of the triad and its complementary).
We calculate the positions of the closest phosphates in both strands (let be Pf and Pr, backbone phosphates of nucleotides f and r, respectively).
We select the closest phosphate of both and its corresponding strand.
Let assume that Pf is the closest phosphate and define its strand as “s”, being “S” the opposite strand.
Then, we consider the set of backbone phosphates in “S” around the position complementary of nucleotide “f” (6 nucleotides up and down).
Depending on their distance to “f” (towards 22Å is a major groove and towards 12 Å a minor groove), we classify them as part of the minor or major groove with respect to nucleotide “f”.
This is a classification of 12 nucleotides around the complementary of “f” in two groups:
1. Set at large distance (i.e. major groove);
2. Set at short distance (i.e. minor groove).
Necessarily, Pr is in the list classified in major or minor groove.
We use the classification of Pr to define the type of the closest groove of the amino-acid (i.e. we should say that this is the groove faced by the amino-acid, defined by the pair Pf and Pr, in closest proximity to the amino-acid).
The closest groove is defined as major groove (A) if Pr is in the list classified in major, otherwise it is defined as minor groove (I).
Chemical group of the nucleotides. We distinguish two main chemical groups of each nucleotide, the nitrogenous base (N) and the backbone (B) that includes the phosphate and sugar.
We calculate the distances between the Cb of the amino-acid and the atoms of the two nucleotides and their complementary.
We select the atom with the shortest distance as the closest atom between the nucleotides and the amino-acid.
We define the chemical group of the nucleotides of the triad as the chemical group to which belongs the closest atom (i.e. N or B)

Added features of triads can also be used on their own as feature-triads (or environment triads), and every extended-triad has an associated feature-triad, both associated with the same distance, amino-acid number and dinucleotide position.

As an example, let be a lysine residue and two nucleotides, adenosine and guanosine, forming the triad [K,(AG)] at 15.6Å, with lysine in residue number 32 and adenosine in 5, described as ([K,(AG)], 15.6, 32, 5).

If lysine surface is mainly exposed to solvent, in a a-helix conformation and the closest strand of DNA is the forward strand, the closest atom of the two nucleotides is a phosphate and the amino-acid faces the minor groove, the extended-triad is [{K,(p-H-E)},{(AG),(UU-F-I-B)}], where added features are (p-H-E) for the amino-acid and (UU-F-I-B) for the dinucleotide.

This produces a feature-triad defined as [(p-H-E),( UU-F-I-B)] at 15.6Å.

We require to define some functions on the sets of triads, extended-triads and featuretriads to extract some of the values collected from a complex structure and apply other functions:

f_td (triad, d, p, q) = (triad, d)
f_t (triad, d, p, q) = (triad)
f_d (triad, d, p, q) = (d)
f_a (triad, d, p, q) = (p)
f_n (triad, d, p, q) = (triad, d)

The same functions are applied to extended-triads and feature-triads accordingly modified.
We also define functions to substitute some of the elements of a triad, extended-triad or feature-triad (the example is given for etriads without loss of generality):

ε_α is a function that substitutes amino-acid residue “a” of the etriad by amino-acid-residue “r”, with the corresponding change of hydrophobicity but preserving the rest of features and measures associated with the triad.
η_v(etriad, n) is a function that substitutes dinucleotide v by n, in Δ = {A, C, G, T} × {A, C, G, T}, with the corresponding change of nitrogenous bases and preserving the rest of features and measures associated with the triad.

4. Statistical Potentials

We use the definition of statistical potentials described by Feliu et al, and Fornes et al. to define several scoring functions for the interaction between a protein and a DNA binding site using contact triads.

We use triads, their associated measures (i.e. distance, amino-acid number and dinucleotide position) and their added features to calculate the frequencies per distance in bins of 1Å (i.e. intervals [0,1], [1,2], [2,3], [3,4], [4,5] etc.) up to 30 Å.

We also calculate the frequencies using the distance as cut-off (i.e. the frequency of triads at distance shorter than “d”, with d= 1,2,3,4, etc.).

To obtain the frequencies, we first calculate the size (cardinality, defined by the function “Card”) of the sets of triads, associated with a distance (d), taken from the set of structures of protein-DNA interactions (PDBDNA) and grouped by their associated distance, limited to a maximum of 30Å (i.e. with i ∈ [1,30]).

The set of triads, associated with distances, amino-acid residue-number and dinucleotide position, is named 3Dset. Then, frequencies are defined using functions defined in section 3 as follows:

Where x = (triad, d, p, q) is a triad associated with a distance d, amino-acid residuenumber p and dinucleotide position q, taken from the set 3Dset, N_D is defined using bins and N_c using cut-offs. Similarly to 3Dset we define the sets e3Dset and f3Dset for extended-triads (etriad) and feature-triads (ftriad), and calculate L_D, L_C , M_D and M_C with i ∈ [1,30] as:

Then, we define the frequencies (F for triads, G for extended-triads and H for feature-triads) as:

Where N can be N_D or N_C, L can be L_D or L_C , and M can be M_D or M_C, depending on the approach to group the triads.
This definition forces us to consider independent the groups obtained by cut-offs, instead of using the ratios with respect to the limit at 30 Å.
We tested in artificial data that this approach preserves the curve of the statistical potential similar to the classical definition by bins of distances, but it’s less affected by the scarcity of data.

To define a reference-state for the statistical potential, we require two more frequencies, one for N_D and another for N_C , using the total number of triads in the database (triads):

Where triads ∈ triads, and it’s easy to proof that:

Where ftriads is the set of feature-triads and etriads the set of extended-triads, with ftriad> ∈ ftriads> and etriad> ∈ etriads. Using these definitions and following previous works, we define the potentials E3DC, ES3DC and PAIR per triad and distance d, using the round value of d (i.e. k), as follows:

Where, F, G, H and O are frequencies calculated by bins or using cut-offs.
The total potential of an interaction is calculated as the sum of the corresponding potential of all triads, feature-triads and extended-triads at distances shorter than 30Å.
We use each potential to score the quality (or potentiality) of the interaction.
Let be I, E and D the sets defined respectively as the set of all triads, extended-triads and feature-triads with their associated distances (d), amino-acid residue number (p) and dinucleotide position (q) in the binary interaction of a TF-DNA structure.

Therefore, we define the energy-based scores (as they are based on total potentials) as:

Similarly, we also use the potentials defined for triads, ftriads and etriads as scores of the quality of a single interaction between one dinucleotide and one amino-acid residue.
Then, the potential ES3DC of an extended-triad associated with distance d can be rewritten as:

Where k = 1 + int(d - 1). We then define two scoring terms, one distance independent (ES3DC_di) and another distance dependent (ES3DC_dd), as follows:

Using eq. 8, eq. 19 and eq. 20, we rewrite ES3DC(etriad, d) as:

And the global energy-based scores:

We consider ES3DCdd to evaluate the quality of an interaction because it is more specific than PAIR, as it uses extended-triads and distances, and it is also highly sensible, because it is normalized over the total of triads at a given distance with independence of the features.

However, we have to note that this can only be considered as a scoring of quality, because the complete statistical potential requires the other terms E3DC and ES3DCdi to complete ES3DC in equation 21.

Finally, due to the scarcity of data the curves of statistical potentials may be jagged.
Therefore, we use a sliding window of approximately W samples defined by distances (by bins or cut-off) to smooth the potential curves. Let be Scr(d) a distance-dependent score, as defined previously, then we define the smoothed score, Scr_smooth, as:

Where k_max and k_min are defined as:

k_max = min (30; 1 + int(d - 1) + int(W/2)
k_min = min (0; 1 + int(d - 1) - int(W/2)

Where, size_samples is the total of bins between k_min and k_max with defined score (Scr), which is around W (i.e. size_samples ≈ W)

5. Z-Scores

We have to note that frequencies are always smaller than 1, consequently their logarithm is smaller than 0.

However, statistical potentials are not necessarily negative, because by definition the use of a reference state is required, which is obtained by the sum of all triads at a given distance (also ftriad or etriad, depending on the potential of interest).

Consequently, the comparison with a reference state can change the sign of the final sum of terms (for example, PAIR and ES3DC can be negative while E3DC may be positive).

The variability of signs of the potentials affects the criterion of quality of the scores and they may become unclear.
Nevertheless, indistinctly of the sign, the best interaction between an amino-acid and a dinucleotide is produced at the distance where PAIR and ES3DC are minimum, because it implies the highest frequency of a triad (or etriad) with respect to all triads (or feature-triads).

For example, ES3DCdd has a minimum for the highest frequency of an extended-triad with respect to all triads.
We then define z-scores in order to follow a criterion that incorporates the sign to score the quality of the interaction between an amino-acid and a dinucleotide as a function of the distance.

We wish that the z-score identifies simultaneously the best distance associated with a triad and the best pair formed by one amino-acid and one dinucleotide.

Consequently, we construct a zscore function with any type of score, applying without loss of generality on an extended-triad (etriad) and an associated distance d, as:

Where: A is the set of all amino-acid types (i.e. Car{A} = 20), and we use the classical functions of average (μ) and standard deviation(σ), defined as:

Notice that we use the function ε (etriad, r) as above and that we use the term score instead of potential (P in previous section) to emphasize that this is used only as a criterion of the quality of the interaction.

It’s easy to proof that this definition satisfies our requirements for the new function zscore:

1) The minimum value of score produces a negative value of zscore (and viceversa, the maximum yields positive);
2) As the zscore compares the score of a particular amino-acid with all other, the etriad becomes ranked from minimum to maximum allowing us to select the best residue among amino-acids in the same position, depending on the quality criterion of the score.

Finally, if we choose to smooth the curves dependent on distance, the zscore has to be calculated first with the unsmoothed scores and subsequently smoothed to avoid introducing biases.

6. Structural Modeling of C2H2-ZF Complexes

Given the sequence of a TF (named protein target) and a DNA binding site fragment (named DNA target), we obtain the structure of the complex by means of homology modelling using the program MODELLER.

First, we search potential templates of the protein target among the set of sequences with known structure in PDBDNA using BLAST 15. We use the alignment obtained with matcher and the sequence and structure of each template to construct the models with MODELLER.

The DNA binding sequence of each testing 3-domain C2H2-ZF protein is formed by 9bp nucleotides, a set of 3bp bound by each individual domain.
For the selection of the binding sequences associated with each finger domain we use the same sequences as in the B1H experiment 11:

1) for the selection of F2, the finger sequences in F1 (N-tail domain) and F3 (C-tail domain) involved in the interface are RSDNLRA(F1) and RSANLVR (F3), respectively binding AAG and GAG; and 2) for the selection of F3, the finger sequences in F1 (N-tail domain) and F2 (inner domain) involved in the interface are RSDELTR (F1) and RSDNLRA (F2), respectively binding GCG and AAG.

The structure of Zif268 binding DNA is modelled with 23 different template structures to introduce structural variability, retrieved by codes:

1p47 (chain A), 1zaa (chain C), 1g2d(chain C), 1a1h(chain A), 1a1g(chain A), 1a1i(chain A), 1a1j(chain A), 1a1k(chain A), 1a1l(chain A), 1aay(chain A), 1jk1(chain A), 1jk2(chain A), 2kmk(chain A), 2wbu(chain A), 4r2a(chain A), 4r2c(chain A), 4r2d(chain A), 5ke6(chain A), 5ke7(chain A), 5ke8(chain A), 5ke9(chain A), 5kea(chain A) and 5keb(chain A).

We compare the sequence of Zif268 used in the experiment with the templates by a sequence alignment with CLUSTALW 16.
We identify as WT the original sequence, and as F2 and F3 the sequences used for the selection of F2 and F3 binding sequences, labeling by “X” the amino-acids that are modified.
This alignment is shown here for 1p47 used as template, highlighting in bold and red the attention of the specific binding sequence of each finger (blue box for F1, yellow box for F2 and green box for F3):

After modeling the structure of Zif268, we complete the complex by modeling the structure of the DNA binding sequence.
However, each template has DNA sequences of different length that do not correspond with the DNA used in the experiment.
The full DNA sequence of the experiment islonger (29bp) than the binding (9bp), which is shown next, embedded in positions 11 to 19 (nucleotides highlighted), labelling by “N” those under test.
Two DNA sequences are considered depending on the experiment, one for the selection of F2 and another for F3:

The structure of the full DNA sequence bound by Zif268 is obtained with the program X3DNA4 by modifying the DNA structure in the complex.
First, we locate the 9bp binding region in the structure of the template and identify the positions at 5’ and 3’ (first and last).

Next, we construct two frames of B-DNA structure with the lengths required to extend the template at 5’ and 3’ up to 29 nucleotides.
The lengths in both sides depend on the location of the 9bp binding sequence and the DNA sequence of the experiment.

Then, we use 3DNA/DSSR to perform a least-squares fitting that locates each base reference frame in the first and last positions.
Finally, the structure of DNA is completed with the right sequence.
We model the DNA structure by substitution of the nucleotides in each model with the corresponding nucleotides of the target.
We use the program X3DNA to substitute the nucleotides of one strand and automatically model its corresponding pair.

In order to understand the approach, we select one of the hexamer sequences in F2, “SAGSYN” (highlighted in bold), with highest affinity for “AAA” and we use the structure of 1p47 as template.
We use the program MODELLER with the alignment:

MODEL GTERPYACPVESCDRRFSRSDNLRAHIRIHTGQKPFQCRICMRNFSSAGSLYNHIRTHTGEKPFACDICGR
1p47_A --ERPYACPVESCDRRFSRSDELTRHIRIHTGQKPFQCRICMRNFSRSDHLTTHIRTHTGEKPFACDICGR

MODEL KFARSANLVRHTKIHLRGS/....................../......................*
1p47_A KFARSDERKRHTKIHLRQ-/....................../......................*

The dots in the alignment are used to indicate the ligand, which in this case corresponds to the DNA sequence, using the slash to separate the chains. MODELLER produces a model with the anticipated sequence on the protein, but the DNA sequence is still the same as in the original template (i.e. 5’-GTGGCGTGGGCGGCGTGGGCGT-3’ in chain “C” and the paired sequence in chain “D”). Then, we use the package X3DNA to modify the sequence in the structure of chains “C” and “D”, substituting the nucleotides by the expected sequence with the program “mutate_bases” from X3DNA package.

The modified sequence becomes then 5’-GCGGCCGCAAGAGAAAAAGTAA-3’, where the anticipated sequence “AAA” is highlighted in bold.
Finally, we use the program “fiber” from X3DNA package to complete the missing nucleotides (i.e. “CGAATTC”) with a B-DNA conformation.

The modeling of any other transcription factor formed by C2H2-ZF domains is similarly done, aligning the protein sequence of the query with the best template protein sequences of transcription factors formed by consecutive C2H2-ZF domains.
We filter out the alignments with gaps in the DNA binding helix of the templates. The schema for modeling is here summarized:

We use a database of 86 chain structures, with DNA complexed with the DNA binding domain, formed by continuous zinc-fingers, as the set of templates for the server in C2H2ZF_repo (codes of chains are also indicated):

'1zaa_C', '4x9j_A', '1p47_A', '1a1i_A', '1a1j_A', '1a1l_A', '1jk1_A', '4r2c_A', '5ke6_A', '5ke7_A', '1llm_C', '1llm_D', '1ubd_C', '2drp_A', '2lt7_A', '3uk3_C', '4f6m_A', '4f6n_A', '4is1_C', '5k5i_A', '5k5l_G', '5und_A', '5v3j_E', '5vmw_A', '5vmz_A', '6e93_A', '1un6_D', '5yef_A', '5yef_J', '2kmk_A', '1g2f_C', '1jk2_A', '1a1g_A', '1a1k_A', '1g2d_C', '4m9e_A', '4r2a_A', '4r2d_A', '5ke8_A', '5kea_A', '1f2i_G', '1p47_B', '1tf6_A', '2hgh_A', '4gzn_C', '5k5l_E', '5k5l_F', '5kkq_A', '5t0u_A', '5v3m_C', '5vmy_A', '5yeg_A', '5yel_B', '6jnl_A', '1un6_B', '5yef_B', '2wbu_A', '1a1f_A', '1a1h_A', '1aay_A', '5ke9_A', '5keb_A', '1f2i_H', '1tf3_A', '2gli_A', '2i13_A', '3uk3_D', '4is1_D', '4m9v_C', '5egb_A', '5eh2_F', '5ei9_E', '5k5h_A', '5k5j_A', '5t00_A', '5vmu_A', '5vmv_A', '5vmx_A', '5wjq_D', '5yeh_A', '5yj3_C', '5yj3_D', '6e94_A', '6jnm_A', '6jnn_A', '5yef_G'

We also model several structures with the complex of Zif268 binding a non-specific DNA region using the same approach. These structures are used as non-binding examples.
The non-binding sequence is taken randomly by selecting a region of the sequence of the weak promoter GAL1, constructing 10 DNA fragments of 29bp for each known binding.

A potential binding test in this region, which is part of the B1H experiment, may well represent the background. The forward weak promoter sequence of GAL1 is formed by 118bp that are shown here:

7. Use of Experimental TF-DNA binding to calculate statistical potentials

One of the main problems to obtain statistical potentials for all families and folds of TFs is the scarcity of known interactions.
We need to enlarge the number of interacting triads. Therefore, here we propose to use the experimental knowledge of TF-DNA interactions to derive interacting triads without requiring the complete knowledge of TF-DNA complex structures. We then use the sets of derived triads associated with distances to calculate the statistical potentials.

Let be a TF with experimentally known interactions with several DNA sequences. We define a mapping function (mapD) between the fragment of the DNA sequence from the structure (S_DNA), and each positive binding sequence (S_exp), as follows:

Where n_m is a dinucleotide, with n_m S_exp, v_m is also dinucleotide, with v_m ∈ S_DNA, m is the position of the dinucleotide in the alignment between S_exp and S_DNA, q_vis the position of the dinucleotide v_m in S_DNA, q_n is the position of the dinucleotide n_m in S_exp, and the position of each dinucleotide is defined (i.e. equal to) the position of the first nucleotide in the DNA sequence.

For the sake of simplicity, when the length of S_DNAis the same as S_exp and the position in the alignment, m, coincides with q_v: and q_n (i.e. m = q_n = q_v ), we write:

We can define a set of mapping functions with the alignments of all the DNA sequences extracted from the structures of TF-DNA complexes in PDBDNA that are aligned with positive binding sequences.
Then, we use the dinucleotide substitution function (as defined in section 3), η_v (etriad, n), of an extended-triad, etriad, containing a nucleotide v which is substituted by n in the dinucleotide, to generate more extended-triads.

There are other TFs for which the structure is not known but can be modelled.
This implies that the sequence of the TF can be aligned with sufficient percentage of identical residues to ensure its modeling. Then, we define another mapping (map_P) between the protein sequence of the TF with known experimental data on DNA binding and the TF sequence of a known structure (template), with:

Where, a_m is an amino-acid residue in the sequence of a template and r_m is the aminoacid in the sequence of a TF with experimental data, in position m of the alignment of both TF sequences that correspond with positions p_r and p_a for r_m and a_m, respectively. Also, for the sake of simplicity, if the position in the alignment, m, coincides with p_a and p_r (i.e. m = p_r = p_a ) we write:

We define the set of all mapping functions with all the alignments between the sequences of TFs with experimental annotation and the TFs with known structure (complexed with DNA). Also, we define etriads(a, v) as the set of extended-triads, extracted from the 3Dset, containing amino-acid residue a ∈ Δ (the set of 20 aminoacids) and dinucleotide v ∈ Δ = {A, C, G, T} × {A, C, G, T}.

We remind the definition of ε _a (etriad, r)as the function that substitutes amino-acid residue “a” of an extended-triad, etriad, by the amino-acid residue “r”. With all these definitions, we increase the set of extended-triads of the e3Dset to e3Dset’, using all the mapping functions mapD (simplified as map̃_D) and map_P (simplified as map̃_P). We use the simplified maps without loss of generality, as all sequences and alignments can be renumbered. Both mappings, map̃_D and map̃_P, are respectively defined with:

1) the alignments of the DNA sequences extracted from the structures in the PDB aligned with positive binding sequences;
and 2) the TFs that can be modelled using the alignment with their structural templates (the set is defined as PBM_3DSET). The new set of extended-triads, e3Dset’, is defined as:

And we recalculate L_D and L_C in equations eq.3 and eq.4 as:

Similar approach is also taken for the sets of triads and ftriads, modifying accordingly the corresponding functions to substitute amino-acid and dinucleotide residues of a triad or a featured-triad.

We proceed similarly with B1H experiments on C2H2-ZF family.
For each finger (F2 and F3) and combination of 3bp nucleotides, we collect all protein sequences producing significant binding signal in the B1H experiment.
We use the modelled structures of the DNA testing sequence of 29bp with different templates and introduce the mappings for the DNA sequence and the modified residues in F2 or F3 from the multiple sequence alignment.
For the DNA sequence the mapping is on the 3bp modified nucleotides, affecting 4 dinucleotides, while for the protein sequence the mapping affects 6 aminoacids, both mappings being different for F2 and F3 selections. This is, for the DNA sequence the mapping is n_m = map̃_D(v_m, where n_m is a dinucleotide of the 3bp under test, v_m is a dinucleotide of the 29bp of the modelled template, and the position, m, is between 11 and 19 (11-13 for F3 and 14-16 for F2).

For the substitution of residues of Zif268 in the interface we require a mapping of the native sequences RSANLVR (F3) and RSDNLRA (F2), for all selections of the binding finger. This mapping is (r_m, p_r) = map_p(a_m, p_a where m is the position in the alignment (residues 47-53 for F2 and 75- 81 for F3), p_r = m, the position in the template, p_a, depends on the template used to model Zif268 and a_m is the corresponding residue in the template in position m of the alignment.

From the template structure we extract the contacts (triads, etriads and ftriads) between amino-acids and dinucleotides and generate the statistical potentials.
However, this introduces a bias by overestimating the constant amino-acids and nucleotides that have not been modified in the experiment. Therefore, only the triads affecting the amino-acids and nucleotides under test are considered to generate the potentials. This is, we only consider the contacts between the 6 amino-acids labelled by “X” and dinucleotides containing one of the 3bp labelled by “N” to generate two potentials, one for F2 and another for F3 finger positions.

We restrict each set of Zif268 sequences to those with highest signal of the B1H binding experiment in order to obtain potentials more specific or associated with the strongest binding. We define three thresholds based on the affinity percentile of a sequence:

1) higher than 90%;
2) higher than 75%;
3) higher than 50%.

To calculate the affinity percentile of a sequence we follow the same definition as the authors.

Each sequence, seq in its corresponding domain, has a logged and normalized frequency of its observation (p_seq).
Hence, the affinity percentile is defined as the sum of all other frequencies lower or equal to the frequency of the sequence (e.g. an affinity percentile of 90% implies that the sequence is on the tail with highest number of observations, in the top 10%).

However, the number of selected sequences may be too different between experiments (i.e. the 3 nucleotides of the binding “TGA” may have 30 sequences with affinity percentile higher than 90, while AAA has only 2), which produces the opposite bias on the expected potential.

To avoid a bias on the number of sequences selected, we force to have around 500 sequences for all trinucleotide-binding experiments, by repeating as many times as we need each sequence (e.g. if only 2 sequences are selected for AAA and they are equally representative, we should repeat 250 times each).
Each sequence is repeated in proportion to the number of observations.

This is, for a sequence with observation p_seq, it is repeated 500 × p_seq/Σ_k∈A90pk where A₉₀ is the set of sequences with affinity percentile higher than 90 (or we use A₅₀ for affinity percentile higher than 50, etcetera). As a consequence of the approach, the contacts derived from the B1H experiment are limited to relatively short distances (the largest contacts are around 15-20Å).
However, we note that we also use contacts extracted from other structures of the C2H2-ZF family in the PDB and from the use of PBM experiments, covering larger distances up to 30 Å.

8. Scoring TF-DNA binding with structure

Scores of single domain structures

Given the structure of a protein-DNA complex, either experimentally obtained (i.e. from crystallography and identified by a PDB code) or modelled, we define several scores of the interaction based on statistical potentials.

First, we calculate the interface of the interaction and extract all triads, extended-triads and feature-triads associated with distances shorter than 30Å.
Then, the score of the interaction is defined as the sum of the scores (i.e. potential) of all triads with their associated distances (or extended-triads or feature-triads, depending on the type of score).

The same approach is applied for z-scores. Let Scr be a potential or a z-score and let C be the set of triads (extended-triads or feature-triads, depending on the definition of Scr) and their associated distances (d), amino-acid residue number (p) and dinucleotide position (q).
The score of the interaction is defined as:

We can obtain the score of a TF without knowing the structure of the TF-DNA binary complex if it can be modelled.
We use the structure of a template to generate the set of triads and the mapping of amino-acids derived from a sequence alignment, mapP as defined in section 7 (eq.28), between the TF sequence and the sequence of the template.

We also need the mapping of dinucleotides between the DNA sequence we wish to model and the DNA sequence in the template interface.
Instead of modeling the structure of the TF-DNA complex, we modify the scores by applying the substitution of the corresponding amino-acids, using the functions defined as in section 3, ε_a(triad, r) and η_v (triad, n), and the mappings map_P (in eq. 28) and map_D (in eq. 27) between the templates and the sequences of TF and DNA, respectively.
Here, instead of using simplified mappings (map̃_D and map̃_P), we generalize the formula by defining special functions, f₁ and f₂, to extract the dinucleotide or amino-acid positions in the DNA or protein sequences:

Where r is either a dinucleotide or an amino-acid residue, and q is a position of a dinucleotide or an amino-acid, respectively for DNA or protein sequences. Then we calculate the score of the interaction as:

Multiple domain TF

There are TF structures with more than one domain, such as the particular case of the C2H2-ZF family, where the TF has several domains like F2 (internal) and two more domains, one at the N-tail (F1) and another C-tail (F3). However, we apply two potentials for all domains, one obtained with B1H data on finger domains in F2 and another for F3. For example, we use the statistical potential ZES3DC calculated with variant sequences in F2 domain (ZES3DCF2) or in F3 (ZES3DCF3).

9. Construction of PWMs using Zif268 structure models.

Given the modelled structure of Zif268-DNA complex, we obtain the PWM by means of statistical potentials using scores or zscores (we use the zscore of ES3DCdd by default as example).

We focus on the specific nucleotides for three continuous fingers (9 bases) covering the whole binding site in sliding windows.
We collect the set of triads, extended-triads and feature-triads with their associated distances between protein and DNA and the associated amino-acid and dinucleotide positions (i.e. C_k, eC_k, fC_k, respectively), where the dinucleotide in the triad belongs in 9 nucleotide overlapping fragments (we name them DNA_F1 , DNA_F2 etc.).

We remind the substitution function η_v(etriad, n), to substitute the dinucleotide v of an etriad by the dinucleotide n. Similar functions are defined to substitute the dinucleotide in triads and feature-triads (these are only affected in the change of the nitrogenous bases).

Then, for each fragment DNA_FK , with k=1,N-8 and the binding site of length N, we obtain a test set with all possible DNA sequences (i.e. a total of 49). We define two mappings, map_DNAFK and map_seq, respectively for the native fragment sequence of the model and any sequence seq in the test set, both between sequence position and dinucleotides (i.e. map_seq(j) = ω_jω_j+1 with ω_j nucleotide in position j of seq). We calculate the score of any sequence of the test set with the triads, extended-triads and feature-triads, using the associated distances, residue number and dinucleotide positions from the complex structure (i.e. triads, extended-triads and feature-triads as C_k, eC_k, fC_k, respectively) and using the corresponding substitution function (e.g. η_v(etriad, n)). Let ZES3DC_dd be the z-score of ZES3DC_dd and assume we apply it on extended-triads without loss of generality, then the score of a sequence seq is:

Where, for each (etriad, d, p, q) ∈ eC_k, etriad, d, v and n are calculated using the functions as defined in section 3 and the mappings defined above:

We normalize the scores between 0 and 1, using the total set of scores obtained with all generated sequences (i.e. set {score_seq} ) and modifying the sign accordingly:

Where we have assumed that the best score is already the maximum. For example, for ZES3DC_dd we need to multiply the score by -1. Then, we rank the normalized scores and select only the DNA sequences producing the top scores over a cut-off threshold (often 0.9). This produces an alignment from which we calculate the PWM (i.e. frequencies of nucleotides in each position)

10. Score per Nucleotide: profiles of a DNA binding site

We define a nucleotide profile as a function on ℝ, f: ℕ ⟶ ℝ, of the nucleotide position in a DNA sequence.
Here, we define score-nucleotide profiles when the function is obtained with the scores and z-scores. Given a TF-DNA complex structure, 3D-TF, and a distance dependent score, score, obtained with statistical potentials (e.g. the smoothed z-score of ES3DCdd). Let be j a nucleotide position of the DNA sequence in the complex, we define a new score, Scr_score in j, as:

Here we use extended-triads without loss of generality, although depending on the statistical potential we can use triads or feature triads instead.

The set eC_j is the set of etriads with all associated distances (i.e. any d), and amino-acid residue numbers (i.e. any p), where dinucleotide in position q implies that j is the position of one of the nucleotides in the dinucleotide at q (i.e. q = j or q = (j − 1)).

This new score can be normalized by considering the contribution of the nucleotide to the total score or as the percentage of the contribution of all nucleotides of the DNA sequence. Assuming that the score is the smoothed z-score of ES3DC_dd and without loss of generality, the normalized nucleotide profile is:

Where factor_j = 2 for all j but the extremes at 5’ and 3’, in which is 1, and from equation 37 we write ZES3DC_dd as:

E is the set of extended-triads(with their associated distances, amino-acid numbers and dinucleotide positions) and length is the length of the DNA sequence. The factor of 2 in equation 39 is produced by the fact that each nucleotide is counted twice in eC_j, with the exception of the extreme positions in 5’ and 3’ where the nucleotides are only reckoned one time.

The curve of Scr_score(j) (raw or normalized) along the positions in the DNA sequence is defined as the nucleotide profile based on 3D-TF for the (raw or normalized) potential defined in score (e.g. the smoothed z-score of ES3DC_dd).

Hence, profiles defined upon scores derived from statistical potentials are dependent on the structure of the TF-DNA interaction complex.
We have to note that, if the structure of this complex has been modelled,several models may be considered (i.e. we define the set of models of TF-DNA as MDL_TF).

Besides, some models may be obtained using different templates, implying that these models introduce a relevant variability on the conformational space of the TF-DNA interaction. Consequently, several nucleotide profiles of scores are accumulated for the same DNA sequence. We then calculate the average and standard deviation of Scr_score(j) for all positions j along the DNA sequence with the nucleotide profiles of score based on each model structure m of the MDL_TF set (i.e. described as (Scr_score(j))_m), using following equations:

Equations in 41 describe two new nucleotide profile functions. The average function is defined as the nucleotide profile of score (e.g. the smoothed z-score of ES3DC_dd) and the RMSD defines its margins of error or variability.

The nucleotide profile can also be calculated with other “scores” different than those derived by statistical potentials, for example by the enrichment or the number of times that nucleotide j is selected by several PWMs assigned to a TF. This extends the definition of score nucleotide profiles to other scores different than those obtained with statistical potentials.

11. Construction of the experimental PWM

The experimental PWM of a Zif268 sequence obtained from B1H, with a specific hexamer fragment on F2 or F3, is calculated based on its affinities for different binding sites.

The binding site is formed by thee nucleotides flanked by two fixed nucleotides (G and A for F2, and two A for F3). All binding sites targeted by a specific hexamer-fragment with affinity higher than a threshold are stored and gapless aligned to construct the PWM (e.g. the top 10% threshold uses all DNA-bound sequences with affinity percentile higher than 90%, while for a threshold of 100% we use all detected sites with any not null affinity percentile).

We construct experimental PWMs for top 10%, top 25%, top 50% and for all targeted sites. These experimental PWMs are also named hexamerspecific PWMs, to distinguish from PWMs obtained with other experiments or with a different approach.

Extension of Methods