Dibenzoylhydrazines as Insect Growth Modulators: Topology-Based QSAR Modelling

Dibenzoylhydrazines Xa-(C6H5)a-CO-N-(t-Bu)-NH-CO-(C6H5)b-Yb are efficient insect growth regulators with high activity and selectivity toward lepidopteran and coleopteran pests. For 123 congeneric molecules, a quantitative structure activity relationship model was built in the framework of the QSARINS package using 2D, Topology-based, PaDEL descriptors. Variable selection by GA-MLR allows building an efficient multilinear regression linking pEC50 values to nine structural variables. Robustness and quality of the model were carefully examined at various levels: data-fitting (recall), leave-one (or some) out, internal and external validation (including random splitting), points not in depth investigated in previous works. Various Machine Learning approaches (Partial Least Squares Regression, Projection Pursuit Regression, Linear Support Vector Machine or Three Layer Perceptron Artificial Neural Network) confirm the validity of the analysis, giving highly consistent results of comparable quality, with only a slight advantage for the three-layer perceptron.


Introduction
Insect growth regulators (IGR) stopping larvae development and inducing lethal processes during moulting are efficient tools in insect control for crop protection. As stressed by Nakagawa et al. [1] since their introduction in the mid 1980's, diacyhydrazines (DAH) , and among them dibenzoylhydrazines, of general formula Xa-(C6H5)a-CO-N(t-Bu)-NH-CO-(C6H5)b-Yb, received an increasing interest as larvicides, owing to their easy synthesis at affordable cost, high efficiency and specificity against lepidopteran and coleopteran pests. These molecules act as moulting accelerating compounds, activating the ecdysone receptor, part of the steroidal 20-hydroxyecdysone moulting hormone receptor. This hormone regulates moulting and metamorphosis. An external high dose of moulting hormone agonists maintains a premature abnormal moulting process, rapidly leading larvae to death [1][2][3]. Additionally this hormone receptor is not present in mammals, making the ecdysone receptor an interesting target for larvicide development.
For limited homogeneous series of chemicals (with a fixed substitution pattern on one of the phenyl rings), several QSAR models, linking the larvicidal activity of acylhydrazines to structural characteristics, have been proposed from Hansch-Fujita [1,4,5] multilinear models or in the framework of Free Energy Relationships, involving "classical" electronic and steric substituent constants (Hammett's sigma and Verloop's constants) [5]. However, strong interactions in di-or tri-substituted phenyl ring A bearing an ortho group, required the presence of "ad hoc" indicator variables. On the other hand, Partial Least Squares analyses of electronic and steric field contributions on grid points surrounding the substrates (CoMFA treatment) were also carried out on diverse species of insects [1,3,6].
In this framework, extended analysis was carried out by Wheelock et al. [3] for a large population of 172 compounds where both phenyl rings may be simultaneously substituted, and the central moiety (CO)-N-(t-Bu)-NH-(CO) differently substituted. The used CoMFA analysis led to satisfactory performance in data fitting (recall), and gave some clues about the more important intervening interactions with the corresponding receptor. But the determination coefficient Q 2 in leave-one-out cross validation was poor (Q 2 = 0.447 on 158 compounds), and the approach involved a huge number of variables before reduction by Partial Least Squares. External validation of the results (prediction tests) was not considered. Furthermore calculated values for compounds experimentally observed as "inactive" were much higher than the activity threshold retained. In a more recent publication [2] Crisan et al. also examined a limited population of 33 compounds, in the framework of the MLR models from QSARINS [7], and pharmacophore search.
We present here various 2D topology-based QSAR models linking the pEC50 values (co-logarithm concentration of half maximum response) to molecular structure. Topology-based models have been widely used with satisfactory performances in developing SAR and QSAR models, and appeared very efficient for prediction of new or unknown chemicals [8][9][10][11][12]. Although physical interpretation of the selected 2D structural parameters is often more difficult than for quantum-derived descriptors, this approach gets rid of energy minimisation processes, structure alignment and (often heavy) MO calculations, and requires only swift evaluation of structure-invariant descriptors, available from the knowledge of only the structural formulae [12]. It may be hoped that working on few, easily evaluated descriptors rather than on a huge number of field values on nodes surrounding the aligned molecules, would lead to flexible models, more easily applicable to the study of new potentially active chemicals.

Data set
In the present work, experimental data, for a population of 126 dibenzoylhydrazines of general formula were retrieved from the previous CoMFA study carried out by Wheelock et al. [3]. pEC50 values, correspond to the co-logarithm of the minimal concentration for obtaining an effect on 50% of the cells), determined in a Bombyx mori cell-based, high-throughput screening via a reporter gene assay. These pEC50 values (expressed in M) cover a range 8.91-4.33.
From the initial population of 133 compounds, 7 chemicals from this original study were discarded, owing to observed inactivity (pEC50 < 4.00). However (vide infra) they may give some clues about the quality of the models we built (by comparison with the calculated activity values in a somewhat "rough test set").
In the present work, the 126 compounds with precise activity values (8.91-4.33) were first ordered by decreasing activity (regardless of structural similarity) and identified by an ID number (1…126). Original Wheelock "names" W… were also indicated for easier retrieval. Structural formulae and activities (pEC50) are gathered in Table 1.

Descriptor generation and model selection
2D topological-type structural descriptors were generated from the software PaDEL [13] leading to an initial pool of about 1200 values for compound, to be introduced in the QSARINS software [7]. These descriptors encompass nature of the atoms, autocorrelation vectors and elements of adjacency or distance matrices, E-states, etc. Elimination of descriptors with (nearly) constant values and pruning pairs of highly inter-correlated values (correlation coefficient higher than 0.9) led to a reduced set of 168 (potentially significant) variables. Subsequent selection was carried out in QSARINS by using OLS-MLR coupled with a genetic algorithm (GA) based procedure [14].
This was carried out in an external validation step [15,16]. For allowing, in parallel, complementary internal validations, five subsets (m = 0 to m = 4) where created with different compositions of training and prediction sets: In subset m, prediction set is composed of chemicals with (ID modulo 5) = m (that is rest of the division of ID by 5 equals m). In other words, one every five compounds is placed in prediction. For example for subset 2 prediction will be carried out on compounds ID = 2, 7, 12…122, (Note that, in parallel, the full data set is accessible with m = 5).This procedure we previously used in several applications [17][18][19], allows for a rather regular splitting all along the reactivity range, irrespective of structural similarities for both training and prediction samples.
From the 168 initially retained variables (structural descriptors), further selection was carried out on subset m = 1. This variable selection procedure was performed in a two steps process embedded in the software QSARINS. In a first step, all the possible triples of descriptors were explored using an exhaustive selection procedure ("All Subsets"). In a second step, the pool of the best 100 models generated by All Subsets was then extended via the GA to explore models with higher complexity, in order to find the model with the best Q 2 loo (and RMSE), and satisfying the QUICK Rule [16,20].
Set up values for GA selection were: number of generations per size = 1000, population size = 100 models, mutation rate = 50%, and 100 models were saved for each model dimension (i. e. number of variables included). In addition, to avoid overfitting and inclusion of inefficient variables, uselessly complicating the models, or leading to insignificant supplementary benefit, we limited the increase of the model complexity up to 9 variables. This led to a ratio (nb compounds/nb descriptors) largely higher than the currently admitted threshold of five according to the OECD guidance [21].
For subset m = 1, chosen for selection driving, the quality (and robustness) of the model was determined by the Q 2 loo coefficient on the training part, and its predictive ability, examined on the prediction part. This corresponds to true "external validation" since the predicted chemicals were never involved in the development of the GAbased population of models. This allowed us to define a restricted pool of 9 descriptors (to be detailed below (&2.3) that will be the basis of the different proposed models (MLR and machine learning approaches). Beside the robustness and precision in data fitting, and preceding external validation process, quality of the MLR models built on these descriptors was also examined at several levels of internal validation: 1) on the remaining subsets (m = 0,2,3,4) in recall, and in leave-one (or some)-out on the training part, 2) in prediction on the corresponding part for each subset, 3) on recall, loo and prediction on randomized samples. This was carried out first with OLS-MLR models and extended with the same descriptor set to various Machine Learning approaches (at least for the five created subsets).

Selected descriptors
The nine selected structural descriptors are respectively: 1) GATS1c, GATS5e, GATS8s Geary autocorrelation terms (lag 1, 5, or 8) weighted by charge, Sanderson electronegativity or I-state. They are calculated on centered property values (wi), but weighted by the square of the centered property value on all atoms minus one (So, mean and standard deviations are accounted for [22]) with A number of atoms, Δk number of atom pairs at (topological) distance k. δij =1 for a topological distance k between atoms i and j equal to the lag, zero otherwise.
2) SM1-Dzm The Barysz distance matrix is defined as a weighted distance matrix (from the Hydrogen depleted graph) that simultaneously accounts the presence of multiple bonds and heteroatoms in the chemicals [23]. SM1-Dzm is the spectral moment of order 1 from Barysz matrix, weighted by mass. 3

) VE3_Dzp
Logarithmic coefficient of Randic-like sum of the last eigenvector (absolute values) from Barysz matrix weighted by polarizabilities.

5) SHBint6
Sum of E-State products of strength for potential internal hydrogen bonds of path length 6. Electro-topological state (E-State) belongs on Roy topological indices based on the valence electron mobile (VEM) count [25][26][27].
6) MPC8 Molecular path count of order 8. 7) TopoShape Petitjean topological shape index [28], relying on notion of generalized radius and diameter. The first trials on the full set (126 chemicals) with the selected 9 descriptors lead to a rather "acceptable" MLR model: for subset m = 1, R 2 = 0.711, Q 2 = 0.657 and Q 2 pred = 0.544, but strong residuals (about 1.1 to 2.1 log unit) were observed for compounds ID = 6, 12, 61. So, we decided to discard these compounds and work on 123 chemicals. We checked that the selected 9 descriptors can be satisfactorily applied to this (slightly), reduced data set.

Results and discussion
Selected descriptors relevance will be first examined in a Multilinear Regression (by ordinary least squares) approach. They will be then applied to various Machine Learning methods using different representations of the descriptor space.

OLS-MLR model
Multilinear regression performance is now characterized at different levels: data fitting ("recall"), cross validation, prediction. Particular attention will be devoted to robustness and accuracy.
In OLS-MLR model (ordinary least-squares multi-linear relationship), the relation between a (univariate) dependent variable y (activity here) and several independent variables xi (structural descriptors) is expressed as: where X is the matrix of the independent variables xi, b and e being the column vectors of coefficients and residuals respectively. Minimizing the residuals by OLS method, it comes: b = (X T X) -1 X T y and the calculated response ŷ is:
An important aspect of MLR treatment is the "applicability domain" [7]. It characterizes "influential" objects: those that in training have a heavy importance in the definition of the model, and in prediction, points falling outside this AD, that must be considered with caution. In the leverage approach, the influence of each object on the regression result (its "leverage") is given by the corresponding diagonal element h of the "Hat" matrix H: where X represents the matrix of the descriptors characterizing the samples.
For a study involving n training samples and k variables, objects with h larger than the threshold value h* = 3(k+1)/n are considered outside the AD. Williams' plot (standardized residuals vs Hat diagonal values h) immediately highlights points outside the AD and outliers with residuals larger than 2.5 times the standard deviation (the common norm). Six points fall outside the Applicability Domain (See Williams'plot, Figure 2 To more firmly establish the robustness and quality of the proposed model (and the corresponding set of descriptors) several confirmations were examined.

Pluri internal validation processes
First, MLRs were built with the same structural descriptors for the other subsets previously defined (m = 0, 2, 3, 4) and for the full set (corresponding to m = 5). Results, gathered in Table 2 (training tr, loo cross validation lo and test te, respectively) show that satisfactory correlations of comparable quality are obtained in these various trials. Remark that in these processes, when elaborating the corresponding MLRs (with the relevant descriptors selected on subset m = 1), the current training set (used to build the corresponding MLR) encompasses only 66% of the chemicals involved for the selection of the retained 9 descriptors ((in subset m = 1). For easier comparison, the results obtained for m = 1 are repeated in the table.
On another hand, in these treatments, each compound is considered four times in training and once in prediction. To have a global estimate of the predictive ability, we also gathered, in a single file, the predicted values obtained in each subset (m = 0 -4), and calculated the usual statistical parameters for the correlation of these grouped values with the experimental ones. Results are reported in the last line in Table 2.

Randomized leave-many-out cross validation
To confirm the choice of the selected descriptors, we carried out 2000 runs of cross validation with 20% data left out. For a more homogeneous sampling, we randomly selected, in the ID-ordered list of 123 compounds, 13 compounds in the more reactive half of the population and 13 in the less reactive part, to build the corresponding validation group. The histograms of R 2 (fitting), R 2 and Q 2 (validation aka prediction) are given in Similarly, we also verified that randomly shuffling activity values (2000 runs) led to very low correlation coefficients. (Q 2 = 0.137 for the m = 1 subset, for example). Consistency of these results prompts us to consider that the nine selected structural variables led to satisfactory fitting and prediction for the various populations studied. Presumably, this choice would not always be the optimal one when looking independently at each splitting. But we consider it gives a unique set of structural variables, actually applicable to the various subsets and that can also be used for the other correlation methods we proposed.

"Inactive compounds"
In the initial data set, seven compounds were discarded since "inactive" (pEC50 < 4). It may be interesting to examine at what level our MLR model calculates their pEC50. Table 3 gathers the evaluations from CoMFA values (from Wheelock et al study [3]) and our model, eq. (1), established from subset m = 1, but nearly identical results would be obtained considering MLR from the whole set of 123 chemicals (m = 5).
Although these two series of values are not obtained in identical conditions (158 compounds in CoMFA, 123 in our work) it is noteworthy than CoMFA predictions are largely overestimated, whereas our proposed values are more in agreement with the observed inactivity, at least for six compounds out of seven. Of course, external factors (other than the balance of structural influences taken into account by the selected descriptors): experimental uncertainty, activity cliff, or change in mechanism, may intervene in the experimentally observed activity. However, our results might be considered as a good "rough external validation" of our model.
On the other hand, some of the methods here employed, work on projections of data into structural spaces of varied dimensionality: enlarged with SVM, or reduced with PPR, or on transformed data (PLS). So it may be interesting to see whether these transformations may affect the encoding of structural information embedded in the descriptors and emphasize or deaden some of their specific characteristics. It will be also interesting to examine whether they could be used as alternatives, equivalent to MLR, or overwhelm it. In several publications indeed, machine learning approaches have proved their ability to cope with nonlinear responses, and proposed improved results as compared to MLR correlation models [10,29,31,32]. Suffices it here to recall some basic elements of these approaches.

Basics on used Machine Learning approaches
1) Partial Least Squares PLS [43] operates not on original variables but on components (latent variables) built from them, so as to represent (at best) simultaneously the variability of both the structural descriptors and the response (pEC50). This is an important difference with Principal Component Regression where the principal components optimize only the variable matrix. One advantage of PLS (not important here, however) is that it can be used when variables are numerous, highly collinear and even more numerous than samples (as for example in CoMFA analyses).
2) Projection Pursuit Regression Developed by Friedman and Stueltzle [44] this non-parametric method relies on an (empirically determined) sum of nonlinear local smooth (univariate) ridge functions introduced iteratively. Schematically, given a trial direction vector a, the descriptor matrix X (k variables*n samples) is projected as: The model operates on these Z projections (linear combinations of the initial variables) and approximates the regression function (linking the property y to associated predictors X) by a finite sum (empirically determined) of smooth ridge functions of the new predictor variables Z. As there are infinitely many possible projections from higher dimension to lower dimension spaces, it is important to have an optimization technique to pursue a sequence of projections that can reveal the most interesting structure in the data set. Once the smoothing function selected, the (tuneable) number of projections is automatically determined by optimizing cross-validation results (Q 2 loo).
3) Support Vector Machine First proposed by Vapnik [41] for classifications, SVM was soon extended to correlations, and is now largely used in QSAR studies, including recent applications to nanoparticles [10,[30][31][32][33][34][35][36][37][38][39]42] SVM is rooted to two key ideas. First, robustness of the model is privileged over good performance in data fitting ("recall"). Second, (using a kernel function) the data are projected in a higher dimensionality space where it may be hoped that a simpler, linear representation is possible. Parameters to be set are: a) Regularisation constant C; that controls the balance between precision and complexity of the model (too small a value gives limited importance to data fitting. A too large value complicates the model and may cause overfitting). b) Epsilon insensitive loss, that defines the diameter of the "error tube along the regression line" where deviations between observed and calculated values are ignored when building the model. c) Parameters of the kernel function (if necessary). As to the two most largely used kernel function, the linear one (here used) k = xx' (where x,x' are independent variables) does not require any parameter, whereas the Gaussian kernel (k = exp(-(x-x')/σ 2 ) depends on the "width" σ.

4) Three Layer Perceptron (Back Propagation Artificial Neural Network)
The first layer is fed with the structural descriptors of the investigated pattern. This information (scaled by the weights of the connexions input-hidden layers) is send to the units of the "hidden layer". On each of these units, these inputs are summed up, and transmitted, thanks to a transfer function (for example, a logistic one) to the output unit where their summation delivers the calculated property value. Connection weights (between units of the successive layers) are iteratively optimised from a training set using a "back-propagation algorithm", operating from the output layer to the input layer [45]. In the present application, only one hidden unit was introduced.

Results from Machine Learning approaches
In the present study, we used for machine learning approaches, the nine descriptors selected by MLR. Diverse other selection routines have been proposed, particularly in the framework of the caret package [46], but they often relied on classification problem and operate by backward selection. Clearly, using here the descriptors selected by MLR and in view of their good results, a drastic improvement of performance would not be expected. However, it was interesting to verify whether the changes in the structural space, induced by machine learning approaches, may be beneficial.
Calculations were carried out in the framework of the Cran-R project, using the caret package [46,47] or homemade combinations of available R routines. Parameter adjustments involve the number of latent variables (PLS), number of projections (PPR), number of hidden units (TLP-ANN), regularisation parameter and noise (epsilon) in SVM. In this case, although some programs have been proposed for this setting [48,49] we prefer using a gridtype method: for 7 possible values of epsilon (0.10-default to 0.50), several values of the regularisation parameter C (1,2,4… 32) are examined and our program automatically determines the best choice. For PPR one projection was selected for all subsets. Similarly for ANN the hidden layer encompasses only one neuron (corresponding to a 9-1-1 architecture of the network) and 100 iterations were carried out. For PLS, a high number of latent variables was necessary (7 to 9), in agreement with the observation that the selected variables are not deeply correlated, and (not surprisingly) results are nearly identical to those obtained in MLR.
For the diverse approaches here used, results are presented in Table 4.for the different subsets, in training, tr, loo cross validation lo and prediction te. Line m = 5 corresponds to the results obtained in training and loo for the full population. The last line indicates the correlation observed between experimental pEC50 and the prediction gathered in the five subsets m = 0-to m = 4. Results for MLR ( Table 2) are repeated for easier comparison.  Table 4, it appears that, in the diverse investigated approaches, results for the different subsets (m = 0,,…,4) and the full population (m = 5), with a same correlation method, are highly consistent and led to similar statistical criteria. For example, looking at recall results, R 2 varied from 0.767 to 0.736 in MLR, and 0.751 to 0.790 for TLP. In fact, this observation was not unexpected since a common set of descriptors was used, Comparing now the different approaches, although the descriptor space was differently treated, MLR, PLS and Linear SVM, in recall, leave-one-out cross validation or prediction, gave nearly identical statistical parameters, whereas PPR (in data fitting) and TLP (for fitting, loo and prediction) are slightly superior. As additional remarks, it may be seen that R 2 and Q 2 -F2 in prediction values were close. This comes from the fact that, since the prediction compounds are regularly split along the reactivity scale, the mean values for training and prediction sets are close.
In Figure 4 are illustrated some examples of these machine learning approaches at different levels: gathered predictions for TLP, LOO for SVM, recall for PPR, compared to MLR recall results, here repeated for easier comparisons.

Structural information from selected descriptors
Importance of the various descriptors on the variations of calculated pEC50 values are exemplified in Figure 5 where are indicated their contributions ((coefficients of MLR equation (1) times the descriptor value), according to ID ordering. Owing to the relatively large number of descriptors (9) intervening in the correlation models, and their topological character, it's difficult to individually associate them to definite influences. Schematically, three variables are directly related to the "shape and volume": 1) TopoShape: something like a generalized diameter /radius ratio 2) MPC8: count of atomic paths 3) SM1_Dzm: involving mass weighting But topological distance factors are also present in autocorrelation indices with lag 5 and 8, and in the constitution of Barysz and Burden matrices (SM1_Dzm, VE3_Dzp and SpMax2_Bhp). At last electronic characteristics (charge, electronegativity, polarizability, and E-states) also affect GATS1c, 5e and 8s, VE3_Dzp, SpMax2_Bhp and SHBint6. Note that the role of H-bond (at least between the carbonyls and NH group with the receptor) was cited by Nakagawa et al. [1] whereas consideration of logP did not improve the CoMFA model, as noted by Wheelock et al. [3]. Presumably the hydrophobicity influence of phenyl ring substituents (noted in previous publications) is more or less included (and borrowed) in unfavourable steric contributions.
An immediate remark is that for most of them, the variation ranges are of comparable extent, with a weight on the global pEC50 values about 1.5 to 4 log units, so that their introduction in the MLR model is mandatory. Only few variables are largely nearly invariant for most of compounds and take different values only for few isolated chemicals (SHBint6, and to a lesser degree GATS8s). At last TopoShape intervenes for only 0.4 log unit. However eliminating these (less efficient) variables is detrimental for the quality of the resulting model. So, for the whole set (123 compounds), eliminating one of these less important descriptors lowers R 2 and Q 2 loo from 0.744, 0.699 to values about 0.705, 0.660 (elimination of SHBint6, TopoShape or GATS8s). Similarly elimination of two variables out of these three, led to values of R 2 and Q 2 loo about 0.680 and 0.664.

Conclusion
In this paper, the activity of 123 dibenzoylhydrazines, acting as growth regulators was investigated through 2D, topology-based, QSAR models. In the framework of open source softwares R and PaDEL and the free application QSARINS, the approach relies on topology-based 2D characteristics. These descriptors, easily calculable, and structurally invariant, avoid the choice of an active conformation, subsequent energy minimization and (often heavy) quantum calculations, and are attainable with only the knowledge of the molecular graph.
After descriptor selection by MLR-Genetic Algorithm of QSARINS, MultiLinear Regression and several machine learning approaches (PLS, PPR, SVM, TLP-ANN) were investigated. They gave satisfactory results, highly consistent and robust, of comparable performance. Quality of the models was duly tested not only in data fitting, but also in cross validation, varied internal prediction steps or random splitting (two aspects not in depth investigated in a previous extended CoMFA study [3]).