Estimation of Monthly Total Dissolved Solids Using ANN and LS-SVM Techniques in the Aji Chay River, Iran

This research follows on from diverse international efforts to safeguard one of the largest natural lakes in the world, Urmia lake in North West Iran. In this research two new numerical packages based on Artificial Neural Networks (ANN) and the Least Square Support Vector Machine (LS-SVM) models were developed to estimate monthly Total Dissolved Solid (TDS) in the Aji Chay River, one the main tributaries of Urmia lake, Iran. A feed forward back propagation (FFB) model was used to obtain a set of coefficients for a linear model, and the radial basis function (RBF) kernel was employed for the LS-SVM model. The input data sets of both the ANN and LS-SVM models consists of six water quality parameters: TDS, Mg2+, Na+, Ca2+, Cl-, and SO4 , all collected on a monthly time scale over a period of 30 years from the Vanyar and Zarnagh stations, in the Aji Chay watershed. The research demonstrated that both models can effectively predict the variability of TDS, but for the Vanyar station with the ANN model (giving an R2 value of 0.913 and RMSE of 0.0032, a Nash-Sutcliffe Efficiency (NSE) coefficient 0.812 and as such has a more efficient and accurate estimation when compared to the LS-SVM model with R2=0.871 and RMSE =0.097 and NSE=0.86. The analysis of Zarnagh station data shows R2=0.853 and RMSE=0.0162, NSE= 0.854 for SVM and R2=0.903 and RMSE =0.0091 and NSE=0.85 for ANN.


Introduction
The evaluation and prediction of surface water quality is one of the central challenges in the water resource industries today. Due to the parametric complexity, the high cost of phenomenological water examinations [1][2][3][4][5][6] in both field and laboratory situations, and also the lack of experimental water quality data, many researchers have utilized data-driven techniques for water quality data retrieval [7][8][9][10][11][12].
Due to reasonable accuracy and relatively low cost, data-driven modelling techniques, known also as black box models, have become widespread in recent decades [12][13][14][15][16]. The ANN is one such black box model with a high potential for prediction in complicated non-linear systems This technique requires a training or calibration phase, and generally estimates the amount of qualitative and quantitative parameters. It is relatively accurate in determining the standard deviation of data, and furthermore, has the capability of modelling the fundamental relationship between the inputs and outputs with a generalization potential [17,18]. The ANN models can be set with a limited number of input variables; conversely, a comprehensive number of records is needed to provide quality training data. This is essential as data-driven methods have a limited capability to provide accurate forecasts of events that are outside the range of the training dataset. Furthermore, when excessive numbers of variables are exploited as inputs, the most correlated variables logically dominate the model and, consequently, it is not possible to utilize all the physical knowledge or available measurements. Nonetheless, this can be solved by pre-processing techniques which select the most sensitive variables and, thus, reduce the input space [19,20].
Another advanced soft method is the LS-SVM, proposed by Vapnik. It is based on the theory of statistical modeling which utilizes quadratic programming techniques. The LS-SVM has been used for time series estimation with acceptable levels of accuracy [21][22][23][24]. Nevertheless, this technique is more time consuming and has high computational requirements as a result of the required limited optimization programming. The quality parameters of surface water are usually defined as pH, ions, BOD, COD, total dissolved solids (TDS). The latter is the combined amount of inorganic and organic substances contained in water [25]. In the present study, ANN and LS-SVM soft methods were developed to estimate the contamination of water resources with high accuracy and low cost. The case study was performed in the Aji Chay river, a water source that is mainly used for agriculture purposes and ecological stability. This study links to the comprehensive international efforts to protect Urmia lake. The Aji Chay River discharges into this highly endangered lake and potentially play a main role in preventing its desiccation [26]. As such, the developed ANN and LS-SVM techniques were used to model the variation of TDS of the Aji Chay River which has escalating salinity due to the increases in pollution levels.
The main objective of this research is the evaluation of Monthly Total Dissolved Solid using ANN and LS-SVM techniques in the Aji Chay River [26][27][28]. Previously numerous researchers have studied the behavior of intelligent techniques in the prediction of water quality parameters, but there are few studies that have compared both ANN and LS-SVM models to evaluate the total dissolved solids of the AjiChay river, and as such forms the principal novelty of this research.

ANNs
ANNs can be used in water quality estimation and modeling. In this technique, feed-forward (FF) and backpropagation (BP) network patterns can be used for the present TDS simulation. It has been shown that the BP network pattern with a three-layered structure is desired for predicting and evaluating water resources problems. As indicated in Fig. 1, in order to estimate water quality time series, a three-layered feed-forward neural network (FFNN) provides a general framework for expressing nonlinear functional mapping between a set of input and output data. In this figure, i, j and k indicate input layer, hidden layer and output layer neurons, respectively, and the w variable is the used weight by the operating neuron. The term "feed-forward" refers to neuron connectivity being defined from a neuron in the input layer to other neurons in the hidden layer, or from a neuron in the hidden layer to neurons in the output layer. i.e. the input and output layers are not connected to each other. The clear explanation for an output value of a three-layered FFNN is given by Eq. 1 [29][30][31][32].
where NN and MN are the number of the neurons in the input and hidden layers, W ji W ji is a weight in the hidden layer connecting the i th neuron in the input layer to the j th neuron in the hidden layer, w jo w jo is the bias for the j th hidden neuron, fh is the activation function of the hidden neuron, W kj W kj is a weight in the output layer connecting the j th neuron in the hidden layer to the k th neuron in the output layer, w ko w ko is the bias for the k th output neuron, fo is the activation function for the output neuron, xi is the i th input variable of the input layer and yk are analysed and observed output variables, respectively. The weights aren't constant in the hidden and output layers, and their amounts can be varied during the network training.

Least Squares Support Vector Machines
For the available training data sets, the dimension (D) is defined as equation (2). Where x k is the input data, y k is the output data, and R, k and N are real numbers, kernel function and natural numbers respectively. Based on the Mercer principle, the kernel function K (0, 0) is defined according to the mapping function ϕ (0): LS-SVM is solved in the main space; therefore, the following optimization problem is shown in Eq. 4 and Eq.5: The γ is an acceptable adjustment parameter, it is also the compromise between the training error and pattern complexity. The favourable function has a wide extension and generalization capability. The larger the value of the γ addresses, the smaller the regression error for the process. The variable ω is the weight matrix. LS-SVM describes a distinct loss function compared with the standard SVM, and it transfers the limit of the difference into the limit of the equation.
By using the Lagrange function: ak ∈ R indicates Lagrange multipliers, the optimal a, b can be obtained from the following (Karush-Kuhn-Tucker) KKT conditions: The following equations are obtained after implying the above conditions: By eliminating the variables ξ and ω, A linear system of functions is therefore obtained : where y=[y 1 , y 2 , … , y k ] T , θ = [1, … , l], a = [a 1 ,…, a n ] T and Values a and b are obtained by solving Eq. 10. This equation indicates the LS-SVM process function which is utilized for water quality investigation. [33].
The option and creation of kernel functions is an important stage which affects the implementation of LS-SVM. This can prepare an essential process to extend LS-SVM from a linear situation to nonlinear phase. There are different general kernel functions such as linear kernel, q-order polynomial kernel function, RBF function and sigmoid kernel function [34]. The (Gaussian) radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms. The RBF kernel on two samples x and x', represented as feature vectors in some input space and define as: ‖x − x ′ ‖ 2 may be recognized as the squared Euclidean distance between the two feature vectors. σ is a free parameter. An equivalent, but simpler, definition involves a parameter γ = (1/2σ 2 ) : Since the value of the RBF kernel decreases with distance and ranges between zero (in the limit) and one (when x = x'), it has a ready interpretation as a similarity measure. The feature space of the kernel has an infinite number of dimensions; for σ = 1 , its expansion is shown in eq.13.

Study area
This study was carried out in one of the largest watersheds in North-Western Iran, the AjiChay catchment with a drainage area of 51,876 km 2 and principal channel length of 265 km. This river originates from the southern regions of the Sabalan and the Qusheh-dagh Mountains and eventually drains to the basin of Urmia Lake. It is important to study of water quality in this river due to the fact that Ajichay river is one of the main available resources with average discharge of 0.6 m 3 /s as drinking water for the west Azarbaijan province. It also important in terms of environmental aspects because this river is drained into the Urmia lake. Therefore, the quality of Ajichay river water, affects drinking water of West Azarbaijan province area as well as quality of Urmia lake. To evaluate the accuracy of the developed ANN and SVM methods, 30-years of TDS data from the AjiChay River were collected for the Vanyar station. This sampling point is located at 38°07′00″N and 46°24′18″E and Zarnagh station located at 47°14′N 38'00E coordinates with an elevation of 1470 m above mean sea level (See Figure 2) [35].

Results and discussion
The inputs of the models were the monthly Mg 2+ (Magnesium), Na + (Sodium), Ca 2+ (Calcium), Cl -(Chloride), and SO4 2-(Sulfate) with the output being the TDS. The time series of total observed TDS data (experimental data collected over 30 years from Vanyar and Zarnagh stations) were randomly classified into two separate parts, i.e., 20% testing, and 80% training, i.e., 20% testing and 80% training, to operate the applied methods. Table 1 shows the statistical specification of the collected TDS data from Vanyar and Zarnagh stations, and Table 2 represents the utilized input data classification of the identified models (M). A trial and error process is utilized to acquire the optimum percent of data for training and testing assortment. The purpose of this study is to determine the best value of performance criteria. Accordingly, three classifications of data were evaluated which includes 30-70 (i.e. 30% of data for testing and 70% of them for training data) 20-80 and 40-60 modes. Between these three classifications, state of 20-80 had optimum result in the criteria performance. Hence, in the both ANN and SVM models, 80% of observed data were used for training and 20% were used for the test issue.
ANN models require that input data sets be normalized between 0.05 and 0.95. This was accomplished applying equation 14 at the model initialization process.
x new = 0.8 x-x min x max -x min + 0.1 (14) The variables xnew, x, xmin and xmax are the normalized values of the original parameter, the original data, and the minimum and maximum values in the data set, respectively. The standardization provides a better TDS variation as an acceptable result for testing regression correlation.

The ANN model in preliminary test
For training and testing data settings for the ANN method, the observed data are selected to be under an 80% training vs. 20% testing pattern, and the ANN numerical evaluation is determined based on R 2 and RMSE. The applied ANN was trained using MATLAB version 7.8. The variables' types and their amounts utilized in the present ANN technique are listed in Table 2. In this table, Epochs addresses the number of training steps, MF is a membership function and "Trimf" is the abbreviation for triangular membership function (see Figure 3).

LS-SVM model in preliminary test
The training and testing data set for the developed LS-SVM technique is similar to the ANN model (80%-20%), and the LS-SVM evaluation process is also taken as R 2 and RMSE and NSE. The kernel used in the LS-SVM numerical analysis was selected to be (RBF) with the regularization parameters.

Outputs of the analysis
The numerical TDS results in Tables 4 and 5 showed that the ANN method has R 2 =0.913, RMSE=0.0032, and NSE=0.812 for the M2 pattern, and R 2 =0.871, RMSE=0.09 and NSE=0.86for the M2 pattern using the LS-SVM model in evaluation of Vanyar station dataset. For Zarngh station: R 2 = 0.853, RMSE= 0.016 and NSE=0.853 is obtained for LS-SVM analysis. The out puts of ANN modelling for Zarnagh station are R 2 = 0.903, RMSE= 0.009, and NSE= 0.806 (Table 6,7). These facts determine the capability and workability of ANN and LS-SVM as efficient soft-methods in estimating TDS. The lower RMSE and higher R 2 of the ANN technique compared to the corresponding results from the utilized LS-SVM indicates greater precision of the ANN in predicting the TDS. The relationship between observed and predicted TDS based on the two investigated approach were presented in form of scatter plot and statistical indicators as shown in Figure 4, Figure 5, Figure 6. Figure 7. Sensitivity analysis of TDS were preformed based on more important parameters such as Mg 2+ and Ca 2+ as shown in Figure 8 and Figure 9. Statistical evaluation of the models to predict TDS using developed ANN method at Zarnagh and Vanyar stations are presented in Table 4 and Table 5 respectively. The same Statistical indices were calculated to evaluate the power prediction of TDS using developed LS-SVM method at Zarnagh and Vanyar stations as shown in Table  6 and Table 7. Statistical indicators for Sensitivity analysis of TDS for Mg 2+ and Ca 2+ illustrated in Table 8 and  Table 9.

Sensitivity analysis
In order to find out which parameter had the most important influence on ANN outputs, sensitivity analysis was conducted on data set. Based on the configurations illustrated in Table 3, to be more specific, the effect of each parameter on the amount of output changes was investigated by keeping the other parameters constant. The results show that in Vanyar station, the parameterCa 2+ is the most important parameter in analysis due to more accuracy and the minimum error with the following values: R2=0.9431, RMSE= 0.0043, NSE= 0.86. Table 8 shows the result of sensitivity analysis on Ca 2+ .

Conclusion
This study developed ANN and LS-SVM techniques, to predict the TDS of rivers, i.e., Aji Chay River. To assess the performance of used methods, Root mean square error (RMSE), correlation coefficient (R 2 ) and NSE were applied as performance indicators for the analysis. The obtained results demonstrated that the ANN technique has a lower RMSE (0.0032) and higher R 2 (0.913) with a reasonable NSE coefficient (0.812) for the testing data of Vanyar station, compared to that obtained from the LS-SVM method with (R2=0.871 and RMSE =0.097 and NSE=0.86. The observed results have a high correlation with the estimated analytical data, and a high accuracy for the ANN meta-method in estimating TDS compared to the results from the LS-SVM method. The analysis of Zarnagh station data shows R2=0.903 and RMSE=0.0091, NSE=0.806 for ANN and R2=0.853, RMSE =0.016 and NSE=0.85 for SVM. The novelty of the developed methods can be used for predicting the TDS of other surface water bodies.