Pre-processing Protocol for Nonlinear Regression of Uneven Spaced-Data

: Regression of experimental or simulated data has important implications in sensitivity studies, uncertainty analysis, and prediction accuracy. The fitness of a model is highly dependent on the number of data points and the locations of the chosen points on the curve. The objective of the research is to find the best scheme for a nonlinear regression model using a fraction of total data points without losing any features or trends in the data. Six different schemes are developed by setting criteria such as equal spacing along axes, equal distance between two consecutive points, constraint in the angle of curvature, etc. A workflow is provided to summarize the entire protocol of data preprocessing, training and testing nonlinear regression models with various schemes using a simulated temperature profile from an enhanced geothermal system. It is shown that only 5% of data points are sufficient to represent the entire curve using a regression model with a proper scheme.


Introduction
Data collection, processing and interpretation have become pivotal tool to help making informed and risk evaluated decision for in every industry.The data collected via appropriate design of experiments and processed through various machine learning algorithms allow us to evaluate or optimize performance of any given setup/system.Previously, to evaluate or predict the behavior of any given system numerous physical/chemical experiments were carried out and the data was manually collected and analyzed.These methods were time and resource consuming and heavily relied on human accuracy.But today, most of the required data can be generated using highly tuned simulations combined with various machine learning algorithms which are custom built to include all the desired physics and chemistry or any other type of laws/interactions.Stewart Robinson (Robinson 2004) explains the what why and when to use the simulations in his book and how it helps us in saving tremendous amount of time, energy and also the required amount of material for a given study.Simulation is a systematic tool, which when used correctly can be powerful in helping solve complex problems and aids in developing an optimized system.Machine learning techniques take us one step further by helping us develop complex mathematical models, by taking simulation data into account.Beylkin et al. (Beylkin, Garcke et al. 2009) developed an algorithms based on multivariate linear regression to develop modes for scattered data.These virtual mathematical models can be further optimized and developed to imitate the real models existing in the real world, hence eradicating the need of performing the cumbersome experiments.Apart from being robust, these models allow us to reduce the error (by defining the tolerance) and thus making them even more accurate.A list of time series forecasting models is provided later.
Usually while performing curve fitting in linear models the R 2 value is considered as the benchmark to establish the fitness of the curve, with R 2 approaching 1 being the best.However, this is not applicable for nonlinear curve fitting.Spiess and Neumeyer (Spiess and Neumeyer 2010) shows how R 2 is an inadequate measure to validate the fitness of curve in nonlinear models.
Simulations can be carried out to analyze the performance of a system over period of time and also to predict future behavior.This is especially in the field of oil and gas engineering and heat transfer in geothermal reservoir where simulation proves to be a most reliable tool.Studies performed by Okouma et al. (Okouma Mangha, Ilk et al. 2012), Kamari et al. (Kamari, Mohammadi et al. 2017) and Alom et al. (Alom, Tamim et al. 2017) on decline curve analysis for shale oil and shale gas, uses the similar technique to forecast the oil/gas production.The three major steps of any simulation based study involves design of experiment, simulation design and data interpretation.When the objectives of the study are established the first step is to design the experiment by choosing the number of parameters to be studied, their range and the different combinations to evaluate the performance of any individual parameters.This is usually done to make the model robust and efficient.Shaibu et.al. (Shaibu, Cho et al. 2009) highlights the importance of robust design, especially for a time oriented data to improve the accuracy of the model.The combinations are usually done by using various design of experiments (DOE) techniques like the Box Behnken method (Box and Behnken 1960).This kind of DOE helps in singling out the effect of individual parameters and helps us visualize the combined effect.
After the DOE is developed, the next step in simulation is the simulation design.In this step, as per the nature of study to be performed, a relevant simulator is chosen or can be self-designed (provided that it incorporates all the correct equations required for the study).Once the simulator is chosen, the setup is designed to represent the system in a virtual environment in the exact manner or by making certain reasonable assumptions.All the parameters to be studied are defined as per the simulator requirements and depending on the nature of study a time period is chosen (if it is a non-steady case).After setting all the things in the desired format (predefined by the simulator), the simulation is ran and the output data is collected.The last step in the simulation based study is to verify the accuracy of the output data.This can be done by performing few test experiments and using the same parameters to run the simulations.Then the simulation results can be compared against the experiment results (or the standard results) and the error percentage is calculated.As per the desired tolerance for the error, the simulations can be redesigned and re-ran to get accurate results.
Once the results are deemed acceptable, the big challenge is to interpret the results and process the output data so that they can be used in developing machine learning algorithms.The data generated by the simulation is in a very raw/crude form and could not be directly used to develop machine learning or response surface models.One of the major problem faced is that the data points are not evenly distributed over a time period.This is caused because of the different convergence techniques used by most numerical simulators.Each simulator has a pre-defined convergence limit which is guided by the minimum time-step provided to the simulator.The minimum time-step is defined to make sure the equations converge and doesn't give any errors or doesn't introduce any artifacts in the results.The initial time-step is chosen carefully according to the nature of equations used and depending on the physical process and the time scale.For example, a simulation with a fixed time-step would generate large number of data points and would also require more time to run.Whereas, using adaptive time-step, the simulator initially generate data points at very small time interval and as the equations begin to converge, it gradually increases the time-step and hence increasing the interval for data point generation.This leads to comparatively small number of data points as compared to the fixed time-step but even then the result might contain unnecessary amount of data points.
In this study, we have considered data obtained from a simulator for a temperature decline curve for the enhanced geothermal system (Asai, Panja et al. 2018).The temperature profile in geothermal system has been simulated or determined by analytical solution in case of relatively simple system (Wu, Zhang et al. 2014, Hadgu, Kalinina et al. 2016, Mudunuru, Karra et al. 2017, Asai, Panja et al. 2018).Uneven intervals are observed in simulated results especially in time series data.Due to the nature of numerical solution method, small time steps are required initially near t=0.On the other hand, larger time steps are used in the later time when the system is more stabilized.To demonstrate this numerical fact, produced water temperature consists of 9936 points from an enhanced geothermal system is shown in Figure 1.The logarithmic scale is used to expand the early time.The actual shape of the temperature profile in linear scale is also shown inside the figure.The X-axis i.e., time is divided into five groups such a way that each group contains 20% data.The first 20% of the data is from 0-23 days.Whereas last 20% of the data is from 6496 days (from 3504 to 10000 days).It is evident from this distribution that the data is highly dense towards initial time period.The density of data (number of points per unit time) is reduced towards the terminal time.In some instances, localized dense data is also observed due curvature of the profile.More data points are required to represent any curvature i.e., changes in slope along curve compared to fixed slope or straight-line portion of the curve.
This uneven distribution of data points over the entire time period, poses a problem in regression of a nonlinear mathematical equation as the data points are drastically skewed towards the beginning of the simulation.Thus, deeming the data points towards the end of simulation as the outliers and hence sometimes model cannot be fitted properly.
The study focuses on tackling such uneven distribution in the simulation results by implementing smart and novel techniques on preprocessing of the data so that it could be used in regression or curve fitting.To validate proposed data processing techniques, we considered the simulated data for the temperature decline curve in an enhanced geothermal reservoir generated through a commercial simulator.
The objective of this study is to reduce the total number of points to represent the entire curve to facilitate post processing of data such as curve fitting.Curve fitting using regression is highly dependent on number points as well as the local density of points.Various schemes are investigated to reduce the total number points and to obtain a better fit.

Methodology
Various steps involved in developing the protocol to reduce number of points in an uneven time series for regression are discussed here.All steps are summarized in a workflow for better comprehension in Figure 2.

Normalization
It is observed in most of cases if not all that the ranges (minimum to maximum) and actual values of Y-axis and X-axis are not comparable in same scale.For example, in figure 1, the temperature (Y-axis) varies from 92 to 182 O C (range 90 O C) and time (X-axis) varies from 0 day to 10,000 days (range 10,000 days).In this case X-axis is more sensitive to regression compared to the Y-axis.To avoid this mathematical problem, it is advised to normalize the data to 0 to 1 for both axes.This will ensure the same ranges and actual values within the same minimum and maximum bracket.In the context of a geothermal system, the temperature and time are normalized as shown in Equations 1 and 2 Using the above formulae, the minimum temperature i.e., 92 O C becomes zero and maximum temperature 182 O C becomes 1.Similarly, minimum and maximum times become 0 and 1 too.Using the normalized data, six schemes are tried in this study to select certain number of points from total points of 9936 (see Figure 1) as shown in the Table 1.Following the normalized temperature and time from geothermal system, all schemes are discussed in the next sections.

Scheme 1: Entire data set
In this scheme, data set is kept unchanged i.e., entire data of 9939 points are chosen for regression.This is base case for comparison with other schemes to establish the effectiveness of the data preprocessing.All points are plotted in Figure 3.

Scheme 2: Equal division of X-axis
In this scheme, the temperatures are selected such a way that they have equal interval i.e., the entire range of data in the X-axis is divided equally.Spacing between two consecutive points in this scheme is calculated as given in equation 3 Any data point in this scheme is calculated by equation 4 To display the position of the points on the curve, about 1% of total data (100 points from 9939 points) are chosen as shown in Figure 4. Details discussion of sensitivity of number of points on curve fitting is provided in results section.1% data is considered as case 1 in the sensitivity study and it is used for all figures showing different schemes for demonstration.4 that the distance between two consecutive points varies depending on the curvature or slope of the curve with respect to time � ∆T ∆t �.More points are located at lower slope section such as the flat potion of the curve compared to higher slope section.At higher slope � ∆T ∆t �, same amount of change in time (∆t) has more changes in temperature(∆T), thus larger distance between two points is observed.Technically point density can be defined as the number of points in a fixed one dimensional length (may be curve or straight line).Same point density at any location on the curve can be found for the curve with constant slope i.e., for linear equation.

Scheme 3: Equal division of Y-axis
Like the equal division of X-axis, the entire range of Y-axis can be divided into equal interval as shown in Figure 5.
In the case of a geothermal system, temperature starts initially at maximum value and reduces towards the minimum as time goes.Any data point in this scheme is calculated by equation 6 Location of higher point density in the curve is opposite to the previous scheme.A higher point density is found in the higher slope section.At higher slope � ∆T ∆t � region, for same change in temperature(∆T), change in time(∆t) in time is less which ensures more points in the region.

Scheme 4: Equal division along curve
In this scheme, the entire curve is divided into equal pieces along the trajectory of the curve.First task in this scheme is to calculate the length of the entire curve.Next, the total length is divided into specified number to get desired number of points with equal spacing along the curve.The distance between two consecutive points (∆L) is calculated using equation 7 and demonstrated in Figure 6.
To calculate the length of the entire curve, distances between two consecutive points (as calculated using equation 7) are added together as shown in equation 11 If the entire length along the curve is divided into N equal spacing, then the interval is calculated as Placing points with ∆L interval along curve is calculated as As shown in equation 10 that the method to find out the points with ∆L interval with previous point is an iterative method.The current time step (ti) is assumed first, the temperature (Ti) is interpolated from the original data set.Then, using calculation shown in equation 10, current time step (ti) is calculated again.If the assumed and calculated time match, then current time step is accepted and proceed for next time step.The points calculated in this way are shown in Figure 7.In this scheme, it is clear that each section of the curve has same point density irrespective of the slope of the curve.

Scheme 5: Constraint in deflection
In the previous schemes, especially in the schemes 2 to 4, the fixed number of total points is selected based on their criteria.Selection of total number of points is totally knowledge based.The total number of points should be sufficient to capture all the features of the curvature.Scheme 5 is formulated to ensure that the curved sections with sharp changes in slope are considered during regression.In this scheme, sum of change in slopes (in terms of angle) in successive points or deflection in the curve is set as a criterion.The calculation of angle in this scheme is explained in Figure 8.
To calculate the total deflection from the (i+1)th point, angles between next few points such as θi,i+1, θi+1, i+2, θi+2, i+3 should be known.The total change in angle from point i to next k points is calculated as Then the percentage change in slope with respect to angle between i and i+1 points is calculated as Points are added until the calculated ∆θi is met the set criteria of deflection.If the calculated ∆θi according to equation 14 is acceptable for a given value (say 75%), then the k th point after i th point is acceptable as the next i+1 point in the scheme.The points calculated this way with 75% of deflection tolerance are shown in Figure 9 Figure 9: Scheme 5 where deflection is chosen as criteria to select representative points This scheme ensures that no parts of curves with sharp slope change are ignored for the regression.Intervals in the curve are irregular depending on the localized slope.In original data, sudden changes of slope are observed in many places and it causes some close spacing of points.This cause the different weightage of different section of the curve in the cost function.

Scheme 6: Mix of schemes 4 and 5
To reduce the number of total points further, the scheme 4 and 5 are mixed together.The scheme 6 is same as scheme 5 except that the points generated in scheme 4 are considered instead of original data set (scheme 1).In scheme 4, equally spaced points along curves are generated.A few neighboring points might not have sharp changes in slopes.Using this scheme those neighboring point could be merged together.The points are significantly reduced compared to figures 7 and 8 as shown in Figure 10.

Regression
After choosing points from experimental data based on the criteria set by each scheme, a mathematical model is fitted.Reviewing several time series forecasting models which are available now, De Gooijer and Hyndman (De Gooijer and Hyndman 2006) classified them into eight categories namely (i) exponential smoothing (Muth 1960, Gardner 1985, Snyder 1985), (ii) Autoregressive Integrated Moving Average (ARIMA) (Yule 1927, Box andJenkins 1970), (iii) seasonal models (Dagum 1982, Huyot, Chiu et al. 1986), (iv) state space and structural models and the Kalman filter (Kalman 1960, Schweppe 1965, Shumway and Stoffer 1982), (v) nonlinear models (Volterra 1930, Wiener 1958), (vi) long-range dependence models, e.g. the family of Autoregressive Fractionally Integrated Moving Average (ARFIMA) models (Ray 1993, Ray 1993), (vii) Autoregressive Conditional Heteroscedastic/Generalized Autoregressive Conditional Heteroscedastic (ARCH/GARCH) models (Engle 1982, Taylor 1987, Bollerslev, Engle et al. 1994) and (viii) count data forecasting (Croston 1972, Willemain, Smart et al. 1994).In this study, a nonlinear function, ( ̅ , ) is chosen for regression as shown in Equation 15 The above equation is originally applied by Arps (Arps 1945) for decline in oil rate.In case of normalized data (0 to 1), the ( � 0 ) becomes one, therefore only parameters required to determine using regression are b1 and b2.In the curve fitting method, a cost function which is generally the sum of the errors or square of the sum of errors is minimized.An optimization routine 'nlinfit' in Matlab (Mathworks Inc.) is used where the cost function is given by the Equation 17 The yi is the normalized temperature from the experiments or simulations and ( ̅ , ) is the normalized temperature predicted by the model for same normalized time.Total N points or observations are used for the fitting.The quality of the fitted function is evaluated by various statistical measurements such as coefficient of determination (R 2 ), error (e), mean square error (MSE), normalized root mean square error (NRMSE) etc. which are discussed in the Appendix A.

Results and Discussions
As described in the workflow (Figure 2), the results are analyzed for both training and test data sets.Temperature profiles and errors in predictability are discussed here for scheme 1 to 7 for various scenario.

Training of Models
In this study, temperature versus time data are simulated from an enhanced geothermal system.Details of the simulation of geothermal system is not relevant to this topic, therefore only the data from the simulation is presented here.Total 9939 points are collected to represent 0 to 10,000 days.
To study the sensitivity of the chosen number of data points on the fitted curve, 8 cases are investigated by varying the number of selected points (from 1 to 20 %) as given in Table 2.In scheme 1, 100% data is used, therefore only single case is applied for scheme 1 and not shown in the above table.For schemes 2 to 5, 1 to 20% of total data are used to estimate the model parameters in equation 15.In scheme 6, number of data points (0.08% to 6.9%) are less than the data points used in scheme 4 because of the nature of the scheme.In scheme 5, choosing any number of points is not straightforward like scheme 2 to 4. In this scheme, the constraint in deflection (angle of curvature) is set as criteria instead of number of points.Therefore, it is a trial and error method to choose an angle such a way that the scheme will have certain number of points.Angles for scheme 5 are 75, 15.5, 2.6, 1.07, 0.70, 0.235, 0.127 and 0.0785 for cases 1 to 8 respectively.
Model provided in equation 15 is trained for all the cases of various scheme.The model parameters are enlisted in Table 3. Model parameters b1 and b2 for scheme 1 are 1.40 and 72.3 respectively.It is evident that in the most of the schemes, parameters do not vary a lot for different cases i.e., the parameters are independent of the number of points chosen for training.Only noticeable variations in parameters are observed for different cases in scheme 6. Significantly low number of points compared other schemes may be the reason behind these results in scheme 6.On the other hand, the parameters vary for different scheme in fixed case.The fitted model from various scheme and the original observation are compared in Figure 11 for case 1 (1% data).Because of the fact that data was normalized, all points fall between 0 and 1, therefore, the errors in the range of -0.05 to 0.05 are not very significant.Errors vary with time for all schemes.The highest errors are observed at the end of the curve and at the point of deflection of the curve.It means that the fitted model had difficulty to represent the curvature section of the data.Scheme 3 had low errors initially but it started deviating in the later time.Surprisingly, scheme 1 where 100% data are used had significant errors.Schemes 2, 4, 5 and 6 showed better fit compared to other schemes by showing overall less error for entire time period.Although, scheme 2 had highest error in the initial portion.Schemes 4 and 5 are possibly the best fit where low error is observed in the curvature section.This is because of the nature of the selection criteria of points in schemes 4 and 5.All points are equally spaced on the curve providing equal weightage to each section of curve in curve fitting.On the other hand, in scheme 5, the curve section had sufficient point for more weightage in the curve fitting, in other words, the cost function defined in equation 17 is more influenced by this section.
Overall fitness of the model for various cases in different schemes is measured by the combined error such as coefficient of determination (R 2 ) and normalized root mean square error (NRMSE) as shown in Figure 13.Considering the results from figures 12 and 13, schemes 4 and 5 are the best choice to select points for regression.Scheme 2 could be another choice but the initial deviations (see figure 12) make it unsuitable.

Testing of Models
Fitness of a model is not always guaranteed by the error analysis from training data.Test data set which is essentially randomly picked points within the range of study is required to check the predictability of the model.In testing of various schemes, model parameters (b1 and b2) obtained from case 5 where 5% data were used in regression are chosen without any strict technical reasons.As shown in the workflow (figure 2), one hundred random values of time are chosen in the range of 0 to 1 for testing of the model.The predicted values from scheme 1 to scheme 6 are compared with the experimental/simulated data in Figure 14.The scheme 2 has the highest R2 and lowest NRMSE which indicate the best fit among all scheme, however, due the higher initial error (figure 16), this scheme may not be the best to predict for entire time period.Scheme 3 has the lowest R 2 and the highest NRMSE values which make it least suited scheme among all.Like training data set, schemes 4, 5 and 6 remain the best schemes where higher R 2 and lower NRMSE are observed.

Conclusions
Despite the large availability of data points, fitted functions often fail to represent all these points due to differences in data density.1% to 20% data points are selected using 6 different selecting criteria (schemes 1 to 6).It is shown that even 1% worth of data points is as good as the entire data set for regression as long as the proper scheme to select points from original data set is chosen.Scheme 4 (where distance between two consecutive points is fixed) and 5 (where angle of deflection is the criteria) are the most efficient schemes that provide a better fit of the mathematical model for any given number of points.Test data set which is chosen randomly confirms the robustness of these schemes.This study helps reduce the number of data points necessary during regression and improve the fit of any model when the data points are not evenly distributed.

Figure 1 :
Figure 1: The data density along the X-axis

Figure 2 :
Figure 2: Workflow for data preprocessing, training and testing of fitted model The workflow can be divided into three broad sections namely data preprocessing, training of mathematical model and testing of the fitted model.Individual components of the workflow such as normalization of data, followed by the various schemes and mathematical model for curve fitting are discussed in the next sections.

Figure 3 :
Figure 3: Scheme 1 where no changes are made and the entire data set is selected.

Figure 4 :
Figure 4: Scheme 2 where points are located with equal interval in X-axis i.e., time It is evident from the figure 4 that the distance between two consecutive points varies depending on the curvature or slope of the curve with respect to time �

Figure 5 :
Figure 5: Scheme 3 where points are located with equal interval in Y-axis i.e., temperature Spacing in this scheme is calculated as given in equation 5

Figure 6 :
Figure 6: The demonstration of intervals and distance between two consecutive points

Figure 7 :
Figure 7: Scheme 4 where points are chosen based on equal division of the length of the curve

Figure 10 :
Figure 10: Selection of points based on 75% cumulative deflection change on the equally spacing points along curve It is noticed that four points out of seven points are located on the deflection section.On the other hand, few points are located on the straight-line sections.

Figure 11 :
Figure 11: Fitted functions and using schemes 1 to 6 for case 1Although the values of model parameters, b1 and b2 (see Table2) vary from scheme to scheme, a good agreement is observed between fitted model and experimental data.This indicates that the ) vary from scheme to scheme, a good agreement is observed between fitted model and experimental data.This indicates that the combination of values b1 and b2 in the model (equation 15) works well in prediction.To visualize the difference between fitted model and experimental data, error is plotted in Figure12.

Figure 12 :
Figure 12: Errors using case 1 for schemes 1 to 6

Figure 13 :
Figure 13: Error analysis of fitted functions for various scheme and different cases (a) Coefficient of determination (R 2 ) (b) Normalized Root Mean Square Error (NRMSE) Different percentage in the X-axis in the above figures are the different cases as shown in table 2. Because scheme 1 has only one case (100% data), one R 2 and one NRMSE values are calculated which are shown by dotted blue lines in the figures.Although the coefficient of determinations (R 2 ) of fitted curves for all schemes except scheme 2 in figure 13(a) are high which an indication of good fit, the NRMSEs are also in the higher side for schemes 4 and 6 which is an indication of bad fitting.

Figure 14 :
Figure 14: Testing of various scheme for model fitness using randomly chosen 100 values of time The predicted values from different schemes have discernible differences.All schemes predicted well close to the observed values until the midway, after that values are underestimated.The highest differences are observed at the end.To differentiate each scheme, the error plots are shown in Figure 15.

Figure 15 :
Figure 15: Error in the predictions from scheme 1 to 6 for testing data Like training set, scheme 2 has the highest error in the initial time period but errors are low in the later time.Schemes 4, 5 and 6 have the low variations in error throughout the entire time period.This can be evident from the quantitative error analysis (R 2 and NRMSE) as shown in Figure 16.

Figure 16 :
Figure 16: Quantitative error analysis of various scheme for testing set (a) Coefficient of determination (R2) (b) Normalized Root Mean Square Error (NRMSE)

Table 1 :
Various schemes for selecting points to reduce the number of points

Table 2 :
Data utilization in each scheme for various cases.(100% data is used for scheme 1)

Table 3 :
Model parameters b1, b2 after regressions of models (equation 15) from various schemes for different cases