TPE.MS.ID.000519

An Improved Computational Learning-Based Model for Estimating Total Organic Carbon in Unconventional Shale Reservoirs

Christopher N Mkono, Zhao Yang, Hongji Liu, Chaohua Guo*

Unconventional resources have emerged as one of the crucial alternatives to the rapidly depleting of conventional hydrocarbon resources. The hydrocarbon potential of shale source rocks is assessed by the percentage of the organic index such as total organic carbon (TOC). Correct estimation of TOC is very important since minor deviations in anticipated results can lead to wastage of investments and time. A slight improvement in estimation practices, on the other hand, can increase the value of an exploration project. Therefore, the objective of this study is to present an improved classification and regression tree (CART) computational learning-based model as an improved alternative in estimating TOC from well logging data. Conventional well logs suite of bulk density, gamma-ray, deep resistivity, sonic transit time, spontaneous potential, and neutron porosity from Mihambia, Mbuo and, Nondwa, Formations of the Mandawa Basin Tanzania, were used as input variables. Results from the developed CART TOC model were compared with the random forest (RF) and backpropagation neural network (BPNN). It was observed that the proposed CART model trained better while generalizing better through unused testing data compared with RF and BPNN. CART model achieved R, RMSE, and MAPE values of 0.9615, 0.0840, and 0.5035 for training and 0.9703, 0.1162, and 0.3722 for testing respectively. The proposed model work with higher accuracy with the sensitivity analysis indicating that gamma-ray, deep resistivity, and sonic transit time significantly influenced the model outcome.

Keywords: Total organic carbon, Classification and regression tree (CART), Machine learning, Well logging

As global oil and gas consumption is upswing and conventional oil reserves are diminishing, the world's huge discovered shale resources have recently drawn much more attention. Such an unconventional resource has emerged as one of the most essential substitutes to the rapid decrease of conventional resources. The exploration and exploitation of unconventional hydrocarbons resources such as shale oil and gas rely particularly on reliable and accurate evaluation of total organic carbon content (TOC). TOC is a measure of the amount of organic matter present in a rock sample.^1,2 Not only that TOC content exhibits the potential hydrocarbon-in-place and quality of the source rock, but also it offers important information about wettability, porosity, rock texture, permeability microstructure, and hydraulic fracturing design of the shale reservoirs.

The most accurate estimation of TOC content is the direct measurement of organic richness in the laboratory on the core samples or using rock-eval pyrolysis.³ On contrary, obtaining core samples from each well in the field and conducting laboratory tests on them is costly and a time-consuming approach. As a result, core-based data are scarce and expensive. In line with this well log data being a critical aspect of mostly well drilling designs are easily accessible. Therefore, to generate correlations that can be applied to the entire well with limited core sample data, related well logs are used.

Different researchers have highlighted the relationship between TOC and geophysical well logs.^4-8 The idea being focused on the reaction and response of well logs signals on the available organic matter. Therefore, the high response of acoustic, resistivity, and spectral gamma-ray, logs is directly proportional to the increase of TOC values. However, bulk density logs have an inverse proportional to the increase of TOC values. Using data from Devonian shale formation, Schmoker⁹ introduced and developed the density log-based technique. Schmoker’s technique is empirical and assumes that any change in bulk density is due to the presence of kerogen. Passey¹⁰ suggested a ΔlogR technique for identifying source rocks by overlaying porosity logs and resistivity logs. Nevertheless, this is an empirical method and was not developed from rock physics principles.¹¹ It's worth noting that, the nonlinear relationship between well logs and TOC in many shale rocks may highly reduce the estimation accuracy of TOC using both Schmoker’s and ΔlogR techniques.

The successful application of computational intelligence (CI) in hydrocarbon exploration and exploitation in recent years, has seen the adoption of intelligence learning models in predicting TOC from well log data.^12-23 Computing intelligence is a captivating discipline that combines computational power with human intelligence to develop sophisticated and trustworthy solutions to stunningly nonlinear and complicated problems. The CI models have the advantage of being able to adapt and learn to the dynamic conditions of the reservoir such as depositional and formation environment whilst utilizing the entire suite of well logs for better prediction of TOC.^24-27 A vast variety of studies indicate that correct utilizing these non-linear algorithms, the TOC content can always be predicted more accurately.^28-32 Artificial neural network (ANN) has been the most commonly utilized computational learning technique for predicting TOC in studies.^33-39 Compared to traditional approaches such as ΔlogR, an ANN performed excellently in these studies due to its capability to draw out patterns between the range of input well logs and measured TOC data. On the contrary constant tuning of the ANN parameters such as number of hidden nodes, biases, and weights to achieve the best performing model structure, ANN suffers intrinsic drawbacks such as overfitting, low computational speed, and converging at local minima.

It is important to address that numerous studies have recommended novel concepts and enhanced learning algorithms as a substitute to the standard ANN. The idea of an incorporated semi-supervised computational intelligence model was use to predict TOC accurately without the requirement for manual overlapping of log curves.⁴⁰ Tan⁴¹ used support vector regression (SVR) in predicting TOC content in a gas-bearing shale and achieving better results. The application of an extreme learning machine (ELM) in predicting TOC in a shale gas reservoir was also investigated.⁴² Mahmoud⁴³ employed the use of new artificial neural networks (ANN) to establish an empirical equation for TOC predictions from conventional well logs data. Self-adaptive differential evolution-artificial neural network (SaDE-ANN) model also showed high accuracy in predicting TOC based on well logs data.^21,44 Gaussian process regression (GPR) was also implemented to predict TOC.^45,46 However, in order to achieve the optimal estimation results of GPR, the user requires to specify the best kernel function. Similar to ELM, most of those computational learning models require an iterative tuning of parameters training to achieve the best performance.

Therefore, we proposed the applicability of classification and regression tree (CART) model to predict TOC using inputs from well log parameters. The CART algorithm is the tree-based technique with the advantage of not being prone to overfit and can perform excellently even when the predictive variables are irregular. The performance of the CART model was further compared with untested computational learning methods of random forest (RF) and backpropagation neural network (BPNN). The result of the present study will rank the CART algorithm to a fairly new computational learning TOC model as an intelligent approach for the reliable prediction of TOC values. The rest of this paper is organized as follows: Section 2 presents the geological setting and data processing. Section 3 introduced three different methods for TOC estimation: BPNN, RF, and CART. Section 4 shows the results and discussion. Section 5 is the conclusion.

Geological setting

Mandawa basin is located in southern coastal Tanzania, separated by Ruvuma saddle in the South and Rufiji River in the North Figure 1. The geological evolution of the Mandawa basin has been studied by different researchers.^47-49 Karoo rifting, Gondwana breakup, East African rift system and opening of Somali basin are the main factors controlled the evolution of Mandawa basin.^50-52 The Mandawa Basin's depositional history was mainly influenced by the Gondwana breakup. Mandawa, Kilwa, Pindiro, Songosongo and Mavuji are the main five groups that are found in the basin. Before the break-up of Gondwana, the depositional environment was continental with both deltaic and fluvial deposits dominating the area.⁵³ Followed by the development of rifting and drifting, restricted marine embayment with barrier reefs were formed from Paleo-Tethys transgression isolating several saline lagoons during the early to middle Jurassic.⁵⁴ In the late Jurassic the basin was subjected to rapid subsidence which last to the early Cretaceous leading to the deposition of clastic sediments in which the fluvial and alluvial deposits of the Mandawa and Mavuji groups were deposited. From the Aptian to the Paleogene, a mid-to-outer shelf zone of coastal Mandawa Basin was declined at a constant speed which resulted to the formation of Kilwa group.^55-63 The source rock of the Mandawa basin consists of Nondwa shales of the lower Jurassic Pindiro Group and Mbuo Claystone of the upper Triassic Pindiro Group.⁶⁴

Data descriptions

The conventional well log data of neutron porosity (NPHI), gamma-ray (GR), spontaneous potential log (SP), deep lateral resistivity log (LLD), sonic travel time (DT), bulk density log (RHOB), and measured TOC values collected from Mandawa basin were used in this study Figure 2. Furthermore, 56 data points of TOC from two wells namely Mbate and Mbuo were used to train the intelligent models while 27 data points of TOC from the Mita Gamma well were used to test the validity of developed models. The statistical features for three different wells suite of Mita Gamma, Mbate, and Mbuo which were used to learn models developed are analyzed in Table 1.

Data processing

During data processing, feature selection (variable selection) was performed to identify and delete obsolete, unnecessary, and redundant data attributes that do not add to a predictive model's accuracy or may minimize the model's accuracy. Pearson correlation coefficient (R) was used to evaluate the relative impact of the input variables on the output Equation 1. The correlation coefficient (R) values always lie in the range between 0 and 1. In this case, the values close to positive indicate a similar relationship between two separate variables, whereas near-zero values indicate a weak relationship between the two-variable pair, and near-negative values indicate an inverse relationship between independent variables.
, (1)
where represents the correlation coefficient of variables a and b, and are the standard deviations for variables a and b, respectively. Well log data and measured TOC were both normalized in the scale between 0 and 1 to reduce the redundancy as well as to improve the integrity of the data. The normalization processing was done using Equation 2:
, (2)
where represents the original value, represents the normalized value of the dataset, is the maximum value and is the minimum value. The selected technique enables the computational learning algorithm to execute faster, improves the accuracy of the model, reduces the overfitting, and also it decreases the complexity of the model.⁶⁵ The relevance of the input data for predicting the TOC is shown in Figure 3.

Back-Propagation Neural Network (BPNN)

The BPNN is a feedforward network which consists of many layers. These layers have been trained using the method of error backpropagation. BPNN comprises three types of layers: hidden, input and output layers.⁶⁶ For hidden and output layers, the neurons presented appear to contain biases, which link to units whose activation is always 1. The bias concept often works as a set of weights. Signals are sent in the opposite directions during the back-propagation learning phase. The BPNN is served as a way to solve the multi-layer perceptron training problem.⁶⁷ The internal network weight change after each training epoch due to backpropagation error and addition of differentiable function at each node, were the major advances for BPNN method.

The flow of data in BPNN is divided into two phases. In the first phase, the input data is displayed forward to the output layer from the input layer, which results in an actual output shown inEquation3.⁶⁶ The BPNN model can be presented by the following equation: ,                                                         (3)
where represent the input vector dimension, is the hidden neurons number, Y is the output variable and x stand for input variables. Note that and stands for bias weights. All of the connection weights (along with the bias weights) are initialized with small random numbers, and an iteration process is used to calculate the final values. The sigmoid activation function, f, is the most widely used and can be presented as in Equation 4:
,                                                                                (4)
For the second phase, the errors between the target and real values are disseminated backward from the output layer to the preceding layers and the connection weights are adjusted to reduce the errors between the actual and target output values. The overall error can be calculated by the total sum of errors (TTS) as shown in Equation 5.
           ,                                                               (5)
where T and C represent the target and calculated signals, respectively and represent the total number of training pairs. BPNN algorithms, on the other hand, have weaknesses such as low iteration speeds and a greater tendency to collapse into local minimums. The algorithm used in this study was Levenberg-Marquardt (LM). The LM algorithm is a technique for determining the minimum of a multivariate function expressed as the number of squares of non-linear real-valued functions iteratively.^68,69 The Gauss-Newton and steepest descent method combines to for an algorithm of LM. When the current solution is not close to the correct solution, the algorithm effectively functions as a steepest descent method. When the current solution is close to the correct solution, the algorithm becomes the Gauss-Newton method. ⁷⁰

Random forest (RF)

Random forest is the method of ensemble learning that is mostly used for regression, classification, and other tasks. During training, it is generally focused on developing multiple decision trees and giving out the classes or predicting each tree.⁷¹ Random Forest combines two methods of Bagging and Features Randomness which helps to get highly accurate results, avoid overfitting problems, and ability to handle larger input datasets and thus make it suitable for the prediction purpose. From the set of training data, the Bagging technique is often used to train each individual tree.⁷² To get a split at each node, this approach just looks at a random subset of variables. Each tree in random forest can only be selected from a random subset of features (Feature randomness). The increased diversification and lower correlation are the results of significant trees variation in the model. As a result, in a random forest, we finish up with trees that are not only trained on different sets of data but also make decisions based on the use of different features.⁷³ The general RF algorithm can be presented by Equation 6.
, (6)
where R (x) represent the individual regression result tree (RT), f(x) is the RF result, and N represent number of trees.

The benefit of the RF is that it can determine the relative importance of parameters, which can be obtained using two methods, Gini impurity (GI) and mean square error (MSE). The GI is used to estimate the quality of each division on each variable in a tree, and the MSE is used to determine the average decrease in prediction accuracy due to partition on each predictor.^71,74 The GI and MSE can be presented in Equations 7 and 8, respectively.
     ,                                                               (7)
   ,                                                               (8)
where p(i) represents the probability of randomly choosing an observation of class I, n represent the number of classes, is the label for an instance and is the mean given by Equation 9 below:
,                                                                            (9)
The predictor variables of multiple types can leads to the unbalance of the GI approach. The MSE (mean square error) approach was proposed to measure the relative importance accurately as compared to the GI method.75 As a result, the RMSE (random mean square error) approach was chosen to predict relative importance in this study. The Random Forest algorithm may include the following steps:

i. Random samples selection from given dataset.
ii. Decision tree construction for every sample. The forecast result from each decision tree will then be obtained.
iii. From every forecasted result, then the voting can be calculated.
iv. The final prediction output is obtained from the result of most voted prediction tree. The illustration of the working principle of the RF algorithm is shown in Figure 4.

Classification and Regression Tree (CART)

Classification and Regression Tree (CART) method was introduced to describe a decision tree approach which can be used to overcome the challenges arisen from the built predictive modeling of either classification or regression.⁷⁶ A nonparametric modeling technique by using group of independent categorical or continuous variables is employed to describe the dependent’s responses. CART generates a classification tree for categorical dependent variable and regression tree when dependent variable is continuous. The decision tree is the CART’s output with each fork indicating a split in a predictor variable and each end node containing an outcome variable prediction.⁷⁷ The most valuable characteristic of CART is the ability to process various kinds of datasets. On top of that, it can also handle a huge amount of data easily.⁷⁸ CART models are simple to learn and operate which giving them a significant advantage when compared to other analytical models.

The steps of building a CART model are mainly based on the following two steps. The first step is to develop the decision tree. CART's basic principle is to identify an optimal feature in the original dataset by improving through some criteria and splits. CART always chooses the feature with the lowest Gini information gain in the existing data set as the decision tree's node division. Basically, the sample sets to be categorized are separated into two sub-sample sets using the Gini index technique and cycled through this step until the present sample sets to be categorized are recognized to be leaf nodes or a requirement for terminating the classification is achieved. The decision tree is pruned in the second step. To build an optimal tree, the tree must be pruned to minimize overfitting. In general, the nodes of the tree must be pruned to manage the tree's complexity, which is determined by the number of leaves on the tree. Furthermore, a cross-validation approach is used to determine the best tree size.

The most often used criteria for splitting the trees are "Entropy" for the information gain and "Gini" for the Gini impurity, which can be represented mathematically as in Equation 10 and Equation 11.
, (10)
, (11)
where P is the probability of class i and k is the total number of classes.

CART models use variance minimization methods to iteratively divide data to determine progressively homogenous groups using independent variable splitting criteria. The dependent data is divided into a sequence of right and left leaf nodes that descend from root nodes as shown in the decision tree structure in Figure 5. The main weakness of this method is the risk of data over-fitting, which occurs when trees grown to their full size match the training data so well that they are unable to extrapolate effectively.⁷⁹

Performance indicators

The statistical indicators used to judge the performance of the predictive models were correlation coefficient (R), root means square error (RMSE), and mean absolute percentage error (MAPE). R measures the strength and direction of the linear relationship between predicted and measured TOC variables, RMSE measures the relative average square of the errors and represents the stability or quality of the models while MAPE describes the model in terms of the percent accuracy. The mathematical expression for R, RMSE, and MAPE is given in Equations 12, 13, and 14.

,                                                               (12)
,                                                               (13)
     ,                                                               (14)
where is the predicted TOC value from the models, represents the actual TOC value measured from core samples, and are the mean values of the predicted and actual TOC, and n represent the number of samples.

Training performance

During training, the uncertainty concerning the optimal CART and RF learning rate was solved using the widely used sequential trial and error method. The learning rate that generated the best TOC prediction for CART was observed at 0.12 with a maximum of 190 trees and the maximum nodes on each tree were specified at 13. Similarly, for RF the learning rate that generated the best TOC prediction was observed at 0.16 with a maximum of 200 trees and the maximum nodes on each tree were specified at 6. The tuning parameter in the architecture of BPNN was the number of hidden neurons which was also obtained as a result of the sequential trial and error method.

During training, it was identified that the CART TOC model trained better than both RF and BPNN. CART had RMSE, and MAPE values of 0.0840, and 0.5035 respectively as shown in Table 2. RF TOC model trained slightly worse with R, RMSE, and MAPE values of 0.9522, 0.0968, and 0.5915 respectively as seen in Figure 7. The TOC model that had the worst training performance was the BPNN with R, RMSE, and MAPE values of 0.9390, 0.1556, and 0.9053 Figure 6.

A good result for the CART TOC model was also observed for the case of the correlation coefficient. During training, CART obtained a high R-value of 0.9615 compared to 0.9522 and 0.9390 obtained by RF and BPNN respectively as seen in Figure 7. The observed scatter diagram correlates measured TOC values against the predicted TOC results from all trained models of CART, BPNN, and RF. The tight cloud of data points about the diagonal line for training data presents the good prediction accuracy of the TOC-developed models. The performance of the developed predictive CART, BPNN, and RF models during the training process is described as compared to the TOC measured data in Figure 8. The obtained results indicate that the CART TOC model has a greater ability to predict TOC with high accuracy as compared with RF and BPNN during training.

Testing performance

Here, unused 27 data points of TOC from the Mita Gamma well were used to test the validity of developed models. It was revealed that the CART TOC model was the best performing model which generated predictions close to the actual TOC values. Table 3 summarizes the results obtained during the validation process (testing). This was seen in Figure 9 as CART obtained the least RMSE and MAPE values of 0.1162 and 0.3722 respectively.

The least RMSE and MAPE score from CART indicates that the TOC predictions results do not deviate much from the measured TOC value. The extent of deviation from the measured TOC value can be examined visually from Figure 10. Therefore, the least RMSE value of 0.1162 during testing makes the proposed CART TOC model the best and most stable TOC model when compared to RF and BPNN. The RF TOC models produced prediction scores of 0.1383 and 0.3874 for RMSE and MAPE respectively. BPNN produced error margin or RMSE and MAPE as 0.5890 and 0.7272 respectively, this makes it a poor permed model. The R-value for CART was the highest score of 0.9703 as indicated in Table 3. Compared to RF and BPNN, the CART model can be described as the most resilient to outliers when dealing with noisy data. The RF and BPNN models scored 0.9449 and 0.9122 as R-values respectively Figure 11. Thus, the output from the statistical error analysis ranks CART as the best performing TOC model.

The variable significance for the well log inputs for prediction of TOC was determined by the influence of the variables’ mean relative produced by the regression tree of CART. Figure 12 shows the TOC regression tree model built from well logs variables. It further shows the contribution of each input well log in the prediction of TOC. CART model selected five well logs out of the six inputs as the most important variables for TOC prediction. The GR was the first important variable in predicting TOC with 45 fields of GR less than 0.63 and an average of 0.195. RHOB became the second important variable with 33 fields and it impacted those fields with high RHOB. The third important variable was SP with 30 fields and an average of 0.111 followed by DT with 17 fields and the last one was NPHI with 15 fields and an average of 0.067.

The present study proposed the predictive capability of the classification and regression tree (CART) model in predicting TOC from petrophysical well logs of the Mihambia, Mbuo, and Nondwa Formations in the Triassic to mid-Jurassic of the Mandawa Basin, southeast Tanzania. The models were trained using well log data from Mbuo and Mbate wells while the well logs data from Mita Gamma well were used to test the validity of the developed model. Based on this, input parameters of a well log suite of GR, SP, NPHI, DT, LLD, and RHOB, were used to develop the TOC models. The evaluation of the proposed model was based on various statistical measures such as RMSE, MAPE, and R.

The results from the experimental study by using both training data and testing data revealed that the CART model produced higher accuracy and correlation with core data when estimating TOC than BPNN and RF models. The variable significance analysis was used to identify the important contribution of the individual well log on the model performance. It was revealed that well logs parameters of GR, SP, DT, NPHI, and RHOB have greater contributions to the performance of the CART model in TOC prediction. This makes CART a more reliable CI technique for attaining accurate TOC estimation.

The authors acknowledge supports from National Natural Science Foundation of China: No. 51704265 (Research on two component gas diffusion-convection model in enhancing shale gas recovery with CO2 injection PI: Dr. Chaohua Guo), the Outstanding Talent Development Project of China University of Geosciences (CUG20170614), and the Fundamental Research Founds for National University, China University of Geosciences (Wuhan) (1810491A07).

None.

The author declares that no conflict of Interest.

Article Type

Research Article

Publication history

Received date: 09 February, 2023
Published date: 20 February, 2023

Address for correspondence

Chaohua Guo, Associate Professor, Department of Petroleum Engineering, China University of Geosciences (Wuhan), Hubei, Wuhan, 430074, China

Copyright

How to cite this article

Mkono CN, Yang Z, Liu H, Guo C. An Improved Computational Learning-Based Model for Estimating Total Organic Carbon in Unconventional Shale Reservoirs. Trends Petro Eng . 2023;3(1):1–12. DOI: 10.53902/TPE.2023.03.000519

Author Info

Christopher N Mkono,¹ Zhao Yang,² Hongji Liu,¹ Chaohua Guo¹*

¹Key Laboratory of Theory and Technology of Petroleum Exploration and Development in Hubei Province and Key Laboratory of Tectonics and Petroleum Resources China University of Geosciences (Wuhan), China

²School of Petroleum Engineering, Northeast Petroleum University, China

Trends in Petroleum Engineering

An Improved Computational Learning-Based Model for Estimating Total Organic Carbon in Unconventional Shale Reservoirs

Abstract

Introduction

Geological Setting and Data Descriptions

Methods

Results and Discussion

Conclusion

Acknowledgments

Funding

Declaration of Conflict of Interest