Inter-Frame Video Compression based on Adaptive Fuzzy Inference System Compression of Multiple Frame Characteristics

ABSTRACT


I. Introduction
The digital video represents the movement of visual images in the form of encoded digital data.The digital video contains a collection of digital images called frames displayed sequentially.Typically, the videos are stored in an uncompressed format.Video compression is related to the need for storage or bandwidth efficiency for video transmission.Video compression minimizes the number of video data bits through encoding information.The result is fewer video data bits than the original.Lossless compression is usually used for archiving audiovisual files and is usually avoided for use [1].Lossy compression is a type of compression that eliminates information that is considered not necessary [2].Lossless compression is a type of compression that removes parts that are considered statistically redundant without losing the importance of information.
There are several approaches to video compression.Inter-frame-based compression [3][4] [5] uses at least two frames to compress the current frame.Intra-frame-based compression [6] [7][8] [9] applies the principles of image compression to each current frame.One commonly used lossless image compression method is Arithmetic Coding (AC).Some of the compression algorithms belonging to this type are Huffman codes [10], the mixture of non-parametric distributions [11], and Integer Wavelet Transform (IWT) [12].Block-based compression [13] [14][15] [16] groups video frames into coding blocks to predict, transform, quantize, and encode.Predicting and encoding the first frame of ARTICLE INFO A B S T R A C T each block is using the intra-frame-based concept.Both the intra-frame and inter-frame-based compression concepts are then applied to the remaining frames.
Video compression algorithms commonly used in various video compression standards (such as MPEG) typically consist of motion estimation, motion compensation, and Frame Difference [17].Motion estimation [18][19] [20] is finding motion vectors that point to the best macroblock in a frame or field of reference.This process explains the previous and later frames to identify blocks that have not changed and motion vectors are stored instead of blocks.Motion compensation [21] [22] describes the transformation of the image from the reference image to the current image.The reference image can be from the previous or later image.Compression efficiency will increase if the image synthesis of the reference image is done accurately.Motion compensation refers to the result of a camera or moving object in a moving frame is the only difference between one frame and another.Based on differences between frames, the frame difference method [23][24] [25] mainly focused on the amount of data to be compressed.There are three types of frame difference methods.I-frame is the easiest because it does not require other frames to decode.P-frames tend to use the previous frame as a reference frame to decompress.B-frames use previous frames and subsequently as reference frames to increase data compression ratios.
Image features play an essential role in computer vision and pattern recognition.An image feature is a relevant information for computing needs associated with a particular application.The feature is also considered to represent the uniqueness of an image.Various studies in this field focus on how to build powerful feature extraction methods.Some of them are the Median Robust Extended Local Binary Pattern [26], Local Weighting Pattern [27], dan Gray-Level Dynamic Range Modification Technique [28].Image features are also widely used in developing various feature-based image compressions algorithms [29][30][31] [32].
From the description above, there are several approaches to video compression.One inter-framebased video compression approach uses at least two frames to compress the frames into a single frame.Various video compression algorithms (besides motion estimation, motion compensation, and frame difference) are a separate area of research.It also applies to the inter-frame-based video compression approach.This study applies an inter-frame-based approach to compressing video by utilizing image features obtained in a certain way.Each pair of adjacent frames (called an odd-even frame pair) compressed into one compressed frame.Compression and video decompression are based on the compressed feature of every odd-even frame pair generated by adaptive FIS.Firstly, adaptive FIS is trained using the features of all odd-even frame couples.The trained adaptive FIS is then used as a codec (encoder-decoder) in the compression-decompression process of the video.The features used are simple statistical features of "mean", "std (standard deviation)", "mad (mean absolute deviation)","mean (std) ", and "mean (mad)".This study assumes that the average DCT component of all video frames is a video quality parameter.

II. Methods
This section presents details of the adaptive FIS training phase to generate compressed features and their use in inter-frame-based video compression.The performance of compressiondecompression results is measured by several parameters selected.

A. General Concept
Various video compression algorithms are generally based on motion estimation, compensation, and frame differences.Motion estimation and compensation are commonly used simultaneously.The motion estimation produces a motion vector.This motion vector is used to compensate each frame in a certain way to produce a compressed frame.Video compression uses the difference frame approach by determining the threshold value obtained from calculating frame differences from each pair of adjacent frames.Each pair's first or second frame will be eliminated if the difference between frames is equal to or less than the threshold value.The remaining frame is considered as the result of video compression after being reconstructed in a certain way.This study proposes using an inter-frame-based approach for video compression, combining each adjacent frame into one compressed one through its compression feature.In general, the proposed method is shown in Figure 1.Each pair of adjacent frames is assumed as an odd-even frame pair.One video file has two sets of frames, odd and even.The proposed method consists of two stages: training and compression.The training stage is used to build rule-based adaptively.Input training is a set of feature pairs of all oddeven frame pairs.The training target is a set of the average feature pairs considered compressed features.In the compression phase, the trained adaptive FIS applied to each odd-even frame pair to produce the compressed feature.This compressed feature is then used to compress the odd-even frame pair into a single compressed frame in a certain way.

B. Training Stage
A video file consists of multiple frames.Each frame is presented as an RGB image.Suppose a video file contains a number of L frames where each frame is an RGB image size of M×N, then mathematically, each frame is declared as in (1).

1) Feature Extraction
Feature extraction was performed on each component R, G, and B. Each frame will have a combination of the three component features.This study uses some simple statistical measures as a feature: mean, std (standard deviation), mad (mean absolute deviation), mean of std, and mean of mad expressed as in ( 2) to (6).
2) FIS Concept FIS is a Fuzzy Logic based inference system where the reasoning process adopts human reasoning abilities.The inference system consists of fuzzification, implication, aggregation, and defuzzification.Fuzzification is the stage that maps each FIS input into a fuzzy input number by using a fuzzy set constructed from a particular membership function.The implication is the stage that maps each pair of fuzzy input numbers into a fuzzy output set by applying each rule-based.If there are N rules, there will be N output fuzzy set of the implication result for each input fuzzy number pair.Aggregation is a phase that aggregates all implication results into a set of fuzzy outputs.Defuzzification converts the fuzzy set of outputs into one crisp number.
A fuzzy set represents the linguistic values in the universe of discourse by using fuzzy MF (Membership Function).There are several types of fuzzy MF.One of the most commonly used is Triangular MF.A fuzzy set constructed by Triangular MF with three linguistic values (Low, Medium, High) shown in Figure 2. Mathematically, the Triangular MF is expressed as in (7).
There are various inference methods.One of the most commonly used is Mamdani, as shown in Figure 3.A fuzzy operator is a logical operation used to obtain a single truth value as the output of each rule based on two or more fuzzy input numbers.The Mamdani method usually uses the AND fuzzy operator (min).The implication method generates a fuzzy output set for each rule based on a single truth value.The Mamdani method typically uses an AND (min) implication method that trims the fuzzy output set based on a single truth value.The process of applying one rule using the Mamdani method is illustrated in Figure 4.The inference process is performed by implementing the rule-based for each pair of fuzzy input numbers.Rule-based is a collection of rules that are made to make decisions.A particular method aggregates all the output fuzzy sets of the implication result.The Mamdani method usually uses OR (max) aggregation method.The aggregated output fuzzy set is then defuzzified using a particular method to obtain a single crisp number.The commonly used method is COA (Center Of Area), mathematically expressed as in (8).

3) Adaptive FIS
If c is the number of crisp input, l is the number of linguistic value, the maximum number of the rule is r=(l)^c.The accuracy of the result depends on the created fuzzy set and the number of the rules in rule-based.Adaptive FIS is used primarily to obtain the accuracy of the results with the specified target error.The term "adaptive" refers to the linguistic value adaptation, the parameters of the fuzzy MF and others during the FIS training process.
In this study, the adaptive FIS is used to renew the number of the linguistic value.The adaptation results are used to build new rule-based based on a crisp input data set.The adaptation is performed by partitioning the universe of discourse starting from the smallest amount (3 linguistic values).Each partition is labeled sequentially.Each crisp input is labeled according to the partition label number of its position.This process is called labeling.Suppose a fuzzy set with three linguistic values, as shown in Figure 5 (a).The universe of discourse is partitioned into three parts with the labeling results as shown in Figure 5 (b).If necessary, the partition process continues, as shown in Figure 5 (c) and Figure 5 (d).0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0  If () and () are each set of odd { 1 ,  3 , …  −1 } and even { 2 ,  4 , …   } frame features, the features compression by using FIS mathematically expressed as in (9).
Training error of adaptive FIS stated by using MAPE (Mean Absolute Percentage Error) which is mathematically expressed as in (11).
In general, the adaptive FIS training process is shown in Figure 7.

FIS (decompression) MAPE
Updates the number of the Linguistic Variables error > target?

C. Compression and Decompression Stage
The trained adaptive FIS consists of two parts: FIS compression and decompression, as stated by Eq. ( 7) and (8).The trained adaptive FIS, which contains trained rule-based, is used for video compression and decompression.The compressed frame is generated as in (12).
Where  = 1 … /2, L is the number of frame, and d is the index of RGB component.Finally, the compressed video file is obtained from the reconstruction process of all compressed frames.
Referring to Eq. ( 10), the decompression of the compressed video by using FIS mathematically expressed as in (13).
(′) and (′) are the set of the decompressed odd and even frame features from compressed video, respectively.The set of compressed feature (′) obtained from the set of the compressed frame { (1) …  (/2) }, mathematically expressed as in ( 14) The decompressed frame generated as in ( 15) All the decompressed frames are reconstructed into a decompressed video file.

D. Performance Parameter
There are a variety of measurable parameter standards for assessing video compression performance.This study uses some of the following performance parameters.

1) Compression Ratio (CR)
CR states the degree of compression achieved.Commonly, a reasonably small CR is expected to increase the efficiency of storage and video data transmission requirements.If    () is the original video file and    () is the compressed video file, CR is expressed as in (16).

2) Quality of compression result
Typically, video compression affects the decrease in video quality.The smaller CR indicates the less quality of the compression result.It means the compression and quality ratio are important factors to be considered.Two types of quality will be measured related to video compression: the quality of compression and decompression.
A gray image in the form of a series is considered as a discrete signal.The signal energy is one of the important characteristics of a signal, such as a feature.DCT is one signal transformation method with a better energy compaction property, presenting the major energy components in sequence with only a few transformation coefficients.Suppose a discrete signal () of length  is mathematically expressed as in (17).
The variable of   () is the DCT coefficients of ().The DCT coefficients of a frame are the average DCT coefficients of components R, G, and B. The average DCT coefficients of the whole frame considered as the energy feature of a video.Video manipulation with a variety of purposes will impact the average change of absolute DCT coefficients.In this study, this change assumed as a video quality change.If   and   the average DCT coefficients of the original and compressed video file, respectively, are the percent change in video quality of compression result expressed as in (18).
̅ () is the average absolute value of   and  ̅̅̅̅ () is the average of the difference between the absolute value of   and   .Whereas ∆  is percent change of the absolute DCT coefficients between   and   .If the value is positive, then it is considered to have an improvement in quality and vice versa.The illustration is shown in Figure 8.The set of DCT coefficients is also used to measure the percent of the decompression video quality change.If   and ′  are the set of DCT coefficients per frame of the original and decompressed video, the percent change in the quality of decompression results is the average of ∆ of the entire frames.
Another performance parameter also used in this study to measure the quality of video decompression results is PSNR (Peak Signal to Noise Ratio).PSNR provides a quantitative measure of the distortion that occurs during the decompression process.If F and F 'are the set of frames of the original video and the decompressed video, then PSNR is expressed as in (19).
MSE is the cumulative square error between the compressed and the original frame.It means PSNR in Eq. ( 19) states the average PSNR of all frames.If the assumed perfect decompression has a PSNR=60dB equivalent to MMSE=0.065, then the PSNR performance difference is expressed as in (20).
3) Total Performance (P) The smaller CR indicates more and more video parts are successfully compressed.The smaller ∆ () indicates that the compressed video quality is getting better.It means the compression performance can be expressed as in (21).
In the same way, the performance of decompression can be expressed as in (22).
Thus, total performance can be expressed as in (23).

E. Sample Video Files
This study has used four video file samples.Table 1 shows the specifications of the entire sample.These samples have been obtained from partitioning a long video file into several video files with random sampling positions.Partition aims to get video files with an even number of frames.

III. Result and Discussion
In this section, we present sample compression frames 21 and 22 of sample 1 from Table 1 using the "mean" feature.The adaptive FIS training has used the   = 2%.Figure 9 illustrates the compression result using adaptive FIS ∆  = −27.61%.The result of compression-decompression of all sample video files from Table 1 has presented in Table 2. Figure 11 shows the CR set curve between samples.If the CR set curve between samples has a good enough correlation visually, then the compression result is only dominant depending on the feature type.Figure 11 shows the CR set curve between samples uncorrelated visually.It has been proven that the compression result depends on the variation of the compressed features used for the compression process.The variation of the compressed features depends on the feature types and the number of the odd-even frame pair for adaptive FIS training.The average compression-decompression result based on the feature type has presented in Table 3. Overall, the proposed method yields an average  = 25.39% with total performance  = 80.13%.Of the five feature types selected, the "average" feature has resulted in the best video decompression with  = 25.08% and  = 87.15%.Table 3 has also presented various compression performance options by feature type as needed.The "mean (mad)" feature has resulted in the best compression ratio ( = 24.68%)making it suitable for storage efficiency requirements.The "std" feature has resulted in the smallest quality change in the compression process (∆  = 10.39%) , making it suitable for video transmission needs without decompression.Its best compression performance has also indicated it (  = 81.69%).

IV. Conclusion
In general, the results of this study have proven that inter-frame-based video compression can be performed by applying feature compression techniques from all even frame pairs using adaptive FIS.The resulting compression ratio depends on the type of feature used and the number of odd-even frame pairs used for the adaptive FIS training.The proposed method yields an average  = 25.39% with total performance  = 80.13%.Of the five selected feature types, the "mean" feature has produced the best video decompression with  = 25.08% and  = 87.15%.The results of this study have also presented various compression performance options by feature type as needed.The "mean (mad)" feature answers the need for storage efficiency.The "std" feature is suitable for video transmission needs without a decompression process.Further studies focused on increasing the variety of uniqueness of compressed features through the use of other features and frame selection techniques for adaptive FIS training.The expected impact is an increase in inter-frame-based video compression performance.

Fig. 5 .
Fig. 5.The universe of discourse partition process (a).A fuzzy set with three linguistic values (b).The three partitions results of its universe of discourse (c).A fuzzy set with five linguistic values (d).The five partitions results of its universe of discourse From Figure 5 (b), we can obtain the result of partitioning the universe of discourse by three linguistic values are: label 1:{0 ... 0.33}, label 2:{0.34 ... 0.66}, label 3:{0.67 ... 1}.Suppose there is a pair of crisp input features A and B. The labeling result by using three linguistic values is shown in Figure 6 (a) where () =  ((() + ())/2).The Rule-based generated from labeling results is shown in Figure 6 (b).

Fig. 6 .
Fig. 6.The rule-based generation (a).An example of a labeling process (b).The result The percent of decompression errors determines the termination of the adaptive FIS training process to the original.The training process continues if | .−   | >  , and vice versa.

Fig. 11 .
Fig. 11.The comparison of CR between samples

Table 2 .
The results of the compression-decompression process

Table 3 .
The average of the compression-decompression results by feature type