Understanding Baseball Odd/Even - 1 Betting
Baseball Odd/Even - 1 betting is a popular form of wagering that focuses on predicting whether the total number of runs scored in a game will be odd or even. This type of bet adds an exciting layer to the traditional baseball betting experience, offering fans and bettors alike a unique way to engage with the sport. In this comprehensive guide, we'll delve into the intricacies of Odd/Even - 1 betting, explore expert predictions, and provide insights to help you make informed decisions.
What is Odd/Even - 1 Betting?
In Odd/Even - 1 betting, you predict whether the total runs scored by both teams in a game will be odd or even. Unlike traditional moneyline bets, which focus on predicting the winner, this bet centers on the aggregate score. It's a straightforward yet strategic form of wagering that can yield significant returns if predicted correctly.
Why Choose Odd/Even - 1 Betting?
- Simplicity: The concept is easy to grasp—just decide if the total score will be odd or even.
- Engagement: Adds an extra layer of excitement to watching games, as every run scored impacts your prediction.
- Potential for High Returns: Because it’s less commonly placed than moneyline bets, odds can offer higher payouts.
Key Factors Influencing Odd/Even - 1 Outcomes
To excel in Odd/Even - 1 betting, understanding various factors that influence game outcomes is crucial. Here are some key elements to consider:
Pitching Matchups
The effectiveness of pitchers plays a significant role in determining the total runs scored. Analyze pitcher statistics such as ERA (Earned Run Average), WHIP (Walks plus Hits per Inning Pitched), and recent performance trends. A strong pitching matchup might suggest fewer runs and an even total.
Batting Lineups
The offensive capabilities of each team's lineup are equally important. Look at team batting averages, on-base percentages, and slugging percentages. Teams with high offensive stats might lead to an odd total due to more frequent scoring.
Injuries and Player Availability
Injuries can significantly impact team performance. Check injury reports and player availability before placing your bet. A missing star player could alter the expected outcome dramatically.
Historical Performance
Review past performances between the two teams. Historical data can reveal patterns or tendencies that might influence future games.
Betting Strategies for Success
Analyzing Game Conditions
Weather conditions can affect gameplay significantly. For instance, wind direction might influence home run potential, impacting whether the total score ends up odd or even.
Leveraging Expert Predictions
<|repo_name|>JinHanKim/Machine-Learning-Algorithm<|file_sep|>/SVM.md
# Support Vector Machine
## 개요
- 분류를 위한 최적의 초평면을 찾는 알고리즘
- 다양한 종류의 데이터셋에 적용 가능하도록 확장 가능한 속성을 가지고 있음
## 정의
### 선형 SVM
- 초평면(Support Plane) : $w^Tx + b =0$
- 각 클래스를 나누는 경계선은 $w^Tx + b =0$이고 두 클래스 사이의 거리는 $frac{2}{lVert w rVert}$로 표현됨
- 이 거리를 최대화하는 $w$와 $b$를 찾아야 함
### 비선형 SVM
- 데이터셋이 선형으로 구분되지 않는 경우가 많음
- 커널 트릭을 사용하여 비선형 SVM을 구현할 수 있음
- 커널 트릭 : 원래 데이터 공간에서의 계산 대신 고차원의 특징 공간에서 계산함으로써 성능을 향상시키는 기법
## 수학적 정의 및 설명
### 최대 마진 문제 정의
#### 기본 개념
- 서로 다른 두 클래스에 속하는 점들 사이의 거리를 margin(마진)이라고 함
- 마진은 $w^Tx + b =0$ 에서 거리가 가장 가까운 두 점 사이의 거리로 정의됨
- 마진은 $frac{2}{lVert w rVert}$ 로 표현됨 (마진은 $w$에만 의존)
#### 문제 정의

$$text{minimize } frac{1}{2}lVert w rVert ^2$$
$$text{s.t } y_i(w^T x_i + b) geqslant +1,quad i=1,...m $$
위 문제를 해결하기 위해 Lagrange Multiplier를 사용하여 다음과 같이 변환할 수 있습니다.
$$L(w,b,alpha) = frac{1}{2}lVert w rVert ^2-sum_{i=1}^{m}alpha_i[y_i(w^Tx_i+b)-1]$$
여기서 $alpha_igeqslant0$ 입니다.
따라서 최대 마진 문제는 다음과 같이 변환됩니다.
$$text{minimize } L(w,b,alpha) = frac{1}{2}lVert w rVert ^2-sum_{i=1}^{m}alpha_i[y_i(w^Tx_i+b)-1]$$
$$text{s.t } y_i(w^T x_i + b) geqslant +1,quad i=1,...m $$
$$text{s.t } alpha_igeqslant0,quad i=1,...m $$
#### Karush-Kuhn-Tucker 조건(KKT Condition)

위 문제에 대해 KKT 조건을 적용하면 다음과 같습니다.

위 식을 이용하여 $nabla_w L$, $nabla_b L$, $nabla_alpha L$ 를 구하면 다음과 같습니다.

따라서 위 식들을 모두 만족하는 $(w,b,alpha)$ 의 값들은 아래와 같습니다.

따라서 우리가 원하는 결과인 $(w,b)$ 의 값을 구할 수 있습니다.

여기서 중요한 것은 **support vector**입니다.
support vector 은 제약 조건인 $y_i(w^T x_i + b) =+1$ 이 성립하는 포인트들입니다.
support vector 들은 margin 에 위치하며 margin 에 위치하지 않는 포인트들은 결국 slack 변수로 처리되어 support vector 들만 중요합니다.
따라서 결국 support vector 들만 고려하면 됩니다.
그러므로 아래와 같이 support vector 들만 고려하여 $(w,b)$ 의 값을 구할 수 있습니다.

위 식에서 중요한 것은 $sum_{i=1}^{m}alpha_iz_ix_i$ 입니다.
$sum_{i=1}^{m}alpha_iz_ix_i$ 는 **decision function** 이라고 합니다.
decision function 은 입력값 x 에 대해서 어떤 클래스에 속하는지 판단하기 위해 사용되는 함수입니다.
결국 decision function 의 값이 양수일 때 한 클래스에 속하고 음수일 때 다른 클래스에 속한다고 판단합니다.
### 비선형 SVM 및 커널 트릭
#### 비선형 SVM 정의
비선형 SVM 또한 위와 동일한 방식으로 접근할 수 있습니다.
다만 입력값 x 를 mapping 함수 f(x) 로 변환하여 f(x) 와 상관관계가 있는 초평면을 찾으면 됩니다.
즉 f(x) 가 주어지면 f(x) 와 상관관계가 있는 초평면(즉 decision function )을 찾으면 됩니다.
그럼 f(x) 를 어떻게 설정해야 할까요?
심볼릭 컴퓨팅 프로그램으로 하나씩 시도해서 좋은 결과가 나오는 함수를 선택하는 방법도 있지만 그것보다 좋은 방법이 존재합니다.
#### 커널 트릭 정리 및 설명
커널 트릭 : 원래 데이터 공간에서 계산 대신 고차원의 특징 공간에서 계산함으로써 성능을 향상시키는 기법
즉 decision function 으로 직접 $sum_{i=1}^{m}alpha_iz_if(x_ix_j)$ 와 같은 식을 사용하여 계산하게 되면 많은 연산량이 필요합니다.(특징공간에서 계산)
하지만 커널 함수(kernel function) K(x,x') 로 교체하여 K(x,x') 를 계산하게 되면 많은 연산량 없이도 동일한 결과를 얻을 수 있습니다.(원래 데이터공간에서 계산)
즉 극적으로 연산량을 줄여줍니다.
우리가 알아야 할 것들 :
**특징공간(feature space)** : mapping 함수 f(x) 로 변환된 입력값 x 의 공간입니다.
예시 : 원래 입력값 x 가 (x,y,z,w)^T 인 경우 mapping 함수 f(x)=(x,y,z,w,x*y,x*z,x*w,y*z,y*w,z*w)^T 로 변환된 경우 ture feature space 가 됩니다.(6차원)
특징공간에서 초평면(f(x)) 와 상관관계가 있는 초평면(f(f(x))) 등등 여러개 생성 가능합니다.(즉 복잡도 제어 가능)
특징공간에서 초평면(f(f(...))) 의 개수를 kernel trick 으로 계산할 수 있다는 것입니다.(즉 복잡도 제어 가능)
따라서 복잡도 제어 및 연산량 감소 가능합니다.
**커널 함수(kernel function)** : ture feature space 에서 동일한 결과를 얻기 위해 필요한 함수입니다.
예시 : mapping 함수 f(x)=(x,y,z,w,x*y,x*z,x*w,y*z,y*w,z*w)^T 인 경우 커널 함수 K(x,x')=(x*x',y*y',z*z',w*w',x*y'*+y*x',x*z'*+z*x',x*w'*+w*x',y*z'*+z*y',y*w'*+w*y',z*w'*+w*z') 로 정의할 수 있습니다.(즉 dot product )
따라서 ture feature space 에서 dot product 연산 대신 kernel function K 로 dot product 연산 할 수 있습니다.(즉 복잡도 제어 가능)
정확히 맞출 필요 없으니 예시처럼 생각하기만 하세요!
**kernel trick theorem** : kernel trick theorem 은 kernel function K 에 대해 아래와 같다고 보장합니다.

증명 :
K(a,a)=K(f(a),f(a))=(f(a))^T*f(a)=||f(a)||^2>=0
K(a,b)=K(f(a),f(b))=(f(a))^T*f(b)=(f(b))^T*f(a)=K(b,a)
따라서 kernel trick theorem 이 성립합니다.
## References
[01] https://www.youtube.com/watch?v=E302NkQOgYU&t=156s&ab_channel=%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98AI-MachineLearning
[02] https://www.youtube.com/watch?v=PjfR9oEjVWg&ab_channel=%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98AI-MachineLearning
[03] https://www.youtube.com/watch?v=GvFyM_kX-Ws&list=RDCMUCIjJVce5Bph_XkPZJzyLcg&index=12&t=1217s&ab_channel=%EC%95%8C%E %A3%A0%E%B8%A5%E %B8%B41%E AI-MachineLearning
[04] https://medium.com/@jonathan_hui/machine-learning-svm-support-vector-machine-tutorial-part-i-d441a9aa543#.tjvz9xcu7
[05] http://cs229.stanford.edu/materials/smo.pdf
[06] http://web.mit.edu/torralba/www/Publications/KDD04_SVM_Tutorial.pdf <|file_sep>> ## Theoretical Introduction about ML Algorithm
# Perceptron Algorithm
## Overview
* One-layer neural network algorithm
* Binary classification algorithm
## Definition
### Single Layer Neural Network
* Single layer neural network model consists of input layer and output layer only
* Input layer has n nodes corresponding to n features in data set
* Output layer has one node corresponding to one output value
### Perceptron Model
* Perceptron model uses weighted sum followed by step activation function
### Step Activation Function
* Step activation function takes input value between [-inf,+inf], maps it into [0,+inf]
* If input value >= threshold , output value becomes 'positive' class (+ve weight class)
### Weight Update Rule
* Weight update rule is based on perceptron learning rule
* If actual output == expected output , no change in weight values
* If actual output != expected output , weights are updated according to following equation
* $$
W_{new}=W_{old}+Delta W=W_{old}+eta(y-y')X=begin{pmatrix}
W_0\W_1\W_2\vdots\W_n
end{pmatrix}_{new}=begin{pmatrix}
W_0\W_1\W_2\vdots\W_n
end{pmatrix}_{old}+eta(y-y')X=begin{pmatrix}
W_0+eta(y-y')X_0\W_1+eta(y-y')X_1\W_2+eta(y-y')X_2\vdots\W_n+eta(y-y')X_n
end{pmatrix}
$$
## Mathematical Definition & Explanation
### Problem Statement
Given training dataset consisting of m tuples with n features each , find optimal weights ($W_{optimal}$ ) so that perceptrons can classify all training tuples correctly
* Training dataset consists of m tuples with n features each: {$X=[X_0,X_1,X_2,...,X_n]$}
* Each tuple has associated class label {$Y=[Y(X)]$, Y(X)= {+ve,-ve}}
* Goal: Find optimal weights {$W=[W_o,W_01,W02,...,WN]$}, where Wo is bias term
### Finding Optimal Weights using Perceptrons
Step # | Description | Equation | Comment
--- | --- | --- | ---
Step #01 | Initialize weights randomly with small values | $${Wi}=rand(-smallValue,+smallValue),i={o...n}$$ | Initial random values for weights should be small so as not to affect convergence process
Step #02 | Calculate weighted sum ($Z(W,X)$): Dot product between weight matrix & input feature matrix | $$Z(W,X)=WX=W_o+W.X=X.W=begin{pmatrix}
X.O & X.I & X.II & ... & X.N \
end{pmatrix}.{begin{pmatrix}
W.O \ W.I \ W.II \ ... \ W.N \
end{pmatrix}}=sum^n_{i=o}(WX)_i=X.W=X.sum^n_{i=o}(WX)_i=X.W=X.W_o+X.W_I+...+X.W_N$$ |
Step #03 | Apply step activation function ($Y(W,X)$): Classify weighted sum into positive/negative class according to threshold value (usually threshold value is zero)| $$Y(W,X)=step(Z(W,X))=left{begin{matrix}
Y(X)& Z(W,X)geqslantthreshold \
Y'(X)& Z(W,X)
Weight update rule for logistic regression is based on gradient descent algorithm instead of perceptrons learning rule used by perceptrons
Gradient descent algorithm updates weight iteratively till cost/error reaches minimum
Gradient descent works best when cost/error surface curve looks like bowl shape
Cost/error surface curve may not look like bowl shape if initial random weight values are very large/small
Logistic regression algorithm uses cost/error surface curve which is convex throughout its domain but may not look like bowl shape at certain regions
Gradient descent finds minimum point where slope becomes zero
Cost/error surface curve always looks like bowl shape if initial random weight values are small enough
Gradient descent converges faster if initial random weight values are small enough because cost/error surface curve looks like bowl shape everywhere
Gradient descent converges slower if initial random weight values are very large/small because cost/error surface curve does not look like bowl shape everywhere
Hence it is advisable use small random numbers as initial random weight values for faster convergence
Logarithmic cost/error functions do not become negative but range from [zero,+infinity]
Logarithmic cost/error functions do not become negative but range from [zero,+infinity]
Logarithmic cost error functions become infinite at certain regions
Logarithmic cost error functions become infinite at certain regions
Note:
The difference between logistic regression algorithm and gradient descent algorithm lies only in their respective loss functions(cost error functions). All other steps remain same.
# Gradient Descent Algorithm
## Overview
Gradient Descent Algorithm finds minimum point(iteration after iteration) along slope direction(line tangent at current point). The slope direction(line tangent at current point) changes with every iteration.
## Definition:
Gradient Descent Algorithm finds minimum point along slope direction(line tangent at current point). The slope direction(line tangent at current point) changes with every iteration.
The Gradient Descent Algorithm stops iterating when it reaches minimum point where slope becomes zero.
Mathematically speaking:
Given Cost/Error Function J(theta)
Find theta such that J(theta)'(derivative w.r.t theta)=zero(J(theta)'(derivative w.r.t theta)=zero)
Note:
Cost/Error Function J(theta)
Cost/Error Function J(theta)
In case Cost/Error Function J(theta)'(derivative w.r.t theta)(slope along line tangent at current point(Cost/Error Function J(theta))) never becomes zero(i.e., Cost/Error Function J(theta)'(derivative w.r.t theta)(slope along line tangent at current point(Cost/Error Function J(theta))) keeps changing sign back-and-forth), then Gradient Descent Algorithm stops iterating after reaching some fixed number(iterations).
The Cost/Error Surface Curve need not look like Bowl Shape(Bowl Shape Cost/Error Surface Curve)
In case Cost/Error Surface Curve does not look like Bowl Shape(Bowl Shape Cost/Error Surface Curve), then Gradient Descent Algorithm may get stuck into Local Minimum Point(Local Minimum Point(Local Minimum Point(Local Minimum Point)), i.e., Gradient Descent Algorithm may converge slowly(Slow Convergence(Slow Convergence)) or may not converge(Never Converges(Never Converges)). In order to avoid such situations we use different variations(small variations(small variations(small variations)) )of Gradient Descent Algorithms.
In case Cost/Error Surface Curve looks like Bowl Shape(Bowl Shape Cost/Error Surface Curve), then Gradient Descent Algorithm converges quickly(Quality Convergence(Quality Convergence)). In order to ensure Quality Convergence(Quality Conversion(Quality Conversion)), we need only use appropriate Learning Rate(Learning Rate(Learning Rate)).
Note:
We need appropriate Learning Rate(Learning Rate(Learning Rate)) only when our Cost(Error)/Surface Curve looks like Bowl Shape(Bowl Shape Cost(Error)/Surface Curve).
We do NOT need appropriate Learning Rate(Learning Rate(Learning Rate)) when our Cost(Error)/Surface Curve does NOT look like Bowl Shape(Bowl Shape Cost(Error)/Surface Curve).
# Batch Gradient Descent Algorithm:
Batch Gradient Descent Algorithm performs one parameter update per epoch(epoch). It means Batch Gradient Descent performs parameter update using entire training dataset(training dataset(training dataset(training dataset))). Due to this reason Batch Gradient Descent performs slow parameter updates(parameter updates(parameter updates(parameter updates))). Hence Batch Gradient Descent requires more epochs(eepochs(eepochs(eepochs))) than Stochastic Gradient Desent(Stochastic Stochastic Stochastic Gradient Descents(Gradient Descents(Gradient Descents(Gradient Descents)))).
Batch GD Formula:
θ := θ − α ∇θJ(θ)
where:
α := learning rate(alpha:=learning rate(alpha:=learning rate))
∇θJ(θ)::= gradient(J(theta))(gradient(J(theta))(gradient(J(thetamay also be written as partial derivatives(partial derivatives(partial derivatives))))))
Note:
Batch GD Formula: θ := θ − α ∇θJ(θ)
Partial Derivatives(partial derivatives(partial derivatives)): ∂J(θ)/∂θ₀∂J(θ)/∂θ₁∂J(θ)/∂θ₂…∂J(θ)/∂θₙ≜∇thetaJ(theta)
Example:
Let us consider Linear Regression Problem with two variables(X₀and X₁): Yhat = β₀ + β₁*X₀ + β₂*X₁(where β₀:=intercept term(beta:=intercept term(beta:=intercept term)), β₁:=coefficient(weight)(coefficient(weight)(coefficient(weight)))of variable X₀(coefficient(weight)(coefficient(weight))),β₂:=coefficient(weight)(coefficient(weight))(coefficient(weight)(coefficient(weight)))of variable X₁(coefficient(weight)))
Let us assume our Hypothesis(Hypothesis(Hypothesis(Hypothesis(Hypothesis))))is Hβ(Xᵢ;β):=β₀+β₁*X₀ᵢ+β₂*X₁ᵢ(where Hbeta(Xitheta;beta):betaO+betalOithetaI+betalIIthetaIIithetaN(the ith row elementsofofeature matrixXFithatare denotedasXi=(XiO;XiI;XiII)), XiO:=(constant term(xiO:=(constant term(xiO:=(constant term(xiO:=(constantterm)))), XiI:=ith row elementsofofeature matrixXFithatare denotedasXiI:, XiII:=ith row elementsofofeature matrixXFithatare denotedasXiII:))
Let us assume our Training Dataset(TD(TD(TD(TD(TD))))consistsofmtraining examples(mtraining examples(mtraining examples(mtraining examples(mtraining examples())))): {(xi,Yᵢ)}ᶜⁱⁿ¹(where xi=((xiO;xiI;xiII),(xiO;xiI;xiII),(xiO;xiI;xiII),(xiO;xiI;xiII),(xiO;xiI;xiII)), xiO:(constant term(xiO:(constantterm(xiO:(constantterm)))), xiI:=ith row elementsofofeature matrixXFithatare denotedasxiI:, xiII:=ith row elementsofofeature matrixXFithatare denotedasxiII:)}
Let us assume our Loss Function(Loss Loss Loss LossFunction(LossFunction(lossfunction(lossfunction(lossfunction(lossfunction(lossfunction()))))))is Mean Squared Error(MSE(MSE(MSE(MSE(MSE(MSE())))))):
LossFunctionLossFunction(lossfunctionlossfunction()) := MSE(MSE(MSE(MSE()))) := (½)*((½)*((½)*((½)*(Hbeta(Xithetatheta;)−Yᵢ)^²+(Hbeta(Xitthetathetatheta;)−Ytthetathetatthaetathetatheethetatheethethethetheetheethetheetheetheeththeeththeeththeeththeetheetheeththeetheetheeththeete)^²+(Hbeta(Xijthetathetajtheta;)−Yjthetatjthetajthetajthetajthetaj)^²+(Hbeta(Xikthetaketaketaketaketaketaketaketaketaketaketaketaketaketakatakakakakakakaka);−Ykthekakakakaka)^²+(Hbeta(Xilthetaletaletaletaletaletaletaletaletaletaletaletailtailtailtailtailtailtailtailtail);−Ylthelatelatelatelatelatelatelatelatelatelatelateltailetailetailetailetailetailetailetailetailetailte))^N(where N=N=N=N=N=N=N=#Training Examples(#Training Examples(#Training Examples(#Training Examples(#TrainingExamples())))))
Let us assume our Optimization Problem(Optimization Optimization OptimizationOptimizationProblem(OptimizationProblem(OptimizationProblem(OptimizationProblem(OptimizationProblem())))))is MinimizeLossFunctionLossFunction(lossfunctionlossfunction())subjecttoconstraintthatparameters(parameters(parameters(parameters(parameters(parameters(parameters(parameters(parameters(parameters(parameters(parametersparametersparametersparametersparametersparametersparametersparametersparametersparametersparameterparameterparameterparameterparameterparameterparameterparameterparameterparameeterparameteerparameeterparameteerparameteerparameteerparameteertareterparameeterparameeterparameeterparameeterparameeterparameeterparameteerparameteertareterparameeterparameteertareterparameetersare constant):
MinimizeLossFunctionLossFunction(lossfunctionlossfunction())subjecttoconstraintthatparameters(parameters(parameters(parameters(parameters(parameetersare constant):
Then Partial Derivatives(partial derivatives(partial derivatives()))of LossFunctionLossFunction(lossfunctionlossfunction())with respectsto parameters parameters parameters parameters parameters parameters parameters parameters parameters parametes(parametes(parametes(parametes(parametes(parametes(parametes(parametes(parametes(parametes(parametes,paramtes,paramtes,paramtes,paramtes,paramtes,paramtes,paramtes,paramtes,paramtesthe partial derivativepartial derivativepartial derivativepartial derivativepartial derivativepartial derivativepartial derivateis given below:
where:
μ̅μ̅μ̅μ̅μ̅μ̅μ̅μ̅μ̅μ̅μ̅μ̅=:Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value Mean Value(:MeanValueMeanValueMeanValueMeanValueMeanValueMeanValueMeanValuethe meanvalueof yi):
σσσσσσσσσσσ=:Standard Deviation Standard Deviation Standard Deviation Standard Deviation Standard Deviation Standard Deviation Standard Deviation Standard Deviation(:StandardDeviationStandardDeviationStandardDeviationStandardDeviationStandardDeviationStandardDeviationStandardDeviationStandardDeviation(the standard deviationof yi)):
εεεεεεεεεε=:Error Error Error Error Error Error Error Error Error(:ErrorErrorErrorErrorErrorErrorErrorErrore(the errorbetween predictionand actualvalue)):
Then Batch GD FormulaBatch GD Formula(batch gd formula(batchgdformula(batchgdformula(batchgdformula(batchgdformula(batchgdformula(batchgdformula(batchgdformula(batchgdformula(given below:
where:
ααααααα=:LearningRate LearningRate LearningRate LearningRate LearningRate LearningRate(:LearningRateLearningRateLearningRateLearningRatemay also bewrittenasthelearningrate):
β₀β₀β₀β₀β₀β₀=:InterceptTerm InterceptTerm InterceptTerm InterceptTerm InterceptTerm(:InterceptTermInterceptTermInterceptTermInterceptTermInterceptTerm(interceptterm(interceptterm(interceptterm(interceptterm(interceptterm)))))):
β₁β₁β₁β₁=:Coefficient(Coefficient(Coefficient(Coefficient))(Coefficient(Coefficient(Coefficient(Coefficient))))ofvariableXiOi(coefficient(coefficient(coefficient(coefficinet(ofvariablexio))):
Then Batch GD FormulaBatch GD Formulabatch gd formulabatch gd formula(given below:
where:
δδδδδδδδδδδδ=:ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdate(ParameterUpdatemay also bewrittenasthechangeintheparameters)):
Then Batch GD FormulaBatch GD Formulabatch gd formulabatch gd formula(given below:
where:
δδδδδδδ=:Parameter Update Parameter Update Parameter Update Parameter Update Parameter Update Parameter Updatemay also bewrittenasthechangeintheparameters):
Then Batch GD FormulaBatch GD Formulabatch gd formulabatch gd formulagiven below:
# Mini-Batch Gradient Descent:
Mini-Batch GDesent performs one parameter update per mini batch(minibatch(minibatch(minibatch))). It means Mini-Batch GDesent performs parameter update using mini batch(minibatch(minibatch(minibatch)))of training data(training datatraining data(training data(training data))). Due to this reason Mini-Batch GDesent performs fast parameter updates(parameter updates(parameter updates(parameter updates))). Hence Mini-Batch GDesent requires less epochs(eepochs(eepochs(eepochs))) than Batch GDesent(Batch BatchesBatchesGDesent(GDesent(GDesent(GDesent(GDesent(GDesent(GDesent(GDesent)))))))). On other hand Mini-Batch GDesent requires more epochs(eepochs(eepochs(eepochs()))) than Stochastic GDesennttStochastic Stochastic StochasticGDesennttGDeStochasticsenettt(t(t(t(t(t(t(t(t(t(ttthan Batch BatchesBatchesGDeSesntttGDeSesntttGDeSesntttGDeSesntttGDeSesntttGDeSesnttt)). Due to this reason Mini-Batch GDeSent falls somewhere between StochasticsStochasticsStochasticsStochasticsStochasticsStochasticsStochasticsStochasticsGDescentsDescentsDescentsDescentsDescentsDescentsDescentsDescentsDescents(Descents(Descents(Descents(Descents(Descents(Descents(Descends(Descends(descends(descends(descends(descends(descends(descends