Beyond Linear Classification

Beyond Linear Classification#

Linear classifiers, including Support Vector Machines (SVMs), face fundamental limitations when confronted with certain datasets. In this section, we demonstrate these limitations and show how neural networks provide an elegant solution.

We start with the classic XOR dataset as an example of a problem that is not linearly separable.

The XOR Problem#

The XOR (exclusive or) dataset is a standard example used to illustrate the limitations of linear classification. The XOR function is defined as:

\[\begin{split} \text{XOR}(x_1, x_2) = \begin{cases} 1, & \text{if } x_1 \neq x_2 \\ 0, & \text{if } x_1 = x_2 \end{cases} \end{split}\]

When plotted in a 2D space, the points corresponding to the XOR function cannot be separated by a single straight line (or hyperplane). Linear classifiers, including linear SVMs, create decision boundaries that are straight lines. Therefore, they cannot correctly classify the XOR dataset.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set the style for the plots
sns.set_style("whitegrid")
# Define the XOR dataset
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
y = np.array([0, 1, 1, 0])

# Create colors for the two classes
colors = ['red' if label == 0 else 'blue' for label in y]

# Plot the XOR dataset
plt.figure(figsize=(6, 6))
plt.scatter(X[:, 0], X[:, 1], c=colors, s=100, edgecolor='k')
plt.title("XOR Dataset")
plt.xlabel("$x_1$")
plt.ylabel("$x_2$")
sns.despine()
plt.show()

../_images/822b031a0840d49a6c284052be3a1077b0609cee20f5a23f228c1974023cec74.png

Linear classifiers attempt to find a boundary—such as a straight line or hyperplane—that separates classes. In the XOR case, no single straight line can separate the points where XOR is 1 from those where XOR is 0. Even if we stack multiple linear transformations, the resulting function remains linear:

\[ linear(linear(x)) = linear(x) \]

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input

# Define the XOR dataset
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])
# XOR outputs: 0 if inputs are the same, 1 if they are different
y = np.array([-1, 1, 1, -1])

# Create a simple neural network with one hidden layer
model = Sequential([
    Input(shape=(2,)),          
    Dense(1, activation='linear'),  
    Dense(10, activation='linear'),  
    Dense(1, activation='linear') 
])

model.compile(loss='hinge', optimizer='sgd')

# Train the model
model.fit(X, y, epochs=1000, verbose=0)

# Evaluate the model on the XOR dataset
predictions = model.predict(X)
print("Predictions on the XOR dataset:")
print(predictions)

2025-05-08 15:00:47.525168: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:00:47.528162: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-08 15:00:47.536605: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746716447.550172    2114 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746716447.554248    2114 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746716447.565625    2114 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716447.565635    2114 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716447.565637    2114 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746716447.565639    2114 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-08 15:00:47.570020: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

2025-05-08 15:00:49.348426: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 28ms/step


1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 38ms/step

Predictions on the XOR dataset:
[[ 0.        ]
 [-0.00534445]
 [ 0.629936  ]
 [ 0.62459147]]

Non-Linearity#

To classify datasets that are not linearly separable, neural networks introduce non-linear activation functions. By adding at least one hidden layer with a non-linear activation function, the network transforms the original input space into a new, intermediate representation.

This intermediate representation makes the XOR dataset linearly separable. Each neuron in the hidden layer learns to recognize specific patterns or clusters within the inputs, effectively reshaping the input data into a form that can be linearly separated by the output layer.

Example: XOR Classification#

Consider a neural network with the following structure that can correctly classify the XOR dataset:

Input Layer: Two neurons, receiving inputs $(x_1, x_2)$
Hidden Layer: Two neurons, each with a non-linear activation function
Output Layer: One neuron, producing the XOR classification result

Typically, the hidden neurons perform transformations such as:

One hidden neuron might separate cases where exactly one input is ‘1’, distinguishing $(0, 1)$ and $(1, 0)$ from the other two cases $(0, 0)$ and $(1, 1)$.
The additional hidden neuron further enhances the network’s ability to represent the data more clearly, simplifying the classification task for the output neuron.

As a result, the output neuron receives transformed inputs from the hidden layer that are now linearly separable, enabling correct classification of the XOR dataset, which is impossible with a single-layer network.

Activation Functions#

An activation function is a mathematical function applied to the output of a neuron (or node) in a neural network. Its primary purpose is to introduce non-linearity into the model, which allows the network to learn and represent complex patterns.

After the neuron computes a weighted sum of its inputs, the activation function transforms this sum into the neuron’s output. Mathematically, if the weighted sum is given by:

\[ z = W \cdot x + b, \]

then the neuron’s output is: $$ a = \phi(z), $$ where $\phi$ is the activation function.

Think of an activation function as a “decision maker” inside each brain cell (neuron) of our artificial brain (neural network). After a neuron collects information from other neurons, the activation function decides how strongly this neuron should “fire” or respond. Without activation functions, our neural network would just be doing fancy math that can only draw straight lines through data. But real-world problems rarely fit straight lines!

Common Activation Functions#

Sigmoid: The “S-curve”#

The sigmoid function transforms any input value into a number between 0 and 1, creating an S-shaped curve. This property makes it especially valuable for binary classification problems—situations where the network needs to decide between two options (like “spam or not spam”). Think of sigmoid outputs as confidence levels or probabilities:

Values close to 1 mean “strongly yes”
Values close to 0 mean “strongly no”
Values around 0.5 indicate uncertainty

The sigmoid function is defined as: $$ \sigma(z) = \frac{1}{1 + e^{-z}} $$

The benefits of the sigmoid function:

Makes Sense as Probability: Outputs between 0 and 1 can be interpreted as “how likely” something is.
Smooth and Gradual: The gentle S-curve helps the network make fine distinctions between values.
Mathematically Friendly: Its derivative is easy to calculate, which makes training more efficient.

While useful, sigmoid has a key drawback: the vanishing gradient problem. For inputs with large absolute values (very positive or very negative), the function’s slope becomes nearly flat. This means:

Network training can slow down significantly
Deep networks with many sigmoid layers may struggle to learn

This limitation explains why, despite its intuitive appeal, sigmoid is now primarily used in output layers for binary classification rather than throughout entire networks.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Define the sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Generate a range of input values
x = np.linspace(-10, 10, 400)

# Compute activation values for each function
y_sigmoid = sigmoid(x)

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(x, y_sigmoid, label='Sigmoid', linewidth=2)

# Add labels and title
plt.xlabel('x')
plt.ylabel('Activation')
plt.title('Activation Functions')
plt.legend(frameon=False)
plt.grid(False)
sns.despine()
plt.show()

../_images/1f765bfc4e2422a96802b5d1e615d9da683b93bd876e0cea7754082b2da789ca.png

Tanh: The “Centered S-curve”#

The hyperbolic tangent function ($tanh$) is a powerful activation function that produces values between $-1$ and $1$. While structurally similar to sigmoid, tanh’s zero-centered output range offers significant advantages for neural network training.

The tanh function is defined as: $$ \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $$

The key benefits of tanh:

Zero-centered output: Unlike sigmoid (which outputs 0 to 1), tanh is centered around zero (-1 to 1).
Steeper gradient: The stronger slope helps networks learn faster in many situations.
Reduces bias: Zero-centered outputs prevent systematic bias in the next layer’s neurons.

When an activation function’s outputs are centered around zero:

The gradients are less likely to zigzag during training
Weight updates become more efficient and direct
The network can more easily learn both positive and negative correlations

For these reasons, tanh is often preferred over sigmoid for hidden layers in neural networks. However, tanh still suffers from the vanishing gradient problem at extreme values, just like sigmoid.

Think of tanh as an upgraded version of sigmoid that helps your network train more efficiently while maintaining the smooth, S-curve behavior that works well for many applications.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

# Generate a range of input values
x = np.linspace(-10, 10, 400)

# Compute activation values for each function
y_tanh = np.tanh(x)

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(x, y_tanh, label='Tanh', linewidth=2)

# Add labels and title
plt.xlabel('x')
plt.ylabel('Activation')
plt.title('Activation Functions')
plt.legend(frameon=False)
plt.grid(False)
sns.despine()
plt.show()

../_images/7023c0c778145d77b30386c6323c04d569e0ea3746bf49a569916952ff3f35e5.png

ReLU: The “Keep it if it’s positive”#

The Rectified Linear Unit (ReLU) has become the most widely used activation function in modern neural networks. Its brilliance lies in its simplicity:

\[ \text{ReLU}(z) = \max(0, z) \]

ReLU follows an incredibly straightforward rule:

For positive inputs: “Pass them through unchanged”
For negative inputs: “Convert them to zero”

This creates a hockey-stick shaped function that combines linear and non-linear properties. ReLU is computationally efficient, easy to optimize, and works well in practice. It has become the default choice for many types of neural networks.

ReLU’s introduction helped enable deep networks with many layers that were previously difficult to train. It’s a perfect example of how sometimes the simplest solutions work best in machine learning.

While ReLU has variants like Leaky ReLU to address its “dead neuron” problem (when neurons get stuck at 0), the original ReLU remains the default choice for most deep learning applications due to its balance of simplicity and effectiveness.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

def relu(x):
    return np.maximum(0, x)

# Generate a range of input values
x = np.linspace(-10, 10, 400)

# Compute activation values for each function
y_relu = relu(x)

# Create the plot
plt.figure(figsize=(10, 6))
plt.plot(x, y_relu, label='ReLU', linewidth=2)

# Add labels and title
plt.xlabel('x')
plt.ylabel('Activation')
plt.title('Activation Functions')
plt.legend(frameon=False)
plt.grid(False)
sns.despine()
plt.show()

../_images/8cd03fc137e94779640eb07520b6dcbbbf12967feafe8c673182622802ddb2c5.png

Try using different activation functions in the code below to see how they affect the decision boundary for the XOR dataset.

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import SGD

# XOR dataset
X = np.array([[0,0], [0,1], [1,0], [1,1]])
y = np.array([[0], [1], [1], [0]])

def train_xor(hidden_neurons, activation_fnc):
    model = Sequential([
        Input(shape=(2,)),  
        Dense(hidden_neurons,  activation=activation_fnc),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=SGD(learning_rate=0.6), loss='mse', metrics=['accuracy'])
    model.fit(X, y, epochs=2000, verbose=0)
    loss, acc = model.evaluate(X, y, verbose=0)
    print(f'Activation function {activation_fnc}, Hidden neurons: {hidden_neurons}, Accuracy: {acc:.2f}')
    predictions = model.predict(X)
    print(f'Predictions:\n{np.round(predictions, 2)}\n')

for activation_fnc in ['sigmoid', 'tanh', 'relu']:
    for neurons in [1, 2, 3]:
        train_xor(neurons, activation_fnc)

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[6], line 26
     24 for activation_fnc in ['sigmoid', 'tanh', 'relu']:
     25     for neurons in [1, 2, 3]:
---> 26         train_xor(neurons, activation_fnc)

Cell In[6], line 18, in train_xor(hidden_neurons, activation_fnc)
     11 model = Sequential([
     12     Input(shape=(2,)),  
     13     Dense(hidden_neurons,  activation=activation_fnc),
     14     Dense(1, activation='sigmoid')
     15 ])
     17 model.compile(optimizer=SGD(learning_rate=0.6), loss='mse', metrics=['accuracy'])
---> 18 model.fit(X, y, epochs=2000, verbose=0)
     19 loss, acc = model.evaluate(X, y, verbose=0)
     20 print(f'Activation function {activation_fnc}, Hidden neurons: {hidden_neurons}, Accuracy: {acc:.2f}')

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py:117, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    115 filtered_tb = None
    116 try:
--> 117     return fn(*args, **kwargs)
    118 except Exception as e:
    119     filtered_tb = _process_traceback_frames(e.__traceback__)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:369, in TensorFlowTrainer.fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq)
    367 callbacks.on_epoch_begin(epoch)
    368 with epoch_iterator.catch_stop_iteration():
--> 369     for step, iterator in epoch_iterator:
    370         callbacks.on_train_batch_begin(step)
    371         logs = self.train_function(iterator)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/backend/tensorflow/trainer.py:734, in TFEpochIterator.__next__(self)
    733 def __next__(self):
--> 734     return next(self._epoch_iterator)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/keras/src/trainers/epoch_iterator.py:102, in EpochIterator._enumerate_iterator(self)
    100 if steps_per_epoch > 0:
    101     if self._current_iterator is None or self.steps_per_epoch is None:
--> 102         self._current_iterator = iter(self._get_iterator())
    103         self._steps_seen = 0
    104     for step in range(0, steps_per_epoch, self.steps_per_execution):

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/data/ops/dataset_ops.py:501, in DatasetV2.__iter__(self)
    499 if context.executing_eagerly() or ops.inside_function():
    500   with ops.colocate_with(self._variant_tensor):
--> 501     return iterator_ops.OwnedIterator(self)
    502 else:
    503   raise RuntimeError("`tf.data.Dataset` only supports Python-style "
    504                      "iteration in eager mode or within tf.function.")

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/data/ops/iterator_ops.py:709, in OwnedIterator.__init__(self, dataset, components, element_spec)
    705   if (components is not None or element_spec is not None):
    706     raise ValueError(
    707         "When `dataset` is provided, `element_spec` and `components` must "
    708         "not be specified.")
--> 709   self._create_iterator(dataset)
    711 self._get_next_call_count = 0

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/data/ops/iterator_ops.py:748, in OwnedIterator._create_iterator(self, dataset)
    745   assert len(fulltype.args[0].args[0].args) == len(
    746       self._flat_output_types)
    747   self._iterator_resource.op.experimental_set_type(fulltype)
--> 748 gen_dataset_ops.make_iterator(ds_variant, self._iterator_resource)

File /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages/tensorflow/python/ops/gen_dataset_ops.py:3478, in make_iterator(dataset, iterator, name)
   3476 if tld.is_eager:
   3477   try:
-> 3478     _result = pywrap_tfe.TFE_Py_FastPathExecute(
   3479       _ctx, "MakeIterator", name, dataset, iterator)
   3480     return _result
   3481   except _core._NotOkStatusException as e:

KeyboardInterrupt: