# Dealing with sparse or missing data

# Working with columns containg NULL values

In MySQL and other SQL dialects, NULL values have special properties.

Consider the following table containing job applicants, the companies they worked for, and the date they left the company. NULL indicates that an applicant still works at the company:

CREATE TABLE example
(`applicant_id` INT, `company_name` VARCHAR(255), `end_date` DATE);

+--------------+-----------------+------------+
| applicant_id | company_name    | end_date   |
+--------------+-----------------+------------+
|            1 | Google          | NULL       |
|            1 | Initech         | 2013-01-31 |
|            2 | Woodworking.com | 2016-08-25 |
|            2 | NY Times        | 2013-11-10 |
|            3 | NFL.com         | 2014-04-13 |
+--------------+-----------------+------------+

Your task is to compose a query that returns all rows after 2016-01-01, including any employees that are still working at a company (those with NULL end dates). This select statement:

SELECT * FROM example WHERE end_date > '2016-01-01';

fails to include any rows with NULL values:

+--------------+-----------------+------------+
| applicant_id | company_name    | end_date   |
+--------------+-----------------+------------+
|            2 | Woodworking.com | 2016-08-25 |
+--------------+-----------------+------------+

Per the MySQL documentation (opens new window), comparisons using the arithmetic operators <, >, =, and <> themselves return NULL instead of a boolean TRUE or FALSE. Thus a row with a NULL end_date is neither greater than 2016-01-01 nor less than 2016-01-01.

This can be solved by using the keywords IS NULL:

SELECT * FROM example WHERE end_date > '2016-01-01' OR end_date IS NULL;

+--------------+-----------------+------------+
| applicant_id | company_name    | end_date   |
+--------------+-----------------+------------+
|            1 | Google          | NULL       |
|            2 | Woodworking.com | 2016-08-25 |
+--------------+-----------------+------------+

Working with NULLs becomes more complex when the task involves aggregation functions like MAX() and a GROUP BY clause. If your task were to select the most recent employed date for each applicant_id, the following query would seem a logical first attempt:

SELECT applicant_id, MAX(end_date) FROM example GROUP BY applicant_id;

+--------------+---------------+
| applicant_id | MAX(end_date) |
+--------------+---------------+
|            1 | 2013-01-31    |
|            2 | 2016-08-25    |
|            3 | 2014-04-13    |
+--------------+---------------+

However, knowing that NULL indicates an applicant is still employed at a company, the first row of the result is inaccurate. Using CASE WHEN provides a workaround for the NULL issue:


SELECT
    applicant_id,
    CASE WHEN MAX(end_date is null) = 1 THEN 'present' ELSE MAX(end_date) END
    max_date
FROM example
GROUP BY applicant_id;

+--------------+------------+
| applicant_id | max_date   |
+--------------+------------+
|            1 | present    |
|            2 | 2016-08-25 |
|            3 | 2014-04-13 |
+--------------+------------+

This result can be joined back to the original example table to determine the company at which an applicant last worked:

SELECT 
  data.applicant_id,
  data.company_name,
  data.max_date
FROM (
  SELECT 
    *,
    CASE WHEN end_date is null THEN 'present' ELSE end_date END max_date
  FROM example
) data
INNER JOIN (
 SELECT
   applicant_id,
   CASE WHEN MAX(end_date is null) = 1 THEN 'present' ELSE MAX(end_date) END max_date
 FROM
   example
 GROUP BY applicant_id
) j
ON data.applicant_id = j.applicant_id AND data.max_date = j.max_date;

+--------------+-----------------+------------+
| applicant_id | company_name    | max_date   |
+--------------+-----------------+------------+
|            1 | Google          | present    |
|            2 | Woodworking.com | 2016-08-25 |
|            3 | NFL.com         | 2014-04-13 |
+--------------+-----------------+------------+

These are just a few examples of working with NULL values in MySQL.