Doppelganger Destroyer: A Drush Script

29 April, 2024
Doppelganger stick figures with one crossed out with a red X.

Doppelgänger noun
dop·​pel·​gäng·​er  (ˈdä-pəl-ˌgaŋ-ər)
   a ghostly counterpart

-Merriam-Webster

I worked on a site that had several CSV imports that would run on a continual basis. The content of one of the CSV files was employee directory entries. 

The employee ID in the CSV is compared to that in the node bundle. If a node exists with that employee ID, it is updated. If there is no match, a new node is created. This is normal behavior I'm sure you've seen before.

It was discovered that many of the employees had almost duplicate nodes: doppelgängers. 'Almost' because the employee ID of the doppelgänger wasn't quite the same as each original. Early on in the life of the content, there had been an issue with the CSV files being encoded as Latin1 rather than UTF-8, which the system expected. The result was that imports were failing.

Until the source of the CSV files could be altered to produce files with the proper encoding, the files were imported as spreadsheets and exported with the proper encoding. In doing so, the spreadsheet software treated the employee ID as numeric, where it had been a numeric string. The difference? Leading 0's. They were dropped, unnoticed, and when the source encoding issue was fixed, the files started importing with a fixed-length numeric string with leading 0's as they should, which did not match the originals absent the leading 0's.

Under normal circumstances, because each new version of the CSV file contained all employees, the previous import could have been rolled back and the newer one imported, along with the leading 0's. The circumstances, though, were that since being imported, fields in the bundle other than those provided by the CSV had been manually populated. Rolling back would have erased those edits.

I decided to handle the issue by writing a script to remove the duplicates. Given that the script would only be run once, a drush script would be fine for the job.

Here is how I did it.

The tree for this project will be:

modules
-- custom
---- doppelganger_destroyer
------ src
-------- Commands
---------- DoppelgangerDestroyerCommands.php
-------- doppelganger_destroyer.info.yml
-------- doppelganger_destroyer.services.yml

There won't be a .module file, since it is not needed unless there's something to go in it.

doppelganger_destroyer.info.yml

name: Doppelganger Destroyer
description: Delete duplicate nodes.
core: 8.x
type: module
core_version_requirement: ^8 || ^9

The info.yml file is standard, containing the bare minimum statements necessary for a custom module.

 

doppelganger_destroyer.commands.services.yml

services:
  doppelganger_destroyer.commands:
    class: \Drupal\doppelganger_destroyer\Commands\DoppelgangerDestroyerCommands
    tags:
      - { name: drush.command }

The sevices.yml file identifies a command that points to the class that will declare the drush command and provide its functionality. This class is the final file to write.

 

DoppelgangerDestroyerCommands.php

namespace Drupal\doppelganger_destroyer\Commands;

After the opening php tag, we identify in PS4 format where this file is situated: 

use Drush\Commands\DrushCommands;
use Drupal\entity;

Then identify the source of classes we will be making use of, one for drush and the other for entities.

/**
 * Drush command file.
 */
class DoppelgangerDestroyerCommands extends DrushCommands {

Identify our class and that it extends the base class for drush commands.

    /**
     * Remove duplicate nodes.
     *
     * @command doppelganger:destroyer
     * @option dry-run Print two lists, employee id's to be selected and deleted
     * @aliases ddest
     */
    public function doppelgangerDestroyer($options = ['dry-run' => FALSE]) {

The annotations will used by the drush command processor. The @command name will appear when receiving a list of drush commands, and will be accepted as a drush command. The command will accept an option of dry-run that has the command run through the motions of deleting the duplicate nodes without physically removing them. The command will allow an aliased version, ddops, in the same way that 

drush cr

can be used in place of

drush cache-rebuild

The final thing that we do in this snippet is declare the base function. In the declaration a default value is provided so that the option dry-run is assumed to not be requested unless done so explicitly with:

drush ddops dry-run
        $nids = $this->getApplicableNids();

The first thing we need to do is obtain a list of the nids (node IDs), if any, that need to be removed. In this example there is one method to determine the nodes that meet the criteria, but there could be more added, as needed. Let's create the method.

    private function getApplicableNids() {
        $nids = \Drupal::entityQuery("node")
            ->condition('type', 'employees')
            ->condition('field_employee_id', NULL, 'IS NOT NULL')
            ->execute();

        return $nids;
    }

The method executes an entity query that returns the nid of any node of the desired bundle (employee), where the field in question (field_employee_id) isn't empty. 

        if ($nids) {
            $purge = [];
            $storage_handler = \Drupal::entityTypeManager()->getStorage("node");
            $confirm = null;

After receiving the response from getapplicablenides(), we check to see whether any where returned, and if so, initialize an array, $purge, that will hold a list of nodes that were/will be purged, depending on the whether this is a dry run or not, for later reporting. We also retrieve a storage handler that will give us access to each node's fields, and declare a variable that will be used to verify deletion prior to performing it.

There are two runtime modes when executing our command, that it is a dry run, where records are not actually deleted, or that it is not a dry run and records will be deleted. Some of the code will apply to on or the other modes, or both.

            if (!$options['dry-run']) {
                $confirm = $this->io()->confirm('Do you want to delete all duplicates?', false);
            }

This is a test for actual deletion being requested. If so, drush will ask the user to verify that this is the intent.

            foreach ($nids as $nid) {
                $node = $storage_handler->load($nid);
                $id = $node->get('field_employee_id')->value;
                if (strlen($id) < 8) {
                    $purge[] = $node;
                    if ($confirm) {
                        $storage_handler->delete([$node]);
                    }
                }
            }

The foreach loop will execute regardless of the mode...whether a dry-run or not. For each applicable nid that was returned, the node it identifies will be loaded. Since the entity query in our method bypassed nodes with no value in field_employee_id, we know that $id will contain a value when retrieved from the node.

The payload of our module is to check whether the length of the id is other than 8. Why? Well, if it is an 8-digit number, with or without leading zeros, that is the largest valid value in this case: an 8-digit number would not have leading zeros, and a value less than 8 digits is absent leading zeros, otherwise its length would be 8. 

So, if the length of the value is less than 8 characters, and if the user has confirmed that deletion is desired, the storage handler is told to delete the node.

            else {
                foreach ($purge as $node) {
                    $this->writeln($node->get('field_employee_id')->value . ' ' . $node->get('title')->value);
                }
                $this->writeln('');
                $this->writeln(sizeof($purge) . ' records will be deleted.');
            }
        }

Whether a dry run was requested, or implied by declining the verification request to delete records, the else condition outputs each node's nid and title on the command line, with a total count given of the number of nodes that would be deleted were deletion approved.

Here is the complete class listing:

<?php
namespace Drupal\doppelganger_destroyer\Commands;
use Drush\Commands\DrushCommands;
use Drupal\entity;


/**
 * Drush command file.
 */
class DoppelgangerDestroyerCommands extends DrushCommands {

    /**
     * Remove duplicate nodes.
     *
     * @command doppelganger:destroyer
     * @option dry-run Print two lists, employee id's to be selected and deleted
     * @aliases ddest
     */
    public function doppelgangerDestroyer($options = ['dry-run' => FALSE]) {
        // The non dry-run might throw a gulp/curl error on local
        $nids = $this->getApplicableNids();
        if ($nids) {
            $purge = [];
            $storage_handler = \Drupal::entityTypeManager()->getStorage("node");
            $confirm = null;
            if (!$options['dry-run']) {
                $confirm = $this->io()->confirm('Do you want to delete all duplicates?', false);
            }
            foreach ($nids as $nid) {
                $node = $storage_handler->load($nid);
                $id = $node->get('field_employee_id')->value;
                if (strlen($id) < 8) {
                    $purge[] = $node;
                    if ($confirm) {
                        $storage_handler->delete([$node]);
                    }
                }
            }
            if ($confirm) {
                $this->writeln(sizeof($purge) . ' records deleted.');
            }
            else {
                foreach ($purge as $node) {
                    $this->writeln($node->get('field_employee_id')->value . ' ' . $node->get('title')->value);
                }
                $this->writeln('');
                $this->writeln(sizeof($purge) . ' records will be deleted.');
            }
        }
    }

    private function getApplicableNids() {
        $nids = \Drupal::entityQuery("node")
            ->condition('type', 'employees')
            ->condition('field_employee_id', NULL, 'IS NOT NULL')
            ->execute();

        return $nids;
    }
}

The drush command list output:

The drush output for help for our command:

The output of a dry run:

ddev drush ddest --dry-run
1111 Luke Skywalker
2222 Blu Druplicon

2 records will be deleted.

The output of a live run with deletion declined:

ddev drush ddest

 Do you want to delete all duplicates? (yes/no) [no]:
 > no

1111 Luke Skywalker
2222 Blu Druplicon

2 records will be deleted.

Lastly, the output of a live run with deletion approved:

ddev drush ddest

 Do you want to delete all duplicates? (yes/no) [no]:
 > yes

2 records deleted.

 

Doppelganger stick figures with one crossed out with a red X.

Login or Register to Comment!