Why I wrote my own symmetric difference?

Oct 20, 2022 · 4 min read

I worked on an Apache Maven Plugin (more accurately on the maven-dependency-plugin) while I realized that it has a dependency commons-collections4 (version 4.2) which is used only once in the whole code base. That means that only a single import statement of those packages (import org.apache.commons.collections4.*) in the code exists. The code looks like this (see AnalyzeDuplicateMojo):

 1private Set<String> findDuplicateDependencies( List<Dependency> modelDependencies )
 2{
 3    List<String> modelDependencies2 = new ArrayList<>();
 4    for ( Dependency dep : modelDependencies )
 5    {
 6        modelDependencies2.add( dep.getManagementKey() );
 7    }
 8
 9    // @formatter:off
10    return new LinkedHashSet<>(
11            CollectionUtils.disjunction( modelDependencies2, new LinkedHashSet<>( modelDependencies2 ) ) );
12    // @formatter:on
13}

That made me take a closer look on that. I analyzed the dependency a bit more and the commons-collections4 is a jar file of ca. 734 KiB in size and contains ca. 300 classes.

So I thought, does it really make sense to depend on a jar file for a few classes being used? So what exactly is the CollectionUtils.disjunction doing? Based on it's javadoc it is calculating the symmetric difference of two sets. An excerpt from Wikipedia:

...also known as the disjunctive union, is the set of elements which are in either of the sets, but not in their intersection. For example, the symmetric difference of the sets {1,2,3} and {3,4} is {1,2,4}.

So that brought me to the conclusion that it might be worth to implement that on my own to prevent depending on a library only for a few classes.

So after a bit of tests and some retries I got the following result:

 1class Util
 2{
 3
 4    public static <O> Collection<O> symmetricDifference( Iterable<? extends O> elementsOne,
 5                                                         Iterable<? extends O> elementsTwo )
 6    {
 7        Collection<O> resultOne = StreamSupport.stream( elementsOne.spliterator(), false ).collect( toList() );
 8        Collection<O> resultTwo = StreamSupport.stream( elementsTwo.spliterator(), false ).collect( toList() );
 9
10        HashSet<O> hashSet = new HashSet<>(resultOne);
11        hashSet.addAll( resultTwo );
12
13        Map<O, Long> cardinalityOne =
14                resultOne.stream().collect( groupingBy( identity(), counting() ) );
15        Map<O, Long> cardinalityTwo =
16                resultTwo.stream().collect( groupingBy( identity(), counting() ) );
17
18        List<O> newList = new ArrayList<>();
19
20        for ( O item : hashSet )
21        {
22            Long cardOne = cardinalityOne.getOrDefault( item, 0L );
23            Long cardTwo = cardinalityTwo.getOrDefault( item, 0L );
24            long max = Math.max( cardOne, cardTwo );
25            long min = Math.min( cardOne, cardTwo );
26            addItem( newList, item, max - min );
27        }
28
29        return newList;
30    }
31
32    private static <O> void addItem( List<O> items, O item, long count )
33    {
34        for ( int i = 0; i < count; i++ )
35        {
36            items.add( item );
37        }
38    }
39
40}

Of course, I've written unit tests to verify the code. The tests I've written at the beginning where to compare the results of my own code with the results of CollectionUtils.disjunction. Afterwards I have removed the calling to CollectionUtils.disjunction as well as the dependency.

Apart from that we can discuss the above code that it is not the optimal solution or could be simplified which is true. You can argue that using Iterable<? extends O> as parameters might be change by using Collection<O> instead which is correct. Still there are other options to simplify the code in different ways like using Collections.nCopies() and using summingInt(..) instead of counting() which results in type change from Long to Integer. In the end makes the code smaller and easier. So the final result looks like this:

 1class Util
 2{
 3
 4  static <O> Collection<O> symmetricDifference( Collection<O> elementsOne,
 5                                                Collection<O> elementsTwo )
 6  {
 7    Set<O> hashSet = new HashSet<>( elementsOne );
 8    hashSet.addAll( elementsTwo );
 9
10    Map<O, Integer> cardinalityOne =
11        elementsOne.stream().collect( groupingBy( identity(), summingInt( e -> 1 ) ) );
12    Map<O, Integer> cardinalityTwo =
13        elementsTwo.stream().collect( groupingBy( identity(), summingInt( e -> 1 ) ) );
14
15    return hashSet.stream().flatMap( item -> {
16      int cardOne = cardinalityOne.getOrDefault( item, 0 );
17      int cardTwo = cardinalityTwo.getOrDefault( item, 0 );
18      int max = Math.max( cardOne, cardTwo );
19      int min = Math.min( cardOne, cardTwo );
20      return Collections.nCopies( max - min, item ).stream();
21    } ).collect( toList() );
22  }
23
24}

So in the end it is worth to reconsider the usage of dependencies because in this case I've written a class with less than 20 lines of code to replace a dependency with 300+ classes of which only a few (3 If I analyzed that correctly) are really being used.