Review of the patch "abipkgdiff parallelization"

Dodji Seketeli dodji@seketeli.org
Thu Jan 1 00:00:00 GMT 2015


Hello,

This is the review of the patch named "abipkgdiff parallelization",
attached to the enhancement request
https://sourceware.org/bugzilla/show_bug.cgi?id=19081.

At the end of this message, I am attaching the initial patch to this
email as it is otherwise inaccessible via email because bugzilla doesn't
send us attachments along with the bug mails we get.

I am also replying to the comment accompanying the bugzilla attachment
here, as I believe it's more convenient to use email to discuss these
things.

> --- Comment #5 from Ondrej Oprala <ooprala at redhat dot com> ---
> So here's my take on this issue.
> I've been using the rpms of vtk to test this throughout.
> Prior to any threading attempts, running this:
> abipkgdiff vtk-6.1.0-26.fc23.x86_64.rpm vtk-6.2.0-8.fc23.x86_64.rpm --d1 
> vtk-debuginfo-6.1.0-26.fc23.x86_64.rpm --d2 
> vtk-debuginfo-6.2.0-8.fc23.x86_64.rpm
> 
> took a bit less than 4 mins:
> 230,08s user 6,60s system 101% cpu 3:53,03 total
> threads make the numbers much nicer \o/

Whoah!  I cannot wait to see this patch land in master ;-)

Thank you for working on this.

> 381,17s user 14,15s system 492% cpu 1:20,23 total
> 
> I tried to be as verbose as possible with the commit message, but if you 
> think I missed
> anything anywhere, please tell me, I want this to be as good as it can
> get.

Nah, the cover letter is great, thank you for taking the time to put
that together.  It's incredibly useful for maintenance to have extensive
context like that.

> Helgrind only complains about elf_version (and elf_begin), which we 
> already discussed on the IRC.

Good, good.

So, I like the spirit of the patch, really.

I just have some comments about its letter.

There we go.

> diff --git a/doc/manuals/abipkgdiff.rst b/doc/manuals/abipkgdiff.rst
> index 7b90d88..4ca6f6a 100644
> --- a/doc/manuals/abipkgdiff.rst
> +++ b/doc/manuals/abipkgdiff.rst
> @@ -89,6 +89,10 @@ Options
>      Do not search the *package2* for the presence of suppression
>      files.
>  
> +  * ``--no-parallel``
> +
> +    Do not extract packages or run comparisons in parallel.

Maybe, right before this sentence, it would be cool to say that by
default, if the system has several processors, the tool uses them to
perform several tasks in parallel.

[...]


> diff --git a/tools/abipkgdiff.cc b/tools/abipkgdiff.cc

[...]

Please, add a comment that describes the whole picture, saying what is
performed concurrently, what is performed sequentially, so that future
readers have the whole "modus operandi" in their heads.  Something like:

    As this program uses libpthread to perform several tasks
    concurrently, here is a coarse grain description of the sequence
    of actions performed, including where things are done
    concurrently.

    (steps 1/ to /5 are performed in sequence)

    1/ the first package and its debug info are extracted concurrently.
    One thread extracts the package (and maps its content) and another one
    extracts the debug info package.

    2/ the second package and its debug info are extracted in parallel.
    One thread extracts the package (and maps its content) and another one
    extracts the debug info package.

    3/ the file system trees of extracted packages are traversed to
    identify existing pairs and a list of arguments for future comparison
    is made.  The trees are traversed in concurrently.

    4/ comparisons are performed concurrently.

    5/ the reports are then emitted to standard output, always in the same
    order.

[...]

> -/// This contains the set of files of a given package.  It's populated
> +/// The key for getting the thread-local elf_file_paths vector, which
> +/// contains the set of files of a given package.  The vector is populated
>  /// by a worker function that is invoked on each file contained in the
> -/// package.  So this global variable is filled by the
> +/// package, specifically by the
>  /// {first,second}_package_tree_walker_callback_fn() functions.  Its content
> -/// is relevant only during the time after which the current package has been
> -/// analyzed and before we start analyzing the next package.
> -static vector<string> elf_file_paths;
> +/// is relevant only until the mapping of the packages elf files is done.
> +static pthread_key_t thr_key_elf_map;

I wouldn't call this variable a *map.  How about calling it
elf_file_paths_tls_key or something similar?

[...]

> +/// This map is used to keep environments for differing corpora
> +/// referenced.
> +static map<corpus_diff_sptr, abigail::ir::environment_sptr> env_map;

It would be informative to add that an environment needs to be kept
alive during the life time of all the libabigail objects that depend
on it.

[...]

> +/// Arguments passed to the package extraction functions
> +struct extract_args

Please call this type package_descriptor.  It really designates the
package as well as the the context that is necessary to extract it and
map its content.  This type is thus used by the functions that extract
and map the package and its associated debug info.

I believe the package_descriptor name is more specific (has more
specific meaning) than the more generic extract_args name.

> +{
> +  package &pkg;
> +  const options& opts;
> +  ftw_cb_type callback;
> +};
> +

Please add a comment describing the type below, just like what you did
for the type above.

> +struct compare_args
> +{
> +  const elf_file	elf1;
> +  const string&		debug_dir1;
> +  const elf_file	elf2;
> +  const string&		debug_dir2;
> +  const options&	opts;
> +
> +  /// Constructor for compare_args, which is used to pass
> +  /// information to the comparison threads.
> +  ///
> +  /// @param elf1 the first elf file to consider.
> +  ///
> +  /// @param debug_dir1 the directory where the debug info file for @p
> +  /// elf1 is stored.
> +  ///
> +  /// @param elf2 the second elf file to consider.
> +  ///
> +  /// @param debug_dir2 the directory where the debug info file for @p
> +  /// elf2 is stored.
> +  ///
> +  /// @param opts the options the current program has been called with.
> +  compare_args(const elf_file elf1, const string& debug_dir1,
> +	       const elf_file elf2, const string& debug_dir2,
> +	       const options& opts)

I think elf1 and elf2 arguments should be passed by reference, to
avoid a useless copy at function invocation time.

> +    : elf1(elf1), debug_dir1(debug_dir1), elf2(elf2),
> +    debug_dir2(debug_dir2), opts(opts)
> +  {}
> +};

> +typedef shared_ptr<compare_args> compare_args_sptr;

Please add a comment for this typedef.

[...]

> @@ -596,10 +660,14 @@ erase_created_temporary_directories_parent()
>  
>  /// Extract the content of a package.
>  ///
> -/// @param package the package we are looking at.
> -static bool
> -extract_package(const package& package)
> +/// @param pkg the package we are looking at.
> +/// @return via pthread_exit() a pointer to a boolean value of true upon
> +/// successful completion, false otherwise.
> +static void
> +extract_package(const package* pkg)

Rather than changing this function, I would add a wrapper to it,
called pthread_routine_extract_package() or something similar.  That
wrapper would call extract_package as it is right now, and either and
pthread_exit() with the appropriate value, depending on the return
value of extract_package.

The idea is that if we later want to use extract a package
independently from the pthread boilerplate, we could just call
extract_package.  Inherently, extracting a package is a basic action
that shouldn't not depend on pthread.

>  {
> +  bool *ret = new bool (true);
> +  const package& package = *pkg;
>    switch(package.type())
>      {
>      case abigail::tools_utils::FILE_TYPE_RPM:
> @@ -607,13 +675,13 @@ extract_package(const package& package)
>        if (!extract_rpm(package.path(), package.extracted_dir_path()))
>          {
>            cerr << "Error while extracting package" << package.path() << "\n";
> -          return false;
> +          goto FALSE;
>          }
> -      return true;
> +      goto TRUE;
>  #else
>        cerr << "Support for rpm hasn't been enabled.  Please consider "
>  	"enabling it at package configure time\n";
> -      return false;
> +      goto FALSE;
>  #endif // WITH_RPM
>        break;
>      case abigail::tools_utils::FILE_TYPE_DEB:
> @@ -621,20 +689,20 @@ extract_package(const package& package)
>        if (!extract_deb(package.path(), package.extracted_dir_path()))
>          {
>            cerr << "Error while extracting package" << package.path() << "\n";
> -          return false;
> +          goto FALSE;
>          }
> -      return true;
> +      goto TRUE;
>  #else
>        cerr << "Support for deb hasn't been enabled.  Please consider "
>  	"enabling it at package configure time\n";
> -      return false;
> +      goto FALSE;
>  #endif // WITH_DEB
>        break;
>  
>      case  abigail::tools_utils::FILE_TYPE_DIR:
>        // The input package is just a directory that contains binaries,
>        // there is nothing to extract.
> -      break;
> +      goto TRUE;
>  
>      case abigail::tools_utils::FILE_TYPE_TAR:
>  #ifdef WITH_TAR
> @@ -642,20 +710,23 @@ extract_package(const package& package)
>          {
>            cerr << "Error while extracting GNU tar archive "
>  	       << package.path() << "\n";
> -          return false;
> +          goto FALSE;
>          }
> -      return true;
> +      goto TRUE;
>  #else
>        cerr << "Support for GNU tar hasn't been enabled.  Please consider "
>  	"enabling it at package configure time\n";
> -      return false;
> +      goto FALSE;
>  #endif // WITH_TAR
>        break;
>  
>      default:
> -      return false;
> +      break;
>      }
> -  return true;
> +FALSE:
> +  *ret = false;
> +TRUE:
> +  pthread_exit(ret);
>  }

[...]

> @@ -742,145 +821,158 @@ set_diff_context_from_opts(diff_context_sptr ctxt,
>  
>  /// Compare the ABI two elf files, using their associated debug info.
>  ///
> -/// The result of the comparison is emitted to standard output.
> -///
> -/// @param elf1 the first elf file to consider.
> -///
> -/// @param debug_dir1 the directory where the debug info file for @p
> -/// elf1 is stored.
> +/// The result of the comparison is saved to a global corpus map.
>  ///
> -/// @param elf2 the second eld file to consider.
> +/// @args the list of argument sets used for comparison
>  ///
> -/// @param debug_dir2 the directory where the debug info file for @p
> -/// elf2 is stored.
> -///
> -/// @param opts the options the current program has been called with.
> -///
> -/// @return the status of the comparison.
> +/// @return the status of the comparison via pthread_exit().
>  static abidiff_status
> -compare(const elf_file& elf1,
> -	const string&	debug_dir1,
> -	const elf_file& elf2,
> -	const string&	debug_dir2,
> -	const options&	opts)
> +compare_binaries(list<compare_args_sptr> *args)
>  {

Likewise, rather than making this compare() function know about
pthread boiler plate, I'd rather leave it alone, and write a wrapper
function named pthread_routine_compare() that would take a vector of
compare_args_sptr (why a list, by the way?).  OK, of course, this
compare() function should be modified to take an environment in
parameter, and use that.  It should also "return" the resulting
corpus_diff_sptr, by parameter, rather than emitting the report to
standard output itself.  By parameter, I mean that the function should
take a reference to corpus_diff_sptr by parameter, and set that
parameter to the corpus_diff_sptr resulting from the comparison.  The
caller (here the pthread routine) would then arrange for
creating/handling the environment, pass it to compare(), and the
resulting corpus_diff_sptr appropriately.  Does that make sense?

[...]

>  }

[...]

>  
>  /// Extract the content of a package and map its content.
> +/// Also extract its accompanying debuginfo package.
>  ///
> -/// The extracting is done to a temporary
> -///
> -/// @param package the package to consider.
> +/// The extracting is done to a temporary directory.
>  ///
> -/// @param opts the options the current package has been called with.
> +/// @param a the set of arguments needed for successful extraction;
> +/// specifically the package itself, the options the current package has been
> +/// called with and a callback to traverse the directory structure.
>  ///
> -/// @return true upon successful completion, false otherwise.
> -static bool
> -extract_package_and_map_its_content(package& package,
> -				    const options& opts,
> -				    ftw_cb_type callback)
> +/// @return via pthread_exit() true upon successful completion, false
> +/// otherwise.
> +static void
> +prepare_package_set(struct extract_args *a)

I would prefer that we call this function

pthread_routine_extract_pkg_and_map_its_content(), as it says more
specifically what it does; it also says that it's to be used as a
pthread routine.

Of course, keep the extended comment that you added, saying that it
also extracts the accompanying debug info.

Also, please drop the 'struct' keyword in 'struct extra_args' (that
should now be package_descriptor) because it's not necessary here.

>  {
> -  if (!extract_package(package))
> -    return false;
> +  pthread_t thr_pkg, thr_debug;
> +  package& package = a->pkg;
> +  const options& opts = a->opts;
> +  ftw_cb_type callback = a->callback;
> +  bool is_debug, result = true;
> +  bool *child_result;
> +
> +#define PTHREAD_JOIN(thr,thr_ret) \
> +  do { \
> +    if (!pthread_join(thr, reinterpret_cast<void **>(&thr_ret))) \
> +      { \
> +	result &= *thr_ret; \
> +	delete thr_ret; \
> +      } \
> +    else \
> +      result = false; \
> +  } while (0)

Why not replacing this define by a function that would take the thread
id, a bool** thread_return_value, and a  bool& result?

> +
> +  // The debug-info package usually takes longer to extract than the main
> +  // package plus that package's mapping for ELFs and optionally suppression
> +  // specs, so we run it ASAP.
> +  if ((is_debug = !!package.debug_info_package()))

Here, the is_debug variable should be named 'has_debug_info_pkg',
because package.debug_info_package() doesn't say anything about the
fact that the package *itself* is a debug info package; rather it
returns the *associated* debug info package, if it's present.

Also, I don't think the '!!' is necessary, because the returned value
of package.debug_info_package() should be implicitly converted to
bool.  Or am I missing something?

[...]

>  }

[...]

>  
>  /// Prepare the packages for comparison.
> @@ -984,19 +1139,86 @@ prepare_packages(package&	first_package,
>  		 package&	second_package,
>  		 const options& opts)
>  {
> -  if (!extract_package_and_map_its_content(first_package, opts,
> -					   first_package_tree_walker_callback_fn)
> -      || !extract_package_and_map_its_content(second_package, opts,
> -					      second_package_tree_walker_callback_fn))
> -    return false;
> +  bool ret = true;
> +  bool *thread_retval;
> +  const int npkgs = 2;
> +  pthread_t extract_thread[npkgs];
> +
> +#define PTHREAD_JOIN(thr,thr_ret) \
> +  do { \
> +    if (!pthread_join(thr, reinterpret_cast<void **>(&thr_ret))) \
> +      { \
> +	ret &= *thr_ret; \
> +	delete thr_ret; \
> +      } \
> +    else \
> +      ret = false; \
> +  } while (0)

If what I said for the previous incarnation of this macro makes sense,
we should replace this one by a function as well.

> +
> +  extract_args ea[npkgs] =

On one hand you are using a variable (npkgs) here, but you still need
to know that the variable's value is '2', otherwise, the
initialization won't pass compilation, because the number of members
of the array is '2'.

So I'd rather write:

    extract_args ea[] =

(okay this should be package_descriptor, rather than extract_args).

> +    {
> +      {
> +	.pkg = first_package,
> +	.opts = opts,
> +	.callback = first_package_tree_walker_callback_fn
> +      },
> +      {
> +	.pkg = second_package,
> +	.opts = opts,
> +	.callback = second_package_tree_walker_callback_fn
> +      },
> +    };

[...]

> +}

> +/// Compare the added sizes of a ELF pair specified by @a1
> +/// with the sizes of a ELF pair from @a2.
> +///
> +/// Larger filesize strongly raises the possibility of larger debug-info,
> +/// hence longer diff time. For a package containing several relatively
> +/// large and small ELFs, it is often more efficient to start working on
> +/// the larger ones first. This function is used to order the pairs by
> +/// size, starting from the largest.
> +///
> +/// @a1 the first set of arguments containing also the size information about
> +/// the ELF pair being compared.
> +///
> +/// @a2 the second set of arguments containing also the size information about
> +/// the ELF pair being compared.
> +bool
> +elf_less(compare_args_sptr &a1, compare_args_sptr &a2)

This is really a nit, but throughout the project, function names must
have a verb in them, for better legibility.  So please call this
elf_size_is_less() or something like that.

> +{
> +  off_t s1 = a1->elf1.size + a1->elf2.size;
> +  off_t s2 = a2->elf1.size + a2->elf2.size;
> +
> +  return s1 > s2;
>  }
>  
>  /// Compare the ABI of two packages
> @@ -1037,6 +1259,8 @@ compare(package&	first_package,
>  
>    abidiff_status status = abigail::tools_utils::ABIDIFF_OK;
>  
> +  list<compare_args_sptr> elf_pairs;

Why not using a vector here?  I mean, it takes less size in memory,
and I am not sure there really is a speed penalty incurred by adding
elements to this container.

[...]


And, for the record, here is the original patch that you attached to
bugzilla.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Bug-19081-abipkgdiff-parallelization.patch
Type: text/x-patch
Size: 32625 bytes
Desc: Initial patch
URL: <http://sourceware.org/pipermail/libabigail/attachments/20150101/c054c922/attachment.bin>
-------------- next part --------------

Cheers,

-- 
		Dodji


More information about the Libabigail mailing list