Condor Project News > HTCondor helps with GIS analysis ( August 24, 2016 )

This article explains how the Clemson Center for Geospatial Technologies (CCGT) was able to use HTCondor to help a student analyze large amounts of GIS (Geographic Information System) data. The article contains a good explanation of how the data was divided up in such a way as to allow it to be processed using an HTCondor pool. Using HTCondor allowed the data to be analyzed in approximately 3 hours, as opposed to the 4.11 days it would have taken on a single computer.

News and Announcements from OSG Operations > TWiki Outage Update

We have completed the previously announced restoration of the TWiki to its state as of Monday 15/Aug. The system is behaving normally at this time but we request you contact us if you encounter any unusual behavior.

We are also in the process of recovering changes to content made between Monday 15/Aug and Friday 19/Aug. If you have any content you need restored and would like prioritized, please let us know.

The GOC regrets any inconvenience and is taking steps to insure this will not reoccur.

News and Announcements from OSG Operations > TWiki Outage

We are currently encountering difficulties with TWiki and are restoring the
service from backup to its state as of Monday 15/Aug. We will attempt to recover
changes made after that time and will appraise you as to the results of our efforts.
The GOC regrets any inconvenience and will inform you as soon as resolution and
further information is available.

Pegasus news feed > Pegasus 4.6.2 Release

We are happy to announce the release of Pegasus 4.6.2.  Pegasus 4.6.2 is a minor release of Pegasus and includes improvements and bug fixes to the 4.6.1 release
New features and Improvements in 4.6.2 are
  • support for kickstart wrappers that can setup a user environment
  • support for Cobalt and SLURM schedulers via the Glite interfaces
  • ability to do local copy of files in PegasusLite to staging site, if the compute and staging site is same
  • support for setting up Pegasus Tutorial on Bluewaters using pegasus-init

New Feature

  • [PM-1095] – pegasus-service init script
  • [PM-1101] – Add support for gsiscp transfers
    •  These will work like the scp ones, but with x509 auth instead of ssh public keys.
  • [PM-1110] – put in support for cobalt scheduler at ALCF
    • Pegasus was updated to use the HTCondor Blahp support. ALCF has a cobalt scheduler to schedule jobs to the BlueGene system. The documentation has details on how the pegasus task requirement profiles map to Cobalt parameters. .
    • To use HTCondor on Mira, please contact the HTCondor team to point you to the latest supported HTCondor installation on the system.
  • [PM-1096] – Update Pegasus’ glite support to include SLURM
  • [PM-1115] – Pegasus to check for cyclic dependencies in the DAG
    • Pegasus now checks for cyclic dependencies that may exist in the DAX or are as a result of adding edges automatically based on data depedencies
  • [PM-1116] – pass task resource requirements as environment variables for job wrappers to pick up
    • The task resource requirements are also passed as environment variables for the jobs in the GLITE style. This ensures that job wrappers can pick up task requirement profiles as environment variables.


  • [PM-1078] – pegasus-statistics should take comma separated list of values for -s option
  • [PM-1105] – Mirror job priorities to DAGMan node priorities
    • The job priorities associated with jobs in the workflow are now also associated as DAGMan node priorities, provided that HTCondor version is 8.5.7 or higher.
  • [PM-1108] – Ability to do local copy of files in PegasusLite to staging site, if the compute and staging site is same
    •  The optimization implemented is implemented in the Planner’s pegasus lite generation code, where when constructing the destination URL’s for the output site it checks for
      a) symlinking is turned on
      b) compute site for the job and staging site for job are same.
      This means that the shared-scratch directory used on the staging site is locally accessible to the compute nodes. So we can go directly via the filesystem to copy the file. So instead of creating a gsiftp url , will create a file url in pegasuslite wrappers for the jobs running on local site.
  • [PM-1112] – enable variable expansion for regex based replica catalog
    • Variable expansion for Regex based replica catalogs was not supported earlier. This is fixed now.
  • [PM-1117] – Support for tutorial via pegasus-init on Bluewaters
    • pegasus-init was updated to support running tutorial examples on Bluewaters. To use this, users need to logon to the bleaters login node and run pegasus-init. The assumption is that HTCondor is running on the login node either in user space or root.
  • [PM-1111] – pegasus planner and api’s should have support for ppc64 as architecture type

Bugs Fixed

  • [PM-1087] – dashboard and pegasus-metadata don’t query for sub workflows
  • [PM-1089] – connect_by_submitdir should seek for braindump.txt in the workflow root folder
  • [PM-1093] – disconnect in site catalog and DAX schema for specifying OSType
  • [PM-1099] – x509 credentials should be transferred using x509userproxy
  • [PM-1100] – Typo in rsquot, ldquot and rdquot
  • [PM-1106] – pegasus-init should not allow (or should handle) spaces in site name
  • [PM-1107] – pegasuslite signal handler race condition
  • [PM-1113] – make planner directory options behavior more consistent



Pegasus news feed > Soybean Science Blooms with Supercomputers

articleTACC (Texas Advanced Computing Center) has published a science highlight of the SoyKB project. Pegasus is used to orchestrate the computations running on TACC Wrangler and automatically retrieving and storing data in the CyVerse data store. Also highlighted is how the XSEDE ECSS (Extended Collaborative Support Service) can be used get scientific workflow support on XSEDE.

Read the full article at:



News and Announcements from OSG Operations > GOC Service Update - Tuesday, August 23rd

The GOC will upgrade the following services beginning Tuesday, August 23rd at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.
Updates to esmond, rsv-perfsonar

Collector, Redirector, Ticket, Ticket exchange, OIM
Rebuild from new content management system
OIM, change default certificate signer from DigiCert to CILogon

Upgrade GlideinWMS to  VOs planning to run GlideinWMS 3.2.15 on their frontends will require all factories to run >=

Updates to wordpress

All Services
Operating system updates; reboots will be required. The usual HA mechanisms will be used, but some services will experience brief outages.

News and Announcements from OSG Operations > Announcing OSG Software versions 3.3.15 and 3.2.41

We are pleased to announce OSG Software versions 3.3.15 and 3.2.41.
This is the last release in the 3.2 series.

Both 3.3.15 and 3.2.41 include:
* CA certificates based on IGTF 1.76
* VO Package v67 - correction for ILC

Changes to OSG 3.3.15 include:
* SLURM scalability enhancements in the BLAHP
* Fixed a bug in the BLAHP where HTCondor could not remove a SLURM job
* Enable XRootD-HDFS to use native HDFS libraries if available
* Add an extension to the GridFTP server to report space usage on the server
* Fix GUMS to properly display long Pool Account lists
* The RSV service will start even though its state file is corrupt
* Update GSI-OpenSSH from 5.7-4.3 to 7.1p2f
* voms-proxy-init generates RFC compliant proxies by default
* Configure voms-server for systemd startup in EL7
* Add voms-admin-client for EL7
* HTCondor 8.5.6 in the upcoming repository

Release notes and pointers to more documentation can be found at:

Need help? Let us know:

We welcome feedback on this release!

Erik Erlandson - Tool Monkey > Using Minimum Description Length to Optimize the 'K' in K-Medoids

Applying many popular clustering models, for example K-Means, K-Medoids and Gaussian Mixtures, requires an up-front choice of the number of clusters -- the 'K' in K-Means, as it were. Anybody who has ever applied these models is familiar with the inconvenient task of guessing what an appropriate value for K might actually be. As the size and dimensionality of data grows, estimating a good value for K rapidly becomes an exercise in wild guessing and multiple iterations through the free-parameter space of possible K values.

There are some varied approaches in the community for addressing the task of identifying a good number of clusters in a data set. In this post I want to focus on an approach that I think deserves more attention than it gets: Minimum Description Length.

Many years ago I ran across a superb paper by Stephen J. Roberts on anomaly detection that described a method for automatically choosing a good value for the number of clusters based on the principle of Minimum Description Length. Minimum Description Length (MDL) is an elegant framework for evaluating the parsimony of a model. The Description Length of a model is defined as the amount of information needed to encode that model, plus the encoding-length of some data, given that model. Therefore, in an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose own description is also efficient (This suggests connections between MDL and the idea of learning as a form of data compression).

For example, a model that directly memorizes all the data may allow for a very short description of the data, but the model itself will cleary require at least the size of the raw data to encode, and so direct memorization models generaly stack up poorly with respect to MDL. On the other hand, consider a model of some Gaussian data. We can describe these data in a length proportional to their log-likelihood under the Gaussian density. Furthermore, the description length of the Gaussian model itself is very short; just the encoding of its mean and standard deviation. And so in this case a Gaussian distribution represents an efficient model with respect to MDL.

In summary, an MDL framework allows us to mathematically capture the idea that we only wish to consider increasing the complexity of our models if that buys us a corresponding increase in descriptive power on our data.

In the case of Roberts' paper, the clustering model in question is a Gaussian Mixture Model (GMM), and the description length expression to be optimized can be written as:


In this expression, X represents the vector of data elements. The first term is the (negative) log-likelihood of the data, with respect to a candidate GMM having some number (K) of Gaussians; p(x) is the GMM density at point (x). This term represents the cost of encoding the data, given that GMM. The second term is the cost of encoding the GMM itself. The value P is the number of free parameters needed to describe that GMM. Assuming a dimensionality D for the data, then P = K(D + D(D+1)/2): D values for each mean vector, and D(D+1)/2 values for each covariance matrix.

I wanted to apply this same MDL principle to identifying a good value for K, in the case of a K-Medoids model. How best to adapt MDL to K-Medoids poses some problems. In the case of K-Medoids, the only structure given to the data is a distance metric. There is no vector algebra defined on data elements, much less any ability to model the points as a Gaussian Mixture.

However, any candidate clustering of my data does give me a corresponding distribution of distances from each data element to it's closest medoid. I can evaluate an MDL measure on these distance values. If adding more clusters (i.e. increasing K) does not sufficiently tighten this distribution, then its description length will start to increase at larger values of K, thus indicating that more clusters are not improving our model of the data. Expressing this idea as an MDL formulation produces the following description length formula:


Note that the first two terms are similar to the equation above; however, the underlying distribution p(||x-cx||) is now a distribution over the distances of each data element (x) to its closest medoid cx, and P is the corresponding number of free parameters for this distribution (more on this below). There is now an additional third term, representing the cost of encoding the K medoids. Each medoid is a data element, and specifying each data element requires log|X| bits (or nats, since I generally use natural logarithms), yielding an additional (K)log|X| in description length cost.

And so, an MDL-based algorithm for automatically identifying a good number of clusters (K) in a K-Medoids model is to run a K-Medoids clustering on my data, for some set of potential K values, and evaluate the MDL measure above for each, and choose the model whose description length L(X) is the smallest!

As I mentioned above, there is also an implied task of choosing a form (or a set of forms) for the distance distribution p(||x-cx||). At the time of this writing, I am fitting a gamma distribution to the distance data, and using this gamma distribution to compute log-likelihood values. A gamma distribution has two free parameters -- a shape parameter and a location parameter -- and so currently the value of P is always 2 in my implementations. I elaborated on some back-story about how I arrived at the decision to use a gamma distribution here and here. An additional reason for my choice is that the gamma distribution does have a fairly good shape coverage, including two-tailed, single-tailed, and/or exponential-like shapes.

Another observation (based on my blog posts mentioned above) is that my use of the gamma distribution implies a bias toward cluster distributions that behave (more or less) like Gaussian clusters, and so in this respect its current behavior is probably somewhat analogous to the G-Means algorithm, which identifies clusterings that yield Gaussian disributions in each cluster. Adding other candidates for distance distributions is a useful subject for future work, since there is no compelling reason to either favor or assume Gaussian-like cluster distributions over all kinds of metric spaces. That said, I am seeing reasonable results even on data with clusters that I suspect are not well modeled as Gaussian distributions. Perhaps the shape-coverage of the gamma distribution is helping to add some robustness.

To demonstrate the MDL-enhanced K-Medoids in action, I will illustrate its performance on some data sets that are amenable to graphic representation. The code I used to generate these results is here.

Consider this synthetic data set of points in 2D space. You can see that I've generated the data to have two latent clusters:


I collected the description-length values for candidate K-Medoids models having 1 up to 10 clusters, and plotted them. This plot shows that the clustering with minimal description length had 2 clusters:


When I plot that optimal clustering at K=2 (with cluster medoids marked in black-and-yellow), the clustering looks good:


To show the behavior for a different optimal value, the following plots demonstrate the MDL K-Medoids results on data where the number of latent clusters is 4:

K4-Raw K4-MDL K4-Clusters

A final comment on Minimum Description Length approaches to clustering -- although I focused on K-Medoids models in this post, the basic approach (and I suspect even the same description length formulation) would apply equally well to K-Means, and possibly other clustering models. Any clustering model that involves a distance function from elements to some kind of cluster center should be a good candidate. I intend to keep an eye out for applications of MDL to other learning models, as well.


[1] "Novelty Detection Using Extreme Value Statistics"; Stephen J. Roberts; Feb 23, 1999 [2] "Learning the k in k-means. Advances in neural information processing systems"; Hamerly, G., & Elkan, C.; 2004

News and Announcements from OSG Operations > GOC Service Update Tuesday, August 9th at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, August 9th at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

Modifications to configuration for bosco
Will not request a certificate
Redirect to https

Add configuration for /cvmfs/ to the repository

Condor Project News > HTCondor 8.5.6 released! ( August 2, 2016 )

The HTCondor team is pleased to announce the release of HTCondor 8.5.6. This development series release contains new features that are under development. This release contains all of the bug fixes from the 8.4.8 stable release. Highlights of the release are: The -batch output for condor_q is now the default; Python bindings for job submission and machine draining; Numerous Docker usability changes; New options to limit condor_history results to jobs since last invocation; Shared port daemon can be used with high availability and replication; ClassAds can be written out in JSON format; More flexible ordering of DAGMan commands; Efficient PBS and SLURM job monitoring; Simplified leases for grid universe jobs. Further details can be found in the Development Version History and the Stable Version History. HTCondor 8.5.6 binaries and source code are available from our Downloads page.

News and Announcements from OSG Operations > GOC Service Update Tuesday, July 26th at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, July 26th at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

Condor Collector
Adding the service to new content management system.

Adding the service to new content management system.

No longer suppress emails during the host cert request process when the submitter is a GA

StashCache Redirector
Adding the service to new content management system.

All Services
Operating system updates; reboots will be required.
The usual HA mechanisms will be used, but some services will experience brief outages.

News and Announcements from OSG Operations > Planned Retirement of OSG BDII Information Service ( - March 31st, 2017

OSG Collaborators,

OSG Operations and Technology are planning the retirement of the BDII information service located at on March 31st, 2017. We have been working with WLCG, ATLAS and CMS to remove dependencies or replace the functionality within our HTCondor Collector service. This work is still ongoing. This message is to alert you to the upcoming deprecation date and to get feedback on any other existing dependencies that might exist to the BDII.

If you are dependent in any way on the OSG BDII or information the OSG BDII supplies to the WLCG or EGI BDIIs please contact us at

Pegasus news feed > Workflow tutorial at XSEDE’16


Are you going to the XSEDE’16 conference and want to learn more about Pegasus? Pegasus will be presented as part of the Introduction to Scientific Workflow Technologies on XSEDE tutorial, Monday July 18th 1pm-5pm. We will also be available to meet with users individually during the conference.

We hope to see you there!

XSEDE’16 schedule:



Pegasus news feed > Pegasus Research Impact

The Pegasus team has released a Research Impact project to conduct Data Science analysis on the publications citing or using the Pegasus software. In this project, we collect citation and authors data from Google Scholar, and conduct analyses on the number of citations (self-referenced and external references), the distribution of citation types (conferences, journal articles, etc.), authors location, among others.


Pegasus (est. 2001) has been widely adopted by the computational research community, and its performance and usage data have already demonstrated its ability to empower the scientists to seamlessly run their simulation or data analyses in distributed systems. In this project, we aim to provide empirical data to demonstrate on how the Pegasus software has contributed and impacted the research community.

These analyses will be conducted periodically (about every 3 months), and will aggregate information from all Pegasus publications (for the software), and the Pegasus project website. The first analysis collected data between 2005 and June 2016, and represent 1100+ citations for over 2400+ author worldwide.

View the Pegasus Research Impact project analysis #kadbtn54:hover {background:#efefef !important;color:#000000 !important;}


News and Announcements from OSG Operations > Requested Update of CVMFS Client Version for NoVA experiment

FabrIc for Frontier Experiments (FIFE) and the NoVA experiment are dependent on CVMFS Client version 2.2.3 or later.
OSG Operations requests all sites that would like to support FIFE and NoVA workflows update to the latest version of the CVMFS Client.
Details on the current release and update instructions can be found at

Thank you for your effort in keeping your OSG installation up to date.

News and Announcements from OSG Operations > Announcing OSG Software versions 3.3.14 and 3.2.40

We are pleased to announce OSG Software versions 3.3.14 and 3.2.40.

Both 3.3.14 and 3.2.40 include:
* CA certificates based on IGTF 1.75

Changes to OSG 3.3.13 include:
* HTCondor 8.4.8: bug fixes for bosco, schedd crash, memory leak using python
* GlideinWMS Improved efficiency, works with any HTCondor version
* HTCondor-CE 2.0.7: Add htcondor-ce-bosco sub-package
* BLAHP 1.18.21: SLURM improvements
* gridFTP 7.30-1.2: adler32 checksum support, fix deadlock
* osg-configure 1.4.1: improved configuration of bosco, GUMS
* xrootd-voms-plugin 0.4.0: added support for 'all' group selection
* osg-system-profiler 1.4.0: detect unconfigured trustmanager
* gridFTP-HDFS 0.5.4: fixed ability to list/remove empty directories
* cvmfs-config-osg 1.2.5: use new CVMFS fall-back policies
* bigtop-utils: Fix default JAVA_HOME to prevent crash in hdfs utils
* osg-voms 3.3-3: Remove voms-admin (EL7 only)

Release notes and pointers to more documentation can be found at:

Need help? Let us know:

We welcome feedback on this release!

News and Announcements from OSG Operations > Registration open for free Workflows Workshop

Registration is now open for a free Workflows Workshop to be held August 9-10 at multiple institutions across the country. Sponsored by the Blue Waters sustained-petascale computing project, this workshop will provide an overview of workflows and how they can enhance research productivity.  

A general session on the value of workflows will be followed by presentations and hands-on sessions with six different workflows.  The objective is to assist the community in understanding the capabilities of these various workflows and to get people started with their usage. These include:
- General overview of workflows; Why use them?, presented by Scott Callaghan, University of Southern California
- Copernicus, presented by Peter Kasson, University of Virgini
- Galaxy, presented by Dave Clements, Johns Hopkins University
- Makeflow/WorkQueue, presented by Nicholas Hazekamp, University of Notre Dame
- Pegasus, presented by Karan Vahi and Mats Rynge, Information Sciences Institute
- RADICAL Cybertools, presented by Shantenu Jha, Rutgers University
- Swift, presented by Mike Wilde, Argonne National Laboratory

The presentations will be followed by a question and answer period to address questions from the community. Additional information on the workshop is available at

The sites hosting this workshop include:
Georgia State University
NCSA  University of Illinois at Urbana-Champaign
Michigan State University
Oklahoma State University
Purdue University
Stanford University
Texas Tech University
University of Kentucky
University of Houston
University of Utah
University of Wyoming

You may register for this workshop through the XSEDE User Portal at: by August 2, 2016. There is a registration button for each site, be sure you select the site where you will be attending.

The Blue Waters sustained-petascale computing project is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993) and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications.