News and Announcements from OSG Operations > 2017 Open Science Grid All-Hands Meeting Registration Now Open!

OSG All Hands Meeting 2017 - San Diego Supercomputer Center - University
of California San Diego

Registration for the All-Hands Meeting of the Open Science Grid hosted by
the San Diego Supercomputer Center (http://www.sdsc.edu/), March 6-9,
2017, La Jolla, CA

Topics to be discussed will include:

* Cyberinfrastructure partnerships: university research computing
HPC centers, XSEDE XD providers, DOE laboratories, NSF Large Facility
computing organizations and commercial cloud providers. Technologies,
strategies, ideas, and discussions on how OSG can foster partnerships
across the widest possible range of CI.

* How high throughput computing accelerates research, and how OSG can help
users scale up.

* Usability challenges and solutions for distributed high throughput
computing applications.

* Connecting virtual organizations, campus researchers and XSEDE users to
the OSG: command line, science gateways, and workflow frameworks.

* Training and education workshops.

* Serving more of the "long tail" of science with high throughput parallel
computing: incorporating multi-core, GPU and virtual cluster resources
into science workflows using shared and allocated distributed
infrastructure.

* Advanced network analytics services for national Science DMZ
infrastructure.

As has been the custom, the 2017 OSG AHM will be co-located with the U.S.
Large Hadron Collider (LHC at CERN) computing facility meetings.

Logistical information, registration and agenda are available at
https://www.opensciencegrid.org/AHM2017 (This will redirect to the
eiseverywhere.com domain.)

News and Announcements from OSG Operations > GOC Service Update - Tuesday, January 24th at 14:00 UTC

The GOC will upgrade the following services beginning Tuesday, January 24th at 14:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

OASIS
* Update cvmfs packages on oasis and oasis-replica to the latest version
* Update frontier-squid on oasis-replica to the latest version
* Enable garbage collection on oasis-replica for the osgstorage.org repositories
* Remove requirement in the oasis-replica install process that external repository servers are responding

Web Services
* Update all software packages associated with OSG Web Pages.

All Services
p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Consolas; -webkit-text-stroke: #000000} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px Consolas; -webkit-text-stroke: #000000; min-height: 14.0px} span.s1 {font-kerning: none}
* Operating system updates; reboots will be required. The usual HA mechanisms will be used, but some services will experience brief outages.

Derek's Blog > Singularity on the OSG

Singularity is a container platform designed for use on computational resources. Several sites have deployed Singularity for their users and the OSG. In this post, I will provided a tutorial on how to use singularity on the OSG.

About Singularity

Singularity enables users to have full control of their environment. This means that a non-privileged user can “swap out” the operating system on the host for one they control.

Singularity is able to provide alternative environments for users than what is installed on the system. For example, if you have an application that installs well on Ubuntu but the system you are running on is RHEL6. In this example, you can create a Singularity image of Ubuntu, install the application, then start the image on the RHEL6 system.

Creating your first Singularity (Docker) image

Instead of making a Singualrity image as described here, we will create a docker image, then load use that in Singularity. We are using the docker image for a few reasons:

  • If you already have a docker image, then you can use this same image with Singularity.
  • If you are running your job on a docker encapsulated resource, such as Nebraska’s Tier 2, then Singularity is unable to use the default images because it is unable to acquire a loop back device inside the container.

Creating a Docker image requires root or sudo access. It usually performed on your own laptop or a machine that you own and have root access.

Docker has a great page on creating Docker images, which I won’t repeat. A simple docker image is easy to create using the very detailed instructions linked above.

Once you have uploaded the docker image to the Docker Hub, be sure to keep track of the name and version you will want to run on the OSG.

Running Singularity on the OSG

For a singularity job, you have to start the docker image in Singularity.

The submit file:

universe = vanilla
executable = run_singularity.sh
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
Requirements = HAS_SINGULARITY == TRUE
output = out
error = err
log = log
queue

The important aspect is the HAS_SINGULARITY in the requirements. It requires that the remote node has the singularity command.

The executable script, run_singularity.sh:

#!/bin/sh -x

# Run the singularity container
singularity exec --bind `pwd`:/srv  --pwd /srv docker://python:latest python -V

The line --bind pwd:/srv binds the current working directory into the singularity container. While the command --pwd /srv changes the working directory to /srv directory when the singularity container starts. The output of the command should be the version of python installed inside the Docker image. The last arguments is the program that will run inside the docker image, ‘python -V’

You can submit this script the normal way:

$ condor_submit singularity.submit

The resulting output should state what version of Python is available in the docker image.

More complicated example

The example singularity command is very basic. It only starts the singularity image and runs the python within it. Another example which runs a python script that is brought along is below. In this example we transfer an input python script to run inside singularity. Also, we bring an output file back that was generated inside the singularity image.

#!/bin/sh -x

singularity exec --bind `pwd`:/srv  --pwd /srv docker://python:latest /usr/bin/python test.py

The contents of test.py are:

import sys
stuff = "Hello World: The Python version is %s.%s.%s\n" % sys.version_info[:3]

f = open('stuff.blah', 'w')
f.write("This is a test\n")
f.write(stuff)
f.close()

Also, it is necessary to modify the submit script to addd a new line before the queue statement:

transfer_input_files = test.py

This tells HTCondor to bring the input file test.py.

When the job completes, you should have a new file in the submission directory called stuff.blah. It will have the contents (in my case):

This is a test
Hello World: The Python version is 2.7.9

Conclusion

Singularity is a very useful tool for software environments that are too complicated to bring along for each job. It provides an isolated environment where the user can control the software, while using the computing resources of the contributing clusters.


News and Announcements from OSG Operations > Announcing OSG Software version 3.3.20

We are pleased to announce OSG Software version 3.3.20.

Changes to OSG 3.3.20 include:
- HTCondor 8.4.10*: Running in SELinux should work now, other bug fixes
- gratia-probe 1.17.2: Improved ability to report local jobs to OSG or not
- Updated to XRootD 4.5.0
- Updated gridftp-hdfs to enable ordered data
- osg-configure 1.5.4: Further updates to support ATLAS AGIS
- Ensure HTCondor-CE gratia probe is installed when installing osg-ce-bosco
- Updated to VOMS 2.0.14
- Completed conversion of packages to use systemd-tmpfiles on EL 7

Changes to the Upcoming Repository include:
- Updated to HTCondor 8.5.8*
- Added Singularity (version 2.2) as a new, preview technology
- Updated to frontier-squid 3.5.23-3.1, a technology preview of version 3

*NOTE: When updating or installing HTCondor on an EL 7 system with SELinux
enabled, make sure that policycoreutils-python is installed before HTCondor.
This dependency will be properly declared in the HTCondor RPM in the next
release.

Release notes and pointers to more documentation can be found at:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/Release3320

Need help? Let us know:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/HelpProcedure

We welcome feedback on this release!

News and Announcements from OSG Operations > Emergency Service Downtime for OSG Connect, CI Connect, and Stash Data Services

OSG Connect, CI Connect and Stash data services will be in an emergency downtime today due to a data center issue.  Apologies for the inconvenience.

Sincerely,
The OSG User Support Team

News and Announcements from OSG Operations > GOC Service Update - Tuesday, January 10th at 14:00 UTC

The GOC will upgrade the following services beginning Tuesday, January 10th at 14:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

OIM
Fixing a bug with displaying CILogon as the cert signer

OASIS
Add space to the data filesystem
Add an extra disk volume to the oasis machine in preparation for upgrade to el7
Update cvmfs packages on oasis and oasis-replica to the latest version
Update frontier-squid on oasis-replica to the latest version
Enable garbage collection on oasis-replica for the osgstorage.org repositories
Change the ligo.osgstorage.org configuration to not require lsb_release

News and Announcements from OSG Operations > OSG TWiki service restored

The problem causing instability on the TWiki has been addressed and the service has been restored to normal operation. Please contact us if you encounter any difficulties or unusual behavior. We apologize for any inconvenience and wish you a happy holiday.

News and Announcements from OSG Operations > Instability on OSG Twiki

Service instability on twiki.opensciencegrid.org was observed starting around 09:30 EST. This is under investigation, expect some intermittency until service is fully restored. We apologize for any inconvenience.

News and Announcements from OSG Operations > Holiday Greetings from OSG Operations

Dear OSG Collaborators,

OSG Operations would like to wish you a happy holiday season and a joyous new year!

We'd like to thank the OSG resource providers and users worldwide that have helped us provide more than 1.3 Billion hours of compute time this calendar year (see http://display.grid.iu.edu/ for a full summary). OSG Operations will continue to provide round the clock holiday coverage in case of operational emergencies over the holiday season. You can continue to open tickets at https://ticket.grid.iu.edu/, send us mail at goc@openscinecegrid.org, or call us at +1 (317)-278-9699. However, we will be operating on holiday procedures beginning on Monday, December 26th, and resuming regular staffing on Tuesday, January 3rd. During this time non-emergency issues will be handled on a best-effort basis.

Operations Weekly Calls will resume on Jan 9.

Thanks to everyone for another successful year! We look forward to working with everyone in 2017!

Erik Erlandson - Tool Monkey > Converging Monoid Addition for T-Digest

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-tac-toe", Sussman replied. "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play", Sussman said. Minsky then shut his eyes. "Why do you close your eyes?" Sussman asked his teacher. "So that the room will be empty." At that moment, Sussman was enlightened.

Recently I've been doing some work with the t-digest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of t-digest sketches is that you can "add" them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scale-out parallel computing platforms such as Apache Spark or Map-Reduce.

In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of t-digests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new t-digest in that randomized order. In Scala code, this algorithm might look like the following:

```scala def combine(ltd: TDigest, rtd: TDigest): TDigest = { // randomly shuffle input clusters and re-insert to a new t-digest shuffle(ltd.clusters.toVector ++ rtd.clusters.toVector)

.foldLeft(TDigest.empty)((d, e) => d + e)

} ```

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.

To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated t-digest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a t-digest sketch and the original distribution, I computed the Kolmogorov-Smirnov D-statistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following box-plot. The result was surprising and a bit disturbing:

plot1

As the plot shows, the t-digest sketch distributions are gradually diverging from the underlying "true" distribution ~N(0,1). This is a potentially significant problem for the stability of monoidal t-digest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in map-reduce-like environments.

Seeing this divergence motivated me to think about ways to avoid it. One property of t-digest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The t-digest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one t-digest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:

```scala def combine(ltd: TDigest, rtd: TDigest): TDigest = { // insert clusters from largest to smallest (ltd.clusters.toVector ++ rtd.clusters.toVector).sortWith((a, b) => a.2 > b.2)

.foldLeft(TDigest.empty(delta))((d, e) => d + e)

} ```

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largest-to-smallest cluster insertion. When I plotted the resulting progression of D-statistics side-by-side, the results were surprising:

plot2a

As the plot demonstrates, not only was large-to-small insertion more stable, its D-statistics appeared to be getting smaller instead of larger. To see if this trend was sustained over longer cumulative sums, I plotted the D-stats for cumulative sums over 100 samples:

plot2

The results were even more dramatic; These longer sums show that the standard randomized-insertion method continues to diverge, but in the case of large-to-small insertion the cumulative t-digest sums continue to converge towards the underlying distribution!

To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no "tails") and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn't seem to be sensitive to shape of distribution.

I have upgraded my implementation of t-digest sketching to use this new definition of monoid addition for t-digests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!

Appendix: Plots with Alternate Distributions

plot3

plot4


News and Announcements from OSG Operations > Planned Retirement of OSG BDII


OSG Operations and Technology are planning the retirement of the BDII
information service located at is.grid.iu.edu on March 31st, 2017. We have
been working with WLCG, ATLAS and CMS to remove dependencies or replace
the functionality within our HTCondor Collector service. This work is
still ongoing. This message is to alert you to the upcoming deprecation
date and to get feedback on any other existing dependencies that might
exist to the BDII.

If you are dependent in any way on the OSG BDII or information the OSG
BDII supplies to the WLCG or EGI BDIIs please contact us at goc@opensciencegrid.org

Condor Project News > HTCondor used by Google and Fermilab for a 160k-core cluster ( December 15, 2016 )

At SC16, the HTCondor Team, Google, and Fermilab demonstrated a 160k-core cloud-based elastic compute cluster. This cluster uses resources from the Google Cloud Platform provisioned and managed by HTCondor as part of Fermilab's HEPCloud facility. The following article gives more information on this compute cluster, and discusses how the bursty nature of computational demands is making the use of cloud resources increasingly important for scientific computing. Find out more information about the Google Cloud Platform here.

News and Announcements from OSG Operations > Announcing OSG Software version 3.3.19

We are pleased to announce OSG Software version 3.3.19.

Changes to OSG 3.3.19 include:
- Update HTCondor-CE to provide data needed by the ATLAS AGIS system
- Provide a way for Gratia to avoid reporting local, non-OSG jobs*
- Provide better hold messages when the Job Router does not route a job
- Update several packages to better integrate with systemd on EL7
- Make osg-configure more robust when writing Gratia probe configuration files
- Update to frontier-squid 3 in the upcoming repository

*NOTE: Earlier versions of the Gratia probes required modifications to the
HTCondor configuration. See the release notes to revert those changes.

Release notes and pointers to more documentation can be found at:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/Release3319

Need help? Let us know:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/HelpProcedure

We welcome feedback on this release!

News and Announcements from OSG Operations > OSG Connect downtime (Dec 19 - Dec 26)

To prepare for the upcoming year, we will be taking a one week downtime of OSG Connect starting Monday, December 19th through Monday, December 26th.

We aim to accomplish two major service upgrades:
- Update of Stash to the latest long-term support release of Ceph (0.94.x -> 10.2.x)
- Migration of all home directories to a newer, faster array (40TB)

In addition to the service upgrades above, we will use the downtime to test and benchmark all Connect services and update system software. During this time, all Connect services will be down and users will not be able to log in.

Most importantly, please backup any critical data on Stash. We recommend using Globus to manage this, see help desk documentation at bit.ly/globus-xfer. Once the downtime has been completed or needs to be extended, we will let you know.  As always, please contact us if you have any questions or concerns.

Sincerely
The OSG User Support Team
https://support.opensciencegrid.org/support/home

Condor Project News > HTCondor 8.5.8 released! ( December 13, 2016 )

The HTCondor team is pleased to announce the release of HTCondor 8.5.8. This development series release contains new features that are under development. This release contains all of the bug fixes from the 8.4.10 stable release. Enhancements in the release include: The starter puts all jobs in a cgroup by default; Added condor_submit commands that support job retries; condor_qedit defaults to the current user's jobs; Ability to add SCRIPTS, VARS, etc. to all nodes in a DAG using one command; Able to conditionally add Docker volumes for certain jobs; Initial support for Singularity containers; A 64-bit Windows release. Further details can be found in the Development Version History and the Stable Version History. HTCondor 8.5.8 binaries and source code are available from our Downloads page.

Condor Project News > HTCondor 8.4.10 released! ( December 13, 2016 )

The HTCondor team is pleased to announce the release of HTCondor 8.4.10. A stable series release contains significant bug fixes. Highlights of this release are: Updated SELinux profile for Enterprise Linux; Fixed a performance problem in the schedd when RequestCpus was an expression; Preserve permissions when transferring sub-directories of the job's sandbox; Fixed HOLD_IF_CPUS_EXCEEDED and LIMIT_JOB_RUNTIMES metaknobs; Fixed a bug in handling REMOVE_SIGNIFICANT_ATTRIBUTES. Further details can be found in the Version History. HTCondor 8.4.10 binaries and source code are available from our Downloads page.


Subscribe