Pegasus news feed > Pegasus Workshop at USC, September 30th

usc-930Time: 2:30pm-4:30pm
Date:  Friday, September 30th, 2016
Location: VPD 106 (Verna and Peter Dauterive Hall, UPC).

Instructor:  The USC HPC and Pegasus team
Course Material: https://pegasus.isi.edu/tutorial/usc/

 

The  USC HPC and Pegasus Team is hosting a half day workshop on September 30th, 2016 at the USC main campus. This workshop includes a hands-on component that requires an active HPC account. If you don’t have USC HPC account and want to attend the workshop, HPC team can now offer temporary HPC accounts for workshop attendees. To be eligible, you must have a USC NetID, and must register via the Registration link below. This is a great way to check out HPC and learn about Workflows if you do not have an HPC account.

Scientific Workflows via The Pegasus Workflow Management System on the HPC Cluster

Workflows are a key technology for enabling complex scientific applications. They capture the interdependencies between processing steps in data analysis and simulation pipelines, as well as the mechanisms to execute those steps reliably and efficiently in a distributed computing environment. They also enable scientists to capture complex processes to promote sharing and reuse, and provide provenance information necessary for the verification of scientific results and scientific reproducibility.

In this workshop, we will focus on how to model scientific analysis as a workflow that can be executed on the USC HPC cluster using Pegasus WMS (http://pegasus.isi.edu). Pegasus allows users to design workflows at a high-level of abstraction, that is independent of the resources available to execute them and the location of data and executables. It compiles these abstract workflows to executable workflows that can be deployed onto distributed resources such local campus clusters, computational clouds and grids such as XSEDE and Open Science Grid. During the compilation process, Pegasus WMS does data discovery, whereby it determines the locations of input data files and executables. Data transfer tasks are added to the executable workflow that are responsible for staging in the input files to the cluster, and the generated output files back to a user specified location. In addition to the data transfers tasks, data cleanup (cleanup data that is no longer required) and data registration tasks (catalog the output files) are be added to the pipeline.

Through hands-on exercises, we will cover issues of workflow composition, how to design a workflow in a portable way, workflow execution and how to run the workflow efficiently and reliably on the USC HPC cluster. An important component of the tutorial will be how to monitor, debug and analyze workflows using Pegasus-provided tools. The workshop will also cover how to execute MPI application codes as part of a workflow.

This workshop is intended for both new and existing HPC users. It is highly recommended that you take the Introduction to Linux/Unix workshop if you haven’t worked in the Linux environment before. The participants will be expected to bring in their own laptops with the following software installed: SSH client, Web Browser, PDF reader. If you have any questions about either of these workshops, please send email to hpc@usc.edu and erinshaw@usc.edu. We look forward to seeing you there!

48 views

 


Derek's Blog > Running R at HCC

The full presentation.

There are many methods to run R applications at HCC. I can break these uses down to:

  1. Creating a traditional Slurm submit file that runs an R script. The vast majority of R users do this.
  2. Using a program, such as GridR, that will create the submission files for you from within R.

In this post, I will discuss and layout the different methods of submitting jobs to HCC and the OSG. Further these methods lie on a spectrum of both difficulty in using.

Difficulty Spectrum Each step is more and more difficult. Running R on your laptop is much easier than running R on a cluster. And running R on a cluster is less difficult than running it on the Grid. But there are techniques to bring these closer together.

Creating Slurm submit files

Creating Slurm submit files and writing R scripts is the most common method of R users at HCC. The steps to this workflow is:

  1. Create a Slurm submit file
  2. Write a R script that will read in your data, and output it
  3. Copy data onto cluster from the laptop
  4. Submit Slurm submit file
  5. Wait for completion (you can ask to get an email)

More on the Slurm configuration is available at HCC Documentation.

A submit file for Slurm is below:

#!/bin/sh
#SBATCH --time=00:30:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=TestJob
#SBATCH --error=TestJob.stdout
#SBATCH --output=TestJob.stderr
 
module load R/3.3
R CMD BATCH Rcode.R

This submit file describes a job that will run 30 minutes, and require 1024MB of ram. Below the #SBATCH lines are the actual script that will run on the worker node. The module command loads the newest version of R on HCC’s clusters. Next command runs an R script named Rcode.R.

A parallel submission is:

#!/bin/sh
#SBATCH --ntasks-per-node=16
#SBATCH --nodes=1
#SBATCH --time=00:30:00
#SBATCH --mem-per-cpu=1024
#SBATCH --job-name=TestJob
#SBATCH --error=TestJob.stdout
#SBATCH --output=TestJob.stderr
 
module load R/3.3
R CMD BATCH Rcode.R

This submit file adds --ntasks-per-node and --nodes=1 that describes the parallel jobs. ntasks-per-node specifies how many cores on a remote worker node is required for the job. --nodes describes the number of physical nodes that this job should span across. All other lines are very similar to the previous single core submission file.

The R code looks a bit different though. Here is an example:

library(parallel")

a <- function(s) { return (2*s) }
mclapply(c(1:20), a, mc.cores = 16)

This will run mclapply which will apply the made up function a across the list specified in c(1, 20).

Using GridR to submit processing

GridR is another method for farming processing out to remote cluster. GridR is able to submit to HTCondor clusters. Therefore, it is able to submit to the OSG through HTCondor.

The GridR package is hosted on Github. The wiki is very useful with examples and tutorials on how to use GridR.

Below is a working example script of using R on HCC’s Crane cluster.

library(GridR)
# Initialize the GridR library for submissions
grid.init(service="condor.local", localTmpDir="tmp", bootstrap=TRUE, remoteRPath="/util/opt/R/3.3/gcc/4.4/bin/R", Rurl="https://www.dropbox.com/s/s27ngq1rp7e9qeb/el6-R-modified_1.0.tar.gz?dl=0")

# Create a quick function to run remotely
a <- function(s) { return (2*s) }

# Run the apply function, much like lapply.  In this case, with only 1 attribute to apply
grid.apply("x", a, 13, wait = TRUE)

# Output the results.
x

This R script submits jobs to the OSG from the Crane cluster. It will run the simple a function on the remote worker nodes on the OSG.

The jobs can run anywhere on the OSG:

OSG Running Jobs

Jobs submitted to the OSG can run on multiple sites around the U.S. They will execute and and return the results.

Conclusion

There are many methods to submitting R processing to clusters and the grid. One has to choose which one best suites them.

The GridR method is easy for experience R programmers. But, it lacks the flexibility of the Slurm submit file method. The Slurm submit method requires learning some Linux and Slurm syntax, but offers the flexibility to specify multiple cores per R script or more memory per job.


News and Announcements from OSG Operations > GOC Service Update - Tuesday, September 27th at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, September 13th at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

OIM
  * Changes to wording for instructions of command line host cert issuance
  * Changes to DNS query for host cert issuance via CILogon. We will try to resolve the DNS address more than once on failure.

Perfsonar
  * Configuration changes to extend 24-hour recovery limitations

Ticket Exchange
  * Configuration changes
  * GGUS synchronization format update
  * FNAL synchronization format update

OSG Website
  * Routine Wordpress version and plugin updates

All Services
  * Operating system updates; reboots will be required. The usual HA mechanisms will be used, but some services will experience brief outages. Additionally, the primary and backup LDAP and DNS servers used internally will be exchanged to allow maintenance on the original primary server

Pegasus news feed > $1M NSF award for Data Integrity Project

timthumb-php

The four-year project, Scientific Workflow Integrity with Pegasus, is funded by a $1 million grant from the National Science Foundation (NSF) as part of its Cybersecurity Innovation for Cyberinfrastructure (CICI) program. Von Welch, director of IU’s Center for Applied Cybersecurity Research (CACR), is the project’s principal investigator.

The Pegasus Workflow Management System is popular among the research community for its ability to easily structure and execute large-scale data analyses. The application benefits a wide range of scientific applications including LIGO (the Laser Interferometer Gravitational-Wave Observatory), which announced the first direct detection of gravitational waves earlier this year—proving that Einstein’s theory was right.

IU will receive nearly half of the grant, $479,855, to increase cybersecurity within Pegasus’s computational science and give scientists added ease of mind by providing the means to validate their data. The remaining half has been awarded to the project’s collaborators—the Renaissance Computing Institute (RENCI) at the University of North Carolina ($230k) and the Information Sciences Institute (ISI) at the University of Southern California ($290k).

By digitally signing the data that is run through Pegasus, these improvements will strengthen consistency in results from multiple workflows. They’ll also allow users to see whether their data has changed since the last time a workflow was completed.

“Scientific data is a key part of scientific workflows and, ultimately, the science project,” said Welch. “By integrating support for data integrity into the popular workflow management tool Pegasus, we increase our trust in computational science in a manner that will be easy for scientists to use.”

Welch and Steven Myers, associate professor at IU’s School of Informatics and Computing, will lead the project team, which includes experts in cybersecurity and virtualization, alongside the Pegasus development team.

One of the challenges of the new project will be to make sure that the cryptography used for ensuring data integrity, such as the digital signatures, will scale appropriately to handle the increasingly large scientific datasets. Myers, an expert in cryptography, will guide the selection, implementation and deployment of the cryptographic systems, making sure they are efficient, and likely to maintain their security over the lengthy time periods scientific data is referenced and used.

“Cryptography can provide strong assurances of data integrity and records of its origin and modifications over the long periods of time that much scientific data is used and must be maintained,” said Myers. “Given the experimental costs of some of this data, having strong assurances is critical, as some groups have definite motive to modify the data, and the experiments are incredibly costly to reproduce if the data’s integrity is questioned.”

Scientists from a variety of disciplines, including astronomy, bioinformatics, earthquake science, gravitational wave physics, ocean science and neuroscience, have used Pegasus to run over 700,000 workflows over the last three years. However, Welch’s team aims to achieve solutions that will be generic enough to apply to other workflow systems and applications and help an even broader scope of researchers.

“I am very excited to work with the IU and RENCI teams to include new and critical data integrity solutions into Pegasus,” said Ewa Deelman, research professor and director at ISI. “The results of this work will benefit a number of science disciplines and will help scientists to have a higher degree of trust in their results and the results shared by their colleagues.”

 

Source: https://itnews.iu.edu/articles/2016/1m-nsf-award-goes-to-iu-led-data-integrity-project.php

 

111 views


News and Announcements from OSG Operations > Announcing OSG Software version 3.3.16

We are pleased to announce OSG Software version 3.3.16.

Changes to OSG 3.3.16 include:
* Updated most Globus Packages to latest available from EPEL
* Note: Now Globus Toolkit strictly checks host names against certificates
* BLAHP 1.18.25: Additional features supported for SGE, PBS Pro, and Slurm
* Update to GlideinWMS 3.2.15
* Fixed major scalability problem in GUMS on EL7
* HTCondor-CE 2.0.8: Support for Terana eScience, minor bug fixes
* The MyProxy server now produces RFC compliant proxies
* Fixed load-balancing in Globus GridFTP when using IPv6 addresses
* Added the HTCondor CREAM GAHP for EL7 platforms
* Completed porting components of OSG Software Stack to EL7
* Added RSV GlideinWMS Tester for VO Front-ends to test site support
* Updated to lcas-lcmaps-gt4-interface to version 0.3.1
* VO Package v68: Added project8 VO

Release notes and pointers to more documentation can be found at:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/Release3316

Need help? Let us know:

https://www.opensciencegrid.org/bin/view/Documentation/Release3/HelpProcedure

We welcome feedback on this release!

News and Announcements from OSG Operations > Scheduled FermiLab Power Outage

This weekend, Fermilab will have a scheduled power outage in the Feynman Computer Center to repair an automatic power transfer switch. The transfer switch ensures that the lab’s computing services have redundant power. This scheduled outage will cause many services at the laboratory to be unavailable.

The outage date is Saturday, Sept. 17, the same day as a scheduled Wilson Hall cooling outage, and is expected to last more than 8 hours. Services may be affected starting Friday, September 16 at 4:00 PM and are estimated to be restored by Saturday at 6:00 PM, though the outage could last longer. (All times US central) At this time we expect Fermilab email, listserv and analog telephones to be operational.

The Open Science Grid services expected to be affected include the OSG VOMS, Gratia, Indico and Docdb. One (of three) oasis replica will be out of service and replaced temporarily by another elsewhere. Details provided by FNAL can be found here:  https://fermi.service-now.com/kb_view_customer.do?sysparm_article=KB0012205

Updates will be provided via Twitter throughout the outage, so follow the Service Desk at https://twitter.com/FNALServiceDesk to stay informed.

News and Announcements from OSG Operations > Emergency Maintenance - psds0.grid.iu.edu - Wednesday, September 7th from 13:00-14:00 EDT

psds0.grid.iu.edu will be unavailable from 13:00-14:00 EDT while GOC Engineers perform maintenance to increase the amount of memory resources available to resolve
an issue with data collection. The GOC regrets any inconvenience this may cause.

News and Announcements from OSG Operations > GOC Service Update - Tuesday, September 13th at 13:00 UTC

The GOC will upgrade the following services beginning Tuesday, September 13th at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.

Ticket, Ticket Exchange, OIM
- Ticket update
- OIM Revocation Digicert user certs and host certs
- OIM message for CMS and Atlas users
- OIM change default certificate signer from DigiCert to CILogon

Erik Erlandson - Tool Monkey > Encoding Map-Reduce As A Monoid With Left Folding

In a previous post I discussed some scenarios where traditional map-reduce (directly applying a map function, followed by some monoidal reduction) could be inefficient. To review, the source of inefficiency is in situations where the map operation is creating some non-trivial monoid that represents a single element of the input type. For example, if the monoidal type is Set[Int], then the mapping function ('prepare' in algebird) maps every input integer k into Set(k), which is somewhat expensive.

In that discussion, I was focusing on map-reduce as embodied by the algebird Aggregator type, where map appears as the prepare function. However, it is easy to see that any map-reduce implementation may be vulnerable to the same inefficiency.

I wondered if there were a way to represent map-reduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.

I'll begin by reviewing a standard map-reduce implementation. The following scala code sketches out the definition of a monoid over a type B and a map-reduce interface. As this code suggests, the map function maps input data of some type A into some monoidal type B, which can be reduced (aka "aggregated") in a way that is amenable to parallelization:

``` scala trait Monoid[B] { // aka 'combine' aka '++' def plus: (B, B) => B

// aka 'empty' aka 'identity' def e: B }

trait MapReduce[A, B] { // monoid embodies the reducible type def monoid: Monoid[B]

// mapping function from input type A to reducible type B def map: A => B

// the basic map-reduce operation def apply(data: Seq[A]): B = data.map(map).fold(monoid.e)(monoid.plus)

// map-reduce parallelized over data partitions def apply(data: ParSeq[Seq[A]]): B =

data.map { part =>
  part.map(map).fold(monoid.e)(monoid.plus)
}
.fold(monoid.e)(monoid.plus)

} ```

In the parallel version of map-reduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal B value, followed by a final reduction of those intermediate results. This is the classic form of map-reduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.

Next I will present an alternative definition of map-reduce. In this implementation, the map function is replaced by a foldL function, which executes a single "left-fold" of an input object with type A into the monoid object with type B:

``` scala // a map reduce operation based on a monoid with left folding trait MapReduceLF[A, B] extends MapReduce[A, B] { def monoid: Monoid[B]

// left-fold an object with type A into the monoid B // obeys type law: foldL(b, a) = b ++ foldL(e, a) def foldL: (B, A) => B

// foldL(e, a) embodies the role of map(a) in standard map-reduce def map = (a: A) => foldL(monoid.e, a)

// map-reduce operation is now a single fold-left operation override def apply(data: Seq[A]): B = data.foldLeft(monoid.e)(foldL)

// map-reduce parallelized over data partitions override def apply(data: ParSeq[Seq[A]]): B =

data.map { part =>
  part.foldLeft(monoid.e)(foldL)
}
.fold(monoid.e)(monoid.plus)

} ```

As the comments above indicate, the left-folding function foldL is assumed to obey the law foldL(b, a) = b ++ foldL(e, a). This law captures the idea that folding a into b should be the analog of reducing b with a monoid corresponding to the single element a. Referring to my earlier example, if type A is Int and B is Set[Int], then foldL(b, a) => b + a. Note that b + a is directly inserting single element a into b, which is significantly more efficient than b ++ Set(a), which is how a typical map-reduce implementation would be required to operate.

This law also gives us the corresponding definition of map(a), which is foldL(e, a), or in my example: Set.empty[Int] ++ a or just: Set(a)

In this formulation, the basic map-reduce operation is now a single foldLeft operation, instead of a mapping followed by a monoidal reduction. The parallel version is analoglous. Each partition uses the new foldLeft operation, and the final reduction of intermediate monoidal results remains the same as before.

The foldLeft function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before. I will conclude with a proof that this encoding is also parallelizable; Note that the law foldL(b, a) = b ++ foldL(e, a) is a significant component of this proof, as it represents the constraint that foldL behaves like an analog of reducing b with a monoidal representation of element a.

In the following proof I used a scala-like pseudo code, described in the introduction:

``` // given an object mr of type MapReduceFL[A, B] // and using notation: // f <==> mr.foldL // for b1,b2 of type B: b1 ++ b2 <==> mr.plus(b1, b2) // e <==> mr.e // [...] <==> Seq(...) // d1, d2 are of type Seq[A]

// Proof that map-reduce with left-folding is parallelizable // i.e. mr(d1 ++ d2) == mr(d1) ++ mr(d2) mr(d1 ++ d2) == (d1 ++ d2).foldLeft(e)(f) // definition of map-reduce operation == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f) // Lemma A == mr(d1) ++ mr(d2) // definition of map-reduce (QED)

// Proof of Lemma A // i.e. (d1 ++ d2).foldLeft(e)(f) == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f)

// proof is by induction on the length of data sequence d2

// case d2 where length is zero, i.e. d2 == [] (d1 ++ []).foldLeft(e)(f) == d1.foldLeft(e)(f) // definition of empty sequence [] == d1.foldLeft(e)(f) ++ e // definition of identity e == d1.foldLeft(e)(f) ++ [].foldLeft(e)(f) // definition of foldLeft

// case d2 where length is 1, i.e. d2 == [a] for some a of type A (d1 ++ [a]).foldLeft(e)(f) == f(d1.foldLeft(e)(f), a) // definition of foldLeft and f == d1.foldLeft(e)(f) ++ f(e, a) // the type-law f(b, a) == b ++ f(e, a) == d1.foldLeft(e)(f) ++ [a].foldLeft(e)(f) // definition of foldLeft

// inductive step, assuming proof for d2' of length <= n // consider d2 of length n+1, i.e. d2 == d2' ++ [a], where d2' has length n (d1 ++ d2).foldLeft(e)(f) == (d1 ++ d2' ++ [a]).foldLeft(e)(f) // definition of d2, d2', [a] == f((d1 ++ d2').foldLeft(e)(f), a) // definition of foldLeft and f == (d1 ++ d2').foldLeft(e)(f) ++ f(e, a) // type-law f(b, a) == b ++ f(e, a) == d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ f(e, a) // induction == d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ [a].foldLeft(e)(f) // def'n of foldLeft == d1.foldLeft(e)(f) ++ (d2' ++ [a]).foldLeft(e)(f) // induction == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f) // definition of d2 (QED) ```


Miha's Blog > Interview with Monika Madlen Vetter, Ph.D.




Monika Madlen Vetter, PhD,  works at  University of Chicago · Department of Ecology and Evolution. Marco Mambelli and I interviewed her as part of our "Going out of the door" strategy to learn what R Statistical project users do. The product we delivered in Beta is Bosco R We wanted to discover the data scientist, but Dr. Vetter is much more than just a Data Scientist. I always look around for what fascinates and her story is amazing. She explains  science in simple words but the genomic studies of plants is a complex science, normally hidden to the casual observer.  

How important is the usage of statistics in your research?

MV: Very important!
My work has two components: conducting an experiment in the laboratory or green house and later analyzing the gathered data. I use R for most of my statistical and graphical analysis. More specifically, I use a method called genome wide association (GWA) study to identify the genetic loci controlling the interactions between plants and bacteria. 

What is a GWA?

MV: A genome-wide association study aims to identify genes controlling the variation in a trait.   Most traits, however, are complex – most diseases for instance. Many genes contribute to hypertension or diabetes in humans. Statistical methods help elucidate these complex genetic traits. Scientific knowledge progressed a lot since the first human genome was sequenced in 2000. We have begun to understand the genetic basis of many diseases using genome-wide association studies.

You work on the innate immunity of plants. We live in a world where particle physics dominate the headlines, this is not a widely covered theme in the media

MV: *laughs* Yes, if I talk about my research, many people react with surprise when realizing that plants actually DO HAVE an immune system. Plants cannot run away to escape pathogens, which constantly threaten their survival and reproduction. They do not have antibodies and we therefore often describe the plant immune system as simple. Yet, it does a pretty good job, which is evident by a green world around us.
I investigate the evolution of innate immunity in plants. Immune receptors of the plant model species Arabidopsis thalianarecognize molecular signals, which are unique to bacteria. The perception of these signals triggers a general and effective defense response but is also accompanied with reduction in plant growth. My current work identified several genetic loci, which control these growth changes upon stimulation of the immune system. Another project investigates how plants shape the bacterial community within their leaves.

What is your biggest challenge on a daily basis?

MV:  [thinking a bit]. Perhaps the biggest challenge is to stay focused on the problem and one specific research questions. So many interesting possibilities and questions distract me. I guess having many new thoughts and a creative mind is also what makes a good scientist.

Does your work move in the direction pharmaceutical research?

MV: I am especially fascinated by how plants modulate their immune responses and growth in response to biotic and abiotic environments but it does not directly aim at developing an application or product. Basic knowledge does lead to innovation on the long run. A crop breeder might use this knowledge to make a plant more resistant against pests while maintaining yield for instance.

What motivated you to select this career?

MV: I like to get to the bottom of things and I was interested in plant biology early in my childhood. My parents would have liked me to be a physician but I could not get around cutting someone open – even for the prospect of helping them. I was interested in lichens instead. Three totally different organisms come together to create a form of life with properties which none of them has by itself. How cool is that!

All creatures struggle to let in nutrients and vent wastes. We know a lot at human level. What about the plant level?

MV: We declare waste as unwanted materials but one’s waste is another’s necessity. The photosynthesis of plants produces sugars from water and sunlight. What they release – their waste so to speak, is oxygen, which is crucial to most other life forms on the planet. Otherwise plants do not consume living matter so they do not have unwanted by-products they would need to get rid of.
Heavy metals can be a problem in plants. They either need an excretion system or a high tolerance when growing in soil contaminated by heavy metals such as cadmium, arsenic, mercury or lead. If the plant accumulates those metals, humans can harvest and depose the plants to clean soil. However, it can also be a problem to human health if we eat these plants. Some plants accumulate heavy metals to get resistant to herbivores. There is a lot of fun research ongoing.
In terms of nutrients, plants struggle just as much as other organisms. Their growth will be limited if they lack certain minerals. You might know that from a ‘sad looking’ plant on your windowsill. It might not get all nutrients from its regular water supply. You need to fertilize or re-pot it, too.

How U of Chicago stimulated your work?

MV: The University of Chicago supplies fantastic research facilities, helps with bureaucracy and provides a stimulating research environment. My co-workers come from diverse  (biological) disciplines, which leads to different viewpoints and lively discussions.

What would be in your opinion the biggest achievement as a scientist?


MV: *laughs* Perhaps I am not idealistic enough to think that my research can solve grandiose  humanity problems. However, my research has relevance to food safety, pathogen resistance and stability of yield in crops. On a smaller scale I am happy to share my passion about biological processes with students or lay people. 

August 8, 2013 in Chicago

Erik Erlandson - Tool Monkey > Supporting Competing APIs in Scala -- Can Better Package Factoring Help?

On and off over the last year, I've been working on a library of tree and map classes in Scala that happen to make use of some algebraic structures (mostly monoids or related concepts). In my initial implementations, I made use of the popular algebird variations on monoid and friends. In their incarnation as an algebird PR this was uncontroversial to say the least, but lately I have been re-thinking them as a third-party Scala package.

This immediately raised some interesting and thorny questions: in an ecosystem that contains not just algebird, but other popular alternatives such as cats and scalaz, what algebra API should I use in my code? How best to allow the library user to interoperate with the algebra libray of their choice? Can I accomplish these things while also avoiding any problematic package dependencies in my library code?

In Scala, the second question is relatively straightforward to answer. I can write my interface using implicit conversions, and provide sub-packages that provide such conversions from popular algebra libraries into the library I actually use in my code. A library user can import the predefined implicit conversions of their choice, or if necessary provide their own.

So far so good, but that leads immediately back to the first question -- what API should I choose to use internally in my own library?

One obvious approach is to just pick one of the popular options (I might favor cats, for example) and write my library code using that. If a library user also prefers cats, great. Otherwise, they can import the appropritate implicit conversions from their favorite alternative into cats and be on their way.

But this solution is not without drawbacks. Anybody using my library will now be including cats as a transitive dependency in their project, even if they are already using some other alternative. Although cats is not an enormous library, that represents a fair amount of code sucked into my users' projects, most of which isn't going to be used at all. More insidiously, I have now introduced the possiblity that the cats version I package with is out of sync with the version my library users are building against. Version misalignment in transitive dependencies is a land-mine in project builds and very difficult to resolve.

A second approach I might use is to define some abstract algebraic traits of my own. I can write my libraries in terms of this new API, and then provide implicit conversions from popular APIs into mine.

This approach has some real advantages over the previous. Being entirely abstract, my internal API will be lightweight. I have the option of including only the algebraic concepts I need. It does not introduce any possibly problematic 3rd-party dependencies that might cause code bloat or versioning problems for my library users.

Although this is an effective solution, I find it dissatisfying for a couple reasons. Firstly, my new internal API effectively represents yet another competing algebra API, and so I am essentially contributing to the proliferating-standards antipattern.

standards

Secondly, it means that I am not taking advantage of community knowledge. The cats library embodies a great deal of cumulative human expertise in both category theory and Scala library design. What does a good algebra library API look like? Well, it's likely to look a lot like cats of course! The odds that I end up doing an inferior job designing my little internal vanity API are rather higher than the odds that I do as well or better. The best I can hope for is to re-invent the wheel, with a real possibility that my wheel has corners.

Is there a way to resolve this unpalatable situation? Can we design our projects to both remain flexible about interfacing with multiple 3rd-party alternatives, but avoid effectively writing yet another alternative for our own internal use?

I hardly have any authoritative answers to this problem, but I have one idea that might move toward a solution. As I alluded to above, when I write my libraries, I am most frequently only interested in the API -- the abstract interface. If I did go with writing my own algebra API, I would seek to define purely abstract traits. Since my intention is that my library users would supply their own favorite library alternative, I would have no need or desire to instantiate any of my APIs. That function would be provided by the separate sub-projects that provide implicit conversions from community alternatives into my API.

On the other hand, what if cats and algebird factored their libraries in a similar way? What if I could include a sub-package like cats-kernel-api, or algebird-core-api, which contained only pure abstract traits for monoid, semigroup, etc? Then I could choose my favorite community API, and code against it, with much less code bloat, and a much reduced vulnerability to any versioning drift. I would still be free to provide implicit conversions and allow my users to make their own choice of library in their projects.

Although I find this idea attractive, it is certainly not foolproof. For example, there is never a way to guarantee that versioning drift won't break an API. APIs such as cats and algebird are likely to be unusually amenable to this kind of approach. After all, their interfaces are primarily driven by underlying mathematical definitions, which are generally as stable as such things ever get. However, APIs in general tend to be significantly more stable than underlying code. And the most-stable subsets of APIs might be encoded as traits and exposed this way, allowing other more experimental API components to change at a higher frequency. Perhaps library packages could even be factored in some way such as library-stable-api and library-unstable-api. That would clearly add a bit of complication to library trait hierarchies, but the payoff in terms of increased 3rd-party usability might be worth it.


Condor Project News > HTCondor helps with GIS analysis ( August 24, 2016 )

This article explains how the Clemson Center for Geospatial Technologies (CCGT) was able to use HTCondor to help a student analyze large amounts of GIS (Geographic Information System) data. The article contains a good explanation of how the data was divided up in such a way as to allow it to be processed using an HTCondor pool. Using HTCondor allowed the data to be analyzed in approximately 3 hours, as opposed to the 4.11 days it would have taken on a single computer.

News and Announcements from OSG Operations > TWiki Outage Update

We have completed the previously announced restoration of the TWiki to its state as of Monday 15/Aug. The system is behaving normally at this time but we request you contact us if you encounter any unusual behavior.

We are also in the process of recovering changes to content made between Monday 15/Aug and Friday 19/Aug. If you have any content you need restored and would like prioritized, please let us know.

The GOC regrets any inconvenience and is taking steps to insure this will not reoccur.

News and Announcements from OSG Operations > TWiki Outage

We are currently encountering difficulties with TWiki and are restoring the
service from backup to its state as of Monday 15/Aug. We will attempt to recover
changes made after that time and will appraise you as to the results of our efforts.
The GOC regrets any inconvenience and will inform you as soon as resolution and
further information is available.

Pegasus news feed > Pegasus 4.6.2 Release

We are happy to announce the release of Pegasus 4.6.2.  Pegasus 4.6.2 is a minor release of Pegasus and includes improvements and bug fixes to the 4.6.1 release
New features and Improvements in 4.6.2 are
  • support for kickstart wrappers that can setup a user environment
  • support for Cobalt and SLURM schedulers via the Glite interfaces
  • ability to do local copy of files in PegasusLite to staging site, if the compute and staging site is same
  • support for setting up Pegasus Tutorial on Bluewaters using pegasus-init

New Feature

  • [PM-1095] – pegasus-service init script
  • [PM-1101] – Add support for gsiscp transfers
    •  These will work like the scp ones, but with x509 auth instead of ssh public keys.
  • [PM-1110] – put in support for cobalt scheduler at ALCF
    • Pegasus was updated to use the HTCondor Blahp support. ALCF has a cobalt scheduler to schedule jobs to the BlueGene system. The documentation has details on how the pegasus task requirement profiles map to Cobalt parameters. https://pegasus.isi.edu/docs/4.6.2/glite.php#glite_mappings .
    • To use HTCondor on Mira, please contact the HTCondor team to point you to the latest supported HTCondor installation on the system.
  • [PM-1096] – Update Pegasus’ glite support to include SLURM
  • [PM-1115] – Pegasus to check for cyclic dependencies in the DAG
    • Pegasus now checks for cyclic dependencies that may exist in the DAX or are as a result of adding edges automatically based on data depedencies
  • [PM-1116] – pass task resource requirements as environment variables for job wrappers to pick up
    • The task resource requirements are also passed as environment variables for the jobs in the GLITE style. This ensures that job wrappers can pick up task requirement profiles as environment variables.

Improvements

  • [PM-1078] – pegasus-statistics should take comma separated list of values for -s option
  • [PM-1105] – Mirror job priorities to DAGMan node priorities
    • The job priorities associated with jobs in the workflow are now also associated as DAGMan node priorities, provided that HTCondor version is 8.5.7 or higher.
  • [PM-1108] – Ability to do local copy of files in PegasusLite to staging site, if the compute and staging site is same
    •  The optimization implemented is implemented in the Planner’s pegasus lite generation code, where when constructing the destination URL’s for the output site it checks for
      a) symlinking is turned on
      b) compute site for the job and staging site for job are same.
      This means that the shared-scratch directory used on the staging site is locally accessible to the compute nodes. So we can go directly via the filesystem to copy the file. So instead of creating a gsiftp url , will create a file url in pegasuslite wrappers for the jobs running on local site.
  • [PM-1112] – enable variable expansion for regex based replica catalog
    • Variable expansion for Regex based replica catalogs was not supported earlier. This is fixed now.
  • [PM-1117] – Support for tutorial via pegasus-init on Bluewaters
    • pegasus-init was updated to support running tutorial examples on Bluewaters. To use this, users need to logon to the bleaters login node and run pegasus-init. The assumption is that HTCondor is running on the login node either in user space or root.
  • [PM-1111] – pegasus planner and api’s should have support for ppc64 as architecture type

Bugs Fixed

  • [PM-1087] – dashboard and pegasus-metadata don’t query for sub workflows
  • [PM-1089] – connect_by_submitdir should seek for braindump.txt in the workflow root folder
  • [PM-1093] – disconnect in site catalog and DAX schema for specifying OSType
  • [PM-1099] – x509 credentials should be transferred using x509userproxy
  • [PM-1100] – Typo in rsquot, ldquot and rdquot
  • [PM-1106] – pegasus-init should not allow (or should handle) spaces in site name
  • [PM-1107] – pegasuslite signal handler race condition
  • [PM-1113] – make planner directory options behavior more consistent

 

306 views


Pegasus news feed > Soybean Science Blooms with Supercomputers

articleTACC (Texas Advanced Computing Center) has published a science highlight of the SoyKB project. Pegasus is used to orchestrate the computations running on TACC Wrangler and automatically retrieving and storing data in the CyVerse data store. Also highlighted is how the XSEDE ECSS (Extended Collaborative Support Service) can be used get scientific workflow support on XSEDE.

Read the full article at:

https://www.tacc.utexas.edu/-/soybean-science-blooms-with-supercomputers

 

252 views

News and Announcements from OSG Operations > GOC Service Update - Tuesday, August 23rd

The GOC will upgrade the following services beginning Tuesday, August 23rd at 13:00 UTC. The GOC reserves 8 hours in the unlikely event unexpected problems are encountered.
PerfSonar
Updates to esmond, rsv-perfsonar

Collector, Redirector, Ticket, Ticket exchange, OIM
Rebuild from new content management system
OIM, change default certificate signer from DigiCert to CILogon

Glidein
Upgrade GlideinWMS to 3.2.14.1  VOs planning to run GlideinWMS 3.2.15 on their frontends will require all factories to run >= 3.2.14.1

WWW
Updates to wordpress

All Services
Operating system updates; reboots will be required. The usual HA mechanisms will be used, but some services will experience brief outages.


Subscribe