Virtualization, Docker, Baremetal and more!

T_Minus · Sep 16, 2015

If you've read some of my threads you've probably noticed that the software we create and use runs on baremetal (or virtual machines for that matter) and we've never been concerned about idle processing time/power because when we're processing data (which could be 24/7 all year) we run near capacity.

However, we're now looking at expanding the processing software into 3-5 other things as well, and now that the virtualization bug has bitten, and I've played around with it the last few months learning more I'm wondering what you guys think of something like this.

Servers/VMs/#s = estimates

# of Baremetal servers: 30
# of servers running a hyper-visor: 6 (ESXI, XEN, ETC)

If we have say 10TB of data to analyze we load up the queue with our jobs, let the BM & VM 'servers' take the jobs, process, then move on... works great, been doing this for years locally and with RackSpace and DO "clouds" (VMs).

It's rather easy to do this with 1 'worker job' possible for load balance / queue management, but when we want to start integrating other software that's unrelated all we care about is if it's in use, or not, and if not we can use it.

This is rather simple to do with APIs at RS, DO, Linode, and eventually VMWARE (I hope, reading up more on this)... we just roll out the VM from image that is the worker, and off we go... However, we're not going to manage and run hypervisors on every baremetal server.

So, that leaves me with thinking about using Docker or ??? and a load balancer.

As you can probably tell the virtualization, docker, etc is still new to me

But I see it def. as the future for eating up as much resources as I can and saving money.

Opinions, ideas, thoughts, ideas?

All of the "workers" analyze some sort of data, be it raw text, html, javascript, image, etc and then spits out the output to the storage server, AWS, local storage then zip file, etc...

Chuckleb · Sep 16, 2015

Another random idea, is all the software Linux-based? If it were, you could just set up a standard HPC cluster setup and submit jobs into the scheduler/queue and have it run on available nodes. Most HPC systems are designed to run heterogeneous jobs, sometimes mixing and matching to fully use up all cores on a node (non-exclusive mode) in order to get complete fill. What you describe could be done pretty easily in that format and you'd not suffer any penalties of a hypervisor.

T_Minus · Sep 16, 2015

Chuckleb said:
Another random idea, is all the software Linux-based? If it were, you could just set up a standard HPC cluster setup and submit jobs into the scheduler/queue and have it run on available nodes. Most HPC systems are designed to run heterogeneous jobs, sometimes mixing and matching to fully use up all cores on a node (non-exclusive mode) in order to get complete fill. What you describe could be done pretty easily in that format and you'd not suffer any penalties of a hypervisor.

Yes, it's all linux based at this time.

This interests me, please go on.

Chuckleb · Sep 16, 2015

We use Warewulf for the cluster management and Torque and Maui for the scheduling and queueing.

What's nice is that we can build the nodes to be network-booted, no HDDs per node. All of our software is built out of a shared tree and our users load up the software they need, submit the jobs using qsub specifying the resources needed (# nodes, # cores, etc), and then the resource manager checks the validity of the job and shoves it out.

There are lots of walkthroughs on how to submit jobs (google for "cluster job submission").

Here's a typical setup:

My staff
- Sets up the cluster (go Dell C6100's!) all connected over IB (not necessarily needed for serial jobs, but can give you 40GbE to storage).
- They build compilers, math libraries, etc up in /software/modules
- They configure modules-environment to load the correct versions of everything

Users
- module load the correct versions of libs and compilers they need
- build their specific software
- qsub the submission script
- a) job runs right away, or b) waits in queue until resources free
- watches output (or waits for output to be emailed)
- logs in and cries because their stuff didn't work

In your case, you'd probably build everything and store centrally, accessible from all nodes. You (or your users) can then follow some simple scripts to submit work. With a good user pool, you can be like me and be at 99% utilization 24x7... well that's because their jobs are parallel and run for weeks.

If you're interested, could do a call or PMs too. Lots of little specifics. Also if your cluster is small, there are some commercial options for support that package a full config and work well.

T_Minus · Sep 16, 2015

@Chuckleb wow, appreciate all the information you've unloaded onto me

Time to do some learning, and I'll be in touch for sure! Thanks again for providing more info / options.

Chuckleb · Sep 16, 2015

I guess I would also say that virtualization is great for isolation, keeping a mix of things running across the pool of hardware. Clusters are when you want to run similar things on multiple machines all the time or concerned about performance.

Now with something like Docker, you can get the best of both worlds if you can package your software to run in the container, but there is overhead since all communications pass through the network stack, so I've seen a performance hit of between 5-10% depending on what I'm trying to do. Another advantage of a container is that you can get back to a known point in the software lifecycle, knowing that a certain configuration will work. To accomplish the same thing in a cluster, you use Environment Modules which basically loads up the software environment to a known state. This is why we build all of our software from compilers all the way down the software stack and load things down to specific version numbers. We save all source code and installers so that we always have access... as well as record all the build strings for software. We only install from repositories for basic things or as a last resort.

Here's an example of a modules config:

#%Module######################################################################
##
## PetSC with OpenMPI
##

# Help message
proc ModulesHelp { } {
puts stderr "\tThe PetSC moduledd\n"
}

# Global variables

set sys [uname sysname]
set mach [uname machine]
set hostname [uname nodename]

# Switch statements to figure out what to deploy

switch -glob $hostname {
cluster1.xxxx.edu {
module load openmpi/1.5.5-gcc-4.9.1
setenv PETSC_DIR /software/paros/petsc/3.1-openmpi-1.5.5-gcc-4.9.1

setenv PETSC_DIR /software/paros/petsc/3.4-openmpi-1.5.5-gcc-4.9.1
setenv PETSC /software/paros/petsc/3.4-openmpi-1.5.5-gcc-4.9.1
setenv HYPRE /software/paros/hypre/2.8.0b/gcc-openmpi-1.5.5
setenv ACML /software/x86_64/acml/4.4.0/gfortran64
setenv HYPRE_V 2.8.0b
system echo "Loaded: PETSc 3.4 with ACML 4.4.0, Hypre 2.8, GCC 4.9.1"
}
cluster2.xxxx.edu {
module load openmpi/1.5.5-gcc-4.7.0-fix hypre/2.9.0b-openmpi-1.6.5-gcc-4.8.2
set PETSC_HOME /software/aegean/petsc/3.4-openmpi-1.5.5-gcc-fix
setenv PETSC_DIR $PETSC_HOME
setenv PETSC $PETSC_HOME
setenv HYPRE /software/aegean/hypre/2.9.0b
setenv ACML /software/x86_64/acml/4.4.0/gfortran64
system echo "Loaded: PETSc 3.4.5 with ACML 4.4, Hypre 2.9b, GCC 4.9.1"
}
}

This is probably why I can work with the Linuxbench stuff so well

.

dba · Sep 16, 2015

How do Torque/Maui/Moab compare to say StarCluster? I have a bit of experience with StarCluster but found it less than reliable, and reliability is important since our "jobs" are worth $ each and can never be lost/dropped. If Torque/Moab are big improvements, then I should look into them. We have our own scheduler now, but I'm a bit worried about it as we grow beyond 20K CPU cores.

Chuckleb said:
We use Warewulf for the cluster management and Torque and Maui for the scheduling and queueing.

What's nice is that we can build the nodes to be network-booted, no HDDs per node. All of our software is built out of a shared tree and our users load up the software they need, submit the jobs using qsub specifying the resources needed (# nodes, # cores, etc), and then the resource manager checks the validity of the job and shoves it out.

There are lots of walkthroughs on how to submit jobs (google for "cluster job submission").

Here's a typical setup:

My staff
- Sets up the cluster (go Dell C6100's!) all connected over IB (not necessarily needed for serial jobs, but can give you 40GbE to storage).
- They build compilers, math libraries, etc up in /software/modules
- They configure modules-environment to load the correct versions of everything

Users
- module load the correct versions of libs and compilers they need
- build their specific software
- qsub the submission script
- a) job runs right away, or b) waits in queue until resources free
- watches output (or waits for output to be emailed)
- logs in and cries because their stuff didn't work

In your case, you'd probably build everything and store centrally, accessible from all nodes. You (or your users) can then follow some simple scripts to submit work. With a good user pool, you can be like me and be at 99% utilization 24x7... well that's because their jobs are parallel and run for weeks.

If you're interested, could do a call or PMs too. Lots of little specifics. Also if your cluster is small, there are some commercial options for support that package a full config and work well.

Chuckleb · Sep 16, 2015

I haven't compared schedulers as most of the HPC systems out there either use SGE, SLURM or Torque, Maui/Moab combinations. TACC uses SLURM for resource management while Blue Waters uses Torque/Moab I think. These systems are pretty large... BW is at about 450k cores I think and some of the scheduling that you can do on them now are pretty amazing. On BW, you can do topology-aware submissions since they use a torus config for their interconnects, keeping jobs spatially local are important. You can do backups schedulers and whatnot as well to help with reliability.

I'd definitely give the Torque/Maui/Moab world a shot to see how it runs for you. I know that it can schedule in the 1000's of jobs at once because one of our workflows spawns out loads of single core jobs. And on the big HPCs, they are very heavily scheduled. We've never had any jobs fail to submit and run on any of the HPC systems.

If you want to see one of the neatest visualizations of job packing, check out BW's top 10 jobs:
Blue Waters User Portal | Torus Viewer

Just looked up Starcluster. That's designed for cloud cluster compute, interesting. Will have to look into for other projects. If you run your own hardware, you'll want to do the Torque (or SLURM)/Maui (or Moab) setup I think.

For cluster distributions, Warewulf and Rocks are the two that we are most familiar with. With Rocks, you can package "rolls" and distribute that way. Warewulf is a bit more flexible though, for updates and everything.

dba · Sep 16, 2015

Chuckleb said:
I haven't compared schedulers as most of the HPC systems out there either use SGE, SLURM or Torque, Maui/Moab combinations. TACC uses SLURM for resource management while Blue Waters uses Torque/Moab I think. These systems are pretty large... BW is at about 450k cores I think and some of the scheduling that you can do on them now are pretty amazing. On BW, you can do topology-aware submissions since they use a torus config for their interconnects, keeping jobs spatially local are important. You can do backups schedulers and whatnot as well to help with reliability.

I'd definitely give the Torque/Maui/Moab world a shot to see how it runs for you. I know that it can schedule in the 1000's of jobs at once because one of our workflows spawns out loads of single core jobs. And on the big HPCs, they are very heavily scheduled. We've never had any jobs fail to submit and run on any of the HPC systems.

If you want to see one of the neatest visualizations of job packing, check out BW's top 10 jobs:
Blue Waters User Portal | Torus Viewer

Just looked up Starcluster. That's designed for cloud cluster compute, interesting. Will have to look into for other projects. If you run your own hardware, you'll want to do the Torque (or SLURM)/Maui (or Moab) setup I think.

For cluster distributions, Warewulf and Rocks are the two that we are most familiar with. With Rocks, you can package "rolls" and distribute that way. Warewulf is a bit more flexible though, for updates and everything.

I'll definitely give Torque/Moab a try. Our use case is slightly off of the HPC norm since we schedule tens of thousands (and growing) of relatively small jobs at once, each of which spawns ~5-20 subjobs and takes 0.1-10 minutes. But other than that quirk, it's a scheduler problem through and through: A wide variety of jobs that vary 100x in size and complexity, a cluster of many many nodes, a pile of different services, and a strong need to run everything as close to 100% CPU as possible.

T_Minus · Sep 16, 2015

dba said:
I'll definitely give Torque/Moab a try. Our use case is slightly off of the HPC norm since we schedule tens of thousands (and growing) of relatively small jobs at once, each of which takes 0.1-10 minutes. But other than that quirk, it's a scheduler problem through and through: A wide variety of jobs that vary 100x in size and complexity, a cluster of many many nodes, a pile of different services, and a strong need to run everything as close to 100% as possible.

Sounds along the lines of the different jobs we see too.

Some jobs may be seconds but there could be 1 million (I have that workload waiting for me!!!) of them, or there may be 10 jobs that take 1hr each with 2 CPU but if more CPU is available we could throw more at it and use more -- this is what i'm hoping for

or the 1 million jobs may spawn 10 million other jobs, etc...

Looking forward to reading up!

My "big goal" is to take our current projects and jobs from them centralize it, and make it VERY EASY for us to "add new jobs" to the system so we can expand the 'software' without having to re-work or program something custom for job scheduling for each 'unique' job. A config file per-job created once, and off we go sounds good

Think many many "micro jobs" that we wan to be able to 'add' to the "job" in this new "system".

Chuckleb · Sep 17, 2015

Yeah, I think for your scales, Moab is what you want. My clusters are simple so the open source Maui works but we don't do 100,000 jobs normally. That's why the big guys use Moab.

With these schedulers, you can do the big runs and use the backfill queue for the tiny runs, getting maximum fill. Some HPC centers charge less for backfill to encourage this type of job balancing.

dba · Sep 17, 2015

Chuckleb said:
Yeah, I think for your scales, Moab is what you want. My clusters are simple so the open source Maui works but we don't do 100,000 jobs normally. That's why the big guys use Moab.

With these schedulers, you can do the big runs and use the backfill queue for the tiny runs, getting maximum fill. Some HPC centers charge less for backfill to encourage this type of job balancing.

When I do give it a try, can I reach out to you for a bit of advice and guidance?

T_Minus · Sep 17, 2015

To be sure, what is a job and what is a task within a job and/or are they the same/does it matter?

(Made-up basic project)
User creates project on web front-end and inputs 50 documents for their new project, and then they elect to have all 9 letter words removed from their document. On the front end this is 1 job/project but with scheduling software is this considered 50 jobs for 1 user?

Chuckleb · Sep 17, 2015

@dba: sure, I can ask my guys for help too

@T_Minus: it doesn't matter, you could either break it up before submission or after it is handed a node. This is up to how you do the parallelism.

For example, we have a request make a map to find pirate gold. The user specifies the region and we would get back the IDs for the images that we need to process images (align edges, color correct, cloud detect, etc). From here, we can either stitch the map and then process the images, or break the images into n chunks. If we do the first route, we would submit a job doing this:

Collect images
Stitch
Process

If we break it on the front end, it would be
Spawn n jobs
Process image subset
Return to parent
Stitch

In the first case, if the software were parallelized, it could do n threads and complete faster than the second where you submit n jobs using a thread or core each.

Some HPC systems give priority to the first type of job, where you want to submit big jobs and use all the cores on the !machine with a big task, while the latter is better for backfill since you know what you need and can micro-allocate the job. The amount of overhead and complexity are the same, it all depends on how well parallelized your task is and the requirements it needs.

The latter model actually works well for hadoop and distributed computing (SETI@home, Condor, etc).

For your example, either model works but the second model may be easiest.

Another thing to remember to do when using multicore machines is to bind to cores when run to gain performance due to less context switching and cache misses. Even for serial independent jobs since the OS may move things around.

T_Minus · Sep 17, 2015

Thanks @Chuckleb

Right now our custom queue/job engine works like this with redis and python workers.

- Front-end: Jobs are created by uploading data and selecting which priority queue
- Job creates queue for image subset/data to be processed
- Worker nodes (for this specific job) are polling for new jobs in the queue
- Worker nodes takes job, marks off in queue
- Worker completes job, updates [whatever/db/file/etc] & notifies user
- Job is removed from Queue, and images are gone from worker queue due to being processed

Of course we have monitoring, status-checks, and other systems in place, but it's simple, and works, but systems sit idle when "specific jobs" aren't in, due to how we have it coded from 1 project to start to 6+ now, without a central management system each "type of job" has different queue, servers, etc to use, which is very inefficient (yet still much cheaper than any SaaS VM rental).

Also, depending on the code we have to fork certain work-loads so certain applications are getting re-written to scale better, but also, like you mentioned in example #1 to break apart processing into smaller pieces that can be done at once / idle cpu, etc... is the ultimate goal.

Docker is appealing because of the existing management options, but it's still 100% new to me, and sounds like 1 or 2 steps up from where I am now, and your suggestion sound 10x better and the ultimate goal for the hardware I'm sitting on.

J Hart · Sep 30, 2015

I randomly stumbled into this thread while browsing around.

We also have our production HPC cluster running Torque with Moab, but I've been looking at building an alternative system using Mesos. Why change things when Torque works? Torque works only when you play by Torque's rules. The code base is a disaster. It is hard to even get it to build on anything other than CentOS. Anyway for me the biggest problem with Torque is it becomes a management nightmare at some point. If you have a very heterogeneous workload you will end up with 100s of libraries of all different versions which you have to be able to load and unload as people have applications which are all built against different APIs. Every new version of a piece of software requires us to fiddle around with trying to select the right combination of parts to add into module so people can actually use it. I would love for things to be in containers instead because now we have the whole program including the libraries wrapped up in a neat little package. The users can build a package, test it, push it to the repo and schedule jobs against it.

Anyway I've gotten this system to mostly work. The system I have put up uses CoreOS on bare metal for workers and (at the moment) 1 master node running etcd, zookeeper, Mesos, and Chronos. I start up Mesos and NFS mounts using fleet. Then jobs are submitted and scheduled via Chronos for the moment. I'm still working on a Mesos framework for batch. Chronos picks an appropriate node or group of nodes and they grab the docker containers needed. Permanent data is on the NFS mounts.

TheNewbGuy123 · Oct 21, 2022

I just stumbled on this one, and it's touching on things I'm trying to work with as part of a new job ... curious if anyone else works in this space?

For me, I'm working for a School within a University Faculty and we have an aging HPC currently running Slurm with Lmod for managing environment modules. There are a number of both pure compute (VM's) and vGPU resources, with jobs typically Ai / ML related. They can either be quick (initial phases or development) or take days / weeks to complete.

Unfortunately, most staff don't care for the complexity of batch scheduling and have been looking at alternatives like GCP. Those that are using the system want to engage interactively (typically using Jupyter notebooks) which ties up resources. The most frequently requested feature we're asked about is docker, and so we're looking at how to re-design the HPC env to improve ease of use and deliver what staff want.

I've never set up a kubernetes cluster before, so I'm working through that at the moment, and should have a test bed set up by this weekend (exposing vGPU resources to those containers is another challenge of course...), but I'm really struggling with the idea of batch scheduling with kubernetes. The default kube scheduler doesn't seem as comprehensive as say, Slurm is, and I'm not sure where to go with it. I just finished watching a youtube clip on Kueflux, which aims to address the short cummings, but it's all a little overwhelming at the minute.

What's everyone else out there doing, and what are what are the pain points?

T_Minus · Oct 21, 2022

Wow, old thread

It's crazy how far things have come eh @Monoman

maybe in 2023 i'll have time to get back to THE SYSTEM

Monoman · Oct 21, 2022

T_Minus said:
Wow, old thread It's crazy how far things have come eh @Monoman maybe in 2023 i'll have time to get back to THE SYSTEM

Maybe, I really did enjoy myself on this project

Search

Virtualization, Docker, Baremetal and more!

T_Minus

Build. Break. Fix. Repeat

Chuckleb

Moderator

T_Minus

Build. Break. Fix. Repeat

Chuckleb

Moderator

T_Minus

Build. Break. Fix. Repeat

Chuckleb

Moderator

dba

Moderator

Chuckleb

Moderator

dba

Moderator

T_Minus

Build. Break. Fix. Repeat

Chuckleb

Moderator

dba

Moderator

T_Minus

Build. Break. Fix. Repeat

Chuckleb

Moderator

T_Minus

Build. Break. Fix. Repeat

J Hart

Active Member

TheNewbGuy123

New Member

T_Minus

Build. Break. Fix. Repeat

Monoman

Active Member