Crunching data on older 4U systems

Maritime · Nov 22, 2016

Hi guys,

First of all, I’m new to this forum and in server/HPC environment. I will be grateful for any suggestions and comments to my dilemmas on my enthusiastic project for which I hope will became interesting to clients. So far it is completely on my budget and in ‘hobby’ surrounding. Also, sorry for length of post

Problem I am dealing with is processing of raw files of about 2-4GB in a projects that initially were up to 20-80GB in size. After initial setup, processing of project of this size would on i7 4790k computer with SSD, 4 cores+HT last about 10-20h. Processing is done through approx. 50 subroutines of which 25 are highly parallel (all 8 logical cores run at 100%). Only about 10 subroutines are absolutely not parallel. Regarding time allocation, parallel subroutines consume about 60-80% of total project time. Each core needs max 2-2,5GB of RAM for calculation. Windows environment.

Initial files are during processing joined and separated into smaller files which are then used for statistical analysis. Each raw file is separated to about 400-800 files (depending on project setup and initial file size). Replacing HDDs with SSDs showed dramatic time reduction. Upon project completion, initial size of data is 2-2,5x bigger. When processed, whole project can be moved to ‘standard’ PC for statistical analysis and visualization.

Idea how to speed things up is to purchase used 4U server (such as HP DL580 – cheapest one or some DELL/IBM) with E7-4870 processors and 256 GB RAM. New fast NVMe SSD would be perfect EXCEPT (as much as I know) they can’t fit into this generation of 4U servers. In addition, new projects could easily be 1TB in size (meaning min 2TB of disk what is too expensive if PCI SSDs are considered). Therefore, I think that RAID 0 with standard 500GB SSD would be excellent replacement (each project upon analysis is completely removed so potential errors would be noticed immediately). Going into newer series of processors would be too expensive. Using cheaper than E7-4870 processors would be counterproductive because speed and number of cores are important.

Additional thing that I would like to take care of is possibility to buy identical 4U server in future and work on project with ‘double’ processing power. I’ve read that IBM is with it's InfiniBand capable of doing so but since HP is much cheaper, is it possible to use DL580 in same combination? Does that mean that 2 servers could work on 1 dataset even without RAID 0 on both units (which would be perfect)?

Loudness and power consumption in this stage is not a problem, I have extra room for it and would connect to it through remote desktop.

How to (wisely) start with this project? Budget could be up to 2500EUR in start. I was even thinking about Xeon Phi but software is not written for it and would not benefit from it.

Thx for reading and commenting!

Patrick · Nov 22, 2016

I think you are right on the Xeon Phi not being optimized.

Infiniband is certainly an inexpensive option and you can use IB to get a high bandwidth/ low latency network for a second server.

On the NVMe side there are PCIe add in cards that can fit easily into existing servers. It is the U.2 form factor that can be a challenge. The other advantage of NVMe is latency which will be much better than SATA SSDs.

I would imagine optimization of the software would have the biggest impact. When I hear of people getting a 3x performance boost, it is usually software tweaks.

I also think that going quad (older generation) E7 would be less desirable than newer generation E5 V3's.

Blinky 42 · Nov 22, 2016

How much profiling of the software have you been able to do, and was it something built in-house or a 3rd party product? Do you have insight into what the process/algorithm is doing to be able to tell what the limiting factors are? The more you know about the way the data is processed, the working set size at each step of the process and the mix of IO vs CPU wait the better you will be able to build a box optimized for the processing within budget.

Beyond the easy to fixes to go with SSD's and 12G SAS or NVMe storage to maximize the IO potential of the system a few things to consider (and you seem to be thinking along these lines already)
- Can you split the jobs up across multiple servers using just shared filesystems or do you need some sort of distributed shared memory? How fast must the interconnect be to not be the bottleneck?
- Can you benefit from any vector instructions or advanced synchronization techniques available in newer CPUs?
- Is the software available to run on GPUs or can be modified to do so?
- Are you better off adding a lot of memory, or faster memory?
- How big are intermediate files / items in the process if there are any - can you put those in a ramdisk or on the fastest device you have to speed things up?
- Are you better off disabling hyperthreading - depending on memory usage patterns the 2nd thread can make things much much worse for some algorithms that hit main memory in patterns the CPU was not designed to handle.

Off the top of my head, if the entire working set + input & output of the whole process fits in 256G then get a dual-socket server with 256 or 512G of memory with your 2500EUR. If that isn't fast enough, then have it serve up a big ramdisk to other worker nodes across 56Gb IB

Good luck & have fun! It can be quite the learning experience to go through and build up something really optimized for a problem and see massive gains for seemingly minor adjustments.

KioskAdmin · Nov 22, 2016

Yup. If it's internal software, its time to optimize. 2500 EUR is enough to get a nice system as said above.

Maritime · Nov 23, 2016

Thanks for advises, I see I'll learn a lot from you.

Software is 3rd party solution, I don't have any influence on it but it is continuously improved (and I would prefer not to mention it's name). Current version is optimized and amount of data needed to be processed is bottleneck.

Regarding your suggestions/questions, replies are below:

@ Patrick
1. NVMe was an idea, but question is price and possibility to add on another disk if this becomes too small or worn out (it could easily write 1TB/day). But anyhow, for future and if project becomes financially positive, definitely a way to go.
2. Software optimization - don't have any influence on it, could suggest and comment on bottlenecks to authors but later on when some serious analysis is done on 'professional' machines.
3. I can't find anything newer then V1 with 30-40 cores on approx. 2.5 GHz and 256 GB RAM for budget. Maybe I'm wrong but cant imagine how V3 with 20 cores could offer same performance as 40 cores on same speed.

@ Blinky 42
1. I've done analysis on several data sets of different size on different computers. From software I can log just time of start and completion of each subroutine (which export to Excel for analysis). Number of faster cores with fast disk system definitely make min 80% of results (under assumption that there is enough memory and disk ofc). During weekend I'll make in detail analysis on one data set. I would like to find some logging software (take csv log or processor and disk usage over time but couldn't find it...).
2. I'm noobie in servers and ever more noobie in multiple servers configuration

Software has internal option with how many threads it will attack the problem, so if 2 units could be connected and synchronized to work with double amount of cores and RAM I would say it should reduce calculation time on big projects a lot. How much or how fast interconnects should be I really don't know.
3. Don't know about vector instructions since all test were done only on Haswell processor line.
4. Can't modify software and it can't run on GPU. Older generations of Xeon Phi are unsupported also (99,9% of certainty). New one, Knights Mill might be useful since Windows will recognize it's cores easily but that is just too expensive even to think about (with questionable outcome at best).
5. Program needs 2,5GB of RAM per thread, everything less will in one point of calculation result with stopping. So, 40 cores without HT = 100 GB, 40 cores with HT = 200 GB min. I have tested previously HT on/off but haven't seen any significant difference. Will do more detailed analysis on weekend. When required memory is available, surplus is not used. Then speed comes into consideration

6. I've played with RAM disk already (great suggestion, thx!) but on small dataset I haven't noticed any significant difference comparing to SSD (and that was totally shocking for me). It would be great to put datasets into RAM disk but: 20 cores x 2,5GB/core = 50 GB + 100GB data set (250GB needed for calculation) = 300 GB min. For 80 cores (40+HT) result would be about 450 GB and for 1TB dataset - 2760GB

But yes, 512 would be theoretically much more useful then 256GB at least for smaller projects which could be put on RD.

Nobbie's question for the end - why nobody here likes 4U systems?

gbeirn · Nov 23, 2016

Well if by 4U you mean 4 socket (U is just a measurement for rack height) I would guess that unless we use them at work they are too cost prohibitive for home or recreational use. Generally unless you really need that density most would be better served by multiple 2 socket systems.

One thing to consider in your use case is what version of Windows you'll need based on how many sockets the board has. I personally haven't dealt with greater than dual socket since Windows 2000 era which required the Enterprise Server version ($$$$$).

Maritime · Nov 23, 2016

Yes, sorry, I was thinking of 4U as a 4 (processor) units... I have access to W10 pro and 2012 server as a part of academia licences. Both should recognise multiprocessor environment but the question is which would server allow...

Sent from my HTC 10 using Tapatalk

gbeirn · Nov 23, 2016

I know for sure 10 Pro only supports up to 2 sockets. Certainly 2012 server supports up to 4, 8 or even more sockets but I don't know the specifics of which version.

MiniKnight · Nov 23, 2016

You need Windows Server. I don't think Essentials can use 4S.

@Maritime let's say you've got a 4x 8 cores = 32 cores total machine. Newer systems you can get 2x 16C or more like 2x 14C for similar processing power. NUMA hurts performance when you've got 4 CPUs over 2 or 1. If you're paying for space and power, 2 CPU is cheaper than 4 CPU often by $200/ mo in space and power so in a year it's cheaper to get 2 processors.

Saying people don't like 4 CPU systems is not true. They just cost a lot to operate and there are licensing costs that become bothersome as well.

BigDaddy · Nov 26, 2016

I have a dual e5-2686 v3 setup running Windows 10 with 64gb ram, an 950 pro and gtx 1060...
36 cores all turbo 2.3, 72 threads. Boosts up to 3.5ghz, cinebench is a bit over 3800, I think ~3300 without hyperthreading. Overkill for my usage. But optimal threads for yours. Good luck getting this many threads at this ipc for better money. Also find out what instuctions your software utilizes, might effect performance greatly.

Are Server 2016 licences also available? It's nice.

You should be able to get a QS E5-2686 v3 for ~500 usd each, I bough mine when they where cheaper

. They are a custom sku made for amazon. Mobo is easy to find. Mine was $300. A 1tb samsung 960 pro sells for ~600 and will easily handle 1tb a day for a very long time. 512gb 950 pro refurbs are going for a bit over 200usd if you need to pinch pence. My asrock mobo has onboard video that will also work remotely, wake on lan and etc. No need for video card if logging in remotely. You can easily get 16x16gb ddr4 ecc for $1200, maybe a bit less if you are a good shopper. 32gb lrdimms can be had for $190 or less each, so if you want the option of more ram later 256gb can still be had for $1500. If you don't already have a power supply that can run a dual mobo board you can get a great one for ~100usd. noctua makes great coolers for x99... can find them for ~$50 each. I can't hear my pc run unless I am gaming. If you have NVME ssd and no graphics card you can stick the build in just about any case.

I have no idea how much tech equipment costs is euros, but according to google they are ~to dollars in value currently. I see the budget issues. if you get a sas raid controller with battery backup for ~70 and a bundle of 400-500gb drives for ~10 each and put them in raid 6 you could have a decent 'relaible' storage solution on the cheap. Personally, I'd get a 960 pro or raid 2 950's if the budget could be stressed added latency doesn't matter.

Whomever mentioned a ramdisk gets a sticker. This machine with sas raid 6 and 512gb ddr4 ecc would be amazing, and expensive.

You could just buy something like this...
HP ProLiant DL580 G7 4x 10 CORE E7-4870 2.4GHz 512GB RAM 512MB FBWC NO HDD | eBay
Maybe not as fast but it costs a less. The versions with a E&-4850 cost even less, but have performance that is also much lower.

1000 2 x cpu
300 mobo
1200 256gb ram
150 2tb SAS raid array
100 850w power supply
If you already have some parts it'll fit in budget.

Broadwell isn't much faster than haswell. Until knights landing launches this would be essentially current equal to gen tech, barring instructions advances.

Hope this helps.

Maritime · Dec 5, 2016

THX BigDaddy for detailed post.

Core and instruction set are important things to consider and I would rather choose 2 socket solution with 30 newer cores then 40 older BUT problem is total investment. I would rather buy this first machine build and ready to use then build it from a parts – just don’t have that much time and energy.

Overall, I’m sure both scenarios will be much better than i7@4,2GHz

I’ve done some testing and results are as follows (same 38 GB dataset, same SSD and 32GB of RAM):
1. i5 4440 (4 threads) = 6h 51min
2. i7 4790k (4 threads) = 6h 21min
3. i7 4790k + HT (8 threads) = 5h 14min

With this dataset, end result is about 75 GB and 4500 small files in total (up to 500kB).

If you check my original post, my idea was exactly what you suggested – E7 4870 resulting in 40 cores, 80 threads which would be used well in this environment with investment about 2000 EUR including 1TB PCIe SSD and 256 GB of RAM.

Now, KNL is another very interesting idea but I would need to test it before purchase…

Search

Crunching data on older 4U systems

Maritime

New Member

Patrick

Administrator

Blinky 42

Active Member

KioskAdmin

Active Member

Maritime

New Member

gbeirn

Member

Maritime

New Member

gbeirn

Member

MiniKnight

Well-Known Member

BigDaddy

Member

Maritime

New Member