napp-it issues as of late

nosense

New Member
Mar 15, 2022
9
0
1
@gea, Currently, I am unsure about napp-it email alerts reliably working. I've had on and off problems in the distant past (OmniTI days) with napp-it email alerts, but I have mitigated it by writing some critical jobs independently of napp-it and eventually a napp-it update fixes the built-in email. Now, there appears to have been major changes. I have TLS email enabled, but the menu no longer gives that option and it doesn't appear to be honored anymore, but it still appears in the Job opt3. Whenever I do an email test through napp-it it fails because it now only uses your standard email script. I've tried to follow your email flow, but with all the duplicate email scripts and options I gave up. Besides, any patches there get overwritten with any napp-it update.

My mail provider prefers port 465 TLS, but will also accept port 587 STARTTLS. I know the Pearl Net::SMTP::TLS module you use has been depreciated for many years and maybe that's the problem. Anyhow, I also use a Pearl script for my external critical jobs and use the regular Net::SMTP module you use for standard email because it has had TLS support for years. I just set the options to SSL=1 and Port=465 and it is currently working correctly. It also works with Port=587 followed by the line starttls().

So, I guess what I'm asking is maybe more of a request if Net::SMTP::TLS is broken. Can the napp-it email system be standardized on Net::SMTP and utilize an option for port and protocol? It appears to work with all major email providers, even allow for those with non-standard ports and would seem to simplify your email handling scripts.

Also, I seem to have had many issues lately with OmniOS/napp-it and I am aware of many of the changes under the hood. I've mitigated many of the issues by using http instead of https after seeing some of your recent comments. Also, upgraded from r151038 to r151040 but too early to tell if that hurt or helped.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
I am not aware of general problems with TLS email alerts. You need to install the SSL modules, see napp-it // webbased ZFS NAS/SAN appliance for OmniOS, OpenIndiana and Solaris : Downloads, then enable TLS in menu Jobs > TLS mail, set name+pw in About >settings and create an alert job. For Gmail you must additionally allow less secure apps in Gmail settings.

As an alternative you can use push alerts via a webapi for SendinBlue (smtps), Pushover or Telegram. For Pro you can also use reports where the send mechanism can be set per job.

Mini_http (the webserver below napp-it) no longer supports https on newest OmniOS 151040. This is why I am on the way to switch to Apache on newest 22.dev.
 

nosense

New Member
Mar 15, 2022
9
0
1
Thanks GEA, but I thought it was implied that I do have the modules installed and updated since I use the same modules. I actually have a script that automatically updates the Perl modules anytime OmniOS is updated. Also, As I explained, there is no longer an enable TLS in menu Jobs > TLS mail option in napp-it anymore! This is on multiple systems. So there is a disconnect between what you are saying and what is happening. Again, parts of napp-it think I have TLS enabled, other parts don't and there's no option to enable/disable anymore. I'm at a loss.
1647383031448.png

1647383305920.png
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
Can you add napp-it release and selected menu set.

Menu set can be defined in About > settings and can be used to translate or limit menu items. You can switch between the selected one and en in the toplevel menu right of logout.

Use "en" for a complete set of menus.
 

Attachments

nosense

New Member
Mar 15, 2022
9
0
1
Release 21.06a7
Settings are @ en
There is no possibility to switch between menu settings to the right of logout.
This is on 3 machines.

I did switch back to version 19.12b14 on one machine, and all menu options (incl. TLS email) are there, but still no option to the right of logout. The other two only previously had 18.12 (I think w), and that version is useless on r151038 or r151040.

1647445288313.png1647445339716.png
 

nosense

New Member
Mar 15, 2022
9
0
1
@gea, I forgot to check email functionality again earlier, but I can now report that release 19.12b14 napp-it email does work on r151038. Only change was backtracking from release 21.06a7.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
On a default setup you can always switch menu sets ex from sol (Solaris) or de (German) to a full en (English) menu set. If there is no menu switch option, I asume you have renemed /var/web-gui/_my/zfsos/_lib/lang/MY/ to /var/web-gui/_my/zfsos/_lib/lang/MY! what makes a private menu set mandatory,
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
hmm..
All folders under /var/web-gui/data/napp-it/zfsos/ represent a napp-it menu item. If there is a Job TLS mail menu folder "/var/web-gui/data/napp-it/zfsos/15_Jobs and data services/04_TLS Email/" it is shown unless the selected menu set hides the item ex in /var/web-gui/_my/zfsos/_lib/lang/MY/about_menus.txt
or a menu set under /var/web-gui/data/napp-it/zfsos/_lib/lang/

As this is the first time I hear about such a problem another option is to go back to an earlier BE or reinstall napp-it. This is mostly faster than problem finding.
 

nosense

New Member
Mar 15, 2022
9
0
1
Thanks gea, I did that with the 19.12.b14 system, which never had 21.06 on it, and it did the same thing (en enabled but limited menus). Yes, there is a 04_TLS Email folder and it is populated on all machines. Again, this happened on three machines so it is not a 1 off.

Shortly after updating (a few hrs.) to 21.06a7 on that 19.12b14 machine to check if it would get the email issue too, OmniOS became unresponsive (NFS working, but console, ssh, smb, napp-it not accessible). That machine has been on r151038 & 19.12b14 since 1/05/22 with no problems, but as soon as I upgraded to 21.06a7 it shortly became unresponsive (no other changes). I went back to 19.12b14 yesterday afternoon and so far no problems. Overall, it has been running since OmniIT days (2016) without any issues (except napp-it email now and then) until now.

I say all this, because when I went from r151036 &18.12w7 on the machine that started this thread with the above email issue sometime in early Jan. (don't have exact date, because I have deleted some of the BEs trying to resolve the issue with a clean try). I initially went to r151038 & 21.06a5, but would have the same OmniOS unresponsive issue (NFS working, but console, ssh, smb, napp-it not accessible) as above. It would happen sporadically between 2-10 days in between. I then got rid of the new BEs and started over and that resulted it the same issues. finally tried r151040 and again no change with random unresponsiveness. Currently, I have just gone back to r151036 & 18.12.w7 on this machine because of an unresponsive OmniOS to see what happens. It ran on that for over a year with just OmniOS updates without issues. Hopefully, I can establish a baseline that works again.

On a third machine, I have r151038 & 21.06a5 that has had no issues since 12/22/21. However, that machine is just a backup replication repository for the other 2 machines and has no VMs and no jobs except email alerts (which don't work because of the above issue). It basically just fetches nightly snaps.

In all unresponsive cases there are no faults, zfs is clean, no crashes, and error logs will show issues after the reboot, but they are varied and all basically relate to smb being unresponsive which it is (so not sure if the errors are an indication of the cause or just an indication of the effect). I will note however, that on the machine that first saw the problems, I found out that I needed to shut down smb/server and smb/client in order to upgrade from r151036 to r151038 and again to r151040. Because of the issues I was having, I had all the hosted zfs VMs shut down (AIO) just to make it a clean OmniOS only when doing the update and I noticed that the update was failing every time exactly when OmniOS complained about the Domain Controller not being found (periodic normal error when the DC is not online). Rather than disjoining the domain, I just shut down smb services to stop the DC not found warning and the updates proceeded correctly each time under this condition. Normally, I upgrade OmniOS with everything running and DC online and then shut down the hosted VMs and reboot during maintenance time so the DC warning is never an issue so I'm not sure if the warning would normally cause an issue.

I have up until now thought the unresponsiveness was an OmniOS issue even though this would be the first noticeable OmniOS issue I've seen since 2016! Now though, with the above data points, I am suspecting napp-it since it is the only common thread that created the same issue on both machines. Third machine doesn't quite fit the story although it is 21.06a5 instead of a7, and napp-it doesn't really do much of anything on it. Plus, it is a vintage (2009 era) single socket Xeon with SATA drives, and the other two machines are mid/new vintage dual Xeons with SAS systems.

Again, I'm at a loss. I can't seem to get meaningful error info after the fact, I can't access the system during the condition to discover what is going on when it happens, and everything I try to check when things are running looks appropriate/normal. Sorry for the long post. It's a frustrating situation. Investigating Dtrace, but have never had to use it before. Seems like you got to have an idea where to start before enabling a trace though.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
You can quite easily check if napp-it is involved in ssh, nfs or smb problems when you run /etc/init.d/napp-it stop. This stops the webserver and all background services like monitoring or background acceleration. OmniOS does not need napp-it to function. Usually problems should be also visible in system or fault logs.

It is unclear to me why smb should hinder an update or some menus become invisible. For AD disconnect you need a rejoin or smb server restart to function again. As ongoing updates may be part of the problem, have you tried a clean setup of 151038 lts or 040 stable?

Especially untypical problems are often due a combination of more than one problem so reducing variations can help. Regarding updates you need a current napp-it for a current OmniOS. This is due changes in OmniOS like Perl release, OpenSSH or ZFS feature updates where napp-it needs ongoing modifications.
 
Last edited:

nosense

New Member
Mar 15, 2022
9
0
1
@gea I appreciate your checking my sanity!

if I do a napp-it stop, what about autojobs. Will they still run? I know that the auto.pl scripts still runs under crontab. I don't need the webserver (GUI) per se, but I need the autojobs to still run. Just haven't followed all the dependencies in your napp-it setup.

I don't think or mean to imply that smb inhibits the menus. AD is working fine until smb goes on a sabbatical. The DC error happens if the DC is down and is normal, but it will interrupt the console with the warning while the system is updating, and the update quits at that point. Again, not sure if this is normal or an issue, as I normally don't update this way.

Yes, I've been thinking about a clean install, but I'm usually inclined to uncover the initial problem because it usually crops up again down the road and you end up back at square 1, specially in this case since it is not confined to one system and so it does not appear to be a 1 off glitch.

Yes, I'm reducing variations now. I am aware of the napp-it/OmniOS version marriage requirements. 19.12b14 appears mostly compatible with all three versions r151036, 38, 40 and that has worked well enough for some time. Yes, I'm aware of 19.12b14 having some things that don't work, but any I've seen are easily accommodated via cli if I need them.

I am re-baselining both systems to what I know has a good working history and will SLOWLY move forward to see what triggers the issue.

I also want to correct an earlier statement that said there were no faults. In fact there was the exact same fault on both machines whenever this happens. Basically I get a OmniOS disk failure:
1647562586469.png
This is the SuperMicro SATA DOM boot drive. I discounted it as an effect rather than a cause. Maybe that in some way is wrong, but this drive boots ESXi and contains the initial datastore which hosts other VMs besides OmniOS. ESXi and the other VMs running off of that drive are having no issues and OmniOS zfs/nfs keeps on working, so the drive did not fail. Again, I thought it non-sensical. The ESXi version has not changed since the issue started, but maybe there is an OmniOS driver issue?

After reviewing my post, I will leave the previous paragraph here just in case it jogs someone's experience, but one system has 2 1/2 months with r151038 && 19.12b14 without issues and boom upgrade to 21.06a7 and within hours get the error above and system is non-responsive... no other changes. Not suspecting a driver issue for now ... data points indicate the error is effect rather than cause.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
System non responsive indicates often a disk problem as napp-it reads pool and disk state on every menu load. You can verify problems at console. A zfs or zpool list or a format command should give data without a delay.

Jobs are triggered by cron that executes auto.pl. This is independent from napp-it. You can disable or enable auto in menu jobs. This remains valid even if you stop napp-it.

btw
A Supermicro DOM is quite good but by far not as reliable as a good SSD. When I tried one some time ago, it died within a year. I would replace with an SSD. The OmniOS disaster recovery method is backup current BE to a datapool, reinstall a minimal OS, restore BE, activate and reboot.

Usually you keep OmniOS as minimal as possible. A clean reinstall (optionally save/restore /var/web-gui/_log/* with napp-it settings and recreate user with same uid) is done within half an hour.
 
Last edited:

nosense

New Member
Mar 15, 2022
9
0
1
@gea, I wanted to provide an update to the situation discussed above (It took a while, but I have found the culprit):

  • I took your suggestion to stop napp-it by running /etc/init.d/napp-it stop, but this did not resolve the problem. Also, the snapshot jobs quit operating when stopped this way. So I continued to look elsewhere.
  • I continued to change minor/major versions of ESXi and system settings, minor/major versions of OmniOS and system settings, and minor/major versions of napp-it, but they also did not resolve the problem. In fact, as I progressed to newer releases the problem just happened more often.
  • Again, the problem is that OmniOS eventually becomes non-responsive except for NFS which continues to work. ESXi and all other VMs including ones on the same datastore as napp-it continue to function normally. On one system over a holiday week OmniOS eventually panicked, but there were no logs for 6 days or crash dump generated! That's been the problem all along, there are no indications in any logs as to what is happening. Logs for normal stuff are not even being generated. You also can't log into OmniOS via ssh or the console directly when this situation happens to probe. It's been very unpredictable and very frustrating.
  • I even spun up a new system on all new hardware and manually created an environment and workload similar to two of my other systems. It suffered from the same problem on all versions of OmniOS 151038 -151042 and all versions of napp-it 19 - 21. This system was on Proxmox instead of ESXi. This strongly indicated that the issue was likely OmniOS or napp-it.
  • Again, the only way to stop it was to revert back to a previous BE with 151036/18.02 except the Proxmox system which started on 151038 and had no earlier BE. We just blew it all away, and started over.
  • Finally, I decided to dig in an try to actively monitor OmniOS services to see what was struggling. I went into SMF, and noticed that napp-it was run as a legacy service. For grins, I disabled napp-it from ever starting via rc?.d edits so that I could cleanly monitor OmniOS by itself. I began to start monitoring some SMF suspects like smbd, etc. Well, smbd was still running fine 4 times longer than any previous non-responsive episodes. I then disabled the other systems from ever starting napp-it via rc?.d (no monitoring though) and they too quit having non-responsive issues!
  • Now, all systems have been operating several weeks without any issues. The only thing changed was keeping napp-it from ever starting!
  • I have created new services to handle all my added jobs that napp-it was running. Mainly, a monthly scrub and snapshots along with their retention policies. Since napp-it doesn't have a way to manage retention of snapshots generated elsewhere, I already had a snapshot retention policy service setup on a backup unit that just received snapshots from other systems so I just added snapshot generation to it. BTW, my snapshots under napp-it used to vary 12 - 26 seconds for the same job from run to run, and 2 - 4 seconds on snapshot jobs with the same parameters on separate filesystems. Without napp-it running, all my snapshots have 0 variance from run to run or job to job.
  • So while the smoking gun is napp-it, I still don't know exactly what is triggering it. Something is getting changed during the initial rc?.d startup scrips that does not get torn down when /etc/init.d/napp-it stop is run. Also, I suspect that the snapshot jobs/retention management may be causing the napp-it jobs engine to run back onto itself if the previous jobs cycle has not completed yet (recursion). Either way, I noticed that in the original home systems there are snapshots missing that are still within the retention policy defined in napp-it. Why? Napp-it was the only entity managing them. Those snapshots were originally there and were/are still captured on the backup system.
  • I like napp-it overall, you have put together a somewhat comprehensive front-end for Solarish type systems. However, I think that the GUI functions need to be separated out along the lines of something like Windows Admin Center where the GUI just reads and sets things and only monitors while open. Any recurring jobs or per persistent monitoring needs to go into lean/efficient/proper SMF managed services (i.e. with heartbeat and recursion monitoring). Then any changes to jobs or viewing of that persistent monitoring can be done via the GUI when it is opened.
  • I'd appreciate your thoughts on any short-term fixes and thoughts on the longer term.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
Thank you for your testings.

Napp-it is basically a Copy and Run application with very minimal OS interactions. It only modifies sudo and pam settings during setup. This also allows an update/downgrade of napp-it versions without trouble. When running you see the following services ex via ps axw

- Apache webserver (most current napp-it) for http/https management on port 80/143
- mini_httpd for remote management

You can kill these services without further effects to jobs or OS services

- background acceleration or monitoring functions

You can kill or disable these perl background jobs at all in napp-it (topmenu mon and acc) without further effects to jobs or OS services

- cron job to start auto.pl and jobs

You can disable this cron job in napp-it menu Jobs > Auto Service. When enabled Jobs are started even when napp-it webservices are not running or stopped via /etc/init.d/napp-it stop


What I would suggest:
1. Disable Mon and Acc as these background jobs can generate a lot of load.
On some hardware problems they may become a problem.

If this does not help:
2. Stop napp-it webservices (Apache and/or mini_httpd)

OS services and jobs (if cron and auto.pl is enabled) are not affected and should continue to operate.



btw: There was a problem report with Intel Sky Lake-E/C620 Series Chipset

Is this the case?
 
Last edited:

nosense

New Member
Mar 15, 2022
9
0
1
@gea

Thanks gea,

Maybe you missed it in my last post, but running /etc/init.d/napp-it stop causes snapshots and presumably snap retention to stop working. Maybe this is where the problem lies. This was with auto.pl still enabled. (it is now disabled in cron, because I am handling the jobs myself.

Also, Mon and Acc are disabled/off and have been off for years

Additionally, mini_httpd is not a listed process on my systems as you indicate. Rather port 81/82 are owned by /var/web-gui/data/tools/httpd/napp-it-mhttpd

So yes, I have already tried all your suggestions and the problem persists. I am also concerned with the discrepancies that continue to pop up in our discussions vs what is actually happening on my systems [mini_httpd does not run as a separate process -- the ports are owend by napp-it_mhttpd | when /etc/init.d/napp-it stop is run autosnaps stop -- they do not continue to run as you indicate]. Why am I seeing something completely different?

Again, even though I could live without the snapshots temporarily, running /etc/init.d/napp-it stop did not resolve the problem, so I initially assumed the problem was in some other area than napp-it. I only found out that it was indeed napp-it after disabling it completely in rc?.d so that I could focus on OmniOS. I am trying to step through/limit napp-it features during startup to try to narrow down what's triggering it.

Are all these your rc?.d scrips?
/etc/rc2_d/S20sysetup​
/etc/rc2_d/S81dodatadm_udaplt​
/etc/rc2_d/S89PRESERVE​
/etc/rc2_d/S98napp-it-poef​
/etc/rc3_d/S99napp-it​

I will add, that once my machines left 151036/18.12 it was not possible to downgrade back to either of those versions as the problem would persist even if the versions were downgraded. The only way to get back to stable is to go to a previous BE that was saved before any of the napp-it or OmniOS upgrades were made, OR as I've said, OmniOS works fine on all versions 151038/151040/151042 as long as napp-it is blocked from ever starting.
Example: OmniOS 151038 + napp-it 21.06 runs at startup system will become non-responsive in 3-36 hrs., reboot same system but run /etc/init.d/napp-it stop immediately (~3min) after startup and the system will still become non-responsive in 3-36 hrs. + autosnaps fail to run, lastly, again same system but comment out the rc?.d scrips listed above , reboot, and the system is running normally for 2+ weeks. As are 2 additional systems okay now that napp-it has been disabled from starting up.​
I have a copy vm with napp-it still enabled to try to troubleshoot now that I have the primary systems stable and updated again. The problem is, it is not really able to do the things that the primary system it copied does as those resources are owned by the primary system. So at best, it mimics the original.
 

gea

Well-Known Member
Dec 31, 2010
2,816
974
113
DE
napp-it init scripts:

/etc/init.d/napp-it-poef (needed for pools on encrypted files)
/etc/init.d/napp-it (start/stop napp-it)

The rc links define startup order and runlevel to start scrips.

Napp-it earlier 22.06 is using mini_httpd on port 81/82 but https on port 82 is no longer working due changes in OpenSSL in current OmniOS.

Current napp-it 22.06/22.dev therefor add Apache 80/143 for http/https

If you stop napp-it, all running napp-it processes are stopped. This is needed to allow an up/downgrade. Cron and auto.pl will start new jobs on time.

Newer OmniOS releases need a newer napp-it that supports a newer Perl or you will see a Tty.io error on some menus like User. Beside that Up/Downgrade is possible.


Have you disabled mon and acc background agents in napp-it?
 
Last edited: