What is your script when it comes to testing intermittent issues?

Notice: Page may contain affiliate links for which we may earn a small commission through services like Amazon Affiliates or Skimlinks.

gigatexal

I'm here to learn
Nov 25, 2012
2,913
607
113
Portland, Oregon
alexandarnarayan.com
Say you've got a restful api server and some users are able to connect to it just fine, it's rock solid, while others experience intermittent issues. What would be some things you'd try? I had this question asked in an interview and am interested in what you more seasoned networking/admins might do/or ask.
 

gigatexal

I'm here to learn
Nov 25, 2012
2,913
607
113
Portland, Oregon
alexandarnarayan.com
Need more information is the 1st thing I'd ask. Second would be are they using their own implementation or "ours"?

Next I'd start by segmenting the working users and non-working to see if there are any clear differences for a 'quick fix'.

I also would make sure I'm monitoring the API Server every 3 minutes from 3 locations across the U.S.A with something like WebsitePulse and also something like New Relic. WSP can be configured to monitor certain protocols so I'd also monitor HTTP / HTTPS as well as monitor them on the server-side, as well as network monitoring, and overall server latency monitoring.

If it's intermittent issues -- does that mean it works sometimes and not others with the same code or is it intermittent with various software that we can nail down to being a software issue on their side?

I've done this exact thing as we've written APIs for large news papers all over CA to use.

I took the response from a more software centric approach. If you're using pre-made software an all if it's 100% the same I'd jump to the user segmentation, and New Relic data to "see" what's really going on, and run network tests between the user and server.

If the API isn't already I'd also make it start recording date/times of all requests, and request my users submit trouble tickets with the date/time of the issue so we can check the log files for activity during that time.

I'd love to hear a more 'network centric' approach at troubleshooting this.
Thanks for your input T_Minus. From what I gathered from the question it sounded a lot like it was the same codebase (though I didn't ask if the stack running the API service was changing, probably would have garnered me more brownie points) and more along the lines of a connectivity issue between the user and the service. I liked your idea about checking trouble tickets and other logs. That's the missing link: I kept getting stuck wondering how, with vanilla logs, I was going to glean anything worthwhile, but with the added context of "trouble tickets" as you call them, one would get greater context to issues which I could compare to the logs. Also, known, is that most if not all the rest of the customers are having or not reporting any issues connectivity or otherwise.

I was also surprised when I asked why the logs were so vanilla to begin with that I got a muted or redirected response. I know you don't put code into production with debugging enabled as it likely slows everything down but things like NewRelic's process monitoring and even the things built into languages like C# make for a low overhead monitoring situation right?
 

gigatexal

I'm here to learn
Nov 25, 2012
2,913
607
113
Portland, Oregon
alexandarnarayan.com
What if the client refuses to work with you on troubleshooting for whatever reason. So that you get all the in process monitoring going and then they won't play ball and help generate data. What else could one look at?

I'm thinking that perhaps the app is sending back packets that are malformed for some reason, could bad packets drop connections? If this user were a large consumer of resources (say many requests per second) to say a single API server (you'd likely have a load balanced cluster but for purposes of brainstorming...) I wonder if there's a kernel level setting perhaps that's effectively dropping connections in order to not overload.