Network Latency Issue on 2 Servers

Bashed · Jul 5, 2016

I'm having these strange network latency, crawling speed issues on a couple of Dell R610 servers I have. I have multiple other R610's with same specs, connected to the same Cisco 3750 switch and they're fine. I've done a lot of troubleshooting and still cannot fix the problem. I would appreciate help here. My observium shows no spikes on my network or aggressive traffic on any other server that would cause some bottle-necking of any kind.

1. Tried replacement NIC cards
2. Tried multiple ports on server NIC and Cisco switch
3. Tried Centos 6.8 and Centos 7
4. Upgraded NIC drivers via Dell OMSA
5. Replaced Cat5 cables
6. Rebooted Servers
7. GigE enabled on server and Cisco switch, full duplex
8. Tried both Google resolvers and data center resolvers

Public speed test though are so slow

Code:

[root@localhost ~]# wget -O /dev/null http://cachefly.cachefly.net/100mb.test
--2016-07-05 13:32:59--  http://cachefly.cachefly.net/100mb.test
Resolving cachefly.cachefly.net (cachefly.cachefly.net)... 205.234.175.175
Connecting to cachefly.cachefly.net (cachefly.cachefly.net)|205.234.175.175|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104857600 (100M) [application/octet-stream]
Saving to: â€˜/dev/nullâ€™

0% [                                                                                                                                                                                                    ] 162,475     24.4KB/s  eta 73m 24s

Installed Dell OMSA this way:

Code:

INSTALL Dell System Update (DSU) AND OpenManage Server Administrator
---------------------------------------------------------------------

yum install perl wget -y
wget -q -O - https://linux.dell.com/repo/hardware/dsu/bootstrap.cgi | bash
yum install dell-system-update -y
yum install srvadmin-all -y
/opt/dell/srvadmin/sbin/srvadmin-services.sh start
dsu -u

j_h_o · Jul 5, 2016

Can you try a different switch, at least temporarily?

Bashed · Jul 5, 2016

I can't, I only have that one switch. But I have a dozen other servers (mostly R610s, some SuperMicro) that work fine on GigE.

aero · Jul 5, 2016

I know you mentioned having gigabit and full-duplex enabled, are both sides set to auto-negotiate or hard coded to 1000/full? Have you checked ethtool on the servers and int status on the switch to make sure they are actually negotiating to 1000/full?

Bashed · Jul 5, 2016

Yes, I checked that the server shows 1000/full. Both servers, switch too.

j_h_o · Jul 5, 2016

This doesn't look like anything to do with DNS; you're reporting HTTP download issues, after the IPv4 has already been resolved.

Any errors reported on the affected ports?

Any differences in NICs on affected/unaffected systems?

What does your upstream look like? Are you getting a single uplink to your router? Any QoS in place? Any NAT? Are all machines on the same subnet?

Are you able to reproduce throughput/latency locally to another R610? Run iperf or set up a dumb nginx instance in a VM and see if you can get decent speeds internally/locally?

Even a cheap 5 port gigabit switch would allow you to narrow/isolate the problem. Can you do work during maintenance window and "just" go down to a different dumb switch, hooking up only your uplink and a few boxes for testing, and see if this reproduces?

Bashed · Jul 5, 2016

On the other proper working servers, a 30MB/ps is common or greater. GigE uplink.

On these 2 servers, 200Kb/s or less is common and the issue. I can rarely break 1MB/s

I did a quick intranet test, doing scp from my NAS to one of the problematic servers, slow 300-400KB/s transfer. NAS to a good server via intranet, scp method is normal 20-30+MB/s transfer.

No QoS, NAT on switch. I have dual failover drops (Ethernet) from the data center upstream (blended bandwidth).

EDIT: on a side note, how can I check the actual full VLAN config on the switch to see if there's anything on the VLAN setup for these 2 boxes that might be causing this issue? My network guy is not around now. I'm fairly fluent on Cisco stuff, but don't memorize everything

Bashed · Jul 5, 2016

They're Broadcom NICs.

Code:

[root@localhost ~]# ethtool -i em1
driver: bnx2
version: 2.2.6
firmware-version: 7.12.19 bc 7.10.0 NCSI 2.0.13
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Code:

[root@server2 ~]# ethtool em1
Settings for em1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supported pause frame use: No
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off
        Supports Wake-on: g
        Wake-on: d
        Link detected: yes

Code:

[root@server2 ~]# lspci -v | grep Ethernet -A 1
01:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
        Flags: bus master, fast devsel, latency 0, IRQ 36
--
01:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
        Flags: bus master, fast devsel, latency 0, IRQ 48
--
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
        Flags: bus master, fast devsel, latency 0, IRQ 32
--
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
        Subsystem: Dell PowerEdge R610 BCM5709 Gigabit Ethernet
        Flags: bus master, fast devsel, latency 0, IRQ 42

Bashed · Jul 5, 2016

Output from switch on the 2 ports...

Code:

Cisco3750#show interface Gi1/0/13
GigabitEthernet1/0/13 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is 001f.6c8a.938d (bia 001f.6c8a.938d)
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
  input flow-control is off, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input never, output 00:00:00, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 0 bits/sec, 0 packets/sec
  5 minute output rate 1000 bits/sec, 1 packets/sec
     23470 packets input, 5854358 bytes, 0 no buffer
     Received 12 broadcasts (5 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 5 multicast, 0 pause input
     0 input packets with dribble condition detected
     240225 packets output, 19478755 bytes, 0 underruns
     0 output errors, 0 collisions, 1 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

Code:

Cisco3750#show interface Gi1/0/12
GigabitEthernet1/0/12 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is 001f.6c8a.938c (bia 001f.6c8a.938c)
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec,
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is 10/100/1000BaseTX
  input flow-control is off, output flow-control is unsupported
  ARP type: ARPA, ARP Timeout 04:00:00
  Last input never, output 00:00:01, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 3000 bits/sec, 5 packets/sec
  5 minute output rate 182000 bits/sec, 6 packets/sec
     286214 packets input, 35687041 bytes, 0 no buffer
     Received 29 broadcasts (23 multicasts)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
     0 watchdog, 23 multicast, 0 pause input
     0 input packets with dribble condition detected
     479017 packets output, 484189224 bytes, 0 underruns
     0 output errors, 0 collisions, 1 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out

namike · Jul 5, 2016

You can do a 'show vlan' from the switch# prompt to check VLANs.

You can also do a 'show run interface g1/0/12' and g1/0/13 from the switch# prompt to check the switch port configuration itself.

The show interfaces looks fine, other than the low number of packets/sec. If you want to eliminate the switch ports move one of your "fast" servers over to the "slow" interface and see if it stays fast.

Are the servers acting fine otherwise? No weird issue like a HDD going out slowing down things or other oddity?

Edit: Is your box dual homed or anything like that also?

-Mike

Bashed · Jul 5, 2016

namike said:
You can do a 'show vlan' from the switch# prompt to check VLANs.

You can also do a 'show run interface g1/0/12' and g1/0/13 from the switch# prompt to check the switch port configuration itself.

The show interfaces looks fine, other than the low number of packets/sec. If you want to eliminate the switch ports move one of your "fast" servers over to the "slow" interface and see if it stays fast.

Are the servers acting fine otherwise? No weird issue like a HDD going out slowing down things or other oddity?

Edit: Is your box dual homed or anything like that also?

-Mike

Code:

VLAN Name                             Status    Ports
---- -------------------------------- --------- -------------------------------
2    VLAN2                            active    Gi1/0/13
93   VLAN93                           active    Gi1/0/12

Code:

Cisco3750#show run interface g1/0/12
Building configuration...

Current configuration : 115 bytes
!
interface GigabitEthernet1/0/12
switchport access vlan 93
switchport mode access
speed 1000
duplex full
end

Cisco3750#show run interface g1/0/13
Building configuration...

Current configuration : 114 bytes
!
interface GigabitEthernet1/0/13
switchport access vlan 2
switchport mode access
speed 1000
duplex full
end

When you say "is your box dual homed", do you mean the network backbone? Not sure I quite understand the question. The network backbone is a tier-1 blend, redundant backbone. I have other similar spec'd R610's attached to the same Cisco switch and backbone, no issues.

namike · Jul 5, 2016

Bashed said:
Code:

VLAN Name Status Ports ---- -------------------------------- --------- ------------------------------- 2 VLAN2 active Gi1/0/13 93 VLAN93 active Gi1/0/12

Code:

Cisco3750#show run interface g1/0/12 Building configuration... Current configuration : 115 bytes ! interface GigabitEthernet1/0/12 switchport access vlan 93 switchport mode access speed 1000 duplex full end Cisco3750#show run interface g1/0/13 Building configuration... Current configuration : 114 bytes ! interface GigabitEthernet1/0/13 switchport access vlan 2 switchport mode access speed 1000 duplex full end

When you say "is your box dual homed", do you mean the network backbone? Not sure I quite understand the question. The network backbone is a tier-1 blend, redundant backbone. I have other similar spec'd R610's attached to the same Cisco switch and backbone, no issues.

Other than what VLAN they are a member of (93 and 2 respectively) the ports look right. I assume these two boxes aren't trying to talk directly to each other..

Dual homed meaning do you have 1 NIC in your box going to one network/subnet and then another NIC in the same box going to a different network/subnet?

You should only have one default gateway as well.

Any possibility to move the cables to eliminate the switch ports? How about trying different cables?

Bashed · Jul 5, 2016

namike said:
Dual homed meaning do you have 1 NIC in your box going to one network/subnet and then another NIC in the same box going to a different network/subnet?

No, just one NIC card / port used per server

namike said:
Any possibility to move the cables to eliminate the switch ports? How about trying different cables?

Already done that, stated in my original post. Replaced NIC card, different port on server, different port on switch, new cables.

namike · Jul 5, 2016

I'm sorry, I did skim over those details in the OP. Does the box seem speedy otherwise? Besides the network? Nothing wierd like a failing HDD in an array or something that could just be slowing hte box down?

You can always do an iperf between your boxes to verify local network speed between the good and bad servers? Or between boxes located on your VLAN 2 then on your boxes on VLAN93?

Bashed · Jul 5, 2016

Yes I'm baffled here. I tried scp between my problematic servers to the local NAS and vice versa, slow. No probelm between good servers > NAS locally.

The servers are running with SSD, RAID 0 on H700 PERC. I have similar good servers with the same exact config with no issue.

xnoodle · Jul 5, 2016

Ip conflict? Try different ips?

Sent from my HTC One using Tapatalk

Bashed · Jul 5, 2016

How can it be an IP conflict? Cisco cannot route the same IPs to 2 different VLANs, plus these are 2 servers with issues not one.

Bashed · Jul 6, 2016

Ran a full hardware diagnosis via Dell Life Cycle Controller. All passed. Would appreciate any more help, brainstorming.

On a side note, the Cisco 3750 is stacked (2 units), each unit has a 16GBps forwarding capacity, 32GBps combined. Is it safe to assume bottle-necking is involved if my Observium is reading an average of about 80-100MBps? It's far less than the full capacity, but just trying to figure out what else is left here.

wildchild · Jul 6, 2016

cisco config seems fine.
Sometimes it's something as simple as a batch of bad quality cables

have also seen in the past that broadcom sometimes just react strange to setting duplex and speeds hardcoded.

In the past there have been some serious issues with NetQueue in broadcoms ( depending of the driver/firmware used ).
in our case we had this with vmware :
https://kb.vmware.com/selfservice/m...nguage=en_US&cmd=displayKC&externalId=2035701
Broadcom BNX2 driver version bnx2-2.0.23b for RHEL 5 - IBM System x and BladeCenter
broadcom NIC SLOW!
symptoms very similar to yours

Bashed · Jul 6, 2016

I already tried autoneg on for speed and duplex too, both switch and server

Network Latency Issue on 2 Servers

Member

Active Member

Member

Active Member

Member

Active Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Active Member

Member

Member

Active Member

Member