Resolved Intel e1000e driver bug on 82574L Ethernet controller causing network blipping ???

taken from: http://www.doxer.org/learn-linux/resolved-intel-e1000e-driver-bug-on-82574l-ethernet-controller-causing-network-blipping/

Earlier I posted a question about centos 6.2 lost internet connections intermittently. Now finally I got the right way to fix this.

Firstly, this is a known bug on Intel e1000e driver on linux platforms. This is a driver problem with the Intel 82574L(MSI/MSI-X interrupts issue). The internet connection lost itself now and then and there’s nothing logged about this which is very bad for troubleshooting.
You can see more bug reporting about this at https://bugzilla.redhat.com/show_bug.cgi?id=632650

Fortunately, we can resolve this by install kmod-e1000e package from ELrepo.org. To solve this, you need do as the following(ignore lines with strikeouts):

Install kmod-e1000e offered by Elrepo

Import the public key:
rpm –import http://elrepo.org/RPM-GPG-KEY-elrepo.org

To install ELRepo for RHEL-5, SL-5 or CentOS-5:
rpm -Uvh http://elrepo.org/elrepo-release-5-3.el5.elrepo.noarch.rpm

To install ELRepo for RHEL-6, SL-6 or CentOS-6:
rpm -Uvh http://elrepo.org/elrepo-release-6-4.el6.elrepo.noarch.rpm

Before installing the new driver, let’s see our old one:
[root@doxer sites]# lspci |grep -i ethernet
02:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

[root@doxer modprobe.d]# lsmod|grep e100
e1000e 219500 0

[root@doxer modprobe.d]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/kernel/drivers/net/e1000e/e1000e.ko
version: 1.4.4-k
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, srcversion: 6BD7BCA22E0864D9C8B756A

Now let’s install the new kmod-e1000e offered by elrepo:
[root@doxer yum.repos.d]# yum list|grep -i e1000
kmod-e1000.x86_64 8.0.35-1.el6.elrepo elrepo
kmod-e1000e.x86_64 1.9.5-1.el6.elrepo elrepo

[root@doxer yum.repos.d]# yum -y install kmod-e1000e.x86_64

After installation, reboot your machine, and you’ll find driver updated:
[root@doxer ~]# modinfo e1000e
filename: /lib/modules/2.6.32-220.7.1.el6.x86_64/weak-updates/e1000e/e1000e.ko
version: 1.9.5-NAPI
license: GPL
description: Intel(R) PRO/1000 Network Driver
author: Intel Corporation, srcversion: 16A9E37B9207620F5453F5E

[root@doxer ~]# lsmod|grep e100
e1000e 229197 0

change kernel parameter

Append the following parameters to grub.conf kernel line:

pcie_aspm=off e1000e.IntMode=1,1 e1000e.InterruptThrottleRate=10000,10000 acpi=off

change NIC parameters(you should add these lines to /etc/rc.local)

#disable pause autonegotiate
/sbin/ethtool -A eth0 autoneg off
/sbin/ethtool -s eth0 autoneg off
#change tx ring buffer
/sbin/ethtool -G eth0 tx 4096 #maybe too large(consider 512). To increase interrupt rate, ethtool -C eth0 rx-usecs 10<10000 interrupts per second>
#change rx ring buffer
/sbin/ethtool -G eth0 rx 128
#disable wake on line
/sbin/ethtool -s eth0 wol d
#turn off offload
/sbin/ethtool -K eth0 tx off rx off sg off tso off gso off gro off
#enable TX pause
/sbin/ethtool -A eth0 tx on
#disable ASPM
/sbin/setpci -s 02:00.0 CAP_EXP+10.b=40
/sbin/setpci -s 00:19.0 CAP_EXP+10.b=40

PS:

pcie_aspm is abbr for Active-State Power Management. This is somehow related to powersaving mechanism, you can get more info here.
acpi is abbr for Advanced Configuration and Power Interface, you can refer to here
apic is abbr for Advanced Programmable Interrupt Controller, it’s somehow related to IRQ. apic is one kind of many PICs, intel and some other NICs have this feature. You can read more info about this here.

Now reboot your machine and you’re expected to have a more steady networking!

PS2:

The reason why there’s so much strikeouts in this article is that I’ve struggled a lot with this kernel bug. Firstly, I thought it’s caused by kernel bug of e1000e driver, and after some searching, I installed kmod-e1000e driver and modified the kernel parameter. Things turned better for a short time. Later, I found the issue was still there, so I tried compile the latest e1000e driver from intel. But neither this worked.

Later, I tried a script which monitored the networking of the time NIC went down. After the NIC failed for several times, I found that Tx traffic was so high each time NIC went to failure(TX bytes went up like 5Gb at a very short time). Based on this, I realized that there may be some DoS attack on the server. Using ntop & tcpdump, I found that DNS traffic was very large, but actually my host was not providing DNS services at all!

Then I wrote some iptable rules to disallow DNS queries etc, and after that, the host now is becoming steady again! Traffic went down as per normal, and everything is now on the track. I’m so happy and so excited about this as this is the first time I’ve stopped an DoS attack!