Recently, I participated in the development of an embedded system as a software developer and my role was to develop or modify Linux device drivers to bring the system up. Network part of the system was based on Realtek’s RTL8211DN Ethernet tranceiver. I encountered several problems working with the driver for this chip which I want to share here.

The first problem I encountered was incorrect link loss detection. Here are two logs from U-Boot console demonstrating the problem. The first one was taken after connecting active network cable to UTP port of the tranceiver, and the second one - after cable disconnection. Here and further I omitted insignificant details from console output.

 # mdio read 0x1
 1 - 0x7969
 # mdio read 0x1
 1 - 0x796d
 # mdio read 0x11
 17 - 0x7c1c
 # mdio read 0x11
 17 - 0x6c1c
 # mdio read 0x13
 19 - 0x6c00
 # mdio read 0x13
 19 - 0x0

The register at address 0x1 is a basic status register. Bit 2 of this register indicates link status and this bit was set after link had been established. The register at address 0x11 is a PHY specific status register. Bit 10 in this register is of special interest in this situation because it shows current link status in real time. This bit is also set here. The register at address 0x13 is an interrupt status register. It indicates that four events has happened since last time it was read: link status has changed, autonegotiation has completed and link speed and duplex mode has changed. All those interrupts were acknowledged and reading this register second time returns zero.

I disconnected network cable and got the following log:

# mdio read 0x1
1 - 0x796d
# mdio read 0x1
1 - 0x796d
# mdio read 0x11
17 - 0x6c1c
# mdio read 0x11
17 - 0x6c1c
# mdio read 0x13
19 - 0x100
# mdio read 0x13
19 - 0x0

Neither basic mode register nor PHY specific status register reflected link loss and continued to provide false information to the system. The only indicator of missing link is False Carrier interrupt. I had to process this interrupt and force the chip to restart autonegotiation after which it updated its registers content in accordance with current link status.

Interrupt acknowledgement

The second problem arose right after the first one. The datasheet for tranceiver is pretty terse about interrupt processing, it just states that the interrupt line is active low. I mistakenly assumed that the interrupt line returns to its idle state after interrupt has been acknowledged and tuned interrupt handler for low level on the line. Checking /proc/interrupts later revealed some kind of a problem:

# cat /proc/interrupts | grep phy
 86:         58          0  gpio-dwapb   0 Level     phy_interrupt

58 interrupts per single event is insane! Debugging this problem by printing status messages from interupt handler and its bottom half showed that processor had been generating interrupts during 2 ms despite the fact that the interrupt was acknowledged within microseconds after event:

 [   79.090851] rtl8211dn_work: phy_priv->irq_status = 0x7d00, phydev->state = 0x9
 [   79.090863] rtl8211dn_interrupt: phydev->state = 9
 [   79.090907] genphy_restart_aneg
 [   79.090929] rtl8211dn_work: phy_priv->irq_status = 0x2500, phydev->state = 0x9
 [   79.090939] rtl8211dn_interrupt: phydev->state = 9
 [   79.090966] rtl8211dn_work: phy_priv->irq_status = 0x0, phydev->state = 0x9
 [   79.090976] rtl8211dn_interrupt: phydev->state = 9
 [   79.091003] rtl8211dn_work: phy_priv->irq_status = 0x0, phydev->state = 0x9

 ...

 [   79.092867] rtl8211dn_interrupt: phydev->state = 9
 [   79.092893] rtl8211dn_work: phy_priv->irq_status = 0x0, phydev->state = 0x9
 [   79.092903] genphy_read_status
 [   79.092908] genphy_update_link

Debug messages showed that interrupts were generated even when PHY’s interrupt status register was empty. The assumption was that PHY did not set interrupt line into its idle state immediately after interrupt acknowledgement and did this only after 2 ms timeout, but I did not have an oscilloscope at the moment to confirm the assumption. I just changed interrupt handler trigger to falling edge, and this solution clashed with the first problem: I processed False Carrier interrupt, acknowledged it and restarted autonegotiation which, in turn, generated link status change interrupt which I could not receive because PHY indefinitely held interrupt line in set state after 2 ms timeout just because new interrupt had arrived. This resulted in yet another bolt-on in driver to bypass this problem.

Later I got an oscilloscope and out of curiosity checked what was going on on the interrupt line:

Interrupt line

My assumption was correct and the line is deasserted after ~2 ms after acknowledging an interrupt. Markers are not visible on the screenshort, but they are set to measure the exact duration of the signal.

Interface autodetection

RTL8211DN can automatically detect active connection on either of its two ports which are named as UTP and fiber in datasheet. At least this feature is stated in the datasheet and this is why this chip was selected for the project. It turned out that autodetection is not… auto. The chip just waits for active link on UTP port for approximately 10 seconds and then switches to fiber port if no connection has been detected within this time frame. It does not detect any active connection on UTP port after the switch has been made. Such behavior puzzled me for a while, because previously working chip did not detect active link on UTP port after autodetection had been activated. I made a dump of its registers from driver and got the following:

[ 7.610098] phy_dev: 0xeda50200, BMCR: 0x1000, BMSR: 0x7949, ANAR: 0x1e1, ANLPAR: 0x0,ANER: 0x4, GBCR: 0x300, GBSR: 0x0, GBESR: 0x3000
[ 8.610096] phy_dev: 0xeda50200, BMCR: 0x1000, BMSR: 0x7949, ANAR: 0x1e1, ANLPAR: 0x0,ANER: 0x4, GBCR: 0x300, GBSR: 0x0, GBESR: 0x3000
[ 9.610093] phy_dev: 0xeda50200, BMCR: 0x1000, BMSR: 0x7949, ANAR: 0x1e1, ANLPAR: 0x0,ANER: 0x4, GBCR: 0x300, GBSR: 0x0, GBESR: 0x3000
[10.610099] phy_dev: 0xeda50200, BMCR: 0x1000, BMSR: 0x7949, ANAR: 0x1e1, ANLPAR: 0x0,ANER: 0x4, GBCR: 0x300, GBSR: 0x0, GBESR: 0x3000
[11.610095] phy_dev: 0xeda50200, BMCR: 0x1000, BMSR: 0x7949, ANAR: 0x1e1, ANLPAR: 0x0,ANER: 0x4, GBCR: 0x300, GBSR: 0x0, GBESR: 0x3000
[12.610094] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[13.610098] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[14.610107] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[15.610097] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[16.610118] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[17.610102] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[18.610095] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[19.610096] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000
[20.610095] phy_dev: 0xeda50200, BMCR: 0x1140, BMSR: 0x109, ANAR: 0x20, ANLPAR: 0x0,ANER: 0x0, GBCR: 0x0, GBSR: 0x0, GBESR: 0x8000

The status change of GBESR register indicates the moment when PHY switched to fiber channel. This happens soon after system has started and if I plugged network cable to UTP port after this moment, then PHY did not detect any link. The only way to fix this was to hard reset the chip - software reset from control register did not work.

I don’t know if this particular chip from buggy batch was my personal Christmas present or all these quirks are there by design. Think twice if you are considering this chip for your next project.