My Internet connection has been failing. I thought the culprit was the telephone company. This is how I discovered I was wrong. Shame on me.
For the last 3 months (from 2009/12 to 2010/03) my Internet connection (ADSL with Prodigy Infinitum from Telmex ) has been failing. It kept disconnecting every few minutes. Sometimes it stayed connected for several hours, then suddenly disconnect and reconnect a minute later, disconnect again, and so on.
Sometimes it just failed for hours, even days. I cannot work without an Internet link, so as a backup connection I signed up with E-go (a wireless service from MVS) . It works perfectly but is a bit slow (512 kbps and 128 kbps) and with a very variable latency.
I got so annoyed with the lack of reliability of the ADSL connection than I started looking for more options. There are not that many options around here, some are just as unreliable (a 3G cell link, for example) or too expensive (a dedicated microwave link).
One alternative connection options I considered were the 3G offers from cell companies around here, (from Iusacell, Telcel and Movistar). I dismissed this option quickly because every person I’ve asked about them told me his/her personal horror history with the performance of the link or the support from the provider or the silly contract terms.
My second alternative was a dedicated connection. I dismissed most providers because most are too expensive and because half the companies I contacted never called back (maybe they are having such a great time selling their service they don’t need more customers). However, I still noted two local providers, Interclan  and Xcien . Interclan has been my provider in the past and I remember their support was very good. I have no experience with Xcien but was willing to try them as their service is not that expensive.
My third option was to stay with Telmex and change my Internet plan to a Business Premium plan (a bit more pricey than my current plan but for 4 mbps/768 kbps it looked good). However, if there was a problem with my ADSL line, this may not fix it, just make the problem more expensive.
So my fourth option was to try to stay with Telmex and my current plan but get my line fixed. This was my less expensive option, a very compelling reason to try it. I can tell that dealing with the Infinitum tech support people can be a very frustrating experience (both for me and for them), and it can become a dead end by the time I explain them I don’t use Windows or that my ADSL router is in bridge mode or that the router is not even the 2wire brand they provided.
So, I decided this: first let the voice people check the line, as there was some audible noise in the line. If the link is still unstable, check the ADSL hardware. If still unstable, reconfigure the network so I can use my router directly (not in bridge mode) with my laptop (which has Windows as secondary OS) and call Prodigy tech support.
So, they checked and fixed the noise in my telephone line. But the Internet connection was still failing. So I moved to step 2, check the hardware.
My network looks like this:
An IPCop server  is configured as a Red-Green-Blue network. The Internet link is connected using a TP-Link TD8840 router in bridge mode. The green switch is a TP-Link TL-SG1008D and the blue access point is a Linksys router configured as AP.
The first test was to reset the router configuration and disconnect it from the main network. I used my laptop to access the router and reconfigure it. The TD884 has a statistic page in its web interface (as most ADSL routers) about the ADSL signal conditions. The two factors I was most interested were the attenuation and SNR margin, as both can affect the connection (and the second is most related to frequent disconnections). What I saw was that the SNR Margin changed constantly, going from 20 to 10 and then suddenly to 0.
Something really puzzled me was that the router stopped responding for a few seconds when that happened. Checking the dmesg messages in the laptop, the router was not only disconnecting the ADSL link but also the Ethernet port. The router logs were not that helpful as they were mysteriously gone after that. I even used the telnet interface of the router but lost the connection every time as the Ethernet port was disconnecting too.
Ok, maybe the ADSL filter is toasted. Or the telephone cable. Or the router. I replaced the filter. No difference. I replaced the cable. No difference. Now for a quick test I replaced the router (and its power supply) with another TD8840 I have as replacement. And that fixed the problem. Now the SNR margin was always around 20. And the connection was not dropped. Cool! Now that I was sure my old router was the problem, I reconnected everything as before and… it failed again. The new router failed just like the old one. The SNR was going from 20 to 10, stayed there and then to 0, and both the ADSL and Ethernet connections were dropped. Ok, it seems it’s something else.
While I was thinking about possible causes, I still was puzzled about the Ethernet port being disconnected too. After a while I realized what was happening: the routers were rebooting. Were the disconnections causing the routers to reboot? Nooo, bad logic. The rebooting was causing the disconnections.
Well, it worked when I used the new router with the new power supply. But now I was using the new router with the old power supply (lazy of me). So maybe the problem is the power supply. Replaced the power supply and… it failed. Ok. The successful test was done with the power supply connected to the AC wall socket. So just to be sure the success case was not a coincidence, I connected the router’s power supply to the wall socket and… it worked. Tried again, with the old power supply… it worked.
Oooook, time to move on to the next suspect. To understand why this was such a puzzle for me, you need to know how is everything connected:
As you can see, this is a real mess. But I’ll fix that later, now the Internet link is the main problem.
As you can see, the ADSL router is connected to a power strip which is connected to an UPS, which is connected to another power strip which is connected to the AC wall socket. So I started measuring the voltage from the closer power strip to the AC wall socket. The output voltage at the power strip ranged from 106 to 112 V. A bit low. The output from the UPS was the same from the protected outlets and 120 to 130 V from the unprotected outlets. From there to the wall socket it was the same.
Just for another quick test, I reconnected the power strip to an unprotected outlet from the UPS. Now the router worked perfectly. So, the problem is the UPS. My first question was, is this UPS output normal for this UPS? As I have the UPS documentation in a hidden box in a forgotten location (along with all the manuals and documentation of every hardware I own) I measured the output voltage from the other UPS (which is the same model). Its output was 118 to 120 V.
So it was the UPS. Well, it IS the UPS, as I have not replaced it. However, I reconnected the router’s power supply to an unprotected outlet in the UPS. It seems all other devices (the switch, the access point and the IPCop server) can work with this low voltage, but the router cannot.
So, all my anger was misplaced: it was not a Telmex problem, but a problem with the UPS. Go figure. By the way, the UPS is an Apollo 1075A  (750VA/450W). Pretty cheap.
Update (2010/04/15): I have replaced the old UPS with an APC BACK-UPS  900VA 120V (for $175 USD, taxes included). Great device, very reliable and Linux-friendly (using APC UPS Daemon, apcupsd ).