IT Horror Stories Top Ten List
In recognition of National Systems Administrator Appreciation Day on July 30, 2010 Azaleos conducted a contest for the best IT Horror Stories related to deployment, migration and support of Microsoft Exchange, Active Directory, SharePoint, BlackBerry Enterprise Server and/or Office Communications Server. Certified and experienced Azaleos Support Engineers are on the job 24x7x365 working to ensure that something similar to the “bad days” that you read about in the final Top Ten list below doesn’t happen to our customers. Congratulations (?) to the winner of the new iPad, Najam S., for his 20 hour Active Directory marathon story.
#1: Vanishing Security Groups Najam S.
One day I woke up and checked email before driving into the office. There numerous help desk requests that users are not getting their mapped drives, requests that the student roster was not working, faculty members were not getting their mapped drives and staff members couldn’t access network resources. I realized that the day was destined to be a bad one. When I finally reached the office I was just in time for an IT all-hands meeting where we found out that almost all the security groups were missing from Active Directory. After a few minutes, our application folk came into the meeting and one of them told us that he wrote a script which “went crazy” and inadvertently deleted the security groups. At this point we started thinking, who’s going to hold him and who will hit him. Our first thought was that we should simply do a complete AD restore or start creating new one. But there was another problem --- we didn't have the list of members and permissions. But luckily, we had a cloned AD machine that was only 1 week older than current. We started creating new security groups and adding members as well. We had to check the folder permissions and there were orphaned folder SIDs instead of group named. We matched that SID with the cloned machine's group's SID and assigned the permissions. Man, it was a horrible day. We worked for 20 hours continuously without break and were finally able to bring the situation back to normal.
#2: Outage Maid to Order Harris L
Back when I worked for a major consumer electronics manufacturer we had an Exchange server at our site in Florida. Every Thursday afternoon after 18:00, it would go offline. Someone would have to drive down there and manually quickly power it back up. It was always an ungraceful shutdown and we could never ID what was causing it. Other equipment nearby was not affected. After a month of this, we decided to stage a stake-out. I went down there in the afternoon and hung around for a few hours. After most of the staff went home, the cleaners arrived. Desks dusted, bins emptied and floor polished. It seems the cleaning crew were using our power outlet for their equipment and when they were done, they helpfully plugged the server back in, but the damage was already done. The solution was to move our equipment to another rack and we left that power outlet to the cleaners.
#3: Executive Privilege Adam S.
A couple years back my network was severely hacked by someone who came in from the outside and deleted the main Exchange message store. Firewall logs had gotten the local IT admin nowhere, so we were called in to do a little snooping around. I wish I’d thought of it, but another guy on the team had the sense to run a wireless utility and he found a wide open Linksys wireless access point in about six seconds. The internal admin insisted there was no wireless running anywhere on the network. It took some sneaker netting, but we found the rogue AP in a senior exec’s office about 20 minutes later. Seemed he saw how cheap they were at the local CompUSA and decided to plug one into the secondary network port in his office so he could use his notebook’s wireless instead of the wired connection because no wires “looks better.” Once we found the leak, we were able to patch it up and get Exchange running again. Needless to say, we also instituted much tighter controls on how peripherals could be implemented and also communications letting employees know what is and isn’t acceptable as corporate peripheral!
#4: Done IT Lately? Martha K.
One morning my CEO called to say that he was not able to receive mail on his Blackberry. I proceeded to look into this case and found that the Radius server was down in this subsidiary office where the CEO was located and VPN was not an option give current network troubles. So I talked the very new and very green local sys admin through the process of checking the BES for SRP connectivity, and his CEO's last contact times. While I was talking him through the SRP test, he stated that his Exchange email went offline. I asked him how he knew and he responded with "I just rebooted the Exchange server". I then used a 3rd party application to share screens with him and took control of his local box so I would have access to the environment, where he proceeded to asked me why I was doing what I was doing, every step of the way. It took me 4 hours to get him out of a degraded state. That's 4 hours of him asking, "Why are you clicking that? What does that do? Are you sure that will fix this? How long will this take?" The best part....the CEO's Blackberry wasn't receiving mail……because it was turned off.
#5: Disappearing SharePoint Search Paul B.
My company was working with a fairly large SharePoint 2007 environment. One day we began experiencing some serious problems with the SP Search functionality. The system would kick off a search and what normally would take an hour was taking 5+ hours and just hanging. I spent a full day troubleshooting on my own before turning to the Microsoft Support Service phone line for assistance. Working with Microsoft, we collectively accrued an additional 130 hours of phone support time and escalations up thru the most senior MSFT support engineers. No matter what we tried, we couldn’t get search to work. Finally, I just happened to be talking with one of my IT colleagues who owned our virtualization technology on VMware and when I mentioned to him the problems I was having he relayed to me that he didn’t think that SP should be having those sorts of problems especially with it running virtualized on top of VMotion. That’s when it hit me. First of all, I wasn’t aware that my colleague had set up my farms on top of VMotion to begin with! When I learned this, I quickly guessed and a short while later confirmed what had been happening. When a new search would kick off things would run as planned and the initial search results would start to pour back. Because SP Search creates such a heavy demand on the servers, however, VMotion would detect this load and move the SP Servers to a new virtual guest instance. This move would break the connection between the search and the databases being searched and not allow the search to continue. Once we removed SP from the VMotion platform we were all systems go.
#6: Reply All Torture George P.
I was trying to figure out how I could more efficiently auto-populate Exchange distribution lists using PowerShell. So I wrote a script which looked at everyone in the entire Exchange GAL and divided the users into 4 different DL’s --- each DL held about 5K users and it was configured as a flat DL without any nested groups. Of course I did this test using a LIVE system. Well, what happened next was that one of the users who had been placed on one of my random DL’s happened to notice that they had been put on that DL and hit reply all to say – “Hey, I don’t want to be on this DL, please take me off.” Of course, since a lemming mentality exists in many organizations, other members of this DL received this initial email and then either a) decided that they too wanted to be removed; or b) decided that they were unhappy that others were wasting their time by filling their inbox with useless “remove me” email clutter. An email storm was born! As a result of this storm, my Exchange Server system was brought to its knees and mail flow was offline for days until I could purge all of the reply alls from the system. This sort of a spike was too much for Exchange to keep up with --- the mail queues were just too long and I had to shut all mail down and walk thru the emails and clear them out using text editors before starting the servers back up. Part of the problem was that I had no easy way to communicate with all users to stop sending email since email was our primary method of communications. We had to remove the membership of the DLs from all the users, which then also caused massive AD replication --- this also weighed heavily on the network --- a true pile-on scenario. As a result of this I subsequently made my DL’s fully nested, with lower numbers of users and with specific permissions of who could send to the larger DL’s. I’m told that the new “Mail Tips” feature in Exchange 2010 and Outlook 2010 is a direct result of this sort of experience being all too common in many corporations. I guess my only consolation is that I wasn’t the only one to fall victim to this issue with Exchange.
#7: Storage Growth Hell Garth H.
I started seeing a huge increase in requests from my users for larger Exchange mailboxes. In combination with Exchange claims that the server could accommodate this I set about purchasing enough hardware to increase individual mailbox sizes to 1GB -- barely. At the same time I also upgraded all 1K users to Outlook 2007 with the new mailbox search capabilities. So, in theory, the hardware and storage that I had procured in order to get my users to 1Gig was perfect as long as the users read the docs that I sent them telling them how to practice proper Inbox maintenance. The problem was that my users didn’t read the docs. All they saw was that they now had 1GB of mailbox space AND they started using Outlook's handy new search feature to turn their e-mail clients into “personal information managers.” Nobody deleted anything anymore --- everything was just left in their Inbox so that they could run quick searches against them, where all they needed to remember is a rough description of the attachment and the name of the person who might have sent it to them. Worse, they sent attachments to themselves just so the doc will be in the inbox somewhere. The result --- hardware that was pegged for a 3 year shelf life got maxed out inside of 3 months and I had to go back to my CIO to purhcase more storage. If there was any saving grace it was that we actually saw a 35 percent decrease in the amount these users used their network home directories. Exchange and Outlook became the main network gateway for personal storage. So we were able to repurpose some storage from the file server machines on the e-mail infrastructure, but we still had to make several large and unscheduled server purchases to keep up with new demand.
#8: Exchange Start-up Abdul K.
The Story begins in 2003 when I got an opportunity to work with one of the large steel companies in Saudi Arabia. Even though I was an MSCE background my specialty was Outlook and Helpdesk. I hardly had hands-on on Exchange servers. It was my first day of my Job, The company was running exchange 5.5 and i had to take over in an emergency as the earlier Sys Admin had went on emergency leave and never came back. I was going through various options in Exchange 5.5 and tried to Google some administration documents --- it was my second day of my job. As I was not that confident with my Exchange abilities, but very confident with Outlook, I went to all the users (there were around 85 mailbox users) and changed the location delivery of emails to local pst. I guess that was what saved me and my job. The third day on the job my Exchange server crashed and I had to re-install Exchange server, but i failed. I was nervous and didn’t even attempt to look at a KB on how to re-install Exchange. My GM called me and asked my what's happening -- somehow I tried to convinced him that Exchange 5.5 was an older version and I started to install the new Exchange version that 2000. I went to one of the download sites and downloaded a copy of Exchange 2000 and installed it --- thank god this time I followed the KB and got it working somehow, then exported the emails from outlook back to mailbox. Even today I’m still ashamed at how I installed a pirated version of Exchange on a production environment, but later on I moved to a licensed version with Microsoft. This experience peaked my interest towards Exchange server and a year later I resigned from the job and came back to Bombay (INDIA) my home town and joined Microsoft as an Exchange Server support engineer. At Microsoft, I had the opportunity to get training on Exchange 5.5, 2000 and 2003. From that point I never looked back and to date Exchange has been my bread and butter as Product Specialist.
#9: Root Canal Mike P.
I came into work expecting an easy day as I had an appointment at midday for root canal surgery! At around 9am, lots of stuff hit the fan. Someone somewhere in the organization (an Exchange Admin) had triggered a global replication of the complete Exchange public folder tree. Within 30 minutes the MTAs were flooded with PF replication messages. Normal mail traffic was obviously impacted and it became apparent that if any user wanted to send a message to someone else in the office it would be quicker to walk and speak to them personally...even if that person was in another continent! I quickly called the team together and did a 'all hands' meeting. We needed a solution and we needed one fast. After about an hour, we discovered heuristic parameters on the MTA that would allow us to route only mail messages to specific MTAs and allocate alternative MTAs as PF only routes. We implemented the change and then I went off to have root canal surgery and was back in the office by 2pm when normal service has been resumed. What a day!
#10: Overwriting AD Troy M.
Several years ago I had the ‘opportunity’ to fix a rather unique problem with Active Directory. A company I was doing a security audit for had managed to overwrite the domain Administrator account, which led to all kinds of interesting issues. What started this was that they had renamed the domain Administrator account to a common name. Then they had allowed a helpdesk admin to have Domain Administrator rights, and to install software on a server. You can see where this is going right? In the course of installing the software on the server, and creating an account to run the software, the helpdesk admin managed to create a new account with the same name as the domain Administrator account which overwrote the Domain Administrator account. The helpdesk administrator did not see any problems in the installation. It wasn’t until services using the Domain Administrator credentials started failing and GPOs stopped processing and users could login that it was realized there was a problem. Fortunately, the systems administrators who were still connected were able to continue working. Once the chewing-out from management was over, a new account was created and granted all the necessary rights to act as the Domain Administrator. It took a while to track down all the necessary rights for the domain admin account, since some of that requires some changes to the domain security policies. But then there were still lingering problems. It took me a while to figure this out, but the issue was that all over the domain, there were references to the old Domain Administrator account, which no longer existed. It had become an orphaned SID which would not resolve to an actual account. This caused problems when processing GPOs and other domain info. To track down the orphaned SID on all the machines in the domain, I created a script to run against all machines in the domain which would look for the specific SID and generate a report so I could remove them all and add the new Domain Administrator account.
