2 Days of Chat Channel and Login Snafus Caused by “Perfect Storm”

Written by Feldon on . Posted in Game Updates & Maintenance

SmokeJumper and Rothgar have been weathering the storm of the Blackburrow -> Everfrost server merge, which knocked out in-game chat channels such as 1-9, 70-79, and cross-server chats on ALL servers for 24 hours, followed by 12 hours of an even deeper chat blackout on ALL servers, with guild chat, group chat, public channels, cross-server chat, login to SOE games, and even access to the SOE forums all intermittently interrupted.

SmokeJumper on the EQ2 Forums:

Unfortunately, this was a bit of a “Perfect Storm”.

The merge that happened on Live would not have affected you other than the chat and email that was outaged for the duration of that merge.

But there was another non-EQII-dev-team outage that was scheduled from 1am last night and that’s what you’re seeing now.

Rothgar added some more details to the discussion:

The mergers have been “planned” since way before the other network maintenance was planned. Yes, it was somewhat of a risk to start the merger before the other maintenance, but had we waited it would have pushed the merges off by another entire week.

Yes, we announced that mergers would begin the 16th. This is mainly due to the fact that we were required to give at least a 30 day notice. So from our standpoint, the earliest we could start the mergers would be the 16th, but we were never willing to start them until we had most of the kinks worked out. It doesnt benefit you guys for us to kick off the merger if we know its going to cause all sorts of problems.

We could have waited until the merger process was ready to go and THEN announce it, but that would have added another 30 days delay in there. So we announced it well before it was ready and hoped for the best. Yes, we were trying to cut it close. Either way, you got the first merger just as soon as it was ready. It wasn’t really delayed, it just wasn’t ready yet. Unfortunately due to the announcement requirement, the appearance of the situation is that it was delayed.

After a merge is done, all mail and friends lists need to be moved from the old server ID to the new server ID. This process takes several hours and needs to be done when chat is down. So we have to disable our UCHAT system for everyone. This means that all channels that go through UCHAT, mail and friends lists are not available. Even though the 1-90 channels are server only, they still go through UCHAT. The guild channel, say, shout, etc are all handled by the server. This is just something we have to deal with for the mergers to happen.

Yes, there was an issue with the naming script, but it was fixed on our side and handled so that you don’t have to petition CS to take care of it. I’m not sure you could ask for a better resolution other than having the problem not occur in the first place.

The reports of people getting XX’s when they had no doppleganger on the other server were all false. I checked several of them myself and in all cases there was another character on the other server with the same name. People have the ability to hide their accounts and alts on EQ2Players and this is why you can’t always use it to see if another character exists.

As SJ mentioned, this maintenance was company wide and out of our control. I’m not sure why EQII wasn’t on the list of affected games though. Maintenance happens. Its in the EULA that servers will come down for maintenance. I hate to always point back at the EULA because it does suck when you can’t play, but the contract is there for a reason. We don’t do maintenance just because we’re bored and want to inconvenience all of our players.

The maintenance was related to network hardware, so that can affect all kinds of servers, not just game servers. Again, there isn’t much we can really do about this in the short term. Maintenance of this type is pretty rare so I dont think it warrants any major changes.

When it was explained to SmokeJumper that ALL public chat on other servers including 1-9, 70-79, Auction, etc. were knocked out on all servers during the merge, he said:

That wasn’t supposed to have happened. I’ll check into it to see what we can do about avoiding that on future merges.

So then, if all chat channels are going to be down on all servers for at least 12 hours for each of the merger days (currently looking at 7-8 days of downtime), why not try to merge all servers at once and just swallow the pill? Rothgar lays out the reasons:

We discussed doing all of the mergers simultaneously but I think it would be a bad idea for several reasons.

First, no matter how much testing we do internally, there is always the possibility of something breaking that we didn’t catch. As much as we want to shield this from happening to even one player, we’d much rather discover a problem after a single merger than have it happen to every server at the same time.

Second, many of the servers are in the same data center and share some of the same networking and/or database hardware. By performing merges at the same time, the could potentially conflict with each other and would definitely slow things down a LOT. I wouldn’t be surprised if we would need 3-4 days of straight downtime for every server if we tried to do them all at once. And if we ran into a problem, it would extend the downtime for everyone, not just the one server.

So hopefully these reasons make sense. We have talked about doing two mergers at a time to speed things up, but that’s probably about as much as I’d feel comfortable doing.

As far as the patch to our Amsterdam data center, that entire process is handled by another department and I wasn’t aware of the network problems myself until it was already past the scheduled maintenance window. We do actually start the file transfer well in advance of the downtime and agree that we should probably not take the servers down if the transfer isn’t done. I’m not sure what the circumstances were around this situation. Perhaps it was close and they felt that it would finish soon. We will definitely discuss it and see what we can do to keep that from happening in the future.

The mergers have been “planned” since way before the other network maintenance was planned. Yes, it was somewhat of a risk to start the merger before the other maintenance, but had we waited it would have pushed the merges off by another entire week.

Tags:

Trackback from your site.

Comments (2)

  • Loredena

    |

    Since this actually relates to the sort of work I do for a living — yes, it was a perfect storm. I’ve had my systems have unexpected downtime due to a network patch elsewhere in the company that “shouldn’t” have impacted me – so I wasn’t told. I’ve had EU or AP run into snags on file transfers or installs, and due to timezone differences they’d often keep plugging away at it because they thought it would be done by the time they could reach someone in NA. It happens. It IS frustrating from the player/enduser side though.

    Reply

  • Grump

    |

    I am really impressed with SJ’s and Rothgar’s announcement on what exactly messed up and going it to some detail with it. In the past SOE had a policy of not explaining what mucked up or why and leaving folks in the dark. Which happened during the last sever merger during the KoS run.

    Reply

Leave a comment

You must be logged in to post a comment.


Powered by Warp Theme Framework