OVH Community, your new community space.

100% ha nas down!!!


DigitalDaz
28-11-2012, 20:13
On a serious note, you will notice that there was never an explanation on why the HA NAS did go down.

If you do look at the claimed infrastructure you really do need to know how that could possibly happen given its supposed configuration.

The off the record explanation that I have had though I can't remember where it came from is truly frightening for your data, if true.

I heard that the two routers that provide the redundancy were BOTH flashed with an upgrade/update whatever without it being tested that after doing one, everything was OK. IIRC one of the jobs going on at that time seemed to back it up. If that was what happened, then all the redundancy in the world won't protect you.

The kit is generally fairly rock solid, what usually causes it all to fall down is human error.

DigitalDaz
28-11-2012, 20:02
Quote Originally Posted by clarge
Just going to add my 2 cents into this (sorry new user). An SLA is a commitment to a customer so offering a 100% SLA does not mean that the service will be up 100% of the time it just means that they are going to try and do everything in their power to keep the server up 100% of the time. The SLA just holds them to certain requirements that must be met and also details your compensation in return as has been stated they will compensate you for the amount of downtime (which I know isn't much use to people). No service will ever be 100% up its impossible as there will always be a situation where a failure could happen. So in a nutshell they are not marketing the service wrong saying 100% SLA as they are offering this, they are not saying that downtime will never happen its just in their best interest (and yours) to keep the service up as much as possible because OVH do not want to be paying customers back all the time for downtime.

Cheers

Chris
Clearly, you haven't spent much time here. With regards to my HA NAS being down, I'm not sure they even accepted it was did they?

..and then there was the dodgy raid card that took 36 hours to get replaced

..and then there was the wrong disk that was replaced when one in the raid set went faulty.

...and then there was my privated cloud that went down for a good two hours due to human error.

...and then I left

I truly got that downhearted in the end that I stopped reporting faults.

clarge
28-11-2012, 14:05
Just going to add my 2 cents into this (sorry new user). An SLA is a commitment to a customer so offering a 100% SLA does not mean that the service will be up 100% of the time it just means that they are going to try and do everything in their power to keep the server up 100% of the time. The SLA just holds them to certain requirements that must be met and also details your compensation in return as has been stated they will compensate you for the amount of downtime (which I know isn't much use to people). No service will ever be 100% up its impossible as there will always be a situation where a failure could happen. So in a nutshell they are not marketing the service wrong saying 100% SLA as they are offering this, they are not saying that downtime will never happen its just in their best interest (and yours) to keep the service up as much as possible because OVH do not want to be paying customers back all the time for downtime.

Cheers

Chris

DigitalDaz
14-10-2012, 19:24
I did believe, its as simple as that. I looked at the way OVH say they have it configured eg multiple hosts conected to multiple diskssets with multiple power supplies etc.

Of course, nothing is flawless but you should be able to stand back look at how things are configured and make a fairly educated judgement as to whether you are within the limits that you are willing to tolerate/pay for etc.

I thought I had, clearly I was wrong but as yet there has still not been any explanation whatsoever.

One thing I have for certain decided is that relying on OVH HA anything is likely to leave you in deeper crap than not using it in the first place.

I would still be very interested to know why it did fail.

I'm out of here anyway, this was the last straw for me, for this particular operation at least.

I'm going RAID 10 colo with back up to NAS with a spare server and doing it in the UK.

Depending on how Kimsufi changes will depend on how we use OVH in the future.

All OVH kit without exception is getting dumped and moved to Kimsufi for the other projects.

If the special editions remain we will probably use those instead of OVH NAS, the bigger ones have twin 750GB hard drives in. I don't know how the drbd thing works but I figure strapping a couple of those together would give you a decent little backup store that should be fairly stable.

Shimon
14-10-2012, 16:46
Don't worry many of us are fanboys for the same reason Thelen is. I'm not ashamed.

Kode
14-10-2012, 14:37
Everything after
Quote Originally Posted by Andy
No, Thelen is what is known as someone who explains things in realistic terms.
was a good point well made, that bit, not so much.

Andy
13-10-2012, 23:56
No neither would I but you get my point.

Myatu
13-10-2012, 23:54
I wouldn't relate telling someone to "go somewhere else" as explaining things in realistic terms, sorry. Thelen has a habit of desultory ramblings of no value.

Anyway, I think the main point for DigitalDaz was why it happened in the first place, particularly when the incident report states no one on the HA service was affected. It would be something that OVH needs to address, to prevent it from happening again.

Andy
13-10-2012, 23:26
No, Thelen is what is known as someone who explains things in realistic terms.

If you want 100% availability on anything then you must have a completely redundant mirrored backup at a completely different location that you can fall over to automatically or at a moments notice. If your 100% availability is so important to you, don't rely on a single point of failure, ever. By relying on OVH on their own that's exactly what you're doing and you only have yourself to blame.

Nothing is ever 100% reliable no matter what OVH or anyone else says. This proves it. If you're blind enough to believe them then you obviously don't have the right experience.

Shimon
13-10-2012, 19:06
Thelen is what is known as a fanboy.

bago
12-10-2012, 11:51
My NAS HA has been down too. So it is not true that NAS HA were not affected (even if travaux says otherwise).

Kode
12-10-2012, 11:23
@Thelen do you work for ovh? Every time I see someone raise questions or issues you always shoot them down with something along the lines of "go somewhere else then, oh you can't for the same price", and then there are statements like
Quote Originally Posted by Thelen
Yea slowly, few more days for the new plans I think monday now :/
I don't care either way really, it would just make sense of your dogmatic attitude.

Thelen
12-10-2012, 02:29
The whole HA isn't necessarily affected for everyone though...

Anyway if you don't want compensation for downtime, but want true uptime, you'll have to go elsewhere. (oh wait, no where else offers a guaranteed uptime for any less than about $4000/month).

DigitalDaz
10-10-2012, 13:42
The ticket was created by your own technician after I reported loss of connectivity on the 24 hour line:


Dear Customer,

An incident ticket 1157231 has just been created.

----------------------------------------------------------------------
Service : zpool-001583
Message :

zpool-001583 is inaccessible

issue related with the task : http://status.ovh.co.uk/?do=details&id=3491

You should have all the logs you need, the drives were showing as empty in the manager too.

And please, you are not going to seriously try and say that I was the only HA customer impacted by this? To even suggest that the HA was not impacted when clearly it was makes this whole incident even worse.

marks
10-10-2012, 11:52
to check the problem, I need at least the NAS name. Also, add an explanation of what happened exactly on your end and your NAS downtime (when it started and finished - with logs if possible), so I can investigate. There was an incident on the NAS:

http://status.ovh.net/?do=details&id=3491

but as it says there, HA shouldn't have been affected. So that's why I need to check well the issue.

Regarding the 100%: this 100% is because the NAS system offers a duplicated network connection and physical storage, and thus we're confident to give you more than our usual 99.95% uptime. And we're ready to pay you compensation back from minute one if there is a problem (any SLA has to have a penalty in case it's broken).

DigitalDaz
10-10-2012, 10:21
Quote Originally Posted by marks
send us the information of your downtime (starting time, end, your product) and we'll see if it complies for compensation.

100% availability means that we give compensation from the 1st second of downtime.
I'm not interested in the compensation one little bit, I'm more interested in why it happened and why anyone should believe that the same thing will not happen again tomorrow. I want to know that when I buy something that is badges as 100% available that it does indeed have more credibility than me offering it on a Kimi. I could offer guaranteed uptime of 100% on all my Kimsufi boxes and in three years I wouldn't have had to pay compensation. It wouldn't make it right though.

Why did it fail?

marks
10-10-2012, 09:53
send us the information of your downtime (starting time, end, your product) and we'll see if it complies for compensation.

100% availability means that we give compensation from the 1st second of downtime.

DigitalDaz
10-10-2012, 09:44
Quote Originally Posted by ebony
i take it you was using the Hybrid High availability NAS storage, The cost seems a little high for the space a few 2g's would be a better choice or maybe a few vps could work as well.
I kid you guys not when I say I have been using OVH for four years and I have only had problems since I started using the professional usage stuff.

Yes, Ebony, I am/was using the ha hybrid NAS.

A few weeks ago, human error took out my 100% available private cloud for two hours, yesterday I lost my 100% available NAS for about an hour and a half.

The only thing that we can 100% say is that OVHs 100% availability is not. Its pure sales bull. I mean, if the network setups are as described, this should not have been able to happen. I hope OVH will actually comment on this thread and say why it DID happen rather than maintaining on the status page that HA-NAS customers where not affected when it clearly did.

By my calculations, even if this NAS does not go down at all again in a year, that downtime will make it 99.99% available at best.

ebony
10-10-2012, 03:22
i take it you was using the Hybrid High availability NAS storage, The cost seems a little high for the space a few 2g's would be a better choice or maybe a few vps could work as well.

Thelen
10-10-2012, 02:02
Hell yea. Happens at least every 3 months to one of the big ones. Not excusing it, just the nature of having a single point of failure.

You are right that 2 99.9% kimsufis would probably have less downtime and be cheaper.

DigitalDaz
10-10-2012, 01:56
Are you saying something like this is common?

Something that is badged as 100% available is out for over an hour?

The reality is I would have been better sticking to local disks I'm running VMs off these because I believe them to be more resiliant than local drives.

Have you seen the on paper spec for the thing that just totally fell to pieces?

Without a shadow of a doubt, two Kimsufis, one acting as a backup to the other would have had less downtime and been cheaper.

Thelen
10-10-2012, 01:39
Pretty fast restore, incidents like this have taken other large providers more than 4 hours.

DigitalDaz
09-10-2012, 22:11
It appears to be back up, just.

Down for at least an hour.

DigitalDaz
09-10-2012, 21:00
Title says it all.

OVH has just confirmed on the incident line that the HA NAS is indeed down