Print Page - Endian on SSD Issues

Title: Endian on SSD Issues
Post by: shcc on Tuesday 27 March 2012, 07:25:29 am

Hello All,

Got a bit of an issue here, we have deployed multiple Endian firewalls. Has anyone noticed a massive failure rate on SSD's? We have been noticing that they fail about 3 months after deployment. The boxes with spindle drives just keep rolling on with no major issues, while the SSD boxes die between 3 and 4 months. We have tried multiple brands of SSD's and multiple capacities as well 8gb to 64gb. I'm stumped. I am a major advocate for SSD's, my boss on the other hand is starting to hate them with a passion. Any help or input would be much appreciated.

Title: Re: Endian on SSD Issues
Post by: mrkroket on Wednesday 28 March 2012, 03:53:40 am

I didn't worked with SSD's yet. But they are famous for their low write cycles.
You need to configure your OS correctly to avoid early wear out.
https://wiki.archlinux.org/index.php/Solid_State_Drives#Tips_for_Minimizing_SSD_Read.2FWrites (https://wiki.archlinux.org/index.php/Solid_State_Drives#Tips_for_Minimizing_SSD_Read.2FWrites)
I don't think SSD is the best thing for some servers. The write ratio on a firewall can be very high (think about logs and caches).

For high speed firewalls, I would use ramdisks instead of SSD's. On a Firewall main disk usage is for logs, proxy cache and lookup tables (like dansguardian). I'd make a big ramdisk (4-6 GB) and use it for these intensive disk operations. On cron I'd program to discharge useful info to spinned HDD (mainly logs).
I know, is a complex environment, and hard to setup, but I prime reliability.

A system with 160GB classic disk and 16GB RAM can be cheaper than its 160GB SSD counterpart with 4GB, and you can achieve almost the same (Ok, the SDD would boot much faster). But unreliable servers are the worst thing ever, no matter how fast they go.
As IT staff your main priority should be stability, not speed. People can complain if the system is slow, but be sure that they will cry out if the system doesn't work at all.

Title: Re: Endian on SSD Issues
Post by: rosch on Wednesday 09 May 2012, 10:32:06 am

For me it was more than 4 months if I remember correctly but my endian 2.4.1 started to mess up not long ago.
I replaced both SSD with old spinning harddisks. There is not much writing going on anyway. At least for me there isn't. But I guess that depends a lot on what services you have enabled..

Anyway, what I could save from the logs is this:

Apr 30 01:50:35 efw kernel: [6532505.296049] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 30 01:50:35 efw kernel: [6532505.296055] ata2.00: failed command: FLUSH CACHE
Apr 30 01:50:35 efw kernel: [6532505.296064] ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Apr 30 01:50:35 efw kernel: [6532505.296066] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 30 01:50:35 efw kernel: [6532505.296070] ata2.00: status: { DRDY }
Apr 30 01:50:40 efw kernel: [6532510.345013] ata2: link is slow to respond, please be patient (ready=0)
Apr 30 01:50:45 efw kernel: [6532515.344015] ata2: device not ready (errno=-16), forcing hardreset
Apr 30 01:50:45 efw kernel: [6532515.344025] ata2: soft resetting link
Apr 30 01:50:50 efw kernel: [6532520.544012] ata2: link is slow to respond, please be patient (ready=0)
Apr 30 01:50:55 efw kernel: [6532525.390013] ata2: SRST failed (errno=-16)
Apr 30 01:50:55 efw kernel: [6532525.390023] ata2: soft resetting link
Apr 30 01:51:00 efw kernel: [6532530.590014] ata2: link is slow to respond, please be patient (ready=0)
Apr 30 01:51:05 efw kernel: [6532535.435013] ata2: SRST failed (errno=-16)
Apr 30 01:51:05 efw kernel: [6532535.435023] ata2: soft resetting link
Apr 30 01:51:11 efw kernel: [6532540.635013] ata2: link is slow to respond, please be patient (ready=0)
Apr 30 01:51:40 efw kernel: [6532570.470013] ata2: SRST failed (errno=-16)
Apr 30 01:51:40 efw kernel: [6532570.470023] ata2: soft resetting link
Apr 30 01:51:45 efw kernel: [6532575.517013] ata2: SRST failed (errno=-16)
Apr 30 01:51:45 efw kernel: [6532575.517018] ata2: reset failed, giving up
Apr 30 01:51:45 efw kernel: [6532575.517023] ata2.00: disabled
Apr 30 01:51:45 efw kernel: [6532575.517030] ata2.00: device reported invalid CHS sector 0
Apr 30 01:51:45 efw kernel: [6532575.517047] ata2: EH complete
Apr 30 01:51:45 efw kernel: [6532575.517096] sd 1:0:0:0: [sdb] Unhandled error code
Apr 30 01:51:45 efw kernel: [6532575.517099] sd 1:0:0:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Apr 30 01:51:45 efw kernel: [6532575.517104] sd 1:0:0:0: [sdb] CDB: Write(10): 2a 00 01 28 1e 01 00 00 08 00
Apr 30 01:51:45 efw kernel: [6532575.517117] end_request: I/O error, dev sdb, sector 19406337

Now if it is really failing I don't know..these SSDs are only about 1 year old.
By the way the SSD are Crucial M4 64GB.
crucial.com/store/partspecs.aspx?imodule=CT064M4SSD2

Title: Re: Endian on SSD Issues
Post by: mrkroket on Thursday 10 May 2012, 12:05:29 am

Probably EFW is not SSD aware, so it can burn some sectors fairly fast.

Title: Re: Endian on SSD Issues
Post by: rosch on Thursday 10 May 2012, 01:09:36 am

Quote from: mrkroket on Thursday 10 May 2012, 12:05:29 am

Probably EFW is not SSD aware, so it can burn some sectors fairly fast.

Yep I agree, and I guess that's what happened. Unfortunately the sectors are not simply marked as dead as is usually done with spinning disks.
I tried to find official information about SSD support but couldn't find anything.
My 6 year old disks do the job just fine. At the same time you can almost switch off your heating system :D

Title: Re: Endian on SSD Issues
Post by: SharpieTM on Friday 11 May 2012, 04:22:41 am

Quote from: mrkroket on Thursday 10 May 2012, 12:05:29 am

Probably EFW is not SSD aware, so it can burn some sectors fairly fast.

SSDs do not write data the same as conventional HDDs. They have wear-leveling, so even if you write to the same file it will level it over the whole NAND (not to the same spot on the NAND). This is all done on the SSDs controller and completely hidden from the OS or SATA controller.

Title: Re: Endian on SSD Issues
Post by: SharpieTM on Friday 11 May 2012, 04:38:47 am

Quote from: shcc on Tuesday 27 March 2012, 07:25:29 am

Are you saying that they are completely dead after you remove them from the machine? Can't be read on another computer??

This is very unfortunate, since I thought using an SSD would bring along a nice boost if using the HTTP PROXY caching.

I wonder if the failures are more related to the 2.6.32 kernel that is being used in EFW, or the SATA controller used?? From what I read (for Ubuntu at least) 2.6.33 was the first kernel that RELIABLY provided TRIM for SSDs. I did notice that my current EFW uses 'noatime' in the fstab for mounting the drive already. I have had good success with linux and SSDs, so this was a surprising thread to read.

Title: Re: Endian on SSD Issues
Post by: SharpieTM on Friday 11 May 2012, 04:48:05 am

Some googling seems to point at 2.6.32 not being the best kernel for SSDs:

archivum.info/linux-ide@vger.kernel.org/2010-02/00243/bad-performance-with-SSD-since-kernel-version-2.6.32.html

In fact, there are a few people that complained about their SSDs not working right with that kernel, and once they used a new one, their issues went away.

So the question now is, how do we get a newer kernel on EFW?

Title: Re: Endian on SSD Issues
Post by: rosch on Friday 11 May 2012, 05:12:51 am

Maybe EFW 2.5.1's kernel is new enough. I don't know because I don't have such a machine running at this time.

As for SSD on Ubuntu, my laptop has been running fine for over a year with Lucid.

Title: Re: Endian on SSD Issues
Post by: hde on Saturday 19 May 2012, 01:33:09 am

I can confirm this issue, we've lost 20+ systems running SSD disks - We tried different SSD manufacturers, different services running on Endian etc. but the system kept on crashing.
The only solution - if we wanted to stay with SSD drives, was to configure RAID1 (using 2xSSD's) and it seems to solve the problem.

We've seen two kinds of error codes:

ata1.01: status: { DRDY ERR }
ata1.01: error: { UNC }

And ICRC error.

Both errors result in a complete system failure so Endian software is unable
to boot up.

Title: Re: Endian on SSD Issues
Post by: rosch on Saturday 19 May 2012, 01:37:35 am

Quote from: hde on Saturday 19 May 2012, 01:33:09 am

I can confirm this issue, we've lost 20+ systems running SSD disks

I assume you also ran efw 2.4.1 on the SSD machines?

Title: Re: Endian on SSD Issues
Post by: mrkroket on Saturday 19 May 2012, 04:54:50 am

Quote from: hde on Saturday 19 May 2012, 01:33:09 am

40 SSD's, wow. That isn't a casual issue.
If you swim in money is an option. I prefer having a nice reliable system rather than a buggy fast system.
If each 1 or 2 years you must change SDD's on each firewall, and control when to change, ppl to change them, downtimes, etc etc it isn't worth the effort.
A Firewall file use is totally different from a desktop machine, maybe SSD's aren't suitable for logging use.

Server fiability must be measured on years, not months. I have had a EFW firewall on a simple desktop machine from 2009, without problems.
Any IT system should priorize in that order:
-Stability and fiability
-Failover capability
-Features
-Speed
-Heat
-Noise

If you improve speed, heat and noise, but add tons on stability problems, what are you achieving?!?!

Title: Re: Endian on SSD Issues
Post by: pwinterf on Wednesday 30 May 2012, 12:14:40 am

can you guys tell me if your are the same as im getting here

im using a small 30 gog SSD

regards peter

posted in general support

Posts: 9

View Profile Personal Message ()


issue with 2.4.1
« on: Yesterday at 11:23:11 pm »
   Reply with quote Modify message
Ive been having an issue with 2.4.1 for quite a while now, that initialy looked like a HW issue.

after about 48-72 hours the disk light on the box is hard on.

no response from the network , and although the menu is still displayed on the console, selecting
any of the options ie reboot gives /sbin/reboot not found.

Now ive replaced the system completely now, with the exception of the plugin network cards.

I had also run with an alternative disk for a few days , just in case that was faulty, and that did the same thing.

after several of these 48-72 hour periods of lock up/freeze of the disk IO the disk becomes corrupt beyond recovery and a re-install

and re-apply of the rules restores operation.

systems running now on a newer version of the jetway atom board that it originality ran on, but it still does it.

I cant move to 2.5.1 until ive resolved the issue with interzone dns that gets broken as soon as i upgrade, with same rule set.

Any ideas.

regards peter

Title: Re: Endian on SSD Issues
Post by: rosch on Wednesday 30 May 2012, 12:46:39 am

Does your log give you hints about what is going on?
If it's close to the posts above then I'd say your SSDs were also failing.

About your spinning disks, you can verify them with smartctl http://smartmontools.sourceforge.net/man/smartctl.8.html (http://smartmontools.sourceforge.net/man/smartctl.8.html).
Or use the palimpsest disk utility if you're running gnome.

So far 2.4.1 did not lockup with spinning disks for me, only on (initially) new SSDs, and that was after 6 months of operation.

Title: Re: Endian on SSD Issues
Post by: pwinterf on Tuesday 03 July 2012, 04:32:22 pm

Finaly caught my system in the act, usualy its so locked up
That i cant review logs, and as disk is offline it cant write
The errors so could never phathom out what was causing it.

This time i ssh in and checked the dmesg and logs which were in cache trying to write
out and loads of ata errors retrying etc, just like one of the logs in this thread.

Ill get a magnetic disk to replace ssd and see how that goes but looks like
Theres an issue with ssd and endian.

Regards peter

Title: Re: Endian on SSD Issues
Post by: rosch on Tuesday 03 July 2012, 07:03:02 pm

Quote from: pwinterf on Tuesday 03 July 2012, 04:32:22 pm

Ill get a magnetic disk to replace ssd and see how that goes but looks like
Theres an issue with ssd and endian.
Regards peter

I tried with 2 magnetic 80Gb disks. After a of weeks the machine was unresponsive and the HD LED was hard on, exactly the same scenario as with the SSD disks.
The 2 disks are old but SMART is not showing any reallocated errors.
All of this makes me think there is a grave bug in endian 2.4.1.
Unfortunately efw-upgrade shows no updates..

Title: Re: Endian on SSD Issues
Post by: pwinterf on Wednesday 04 July 2012, 08:57:48 am

Thank you
Now i know its not just me, you have exactly the same issue.

So i wonder if our systems have anything in common, mones a jetway atom board.

I suspect we may be able to drop in another kmod for the sata controler.

Anyhow now i know its endian i can stop wasting my time replacing hardware.

Regards peter

Title: Re: Endian on SSD Issues
Post by: shcc on Friday 13 September 2013, 05:30:43 am

This 8GB modular SanDisk SSD has been in place for 2+ years without a problem. Amazingly enough. It's the only SSD we have used out of many that has survived.
On EFW 2.4.1.

FYI

hdparm -I /dev/sda

/dev/sda:

ATA device, with non-removable media
Model Number: SanDisk pSSD-S2 8GB
Serial Number: 101337301313
Firmware Revision: SSD 6.00
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 15525 15525
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 15649200
LBA user addressable sectors: 15649200
LBA48 user addressable sectors: 15649200
device size with M = 1024*1024: 7641 MBytes
device size with M = 1000*1000: 8012 MBytes (8 GB)
Capabilities:
LBA, IORDY(may be)(cannot be disabled)
Queue depth: 1
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 1 Current = 1
DMA: *mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 (?)
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* NOP cmd
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
Look-ahead
* Write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
* FLUSH CACHE EXT command
* Mandatory FLUSH CACHE command
* 48-bit Address feature set
* DOWNLOAD MICROCODE cmd
* SMART self-test
Security:
Master password revision code = 65534
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
Checksum: correct

EFW Support

Support => Hardware Support => Topic started by: shcc on Tuesday 27 March 2012, 07:25:29 am