<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Xyratex Storage Insights</title> <atom:link href="http://blog.xyratex.com/feed/" rel="self" type="application/rss+xml" /><link>http://blog.xyratex.com</link> <description>Xyratex Blog</description> <lastBuildDate>Tue, 01 May 2012 17:37:13 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.0</generator> <item><title>Live from HPCS &#8211; Canada</title><link>http://blog.xyratex.com/2012/05/01/live-from-hpcs-canada/</link> <comments>http://blog.xyratex.com/2012/05/01/live-from-hpcs-canada/#comments</comments> <pubDate>Tue, 01 May 2012 16:08:22 +0000</pubDate> <dc:creator>Peter Bojanic</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[ClusterStor]]></category> <category><![CDATA[Compute Canada]]></category> <category><![CDATA[Cray]]></category> <category><![CDATA[High Performance Computing Symposium]]></category> <category><![CDATA[HPCS]]></category> <category><![CDATA[I/O Bottleneck]]></category> <category><![CDATA[Lustre]]></category> <category><![CDATA[OpenSFS]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=207</guid> <description><![CDATA[Hello from Vancouver! I am here at the High Performance Computing Symposium as part of BCNET &#38; HPCS 2012 which is Canada’s premier HPC forum.  The three day Symposium looks to be an exciting one with close to 500 high profile delegates from various disciplines within the research and HPC communities.   Xyratex is participating in multiple [...]]]></description> <content:encoded><![CDATA[<p>Hello from Vancouver! I am here at the High Performance Computing Symposium as part of BCNET &amp; HPCS 2012 which is Canada’s premier HPC forum.  The three day Symposium looks to be an exciting one with close to 500 high profile delegates from various disciplines within the research and HPC communities.  </p><p>Xyratex is participating in multiple levels in the exhibition as well as leading several presentations.  We kicked things off bright and early Sunday morning with Lustre Community presentations by Kevin Canady who spoke about the Open Scalable File Systems (OpenSFS) community and myself, talking about the work that Xyratex is doing in partnership with Cray to mature Lustre® 2 for HPC. There were HPC storage site updates from Simon Fraser University and Clumeq and our colleagues from Whamcloud presented on Lustre releases and Lustre system administration.</p><p>Last June we announced the ClusterStor™ 3000 and shortly after we participated in HPCS in a limited fashion.  Today we are proud to be Gold Sponsors of HPCS and will be presenting “HPC Storage – Breaking the IO Bottleneck”, which highlights the unique architectural approach by Xyratex that addresses the industry’s increasing IO demands. ClusterStor delivers linear performance scalability, ease of installation and management, and enhanced storage system reliability at scale. Please visit us at Booth 19 in the Segal Centre to learn more about our innovative HPC Solutions.</p><p>For more information see the press release: <a
href="http://www.xyratex.com/Company/News/Detail.aspx?ID=324">http://www.xyratex.com/Company/News/Detail.aspx?ID=324</a></p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2012/05/01/live-from-hpcs-canada/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>When Size and Scale Matter</title><link>http://blog.xyratex.com/2012/03/08/when-size-and-scale-matter/</link> <comments>http://blog.xyratex.com/2012/03/08/when-size-and-scale-matter/#comments</comments> <pubDate>Thu, 08 Mar 2012 19:11:17 +0000</pubDate> <dc:creator>Mike Stolz</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[disk component technology]]></category> <category><![CDATA[disk processing solutions]]></category> <category><![CDATA[storage solutions]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=203</guid> <description><![CDATA[Late last year IDC recognized Xyratex as the world’s largest supplier of disk storage systems to the original equipment manufacturer (OEM) market in 2010. This supplier ranking was based on 2010 revenue and highlighted a substantial 69.5% year-over-year revenue growth and resulting  43.9% share of revenue earned by ten OEM suppliers that IDC tracks; a [...]]]></description> <content:encoded><![CDATA[<p>Late last year IDC recognized Xyratex as the world’s largest supplier of disk storage systems to the original equipment manufacturer (OEM) market in 2010. This supplier ranking was based on 2010 revenue and highlighted a substantial 69.5% year-over-year revenue growth and resulting  43.9% share of revenue earned by ten OEM suppliers that IDC tracks; a gain of over 14%. This report is typically issued in the 4<sup>th</sup> quarter of the next calendar year to enable time to compile the necessary data for analysis.</p><p>In the report, “Worldwide Enterprise Storage Systems 2010 Vendor Shares,” (IDC Worldwide Enterprise Storage Systems 2010 Vendor Shares, doc #229828, October 2011), IDC analyzed the market share data of the top 30 disk storage system vendors as well as major storage system OEM suppliers. In IDC’s assessment of OEM revenues for selected suppliers, Xyratex ranked first in 2010 with nearly $1.3 billion of enterprise data storage solutions revenue in the calendar year and shipping over three exabytes (3,000 petabytes) of storage through its customer base. To put that in perspective, that equates to over 3 million gigabytes or 750 million DVDs.</p><p>Industry wide increases in demand for data storage from both from our tier one and emerging storage technology partners contributed significantly to our exceptional growth in 2010.  Building on this success, Xyratex shipped its <strong><em>one millionth</em></strong> data storage platform! That’s a lot of capacity and looking forward we see no slowdown in this growth.  </p><p>Another notable point is our continued investment in R&amp;D and the growth in our IP portfolio. With nearly 450 patents granted or pending, Xyratex has established a significant amount of data storage expertise.</p><p>But does any of this really matter?</p><p>We  think it does, but don’t take our word for it. Our recent history can shed a bit of light on this:</p><p>Nearly all of the significant data storage acquisitions over the last 3-4 years have a common theme – the company being acquired leveraged Xyratex for the platform while they focused on their application. Hence EqualLogic, XiV, Data Domain, 3PAR and Compellent, which all based their solutions on Xyratex platforms, all proved such attractive targets for acquisition.</p><p>-       Often during these acquisitions, forecasting and product ramp are difficult to determine. We have learned and become the expert in managing the evolving data storage requirements through the transition period, supporting the acquiring company manage the growth in demand.</p><p>-       Our customers enjoy the capability and efficiency of our worldwide operations to fulfill their storage demand requirements from the closest Xyratex facility. We have full operations capability in North America, Europe and Malaysia.</p><p>-       Few companies have better disk drive knowledge than Xyratex . Our expertise in Disk Processing Solutions and Disk Component Technologies provides Xyratex with a unique advantage in understanding the design requirements of new disk drive technologies well in advance of these drives becoming available in the market.</p><p>-       Most of our customers are serving the Enterprise Data Storage market where Reliability, Availability and Serviceability (RAS) are paramount. We are innovators in hardware and software RAS system design combining  the highest quality data storage platforms  with competitively priced solutions</p><p>-       Designing our ClusterStor HPC storage solutions taught us how we can produce a robust and reliable scale-out solution combining  Xyratex hardware and software architecture and design with the best-in-class Open Source software.</p><p>If you’re looking to enter into the emerging Cloud market or have a vertical application or, if you have a great idea for the traditional IT user, we encourage you to look to Xyratex as your partner and take advantage of these capabilities for yourself.</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2012/03/08/when-size-and-scale-matter/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Webinar &#8211; InfiniBand-based Storage for Efficient and Scalable High Performance Computing</title><link>http://blog.xyratex.com/2012/02/10/webinar-infiniband-based-storage-for-efficient-and-scalable-high-performance-computing/</link> <comments>http://blog.xyratex.com/2012/02/10/webinar-infiniband-based-storage-for-efficient-and-scalable-high-performance-computing/#comments</comments> <pubDate>Fri, 10 Feb 2012 15:30:27 +0000</pubDate> <dc:creator>Michael Connolly</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[ClusterStor]]></category> <category><![CDATA[HPC]]></category> <category><![CDATA[InfiniBand]]></category> <category><![CDATA[Mellanox]]></category> <category><![CDATA[petascale computing]]></category> <category><![CDATA[scalable storage]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=200</guid> <description><![CDATA[I had the recent opportunity on January 25, 2012 to participate with our valuable partner Mellanox Technologies, Ltd. in a joint educational webinar. Mellanox is a leading supplier of high-performance, end-to-end connectivity solutions for data center servers and storage systems and Xyratex currently utilizes their InfiniBand technology as part of our ClusterStor™ 3000 storage solutions. [...]]]></description> <content:encoded><![CDATA[<p>I had the recent opportunity on January 25,<sup> </sup>2012 to participate with our valuable partner <a
href="http://www.mellanox.com/">Mellanox Technologies, Ltd.</a> in a joint educational webinar. Mellanox is a leading supplier of high-performance, end-to-end connectivity solutions for data center servers and storage systems and <a
href="http://www.xyratex.com/">Xyratex</a> currently utilizes their InfiniBand technology as part of our <a
href="http://www.xyratex.com/products/storage/clustered_filesystem_solutions/clusterstor_3000.aspx">ClusterStor™ 3000</a> storage solutions.</p><p> The interactive webinar focused on the advantages of InfiniBand-based storage in building petascale computing systems for high performance computing (HPC) environments as well as how to modernize today’s data centers, an important if not critical component of Cloud Computing services and managing exponential data growth. The presentation content included detail on the industry, Mellanox&#8217;s 40 and 56Gb/s InfiniBand interconnect solutions for optimum performance and elimination of storage I/O bottlenecks, and, of course, Xyratex ClusterStor solutions as the leading example to achieve high levels of throughput performance and enablement of efficient scaling from terabytes to tens of petabytes of data.</p><p>Presenting via webinar, as many of my colleagues are aware, is far different than presenting to live audiences in that it is difficult to gauge the interest of the group and areas of information to focus – not to mention a medium in which to ask questions handled via chat boxes. Even with this challenging format, the webinar went very well with many follow up questions and multiple parties contacting us directly for more information. In addition, many of those unable to attend the live presentation have now been taking advantage of the recorded presentation up on Mellanox’s website.</p><p> This webinar is a primary example of the value of industry partnerships between leading organizations. If you were unable to attend the live event, I urge you to have a listen to this enlightening presentation and find out more about the advantages of InfiniBand-based storage in building petascale computing and storage solutions for HPC in areas such as research, simulation, climate modeling and oil and gas exploration.</p><p> You can access via this link: <a
href="http://www.mellanox.com/webinars/2012/InfiniBand-based-Storage/">http://www.mellanox.com/webinars/2012/InfiniBand-based-Storage/</a></p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2012/02/10/webinar-infiniband-based-storage-for-efficient-and-scalable-high-performance-computing/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Ultra-dense storage subsystems – some points to consider</title><link>http://blog.xyratex.com/2012/02/08/ultra-dense-storage-subsystems-%e2%80%93-some-points-to-consider/</link> <comments>http://blog.xyratex.com/2012/02/08/ultra-dense-storage-subsystems-%e2%80%93-some-points-to-consider/#comments</comments> <pubDate>Wed, 08 Feb 2012 15:35:51 +0000</pubDate> <dc:creator>Doug Donsbach</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[high density storage]]></category> <category><![CDATA[ultra-dense storage]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=197</guid> <description><![CDATA[In order to minimize the use of space in the modern data center, many storage solution architects and storage subsystem users are turning to innovative packaging schemes to stuff as many hard disk drives (HDDs) and/or solid state drives (SSDs) into as small a volume as possible.  While it’s simple to state the problem as [...]]]></description> <content:encoded><![CDATA[<p>In order to minimize the use of space in the modern data center, many storage solution architects and storage subsystem users are turning to innovative packaging schemes to stuff as many hard disk drives (HDDs) and/or solid state drives (SSDs) into as small a volume as possible.  While it’s simple to state the problem as being one of highly efficient use of the space available inside the enclosure for the purpose of mounting HDDs and SSDs, there are many complex considerations involved in actually engineering an ultra-dense storage subsystem.</p><p>For starters, there is the issue of heat.  Mounting a large number of drives in a relatively small volume requires a potentially huge volume of airflow to keep everything cool and comfortable inside the drive manufacturer’s temperature specifications.  The only practical way to deliver this airflow is through the use of large fans, which have the potential of generating high levels of power consumption and noise as they spin at high speeds.</p><p>Because the enclosure is expected to cool all of its components across a broad range of ambient temperatures, one option is to provide a relatively small number of preset fan speeds, and run the fans at those speeds based on ambient temperature.  However, this approach is usually very inefficient because there will be variations in workload and drive types and configurations that will cause the fans to be running either too fast (and consuming more power and generating more noise) or too slow (and reducing the temperature margin available for the drives) at certain ambient temperatures.</p><p>A smarter approach to the cooling problem is to design in a large number of semiconductor temperature sensors in critical locations throughout the enclosure and use the enclosure management processor to integrate the temperature measurements from the sensors to arrive at an optimum setting for the speed of the fans.  This will always result in the speed of the fans being precisely that required to control the temperature of the components inside the enclosure, hence maximizing the life of the components while optimizing the noise levels and power consumption of the fans. Happiness all round!</p><p>Next up on the list of design points is vibration.  As drive performance and density has increased over the generations of HDDs, in some cases so has their susceptibility to vibration as a detriment to both performance and reliability.  Put a bunch of performance optimized drives in an enclosure, get them all seeking at the same time, and you’ve created the potential for the vibration created by the drives to be great enough to interfere with the ability of an individual drive’s servo loop to keep the head on track.</p><p>When this happens, drive performance takes a nose dive and data reliability could be impacted as well.  And the fans we talked about earlier?  They’re also sources of vibration, and although our intelligent approach to fan speed control has them spinning just fast enough to keep things cool, it’s still important to realize that at the upper end of the ambient temperature range when the fans really do need to run at high speed, we don’t want them adding to the vibration problem.  Fortunately, Xyratex has amassed a great deal of knowledge from its decades of producing drive process, inspection and test equipment.  When that expertise is coupled with careful modeling and a focus on both isolating the drives from vibration and minimizing sources of fan vibration, the enclosure is going to be a superior performer when the heads are seeking.</p><p>While I just touched a bit on drive performance, another important consideration is the topology of the data channels inside the enclosure.  A critical question to ask is if the storage subsystem I/O architecture is up to the task of delivering all the performance possible from the drives inside.  After all, we’ve made sure that the vibration isn’t going to be a problem, so let’s get some I/Os going and see what kind of numbers we can wring out of this thing!  With a Xyratex engineered enclosure, you can be sure that you’ll have I/O channels designed from the I/O module host port to the connection to the drive that will allow you to take advantage of all of the performance from all the drives, all the time.</p><p>And with data flowing at maximum rates, what’s to keep one bit out of all those gigabits per second from taking an occasional detour into a place it shouldn’t go, resulting in data loss or performance-robbing retry operation?  Careful control of signal integrity is what determines that, and as enclosures have progressed from dense to ultra-dense, so too have the lengths these critical data and control signals must travel from the I/O module to the drive.  Detailed modeling and analysis of the paths taken by the signals inside our enclosures, along with careful layout and optimal selection of PCB material and other devices, is the foundation of the Xyratex design process.  An error free system is what you’ll get, and how important is that when you’ve packed all those drives into that space-saving enclosure?</p><p>Finally, let’s not skip an important area: usability.  With a large number of drives in an ultra-dense enclosure, who wants to go through a tedious, one-at-a-time sequence installing the drives?  Along with our enclosure we developed a tool that allows you to install multiple drives into the enclosure in one easy motion, saving installation time and money – especially important in multi-petabyte installations!</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2012/02/08/ultra-dense-storage-subsystems-%e2%80%93-some-points-to-consider/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>FDMI™ &#8211; File Data Management Interface</title><link>http://blog.xyratex.com/2012/02/01/fdmi%e2%84%a2-file-data-management-interface/</link> <comments>http://blog.xyratex.com/2012/02/01/fdmi%e2%84%a2-file-data-management-interface/#comments</comments> <pubDate>Wed, 01 Feb 2012 19:19:01 +0000</pubDate> <dc:creator>Peter Braam</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[changelog]]></category> <category><![CDATA[FDMI]]></category> <category><![CDATA[File Data Management Interface]]></category> <category><![CDATA[IRODS]]></category> <category><![CDATA[peta-scale]]></category> <category><![CDATA[petascale]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=193</guid> <description><![CDATA[Data management is a hugely important aspect of storage, and yet interoperable standard approaches to perform it have been slow in coming.  A few frameworks have become widely used, with DPMI for backup probably leading the pack.  Emerging initiatives such as IRODS hold promise to bring order to rule oriented data management and middleware.  Others [...]]]></description> <content:encoded><![CDATA[<p>Data management is a hugely important aspect of storage, and yet interoperable standard approaches to perform it have been slow in coming.  A few frameworks have become widely used, with DPMI for backup probably leading the pack.  Emerging initiatives such as IRODS hold promise to bring order to rule oriented data management and middleware.  Others such as DMAPI for HSM have seen success but limited adoption, with a few vendors supplying applications and strict interoperability requirements.  This post is about an initiative of Xyratex which addresses the performance and scalability requirements of data management at peta-scale and beyond.</p><p> With storage transitioning from single storage servers to multi-server clustered and cloud storage, data management problems get some new dimensions.  For example, a single file system client may not provide the bandwidth to replicate peta-scale storage.  Moreover, scanning file systems with hundreds of millions of files or more takes a long time in the best of circumstances, and can impose unacceptable load on the servers.  A further scalability concern arises from the great variety of data management that is needed:  it is not uncommon to require backup, and replication and migration services and audit, all at the same time.  Oh, and while we are at it, isn&#8217;t the file system checker we discussed last month another scalable data management tool?</p><p>The File Data Management Interface (FDMI) provides a few interfaces that assist developing scalable management applications while addressing these sorts of concerns.  Moreover, if the interface is adopted by multiple file systems, the resulting data management components should interoperate.</p><p> There are only a few concepts in FDMI.  First FDMI requires a changelog.  Such a changelog can be almost free of cost if it is built in conjunction with the file system recovery logs.  Some quick calculations show that a changelog doesn&#8217;t grow so fast and it&#8217;s often possible (but not required) to keep the entire history.   A first key observation is that while servers will need to record the changelog in a transactional, reliable manner, it is not desirable to also ask these same servers to become responsible for all kinds of data management.  Instead, the changelog should be exported, possibly simply as a file, and be available for parsing on any number of data management nodes in the cluster.  Each of these data management nodes will parse the changelog for its own purpose, and in particular, it may build indexed databases for its particular management task.  We already discussed the sorts of relationships that file system checkers need, replication of subtrees may need indices by path names, migration software may need indices mapping data to server nodes, HSM may need indices of not recently used files to make its migration decisions.  By exporting the file system changelog we have created a scale-out data management framework, and do not impose significant new load on the servers.  </p><p> Changelogs, as is well known, eliminate the need for scanning: for example a changelog driven replicator has no need to scan the file system to find all changes, it can pick at the changelog record where it previously left off.   Doing this on a subtree of a file system does require and extra index, but is similar, in principle.   </p><p> Picking up where one left off contains another aspect different of the changelog, and this introduces the 2nd central FDMI concept, that of an update stream.   Suppose a batched sequence of replication requests is send from a source to a recipient and the recipient crashes.    It would be very convenient if the recipient &#8211; upon restarting &#8211; could find what was the last changelog entry that it recorded in its storage.  This would tell the source where to resume to get the complete sequence to the recipient.  We call this recipient based infrastructure an update stream, and it combines the creation of a record associated with the data management operation with the operation itself.    The stream itself names the source and in many cases the record corresponds to the sources changelog record.   The operations associated with an update stream should be a super set of normal file system API&#8217;s and include operations like opening files by inode, setting creation times and others that are not available through the POSIX API.</p><p> We have found that the framework we describe here allows one to implement many data management operations in a scalable manner.  However, there are some operations that require a different framework, and these are operations that require synchronous interception.   Last month we saw an example &#8211; the file system may need to synchronously invoke the repair tool before it can proceed.  Many synchronous management tasks such as snapshots, continuous data protection (CDP) are less tolerant of latency than the repair tool and require an interception framework akin to the Windows filter drivers.  However a very large amount of data management tasks can work, at enormous scale with the primitives we have described.  In a future post we will revisit the fine points of the parallel file replicator and discuss its scalability further.</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2012/02/01/fdmi%e2%84%a2-file-data-management-interface/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Repairing and Checking File Systems</title><link>http://blog.xyratex.com/2011/11/28/repairing-and-checking-file-systems/</link> <comments>http://blog.xyratex.com/2011/11/28/repairing-and-checking-file-systems/#comments</comments> <pubDate>Mon, 28 Nov 2011 16:36:30 +0000</pubDate> <dc:creator>Peter Braam</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[checksum]]></category> <category><![CDATA[data integrity]]></category> <category><![CDATA[file system]]></category> <category><![CDATA[fsck]]></category> <category><![CDATA[repair node]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=190</guid> <description><![CDATA[Checking and repairing large file systems has become one of the biggest challenges facing the technology.   The utility that does this goes by the name of &#8220;fsck&#8221;, and it iterates over the entire content of a file system checking for the integrity of the (relational) data structures.  The problem is that it takes a [...]]]></description> <content:encoded><![CDATA[<p>Checking and repairing large file systems has become one of the biggest challenges facing the technology.   The utility that does this goes by the name of &#8220;fsck&#8221;, and it iterates over the entire content of a file system checking for the integrity of the (relational) data structures.  The problem is that it takes a very long time to run fsck on current, large file systems.</p><p> Why do we need checkers?  Early file systems, like ext2, had no transactional recovery mechanisms and were checked and repaired after all &#8220;un-clean&#8221; shutdowns, such as abrupt power down of systems.  That was improved with a next generation of file systems, which had shadow trees (WAFLE) or journals (ext3) to recovery.   Now almost all recovery of file systems proceeds without checking and repairing.   But it was not the end of fsck.  First, if the data on the storage device became somehow corrupted, fsck would always help out, but the recovery mechanisms would not.  Fsck might lose the name of some files, or some data blocks, but a usable file system, typically with almost all files in good shape would emerge.   To fight this kind of corruption, RAID became the norm, and more recently RAID combined with checksumming techniques for data integrity, such as seen in ZFS and btrfs.   Surprisingly, this is still didn&#8217;t spell the death of fsck.</p><p> Embarrassingly, software bugs are nearly impossible to eliminate in the increasingly sophisticated file systems and can lead to data corruption that again escapes all mechanisms of repair, except fsck.   But fsck takes very long to run now and because effectively the iteration over all the file is the root cause of this, only modest progress can be expected.  To avoid the file system being unavailable while fsck is running, fsck is now being modified to run &#8220;in the background&#8221;.   This is a major step forward, because the file system will be available, but a process like this running in the background is going to either run for a very long time (during which an increased risk of further problems will exist) or it is going to impact performance.  Also, fsck may need substantial amounts of memory while it is running. </p><p>Fsck is an interesting program.  It builds up an alternative collection of tables describing the (possibly damaged) file system on the disk and uses these tables to perform the repairs.  It is important that the tables fsck uses are generated in a way that makes it unlikely that the file system itself and fsck would induce the same corruption.   The repair functionality itself replaces the file system&#8217;s on-disk tables with consistent data from the tables.  Indeed, the file system research group at the University of Wisconsin has described the repair functionality as a simple SQL program (and along the way discovered embarrassing errors in the existing fsck programs).   </p><p>The Xyratex file system team went one step further and observed that the tables that are needed to perform the repair can be maintained by simply processing the changelog for the file system.  The changelog is usually present anyway for roll forward recovery, and can be used to provide numerous data management services, such as replication and migration which we will describe in another post.</p><p>The changelog can be exported as a file to a special repair node.  The repair node builds and maintains a secondary set of tables, like we discussed above.  This eliminates the scanning performed by fsck &#8211; the tables are always ready for use.  Clearly this scheme is also easily adapted to complex multi-server file systems where intra-node inconsistencies can occur when distributed recovery fails.</p><p>There are two modes of repair, and one is novel.   While the tables are being built, consistency checks can be performed on them.  If something is determined to be awry by the repair node, preventive repair can be initiated by modifying the file system data structures that contain the corrupt data, just like an online fsck would do, but without the overhead of scanning, and before anything has gone wrong.   If instead the file system encounters an inconsistency that is un-expected, it can request a reactive repair.  The repair node will provide repair information, without the wait.  It will act very similarly to fsck when fsck finally has built its tables and is ready to repair.</p><p> In summary the repair node will be a much better fsck &#8211; no scanning and waiting are required, it will offer preventive repairs and the functionality is offloaded from a usually stressed out metadata server.</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2011/11/28/repairing-and-checking-file-systems/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Storage Performance: Block vs. File System</title><link>http://blog.xyratex.com/2011/11/11/storage-performance-block-vs-file-system/</link> <comments>http://blog.xyratex.com/2011/11/11/storage-performance-block-vs-file-system/#comments</comments> <pubDate>Fri, 11 Nov 2011 19:43:44 +0000</pubDate> <dc:creator>Torben Kling Petersen</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[File system performance]]></category> <category><![CDATA[HPC]]></category> <category><![CDATA[I/O performance]]></category> <category><![CDATA[ldiskfs]]></category> <category><![CDATA[Lustre]]></category> <category><![CDATA[parallel file system]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=186</guid> <description><![CDATA[File system performance is often a major component of overall HPC system performance, and is heavily dependent on the nature of the application(s) generating the load. To achieve optimal performance, the underlying file system configuration must be balanced to match the application characteristics. Hence the basis for Lustre. However, regardless of the style of file [...]]]></description> <content:encoded><![CDATA[<p>File system performance is often a major component of overall HPC system performance, and is heavily dependent on the nature of the application(s) generating the load. To achieve optimal performance, the underlying file system configuration must be balanced to match the application characteristics. Hence the basis for Lustre. However, regardless of the style of file system (such as UFS, NTFS, ZFS, or any of the parallel file systems used today), the underlying disk structure dictates the performance capabilities of these file systems.</p><p>As soon as you format a disk, you start applying limitations to that disk in the form of block sizes, partitions, presence of journals or versioning options etc. In fact there are more than 50 different disk file systems (I stopped counting at 50 so ….), all with different advantages and disadvantages. The underlying disk file system used by Lustre is called “ldiskfs” and is very similar to the common Linux file system “Ext4.”</p><p>In addition to disk file systems, common data security requires some form of data protection for the physical disks where the most commonly used solution today is RAID-6. RAID-6 uses N+2 disks (where N is 5 or greater) where actual write performance is further compromised by the need of calculating 2 parity blocks for every data block being created. As luck would have it (and some people would argue that “luck” does not have anything to do with it) read operations in a RAID-6 set is not associated with a performance penalty.</p><p> Even with disks formatted using EXT4 and bundled together in a RAID-6 set, we’re still talking about block storage. If we were to export this RAID set using iSCSI or Fibre Channel (altogether not uncommon in SAN environments) we would still need additional tools to read and write data to the devices. </p><p>Measuring performance at the device level can be accomplished using one of several benchmarking tools such as dd, sgpdd-survey, iometer and others and is often limited by the Host Bus Adapters being used (these tests are used to test bare metal I/O performance of the raw hardware, while bypassing as much of the kernel as possible. These tests commonly destroy any file system data on the device and should never be used on a disk array in production). Disks and arrays are very sensitive to request size. To identify the optimal request size for a given disk or RAID set, benchmark the disk with different record sizes ranging from 4 KB to 1 to 2 MB. That said, any found performance peak might be totally irrelevant as the intended application(s) may use different I/O characteristics. Note that performance is limited by the slowest disk in an array.</p><p> Enter the file systems. Once the block devices are created, applying a file system (especially a parallel file system) is done to add functionality to the underlying block storage. This functionality can be to create a network sharable file system such as NFS or a HPC system such as Lustre. In the latter case, the Lustre file system is designed to remove the performance limitations of the underlying block storage by striping the data over multiple such devices in a parallel fashion and by tuning the I/O blocks to closer match the optimal block sizes of the array (most modern disk systems peak at around 4M).</p><p>So how does this affect storage benchmarking?  There is a difference between block storage benchmarks and file system or application benchmarks. Understanding this difference is critical as it is the file system performance results that will directly affect your HPC application.</p><p>You need to be  careful as you assess different storage systems and the different claims of performance. Sometimes performance is stated as the aggregate block storage bandwidth equating to artificially inflated and inaccurate performance numbers. In many cases the performance numbers are indeed based on the actual RAID sets but run locally on the array and are not connected to clients that would access the storage . The only way to get a sense of how good the performance of a given  storage system you have to run a standard file system and/or application level benchmark both from the local host as well as take measurements from external clients. Only when you have a firm grip of every level of the storage array (disk and array performance, file system and network all the way up to the clients) can you achieve understanding of the total solution capabilities.</p><p> Measuring performance often requires multiple external clients running multiple iterations of a file system or application benchmark such as IOzone, IOR, Bonnie++, OST-Survey etc. that also requires careful tuning to get peak performance out of the storage array.</p><p>We will cover these benchmarks and the differences in the next post…</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2011/11/11/storage-performance-block-vs-file-system/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Sizing Lustre For Throughput and/or Capacity</title><link>http://blog.xyratex.com/2011/11/07/sizing-lustre-for-throughput-andor-capacity/</link> <comments>http://blog.xyratex.com/2011/11/07/sizing-lustre-for-throughput-andor-capacity/#comments</comments> <pubDate>Mon, 07 Nov 2011 18:20:44 +0000</pubDate> <dc:creator>John Fragalla</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[HPC performance]]></category> <category><![CDATA[Lustre Capacity]]></category> <category><![CDATA[Lustre Throughput]]></category> <category><![CDATA[Object Storage Server]]></category> <category><![CDATA[Object Storage Target]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=168</guid> <description><![CDATA[Applications will dictate the throughput and capacity of a HPC storage solution using Lustre.  Lustre is a flexible file system that allows for scaling capacity horizontally or vertically to meet the application I/O requirements.  Scaling capacity horizontally, or scaling-out, will increase storage throughput and capacity by incorporating additional Object Storage Servers (OSS – processor, memory, [...]]]></description> <content:encoded><![CDATA[<p>Applications will dictate the throughput and capacity of a HPC storage solution using Lustre.  Lustre is a flexible file system that allows for scaling capacity horizontally or vertically to meet the application I/O requirements.  Scaling capacity horizontally, or scaling-out, will increase storage throughput and capacity by incorporating additional Object Storage Servers (OSS – processor, memory, and I/O) attached to dedicated storage enclosures.  Scaling capacity vertically, or scaling-up, will increase storage capacity behind fewer Object Storage Servers.  Below is a diagram that illustrates the flexibility of Lustre File system and different storage configurations.</p><p
style="text-align: center;"><a
href="http://blog.xyratex.com/wp-content/uploads/2011/11/File-System-Failover2.png"><img
class="size-full wp-image-170  aligncenter" title="File System Failover" src="http://blog.xyratex.com/wp-content/uploads/2011/11/File-System-Failover2.png" alt="" width="243" height="255" /></a></p><p>Within Lustre, OSS’s are dedicated storage servers attached to multiple disks within an enclosure configured as multiple Object Storage Targets (OST’s).  OST’s are a group of drives configured with a particular RAID set, typically RAID 6 or RAID 5.   When scaling horizontally, the storage capacity is scaled linearly by adding more OSS’s.  The more OSS’s included in the Lustre file system, the more throughput is available for the application to read and write data.  When scaling vertically, there will be typically less OSS’s but more storage capacity behind those OSS’s.  This method is designed for applications not needing more throughput when increasing storage capacity.</p><p>When architecting a Lustre storage system to match the capacity and throughput requirements for applications running on a HPC system, it is typical to exceed the throughput requirement to meet the capacity requirement.  Sizing the capacity horizontally will increase throughput, which increases overall application performance, by matching the front-end OSS performance with the backend disk performance.  Below is a diagram illustrating horizontal scaling increasing front-end performance at the same rate as backend disk performance.</p><p
style="text-align: center;"><a
href="http://blog.xyratex.com/wp-content/uploads/2011/11/Balanced-Performance2.png"><img
class="size-full wp-image-179 alignnone" title="Balanced Performance" src="http://blog.xyratex.com/wp-content/uploads/2011/11/Balanced-Performance2.png" alt="" width="286" height="216" /></a></p><p>When sizing capacity vertically, the backend disks will usually outperform the front-end OSS performance, which reduces overall application performance.  Vertical scaling does not take advantage of the additional disk throughput, which limits the overall storage throughput, and leaving valuable bandwidth unused due to an imbalance of OSS and disk throughput.  Below is a diagram illustrating vertically scaling and an imbalance between front-end throughput and back-end disk performance.</p><p
style="text-align: center;"><a
href="http://blog.xyratex.com/wp-content/uploads/2011/11/imbalanced-perform-and-scal-pic-from-fragalla2.png"><img
class="size-medium wp-image-177 alignnone" title="imbalanced perform and scal pic from fragalla" src="http://blog.xyratex.com/wp-content/uploads/2011/11/imbalanced-perform-and-scal-pic-from-fragalla2-300x215.png" alt="" width="300" height="215" /></a></p><p>In Part 5, we will discuss the difference between block and file system throughput, and what is more important for application performance.</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2011/11/07/sizing-lustre-for-throughput-andor-capacity/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Xyratex Releases Lustre Migrator Tool</title><link>http://blog.xyratex.com/2011/11/03/xyratex-releases-lustre-migrator-tool/</link> <comments>http://blog.xyratex.com/2011/11/03/xyratex-releases-lustre-migrator-tool/#comments</comments> <pubDate>Thu, 03 Nov 2011 16:56:02 +0000</pubDate> <dc:creator>Michael Connolly</dc:creator> <category><![CDATA[Storage Insights]]></category> <category><![CDATA[Lustre]]></category> <category><![CDATA[lustre tools]]></category> <category><![CDATA[migrator]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=156</guid> <description><![CDATA[In our continued commitment and support of the Lustre® community, we have announced and released, in mid-October, a Lustre migrator tool that is available for use, licensed under the GPL 2.0, to the entire Lustre community. The Xyratex migrator tool converts Lustre file system MDS on-disk data structures from version 1.8 format to version 2.x [...]]]></description> <content:encoded><![CDATA[<p>In our continued commitment and support of the Lustre<sup>®</sup> community, we have announced and released, in mid-October, a Lustre migrator tool that is available for use, licensed under the GPL 2.0, to the entire Lustre community. The Xyratex migrator tool converts Lustre file system MDS on-disk data structures from version 1.8 format to version 2.x native format and has been favorably tested against the latest Lustre community release version 2.1. Our experienced software developers recommend upgrading the MDS/MDT component from Lustre 1.8 to 2.x level software to improve file system performance and to enable full 2.x functionality. Until the development and release of this migrator tool, the migration between versions was not an easy task, but our internally developed tool accomplishes this task quite well in cleanly patching against Lustre 2.x versions. The migrator tool uses the “upgrade” mount option to change the on-disk data format of the MDS/MDT component to native Lustre 2.x as well as utilizing the “restore” mount option to enable file system level backups (tar, rsync, cp); specifically, in restoring the Object Index for restored files.</p><p> Coinciding with the migrator tool release, we simultaneously announced the launching of the “Xyratex Lustre Community” page (<a
href="http://www.xyratex.com/technology/lustre.aspx">http://www.xyratex.com/technology/lustre.aspx</a>) where the migrator tool itself is available along with detailed procedural information on its use. In addition to the migrator tool, this Xyratex site provides Lustre related information regarding Xyratex efforts and contributions within the community.</p><p> We sincerely hope the community benefits from our contributed Lustre tool and we value any community’s feedback on its capabilities and use. Comments can be directed to <a
href="x-msg://23/lustre-devel@lists.opensfs.org">lustre-devel@lists.opensfs.org</a> or<br
/> <a
href="x-msg://23/lustreinfo@xyratex.com">lustreinfo@xyratex.com</a></p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2011/11/03/xyratex-releases-lustre-migrator-tool/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Reflections from the HPC User Forum – San Diego</title><link>http://blog.xyratex.com/2011/10/31/reflections-from-the-hpc-user-forum-%e2%80%93-san-diego/</link> <comments>http://blog.xyratex.com/2011/10/31/reflections-from-the-hpc-user-forum-%e2%80%93-san-diego/#comments</comments> <pubDate>Mon, 31 Oct 2011 15:14:04 +0000</pubDate> <dc:creator>Peter Bojanic</dc:creator> <category><![CDATA[Storage Insights]]></category> <guid
isPermaLink="false">http://blog.xyratex.com/?p=154</guid> <description><![CDATA[I&#8217;ve had the privilege of participating in IDC HPC User Forum panel discussions for a few years and I always find the dialog engaging. Last month at the forum meeting in San Diego I joined six other Lustre community leaders from a cross section of organizations to discuss &#8220;The Future of Lustre&#8221;.  This was my [...]]]></description> <content:encoded><![CDATA[<p>I&#8217;ve had the privilege of participating in IDC HPC User Forum panel discussions for a few years and I always find the dialog engaging. Last month at the forum meeting in San Diego I joined six other Lustre community leaders from a cross section of organizations to discuss &#8220;The Future of Lustre&#8221;.</p><p> This was my first IDC meeting representing Xyratex, having joined the company almost a year ago. I have a long history with Lustre but my perspective has shifted in the past year working as a Lustre solution vendor. The community collaboration aspects of Lustre and its resilience to endure the churn of its commercial sponsors are of paramount importance. Roadmap advancement is important but never at the cost of quality and resilience. </p><p> I&#8217;m greatly encouraged by the progress our Lustre community is making. In watching the <a
href="http://insidehpc.com/2011/09/11/video-panel-the-future-of-lustre/">video</a> (courtesy of InsideHPC), I&#8217;m reminded by the optimistic outlook all of my colleagues shared. The future of Lustre is bright, indeed</p> ]]></content:encoded> <wfw:commentRss>http://blog.xyratex.com/2011/10/31/reflections-from-the-hpc-user-forum-%e2%80%93-san-diego/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> </channel> </rss>
<!-- Dynamic page generated in 0.671 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-05-16 15:03:26 -->
<!-- Compression = gzip -->
