• Drive Failure Modeling

  • This is a part of TUBITAK 2232 project and aimed at finding accurate device failure modeling using higher order statistical models. This study would include hidden markov models, data fitting, estimation of inter-arrival rates as well as count models. This study is part of a larger application the summary of which is pictured below. The service development shall provide the outcome of our modeling study as well as various other data such as amazon price, quality of the product based on reviews, factual data from the database etc.

As mentioned, the project has different components:

1. Large database of disk or SSD drives with SMART and health information electronically store. It is hard to record all that data by myself so i would like to thank Backblaze who provides such data for analysis which is based on around 50K different HDDs in their data center. One of the components of our software takes the data and generates an appropriate database.

2. Another component of this study is to develop programs to do the modelings and enlarge the database with the new results for each drive indicating more accurate health information such as MTTF, MTTDL, predicted AFR etc.

3. Finally, last component is to consult other webpages such as amozon to collect information about drives to enrich the contents of the databases which we can later use to share with users based on their filtering options through a thin web interface.


Founsure 1.0


Founsure 1.0 is the very first version of a totally new erasure coding library using Intel's SIMD and multi-core architectures based on simple XOR operations and fountain coding. This library will have comparable performance to Jerasure library by Dr. Plank and yet shall provide great repair and rebuilt performances which seem to be the main stream focus of erasure coding community nowadays. You can find more information about this project here.

  • Hadoop Project


  • Sorry to say that but this is yet another version of hadoop, an open source framework for processing and querying vast amounts of data on large clusters of commodity hardware. A comprehensive intro can be found at here. Big data
  • This project has sparked energy and interest when I was involved with Big Data applications at Quantum. Original distribution of Hadoop had the following issues:
    • ■ Whenever multiple machines are run in cooperation with one another in parallel, ideally in sync'ed fashion, the probability of failures become a norm rather than an exception. These failures include but are not limited to switch or router breakdowns, slave nodes experiencing overheat, crash, drive failures and/or memory leaks. There is not built-in protection mechanism within the original design of Hadoop except replication of the same object in different clusters, which might be very costly.

    • ■ No security model is defined. No safeguarding against maliciously inserted or improperly unsync'ed data.

    • ■ Each slave compute node has limited amount of resources, processing capability, drive capacity, memory and network bandwidth. All these resources should be effectively used by the overlaid large-scale distributed system. For example, replication of data at various compute nodes for data resiliency is an inefficient way of providing data protection.

    • ■ Syncronization of compute nodes remain another important challenge that a distributed system design is aimed at solving. Async operations leads some of the nodes to be unavailable, in which case the compute operations must still continue,  ideally with only a small penalty proportional to the loss of some computing power.

    • Under these circumstances , the rest of the distributed system should be able to recover from the component failure or transient error condition and continue to make progress for processing intensitve applications. In addition, the progress must be responsive, fast and secure. Providing all these nice features is a huge challenge from an engineering perspective.

    Objective

    Main futures of this project include:

    • Data protection - Purely XOR-based extremely fast and efficiently repairable erasure coding (I do not know how much i can disclose on that....But it will defintely be possessing better features than LRC of Mcrsft (so to name Xorbas of FB) and probably R10-Raptor type codes). Original design of erasure codes do not in general respect to the peculiar requirements of data storage businesses. Practical coding strategies that address data storage requirements is a huge challenge and we intend to provide one tangible solution in this project. It is also intended to bring data-sensitive durability concept into the picture and protect data based on user's durability goals in a hybrid cloud environment. This part of the project involves modifications in HDFS layer.
    • Data integrity/security - Sensitive data aware MapReduce (MR) framework for hybrid clouds. It steers data and computation through public and private clouds in such a way that no knowledge about senstive data is leaked to the public machines. The new feature of our secure MR solution brings exploitation of public compute resources when processing the sensitive data. This would lead to remarkable execution time speed improvements compared to other secure MR frameworks.
    • Node Syncronization - minimize the number of unavailable clusters for data processing through intelligent sync algorithms. (More to follow later)
    • Power efficiency of the overall HPC system (large clusters) through HDFS, map-reduce functionality and cluster tagging.

    We will update the news as we proceed with the project...please stay tuned.

  • Go Back to Main Page »