结构化数据、非结构化数据、半结构化数据

W_法 4月前 ⋅ 364 阅读

数据结构

          在对分布式存储系统进行分类前,有必要了解一下分布式存储系统所存储数据结构。每种分布式存储系统,都用于存储一种或多种数据结构的数据。根据数据是否具有结构,可将数据分为两类:结构化数据(Structured Data)和非结构化数据(Unstructured Data)。而结构化数据中又包含一类特殊的数据——半结构化数据(Semi-structured Data)。下述的“结构化数据”均指不包含“半结构化数据”的其他所有结构化数据。

结构化数据

          结构化数据产生于20世纪兴起的数字化阶段。这里摘录了两个比较有价值的对结构化数据的定义:

          The data in those neat columns and rows is what’s referred to as structured data.

          https://datascience.berkeley.edu/structured-unstructured-data/

      Structured data is comprised of clearly defined data types whose pattern makes them easily searchable.  

          https://www.datamation.com/big-data/structured-vs-unstructured-data.html

          结构化数据没有严格,且一致的定义,但这并不影响我们对它的理解。

         结构化数据一般存储在关系数据库中,可以用二维关系表结构来表示。结构化数据的模式(Schema,包括属性、数据类型以及数据之间的联系)和内容是分开的,数据的模式需要预先定义。(《大规模分布式存储系统》,杨传辉)

         可见,结构化数据就是存储在数据库二维表里的行数据。在存储结构化数据前,必须先定好统一的模式。

非结构化数据

         非结构化数据定义比较多,搜集了网上比较权威的机构对其定义:

       Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

         https://en.wikipedia.org/wiki/Unstructured_data Wiki

       Gartner defines unstructured data as content that does not conform to a specific, pre-defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables. Within the enterprise unstructured content takes many forms, chief amongst which are business documents (reports, presentations, spreadsheets and the like), email and web content.

    https://blogs.gartner.com/darin-stewart/2013/05/01/big-content-the-unstructured-side-of-big-data/

        Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated.

        https://www.datamation.com/big-data/structured-vs-unstructured-data.html

      Unstructured data is heterogeneous and variable in nature and comes in many formats, including text, document, image, video, and more. Unstructured data is growing faster than structured data. According to a 2011 IDC study, it will account for 90 percent of all data created in the next decade. As a new, relatively untapped source of insight, unstructured data analytics can reveal important interrelationships that were previously difficult or impossible to determine.

    https://www.intel.com/content/www/us/en/big-data/unstructured-data-analytics-paper.html Intel

         非结构化数据,与结构化数据完全相反,本身不具备结构,更不需要预定义数据模型,常见的非结构化数据有所有格式的办公文档、文本、图片、图像、音频和视频信息等。更详细非结构化数据实例如下:

Typical human-generated unstructured data includes:

    Text files: Word processing, spreadsheets, presentations, email, logs.

    Email: Email has some internal structure thanks to its metadata, and we sometimes refer to it as semi-structured. However, its message field is unstructured and traditional analytics tools cannot parse it.

    Social Media: Data from Facebook, Twitter, LinkedIn.

    Website: YouTube, Instagram, photo sharing sites.

    Mobile data: Text messages, locations.

    Communications: Chat, IM, phone recordings, collaboration software.

    Media: MP3, digital photos, audio and video files.

    Business applications: MS Office documents, productivity applications.

Typical machine-generated unstructured data includes:

    Satellite imagery: Weather data, land forms, military movements.

    Scientific data: Oil and gas exploration, space exploration, imagery, atmospheric data.

    Digital surveillance: Surveillance photos and video.

    Sensor data: Traffic, weather, oceanographic sensors.

        https://www.datamation.com/big-data/structured-vs-unstructured-data.html

半结构化数据

      Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure.

        https://en.wikipedia.org/wiki/Semi-structured_data wiki

      Such information is typically “semistructured” in that there is some structure in the documents but not exactly a formal structure such as that imposed by a database schema or an XML DTD.

    http://infolab.stanford.edu/~maluf/papers/ideas05.pdf Semi-structured Data Management in the Enterprise: A Nimble, HighThroughput, and Scalable Approach

       https://www.w3schools.com/xml/xml_dtd_intro.asp  XML DTD intro

      Semi structured data does not have the same level of organization and predictability of structured data. The data does not reside in fixed fields or records, but does contain elements that can separate the data into various hiearchies.

    https://community.tealiumiq.com/t5/Universal-Data-Hub/Structured-Data-vs-Semi-Structured-Data/ta-p/15617#toc-hId--1333191080

       半结构化数据,介于非结构化数据和结构化数据之间,它一般是自描述的,与结构化数据最大的区别在于,半结构化数据的模式结构和内容混在一起,没有明显的区分,也不需要预先定义数据的模式结构。常见的半结构化数据有:HTML、XML和JSON等。(《大规模分布式存储系统》,杨传辉)

        可见,半结构化数据是一种特殊的结构化数据。相比结构化数据,半结构化数据无需提前预定义数据模型;相比非结构化数据,半结构化数据具有一定的数据结构,具有“自描述性”。


注意:本文归作者所有,未经作者允许,不得转载

全部评论: 0

    我有话说: