NLM CONVERSION TO BUILD ATOMIC PHYSICS CONTENT IN AN AGILE FASHION JATS-CON, April 2, 2014 OSA The Optical Society & DCL Data Conversion Laboratory, Inc. 1 scholarly publisher with 19 current and legacy journals, 300+ conference proceedings

2 OSA Governance: Build moreflexible products and services! How? Break 1917-2012 content into well-polished atomic pieces following an industry standard Develop infrastructure to manage and enrich content, to build new products and services in an agile fashion Budget allocated for five-year strategic plan

3 Some evidence of success With content converted to NLM XML, have developed Enhanced article: Interactive HTML Derivative products: ImageBank Business Intelligence: New insights into author, topic, funding, and other trends

4 Citation data 5 Equation data 6

7 Legacy content (750,000 journal pages) We expected this . . . 8 This . . . not so much JOURNAL AS COMIC BOOK

SCHOOL YEARBOOK 9 1. Most confusing: Articles skipping pages, sometimes in two directions 10 2. Most shocking: legacy PDF not matching Legacy

print Print Legacy PDF for same article 11

3. Most pervasive: nonscientific content tacked onto research articles These are not the authors 12 Project specifications: two extremes

1. Hand the project over to the trusted vendor and be done with it 2. Spend up to a year doing heavy content analysis

and spec creation 13 Data Conversion Laboratory We convert content from any format to any format. Expertise with JATS, and most industry standard DTDs and Schemas Established in 1981; a pioneer in the data conversion industry Over a billion pages converted

Expertise in complex conversion projects; STM Publishing, eBooks, Technical documents, Educational Publishing, and Library Digitization. Projects range from one book to entire libraries and legacy collections Infrastructure for large-scale projects, with automated tracking, quality assurance, and customer reporting for every item Industries include Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, Libraries & Universities

Publish DCLNews, a monthly newsletter devoted to XML and Electronic Publishing topics going to 7,000 subscribers 14 Thoughts on Managing a Large Legacy Conversion Effort 1)

Phased Approach 2) Flexibility and Collaboration 3) Keep it Simple 4)

Keep Monitoring Quality 15 1) Phased Approach Why? Varied sources (PDF, XML, SGML) Content that changed over time Very large input corpus going back to 1917

Allow for the quick, phased release of new OSA products Strategy for OSA materials Focus on one source type at a time but keep the big picture in mind Convert newest material first Review and decide on conversion nuances as they came up 16 Source Material Challenges XML

OSA Proprietary DTD NLM v2.3 DTD PDF PDF Normal PDF Image SGML Multiple DTDs 17

2) Build Flexibility and Collaboration into the Conversion Process Develop an overall specification, with allowance for change as new scenarios are uncovered Software development sprints to incorporate changes

Close collaboration with OSA to manage new situations affecting completed work and work in process 18 Tools Used to Retain Flexibility Client-Vendor

collaboration for decision making Hub and Spoke processing Handling of conversion anomalies Quality assurance reviews

Learning databanks 19 3)Theres a Lot of Detail Keep It Simple Fitting structures into the existing JATS tagging structure CALS to HTML table conversion MathML line break retention Cross-reference ranges Rendering limitations

Unexpected content scenarios 20 Cross-Reference Ranges Bibliographic Figure 21

Rendering Limitations No CSS support for table character alignment PDF: HTML: 22

Unexpected Content Scenarios Missing text - Printed page problems 23 Unexpected Content Scenarios (cont.)

Jumping pages 24 Unexpected Content Scenarios (cont.) Special characters with no corresponding Unicode 25

Unexpected Content Scenarios (cont.) Non-standard Structure ____________________________________ Optical Activities in Industry

66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments for this column which should be sent to him at the above address

26 Unexpected Content Scenarios (cont.) White space filler 27

4) Keep Checking Quality Dont Get Too Far Ahead Visual review OSA Schematron Reporting stylesheets OCR and hyphenation spellchecker software QA software

Learning databanks 28 Visual Review Correct entities are used Math displays correctly Table alignment is accurate Images correspond to the source

29 OSA Schematron The Schematron includes over 300 checks Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials must have @publication-type='other' ($unpublished and @publicationtype != 'communication' and @publication-type != 'other' / warning) [report] Warning:ALERT [LJF:JBCO140]: no tables found but title reads 'Figures and Tables' (matches(title, 'Table') and not(exists(tablewrap)) / warning) [report]

ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than one article-title (count(article-title) > 1) [report] 30 DCL QA Software Highlight any discrepancies between the specifications and the tagging Identify suspicious start of a paragraph Flag missing external files associated with the XML

Find missing cross references to specified structures such as Tables and Figures 31 Hyphenation Spellchecker 32

Reporting Stylesheets Provides easier review of metadata components for a set of articles 33 OCR Tools

Modified versions of the fonts designed to help distinguish between similar looking characters O vs 0, Z vs 2, 1 vs l used within the proofreading phase 34 Learning Databanks Ongoing updates made based on feedback and newly determined rules and structures

Conversion software QA software Schematron Spellchecker and hyphenation software Editorial guidelines Image creation 35

Conclusions OSA has nearly completed a large backfile conversion project in close coordination with DCL. The project, which is based around NLM markup, has allowed OSA to enhance its publishing platform, build derivative products, and significantly improve its ability to gather business intelligence from a deep journal backfile. We offer the following lessons learned: With large content projects, plan ahead but prepare to work in an agile fashion The content owner should stay engaged throughout the project to

align real-time decisions with business aims Ownervendor collaborationwhen the right partners are involved improves morale, attention to detail, and decision-making 36 Scott Dineen Sr. Director Publishing Production & Technol.

The Optical Society [email protected] Devorah Ashlem Senior Project Manager Data Conversion Laboratory [email protected] 37

Recently Viewed Presentations

  • PR Challenge - Green Bank Observatory

    PR Challenge - Green Bank Observatory

    Time to complete: < 6 months , will require outside expertise. (PR firm) Task Number 2 Task No. 3 Task No. 4: Develop PR Plan and Schedule (Div heads, PR team) Task # 5 Implement the plan and… Repeat, repeat,...
  • Which theorist? Kohlbergs Theory of Moral Development or

    Which theorist? Kohlbergs Theory of Moral Development or

    Childbearing Stage - Expanding. Parenting Stage- Developing. Launching Stage- Middle Age. Mid-Years Stage-Pre-Retirement 55-64. Aging Stage- Retirement 65+ What Situations could occur to change the family life cycle? Individuals choose not to marry.

    QATAL IN. DEPENDENT CLAUSES (Lesson 4) QATAL IN. INDEPENDENT CLAUSES (Lesson 5) WAYYIQTOL (Lessons 1, 2, 3) Preceded by . a relative, e.g. כִּי אִם אֲשֶׁר. any other
  • BREAKFAST College Achieve  Central K-8 Monday Tuesday Wednesday

    BREAKFAST College Achieve Central K-8 Monday Tuesday Wednesday

    dipper doodle bar (DF) choice of milk. orange juice option available. zac omega bar blackberry (DF) fresh fruit. choice of milk. lemon muffin. RF honey buttons cereal (DF) fresh fruit. choice of milk. What's New? Resolve to eat breakfast this...
  • Lean Production - Rutgers University

    Lean Production - Rutgers University

    Arial Times New Roman Helvetica Wingdings Russell_Taylor Bitmap Image Lean Systems Lean Production Waste in Operations Waste in Operations (cont.) Waste in Operations (cont.) Basic Elements Flexible Resources Pull System Benefits of Small Batch Size Slide 10 Slide 11 Common...
  • Sheet Metal Engineering Research Roadmap T. Neitzert H.

    Sheet Metal Engineering Research Roadmap T. Neitzert H.

    Framing. Reduced steel usage and increased strength. Building steel systems research . Different cross sectional area members. Transverse forming. Modular construction. Prefabrication. Thermal efficiency increase (integrated thermal brakes) Composite steel frame.
  • Welcome to 5th Grade

    Welcome to 5th Grade

    proud mom of BRHS varsity and JV Cheerleaders! My husband is Scott, and we. have two amazing girls, Olivia and Amelia. My favorite color is . PINK!! Website. Please visit our 5th grade website and our class website. We will...
  • Welcome to The 2002 Summer School - Unil

    Welcome to The 2002 Summer School - Unil

    Important psychosocial consequences Impact on self-esteem and body image of the developing adolescent May affect social interactions * Psychosocial judgements and perceptions of adolescents with acne vulgaris: A blinded, controlled comparison of adult and peer evaluations Ritvo E et al,...