AWS Glue Now Supports Crawler History

2022-09-24 10:15:27 By : Mr. Grant Liu

Live Webinar and Q&A: Top 10 Innovations in the NoSQL Cassandra Ecosystem (Live Webinar October 18, 2022) Save Your Seat

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Jordan Bragg discusses using entry-points, breadth-first scanning, and operation tagging to demystify the domain, see where to dive deeper, and uncover what technical debt may exist.

Even when designing a Minimum Viable Architecture (MVA), developers must consider resource location, especially when mobile apps are part of a distributed system. Distributing the data and processing can introduce new challenges if location is not part of the decision making criteria.

In a web-based service, a slowdown in request processing can eventually make your service unavailable. Chances are, not all requests need to be processed right away. Some of them just need an acknowledgement of receipt. Have you ever asked yourself: “Would I benefit from asynchronous processing of requests? If so, how would I make such a change in a live, large-scale mission critical system?”

Jessica Kerr considers that we should be looking at the software as part of the team, and observability in the software becomes an asset to organizing teams.

At QCon Plus November 2021, Nora Jones, CEO and founder of Jeli, talked about how to build production readiness reviews (PRR) with emphasis on context and psychological safety. Her talk focused on the particulars of a PRR process that relates to incidents.

Understand the emerging software trends you should pay attention to. Attend in-person on Oct 24-28, 2022.

Make the right decisions by uncovering how senior software developers at early adopter companies are adopting emerging trends. Register Now.

Adopt the right emerging trends to solve your complex engineering challenges. Register Now.

Your monthly guide to all the topics, technologies and techniques that every professional needs to know about. Subscribe for free.

InfoQ Homepage News AWS Glue Now Supports Crawler History

AWS recently launched support for histories of AWS Glue Crawlers, which allows the interrogation of Crawler executions and associated schema changes for the last 12 months.

AWS Glue is a serverless data-integration service. The service is a suite of AWS integrations built around two major components: its data-cataloging functionality, Glue Data Catalog (based on Apache Hive Metastore), and its extract-transform-load (ETL) pipeline capability, Glue ETL (based on Apache Spark). A Glue Data Catalog, for which the Crawler history feature displays changes, represents a metadata store in the Glue ecosystem. The catalog houses table definitions, which describe the schema of data that exists in a location outside of Glue, such as AWS Simple Storage Service (S3) or Relational Database Service (RDS). The catalog can then be used by Glue ETL as a reference to sources or targets of data for its pipelines, as well as by other AWS analytics services such as AWS Athena. Table definitions can be added manually or created using Crawlers.

Crawlers are jobs that create or update table definitions in a Glue Catalog on their completion. They can be run ad hoc or to a schedule and interrogate the target data source by classifying and grouping the scanned data. Crawlers use built-in classifiers for inferring the data’s schema and format but can be enhanced with user-defined custom classifiers for more complex use cases.

On execution of a Crawler, the history feature shows contextual information such as the duration of the run, the associated computing costs, and the changes effected in the metadata store.

Source: https://aws.amazon.com/es/blogs/big-data/set-up-and-monitor-aws-glue-crawlers-using-the-enhanced-aws-glue-ui-and-crawler-history/

Given that AWS Glue is an amalgamation of synergistic tools, its components are often compared to other solutions rather than the entire offering. The Glue Catalog is often compared to the Apache Hive Metastore, while Glue ETL offers functionality that can be found with AWS’s  Elastic MapReduce service. Yoni Augarten of lakeFS, in a comparison of Glue Catalog and Hive Metastore, recommended Hive for larger organizations heavily invested in the Hadoop ecosystem and Glue Catalog for smaller teams with more straightforward requirements.

The Crawler history feature can be used via the AWS console, programmatically via the ListCrawls Web API, or via any of the official AWS SDKs.

Becoming an editor for InfoQ was one of the best decisions of my career. It has challenged me and helped me grow in so many ways. We'd love to have more people join our team.

D2iQ: The Leading Independent Kubernetes Platform. Learn more.

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

You need to Register an InfoQ account or Login or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

Real-world technical talks. No product pitches. Practical ideas to inspire you and your team. QCon San Francisco - Oct 24-28, In-person. QCon San Francisco brings together the world's most innovative senior software engineers across multiple domains to share their real-world implementation of emerging trends and practices. Uncover emerging software trends and practices to solve your complex engineering challenges, without the product pitches.Save your spot now

InfoQ.com and all content copyright © 2006-2022 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we've ever worked with. Privacy Notice, Terms And Conditions, Cookie Policy