buarki

Site Reliability Engineer, Software Engineer, coffee addicted, traveler

Searching Castles Using Go, MongoDB, Github Actions And Web Scraping

June 8, 2024

2927 views

Share it

It's Been A While!

It's been a while since last one! TL;DR: work, life, study... you know :)

A Project Of Open Data Using Go

In March 2024, I created a small project to experiment with using Server-Sent Events (SSE) in a Go web server to continuously send data to a frontend client. It wasn't anything particularly fancy, but it was still pretty cool :)

The project involved a small server written in Go that served a minimalist frontend client created with raw HTML, vanilla JavaScript, and Tailwind CSS. Additionally, it provided an endpoint where the client could open an SSE connection. The basic goal was for the frontend to have a button that, once pressed, would trigger a server-side search to collect data about castles. As the castles were found, they would be sent from the server to the frontend in real-time. I focused on castles from the United Kingdom and Portugal, and the project worked nicely as you can see below:

The code of such minimalist project can be found here and you can follow the README instructions to run it on you local machine.

A few days ago, I revisited this project and decided to expand it to include more countries. However, after several hours of searching, I couldn't find an official consolidated dataset of castles in Europe. I did find a few datasets focused on specific countries, but none that were comprehensive. Therefore, for the sake of having fun with Go and because I have a passion for history, I started the project Find Castles. The goal of this project is to create a comprehensive dataset of castles by collecting data from available sources, cleaning it, preparing it, and making it available via an API.

Why Go Really Shines For This Project?

Goroutines and channels! The biggest part of the code of this project will be navigating throughh websites, collecting and processing data to, in the end, save it on database. By using Go we leverage the ease that the language offers us to implement these complex operations keeping the maximum possible amount of hair :)

How It Works So Far?

So far I implemented data collectors for 3 countries only: Ireland, Portugal and United kingdom, the reason was that the effort for finding a good reference for these countries was not so hard.

The current implementation basically has two main stages: the website inspection for the links containing castle data and the data extraction per se. This process is the same for all countries and due to that an interface was introduced to establish an stable API for current and future enrichers:

type Enricher interface {
	CollectCastlesToEnrich(ctx context.Context) ([]castle.Model, error)
 
	EnrichCastle(ctx context.Context, c castle.Model) (castle.Model, error)
}

If you want to see the implementation of at least one, here you can find the enricher for Ireland.

Once we have enrichers able to scrap and extract data from proper sources we can actually collect data using the executor package. This package manages the execution of enrichers by leveraging goroutines and channels distributing the work load among the available CPUs.

The executor current definition and function constructor can be see bellow:

type EnchimentExecutor struct {
	enrichers map[castle.Country]enricher.Enricher
	cpus      int
}
 
func New(
	cpusToUse int,
	httpClient *http.Client,
	enrichers map[castle.Country]enricher.Enricher) *EnchimentExecutor {
	cpus := cpusToUse
	availableCPUs := runtime.NumCPU()
	if cpusToUse > availableCPUs {
		cpus = availableCPUs
	}
	return &EnchimentExecutor{
		cpus:      cpus,
		enrichers: enrichers,
	}
}

The execution process is basically a data pipeline in which the first stage looks for castles to be enriched, next stage extract data from the given sources and last one persists it on DB.

The first stage goes by spawning goroutines to find the castles and as those castles are found they are pushed into a channel. We then merge those channels into a single one to be consumed by the next stage:

func (ex *EnchimentExecutor) collectCastles(ctx context.Context) (<-chan castle.Model, <-chan error) {
	var collectingChan []<-chan castle.Model
	var errChan []<-chan error
	for _, enricher := range ex.enrichers {
		castlesChan, castlesErrChan := ex.toChanel(ctx, enricher)
		collectingChan = append(collectingChan, castlesChan)
		errChan = append(errChan, castlesErrChan)
	}
	return fanin.Merge(ctx, collectingChan...), fanin.Merge(ctx, errChan...)
}
 
func (ex *EnchimentExecutor) toChanel(ctx context.Context, e enricher.Enricher) (<-chan castle.Model, <-chan error) {
	castlesToEnrich := make(chan castle.Model)
	errChan := make(chan error)
	go func() {
		defer close(castlesToEnrich)
		defer close(errChan)
 
		englandCastles, err := e.CollectCastlesToEnrich(ctx)
		if err != nil {
			errChan <- err
		}
		for _, c := range englandCastles {
			castlesToEnrich <- c
		}
	}()
	return castlesToEnrich, errChan
}

The second stage spawn a group of goroutines to be listening to the output channel of previous stage, and as it receives castles it extracts data by scraping the HTML page. As the data extraction finishes, the enriched castles are pushed into another channel containing the enriched castles.

 
func (ex *EnchimentExecutor) extractData(ctx context.Context, castlesToEnrich <-chan castle.Model) (chan castle.Model, chan error) {
	enrichedCastles := make(chan castle.Model)
	errChan := make(chan error)
 
	go func() {
		defer close(enrichedCastles)
		defer close(errChan)
 
		for {
			select {
			case <-ctx.Done():
				return
			case castleToEnrich, ok := <-castlesToEnrich:
				if ok {
					enricher := ex.enrichers[castleToEnrich.Country]
					enrichedCastle, err := enricher.EnrichCastle(ctx, castleToEnrich)
					if err != nil {
						errChan <- err
					} else {
						enrichedCastles <- enrichedCastle
					}
				} else {
					return
				}
			}
		}
	}()
 
	return enrichedCastles, errChan
}

And the main executor's function that does it all is bellow one:

func (ex *EnchimentExecutor) Enrich(ctx context.Context) (<-chan castle.Model, <-chan error) {
	castlesToEnrich, errChan := ex.collectCastles(ctx)
	enrichedCastlesBuf := []<-chan castle.Model{}
	castlesEnrichmentErr := []<-chan error{errChan}
	for i := 0; i < ex.cpus; i++ {
		receivedEnrichedCastlesChan, enrichErrs := ex.extractData(ctx, castlesToEnrich)
		enrichedCastlesBuf = append(enrichedCastlesBuf, receivedEnrichedCastlesChan)
		castlesEnrichmentErr = append(castlesEnrichmentErr, enrichErrs)
	}
 
	enrichedCastles := fanin.Merge(ctx, enrichedCastlesBuf...)
	enrichmentErrs := fanin.Merge(ctx, castlesEnrichmentErr...)
 
	return enrichedCastles, enrichmentErrs
}

The full current implementation of the executor can be found here.

The last just consumes the channel with enriched castles and save them in bulk into MongoDB:

castlesChan, errChan := castlesEnricher.Enrich(ctx)
 
var buffer []castle.Model
 
for {
  select {
  case castle, ok := <-castlesChan:
    if !ok {
      if len(buffer) > 0 {
        if err := db.SaveCastles(ctx, collection, buffer); err != nil {
          log.Fatal(err)
        }
      }
      return
    }
    buffer = append(buffer, castle)
    if len(buffer) >= bufferSize {
      if err := db.SaveCastles(ctx, collection, buffer); err != nil {
        log.Fatal(err)
      }
      buffer = buffer[:0]
    }
  case err := <-errChan:
    if err != nil {
      log.Printf("error enriching castles: %v", err)
    }
  }
}

You can find the current version of the main.go here. This process runs periodically using a scheduled job created using Github Actions.

Next Steps

This project has a considerable roadmap ahead, bellow you can find listed the next steps.

1. Implement recursive crawling in order to add more enrichers is making it possible to do recursive crawling of a website, because some of them has a huge list of castles in such a way that the listing is done through pagination.

2. Support for multiple enrichment website sources for the same country: It must also support multiple enrichment website sources of the same country because this is something possible as I could see.

3. Develop an official website: In the meantime, an official website for this project must be done to make the collected data available and for sure to show the progress. Such site is in progress and you can already visit it here. Due to my lack of design skills the site is ugly as hell, but stay tuned and we'll get over it :)

4. Integrate machine learning for filling data gaps: And for sure, something that will help a lot, specially in complementing data hard to be found via the regular enrichers, will be machine learning, because by prompting these models with requests for hard-to-find data, we can efficiently fill in data gaps and enrich the dataset.

Contributions Are Welcome!

This project is open source and all collaborations are more than welcome! Whether you're interested in backend development, frontend design, or any other aspect of the project, your input is valuable.

If you find anything you want to contribute - specially with frontend :) - just open an issue on the repository and ask for code review.