Collection Policy

PERMALINK

If you want to cite the Web Archive collection policy, please use the Archival Resource Key (ARK):

https://persist.lu/ark:70795/g7cr0x6xqg

Vision

These are the ambitions the Luxembourg Web Archive is striving for. Given the volatile, fast-paced nature of the Internet, the complexity and extent of the task and the limitations in terms of budget and staff, it is difficult to fulfill these objectives to completion.

Completeness

All contents from the Luxembourg web is captured in the LWA so that no content from the live web is lost. The archived versions of websites are true to the original in every aspect of the user experience.

Preservation

The Luxembourg web is being preserved for the long term, for future generations.

Access

Access to the LWA is as open as legally possible, to as many audiences as possible.

Research use

The LWA is an essential source of information, actively used in national and international research.

Awareness

The legal deposit for websites and awareness about the necessity for web archiving are widely known by website owners, users and content producers.

Strategy

The Strategy adresses the different aspects of the Luxembourg Web Archive that need to be defined in order to fulfill the objectives described in the Vision.

Luxembourgish legislation

The legal deposit law defines the mission and mandate of the National Library of Luxembourg. However, the definitions of what should be archived are quite broad. Therefore, the BnL has to prioritise its means, in order to archieve an optimal coverage of the Luxembourg web.

Resources

Both staff and budget are important factors in how well the BnL is able to accomplish its goals and what level of service can be offered to web archive users.

Technical developments

Web archiving is still a relatively young field, with many technical barriers. Developers and web archivists are working hard on keeping up with the evolution of the Internet and shifting challenges. New developments in crawling and archiving technologies can completely change the face of web preservation.

Demands by users and contributors

Todays’s website owners are the web archive users of tomorrow, or maybe after 10, or 20 years. Keeping in touch with publishers and content producers, leads to a better understanding of the Luxembourg webspace and what needs to be preserved right now.

Local and international collaborations

Legal desposit and digital preservation are not domains exclusive to the National Library. There are many areas where we need to work together with other archiving institutions, researchers and networks of web archiving practitioners, in order to meet common challenges and responsabilities.

what we are archiving:

The Luxembourg Web

There are different areas of interest defined by the Luxembourg legal deposit law, which concentrates on websites published on the territory of the Grand-Duchy. However, we also want to archive websites and content about Luxembourg and Luxembourgish citizens, from all corners of the Internet. The hardest part is to find the right addresses outside of the .lu top level domain.

How do we decide to include an international website to the web archive?

Every website that is not registered in a .lu domain is checked manually before being added to our seed lists. There are different criteria to decide whether a website is relevant or not:
– Luxembourgish language
– Luxembourgish town in the website name
– main address/headquarter located in Luxembourg
– main contact phone number is a Luxembourgish phone number
– not more than 5 locations worldwide
– “Made in Luxembourg” logo on the homepage
– Common background images, such as the Philharmonie, Golden Lady, Luxembourg old town etc.
– “Impressum” oder “legal mentions” list the company/organisation as registered in Luxembourg
– The “About us” section mentions the company’s foundation or location in Luxembourg
– The “Team” section mentions that the founders or main team members are from Luxembourg
– Among the “projects”, “clients” or “partners” are many Luxembourgish companies or organisations

why the BnL?

Legal Deposit

As it is the case for all national web archives, the legal deposit mandate does not translate to an obligation of completeness, due to the extent of these areas of interest.
The legal deposit for paper obliges publishers to hand in their publications to the National Library. In the case of websites, the obligation for publishers is fulfilled if the BnL is able to access and harvest their websites.

Article 9 et Article 10 de la Loi modifiée du 25 juin 2004 portant réorganisation des instituts culturels de l’Etat

Legilux.luTexte coordonné

Règlement grand-ducal du 6 novembre 2009 relatif au dépôt légal, modifié par le règlement grand-ducal du 21 décembre 2017

Legilux.lu

what is off limits?

Restrictions

While the Luxembourg Web Archive strives to grow over time and expand its inventory of the Luxembourg web space, there are exceptions to this inclusive approach:
– Websites will not be harvested, or only partially harvested, if the technical resources necessary for its archival are not justified by, or out of proportion to its relation to the Luxembourg legal deposit.
– No information will be permanently deleted from the Luxembourg Web Archive. In case a court decides certain content is illegal, the BnL will have to withdraw access to this content from the browsable web archive.
– Exclusions due to reasons of privacy concerns are generally not applicable. The right to be forgotten is not absolute and does not supersede the right to public information.
– The LWA is not required to ask for permission to harvest and archive websites within the legal deposit. The transparency about interactions with websites is nonetheless important. During archiving operations, the LWA communicates a URL to website owners, which explains the legal deposit procedure (http://crawl.bnl.lu).

Objections?

Copyright

Copyright legislation states an exception for « research and private studies from terminals on the premise of the institution ». Publications collected under legal deposit become the property of the institution and can be made available to the public. In the future, we may see a clear exception for text and data mining.

Deletion requests

No content of the LWA will be permanently deleted, but we might withdraw access to certain contents from the browsable web archive.

Exception in GDPR

“The treatment of publications may include their archiving in regards to public interest, for scientific, statistical or historical reasons.”

The right to be forgotten is not absolute and does not supersede the right to public information.

how do you do it?

Methods

Since it is not possible to archive every website at every moment, we have to use our resources in a versatile manner. We operate different kinds of crawls for different types of websites or different aspects of the web.

.LU Domain Crawls

Domain crawls are operated 4 times a year and create a snapshot of the all “.lu” addresses plus additional lists of websites determined by the Luxembourg Web Archive.
These crawls cover a large number of websites at once, but can be tardy in capturing sites that are changing at a rapid pace or may have disappeared between two harvests.

Event Collections

Event collections try to harvest as much information as possible about a certain event over a limited time frame. The seed lists for event crawls are restrictive, but the frequency of captures are higher. There is always a start and end date to event crawls, which could be determined in advance, (e.g. for elections), or could depend on the urgency of surprising events (e.g. natural catastrophes or Covid-19).

Thematic Collections

Thematic collections cover a specific topic or field of interest, with a higher priority to the Luxembourg Web Archive. This could be linked to the pace of changing information, or the importance of the topic. The seed lists will expand over time and have additional harvests, complementing the coverage by domain crawls and event collections.
    

Growing lists

Event and thematic collections are often referred to as “Special collections”. Both types of collections help to improve the lists used for each domain crawl, which makes up the foundation of the Luxembourg Web Archive, where most of our data is captured. Moreover, the research for an event collection will help to build the basis for a thematic collection, which will keep on evolving when the event collection has ended. By developing the thematic collections, we will have a starting point to build upon, once the next event collection is launched. For example: the seed list for an election collection will flow into the thematic collection “Politics & Society”, which will eventually help to build the seed list for the next election event collection.

Special collections

what topics are there?

Thematic Collections

Art
Business & Industry
Creative Industries
Education & Research
Environment & Biodiversity
Expat Communities
Families & Pets
Fan-culture & Communities
Film
Health & Well-being
Humanitarian NGOs
Humanities
Luxembourg News Media
Literature
Memes & Internet Culture
Music
Performing Arts
Photography
Politics & Society
Religion & Spirituality
Sciences
Sports

The seed lists for these collections are evolving, but also the number of topics and themes is always growing. We are open to suggestions and appreciate the input from website owners and content creators, who would like to see their websites be part of the web archive. If you would like to contribute to one of these topics, feel free to use our suggestion form, where you can submit a website in less than 1 minute.

Suggest a website

how we record a website

Metadata

The information we use to describe a website on our seed lists, which we also ask our contributors to provide:

URL

The address of a resource (in this case a website) on the Internet. In most cases, we include domains in our thematic collections and domain crawls and try to avoid any suffixes following the top level domain (such as /home, /fr etc.).

Title

The title or name of the website, usually found on top of the homepage.

Category

Within the topics and themes of the thematic collections, websites are further regrouped into categories, to further describe the institution, association, business or person behind the website. Categories are constantly adjusted or added according to the websites added to the collections.

Frequency

The frequency of captures is determined by the LWA team. Apart from seeds that would only be archived once, the standard minimum would be 4 captures per year. There is also the possibility to set the frequency to monthly, weekly or daily captures, if the pace of changing content requires more frequent crawls.

Depth

Depth of crawl describes how deep the crawler is supposed to dive into a website. In the standard setting, the crawler will include all pages from the seed domain. In the “one page” setting, the crawler will only capture the first page encountered and not follow any links on that page. This would avoid harvesting a large amount of data from a bigger website, where all other pages are not relevant in accordance to the areas of interest of the Luxembourg Web Archive.

Description, Keywords

An open form field used to describe the added website with any other information that would be useful in characterising its content, type of website, organisation behind the site etc.

Social Media

Apart from Twitter seeds, no social media profiles are being archived on a regular basis. However, if we add a website to a thematic collection, we also add the associated Facebook, Instagram, Twitter (and Youtube) profiles, in order to have a useful basis of addresses, once the technical circumstances of social media archiving have changed.

Whom is it for?

Collaboration

There are different target audiences, who are at the same time users and contributors of the Luxembourg Web Archive. The layers indicate the proximity to the archive itself and the missions of the National Library. The interchange between raising awareness about web archiving and fostering collaboration through every layer will improve the quality and service provided by the archive to every group of target audiences.

BnL

The closest level to the web archive is the BnL staff, not only by operating the LWA, but because all BnL staff has to be the first line of promoters and contributors to its collections. Since the LWA is contributing to the traditional collections of the BnL, so should the expertise used in building these collections help to grow the web archive.

Key Partners

Starting with cultural institutions, such as the CNA and the CNL, the LWA will continue to ask for the participation of key contributors in providing seed lists. These lists should form the basis for thematic collections, such as “Literature”, “Film”, “Photography” and “Music”. Other stakeholders will add seeds to these collections as well, such as the subject experts and organisations interested in seeing their line of work represented in the web archive.

Public Sector

Transparency and access to information are major concerns to the public sector. State and municipality websites contain vital information to every citizen of Luxembourg. Therefore, the government, administrations and municipalities have a high interest in having their content preserved, as a service and obligation to their constituents.

Students and researchers

There are different approaches to using a web archive as a source of information: Close reading, data-centric views, or graph-centric views. Students and researchers are showing more and more interest in all of these approache. However, the mere existence of web archives and their potential to scientific research, are still widely unknown, whereas web archives are only useful to a research project, if the collections meet their needs in content, metadata and access to data. In order to improve its collections and contents for the research questions of tomorrow, it is important to get researchers and students involved in contributing to collections, as well as using and processing data from the LWA. There could also be a possibility of integrating websites, archived by researchers, into the LWA for long-term preservation.

Professional use

Websites get regular updates, companies change plans, every business idea or product spawns a new website, start-ups come and go. The target audience of professional users should find answers to a number problems in the LWA: Companies have an interest in seeing their websites, products and services preserved in the archive. Professional networks and associations should be invested in seeing their sector of activity preserved and need to be included in the process of expanding thematic collections.

Between the professional use and the private use, a very important audience is represented by content creators and website owners. Any person creating or uploading content for a website, has the intention for people to see and access this content on the Internet. The LWA maintains this access for future generations, by providing a valuable service to creators and users alike.

Private use

Most people don’t have a day-to-day need to use a web archive, but might end up looking for a particular website that has disappeared, follow a URL leading to an error code, or try to find that image or document that they had seen before, but is missing now. These needs are most difficult to be addressed by the LWA, since they concern every person and every element of every website. The involvement of every previous level of target audiences, helps in improving the web archive and to meet those needs.
Since the LWA is asking for contributions, suggestions and help from all layers of the collaborative chart, these actors also demand a service in return. From public institutions, to companies and private website owners, all stakeholders will have certain expectations to the harvesting and preservation of their online content. It is therefore crucial to focus not only on communicating the necessity and benefits of web archiving, but also to manage expectations, and explain the possibilities and limitations of our program.

Possibilities and limits

As we have mentioned above, there are limits to what can be achieved in regards to the vision and strategy of the Luxembourg Web Archive. Regarding the areas of interest and building of seed lists, the National Library tries to be as inclusive as possible. However, users and website owners have to be aware of what can be expected from captures of websites in the archive, what kind of gaps can occur and what kind of content is not being archived at all. The LWA cannot guarantee that an archived website will be 100% true to its original, including every document and media file with absolute certainty. Therefore, certain measures of separate archiving of documents still remains necessary if this degree of completeness is expected.

OVERVIEW IN FRENCH

Gaps in archiving different types of media

Due to a lack of resources and technical means, the BnL is currently not archiving:

Videos from Youtube channels, or other video platforms. On many websites, videos are being streamed from external platforms. It is very likely that these videos are not being archived either.

The same goes for music video channels, or other music streaming solutions.

The situation is similar for podcasts, which are often streamed via hosting platforms, which are not openly accessible (such as Spotify or iTunes).

These gaps are concerning and the National Library wants to be able to do more in these areas, since we are losing the creative work by Luxembourgish artists, musicians, actors, content creators and many others.
What is needed are more resources in terms of staff and budget to be able to create solutions tailored to the needs of the Luxembourg web.

The Team

The Luxembourg Web Archive is operated in collaboration of the National Library’s heritage department and the IT department.

While there are a number of people contributing to the objectives of the web archive in different ways, the core team consists of 3 people:

Technical manager

– manages domain crawls
– administration of technical infrastructure
– QA on the technical level
– point of contact for webmasters

Digital Curator

– manages special collections
– QA on the content level
– oversees collection policy
– point of contact for content owners
– administration of webarchive.lu

Software engineer

– develops software solutions
– adapts existing tools for specific needs
- data management
- indexation, research engines