Offline html2txt, or: concordancer for HTML files
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 12:57
Member (2006)
English to Afrikaans
+ ...
Oct 5, 2017

Hello everyone

I have a bunch of HTML files that I want to search in a concordance style searcher. I have Windows 7. Does anyone know of an offline HTML2TXT converter that does not insert line breaks (HTML 2.0, with a couple of non-standard ignorable tags), or alternatively, a concordancer that can handle HTML files without having to load all the HTML files before each search? There are about 600 000 files.

Thanks
Samuel


 
Michael Beijer
Michael Beijer  Identity Verified
United Kingdom
Local time: 11:57
Member (2009)
Dutch to English
+ ...
tlCorpus? Oct 5, 2017

Samuel Murray wrote:

Hello everyone

I have a bunch of HTML files that I want to search in a concordance style searcher. I have Windows 7. Does anyone know of an offline HTML2TXT converter that does not insert line breaks (HTML 2.0, with a couple of non-standard ignorable tags), or alternatively, a concordancer that can handle HTML files without having to load all the HTML files before each search? There are about 600 000 files.

Thanks
Samuel


I'm not 100% sure, but I am pretty sure that tlCorpus should be able to do it for you. It's also very cheap, at only €38.

http://tshwanedje.com/corpus/


 
Andriy Yasharov
Andriy Yasharov  Identity Verified
Ukraine
Local time: 13:57
Member (2008)
English to Russian
+ ...
archivarius 3000 Oct 5, 2017

I use archivarius 3000 which is a simple yet fast application to search documents and e-mail on the desktop computer, your local network and removable drives (CD, DVD). The documents can be searched by content, the same as with Internet search engines.
http://www.likasoft.com/index-en.shtml


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 12:57
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
TshwaneDJe Oct 5, 2017

Michael Joseph Wdowiak Beijer wrote:
I'm not 100% sure, but I am pretty sure that tlCorpus should be able to do it for you.


It works, but it requires that all files be loaded in advance. And there is no progress bar. I was able to use it with 500 files (the wait was about 1 minute), but I don't see myself using it with 500 000 files.


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 12:57
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Archivarius 3000 Oct 5, 2017

Andriy Yasharov wrote:
I use Archivarius 3000...


The problem with Archivarius is that you have to create an index first before you can do any searches. Then, it doesn't display the results in a concordance-like display, but as a list of files with previews. Also, if there is more than one hit in a file, Archivarius shows only one entry for the entire file.


 
Jean Dimitriadis
Jean Dimitriadis  Identity Verified
English to French
+ ...
BootCaT & AntConc Oct 5, 2017

Samuel,

I've thought of a solution that might work for you.

The idea is to:

- use BootCaT to create a txt corpus from a local folder holding your html files, and then
- use AntConc or a similar monolingual concordancer to search the data.

In more detail:

BootCaT: The BootCaT front-end is a (Java) graphical interface for the BootCaT toolkit. It automates the process of finding reference texts on the web and collating them in a
... See more
Samuel,

I've thought of a solution that might work for you.

The idea is to:

- use BootCaT to create a txt corpus from a local folder holding your html files, and then
- use AntConc or a similar monolingual concordancer to search the data.

In more detail:

BootCaT: The BootCaT front-end is a (Java) graphical interface for the BootCaT toolkit. It automates the process of finding reference texts on the web and collating them in a single corpus - http://bootcat.dipintra.it/

Normally, you use BootCaT to quickly build a corpus from specific URLs or from websites that contain a combination of some keywords (topples) you define. But you do not want to use the web, you want an offline solution for your local files. BootCaT can do that as well.

Here’s how:

Launch BootCaT to start the wizard. Hit Next. Choose a corpus name and a language (in Options, you can also choose a destination folder for your corpora). Hit Next. Choose “Local files (advanced)” and select the folder that contains your html files. Hit Next. Press Build corpus. [Before pressing Build corpus, you may want to click on Show advanced options on the top, and uncheck the option “Discard documents not in this language”]. In my test (a website downloaded using httrack), this goes really quick, but I guess 500,000 files will take a while. You can then open the corpus folder and find the “corpus.txt” file. Caveat: this does not convert the html files to separate txt files, just one. However, this should be OK for conconrdance searches.

This txt file can be opened in AntConC - http://www.laurenceanthony.net/software.html - so that you can perform searches in a real concordancer.

PS: A web version of BootCaT is also implemented in the Sketch Engine - https://the.sketchengine.co.uk/auth/corpora/

If you try this, do tell us how it went.

Jean

[Edited at 2017-10-05 15:42 GMT]
Collapse


 
Jean Dimitriadis
Jean Dimitriadis  Identity Verified
English to French
+ ...
@Samuel Oct 14, 2017

Hello Samuel,

Did you find a solution that allowed you to handle the html files as intended?

Jean


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 12:57
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Jean Oct 14, 2017

Jean Dimitriadis wrote:
Did you find a solution that allowed you to handle the html files as intended?


BootCAT worked sufficiently. But I've realised that what I really need is a good per-file HTML2TXT converter. BootCAT performs the conversion and a file merge at the same time. It doesn't always get the file encoding right, though. A number of my files came back with weird characters in them (though 99% of characters in such files were fine).


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Offline html2txt, or: concordancer for HTML files






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »