For professional translations, visit timtranslates.com.
This will be the first of a series of posts in which I will give tips on how to adapt glossaries for importing into our terminology databases. The problem we often have is that the format we find on the web is not suitable — or not ideal — for importing into terminology databases, but there are usually ways of adapting the format so that we can import the glossary.
In this post, we are going to look at this general glossary. It doesn’t look too complicated to import, since the two languages are separated by tabulations, and each entry is on a new line. But some entries contain more than one term, and so if we import the glossary in its current format, we’ll then have to delete one of the terms when we insert an entry into a translation. Or worse: because the entry has five different words, our translation memory software does not pick up that a word in our text is also in the glossary.
What we want ideally is a separate entry for every synonym in one language combined with every synonym in the other. To start with, we need to copy the glossary into a file. For this demo, if you like, start by adding only the A page, then do the whole thing afterwards once you’ve followed these steps. So, paste the entire A page into a Word document, then switch on all characters so you can see the format. As you can see, in this glossary the two languages are separated by a tabulation.
At various points in the demonstration we shall observe how the changes we make have affected the line that currently reads:
aberrance, abberancy -> error; extravío; annormalidad
Save the document in txt format and close it, then download a program for editing files using regular expressions. I recommend using PowerGREP, and I will base the rest of my instructions on this program.
Open PowerGREP, and in the folder bar select the file you have just saved. Select the option to do a search and replace as the action type, and make sure you select Regular expression as the search type.
In the Search box, type in the following query, without the quotes (for “[type enter]”, literally press the enter key to show the carriage return symbol): “[type enter](.*)t(.*); (.*)”.
- (.*) = means any series of characters
- t = tabulation
- ; is literal text, that is, find a semi-colon (which we have said to be followed by a space)
So, we are searching for any series of characters at the start of a line, followed by a tabulation, followed by any series of characters, followed by a semi-colon, then any series of characters up to the end of the line. I will explain the brackets below.
In the replace box, type: “[type enter]1t2[type enter]1t3”
Here, we are replacing what we have found with: a carriage return to replace the one we included in the search; the 1 means we are replacing what we found in the first set of brackets in our find query, that is, a set of characters before a tabulation; we then reintroduce the tabulation; we then reintroduce the contents of the second brackets (that is, the first synonym in Spanish); this is followed by a carriage return; we then once again introduce the content of the first brackets (that is, we repeat the English); and finally, we introduce another tabulation followed by the contents of the third brackets (that is, the second and any subsequent synonyms).
Before starting the replace process, either make a copy of your original, or better, set up the program so it makes a copy in the same folder each time you do a replace. That way the built-in undo function will work, and it will be easy to delete the backups at the end.
If you like, do a preview to see what will be found and replaced. If you are happy with the preview, click on replace. You will now see that an additional entry has been made for the second and subsequent synonyms.
Our test line has become:
aberrance, abberancy -> error;
aberrance, abberancy -> extravío; annormalidad
But we still have two synonyms in line 2. To create a separate line for every synonym in Spanish, we must keep clicking on replace until no replacements are made. Once you have done a replacement that results in 0 matches, open the file up again and you will see that our test line has become:
aberrance, abberancy -> error;
aberrance, abberancy -> extravío;
aberrance, abberancy -> annormalidad
We now want to do the same with the English synonyms. The English synonyms are separated by commas, and appear before the tabulation, so we will have to change the syntax. Use the following syntax:
Search: (.*), (.*)t(.*)
Replace: 1t3[type enter]2t3
Our test line now shows:
aberrance error
aberrancy error
aberrance extravío
aberrancy extravío
aberrance anormalidad
aberrancy anormalidad
Again, sometimes there are more than two English synonyms, so we should keep replacing until we get 0 matches. Then open up your file again, and it should be ideal for importing. If you did not do so before, copy the whole glossary, letter by letter, into a single file, and repeat the process.
The precise syntax to be used for these processes depends on how each glossary is formatted. I hope, though, that based on this exercise you will be able to convert other glossaries into formats that are ideal for importing into terminology databases. In future posts, I hope to show you how to convert some other glossaries. If there is a glossary you really want to import, but you are not sure how, let me know, and I’ll see if I can explain how to do it in a future post.
Please, please, please, please let me know if you have found this useful, as it has taken me quite a long time to write, and it would be nice to know if people have found it useful. Please also let me know if you can suggest any improvements to my message. Why not leave me a comment on my blog?
m’ha estat molt útil l’anotació, perquè no coneixia cap el programari grep, que crec que em pot ser de molta utilitat. En coneixes cap com el powergrep que sigui freeware?
disculpa que no t’escrigui en anglès, però no el domino prou.
La posibilidad de utilizar expresiones regulares es muy interesante, no sólo para manipular glosarios, sino también para corregir u homogeneizar grandes memorias de traducción en formato tmx o txt. No obstante, utilizarlas no es precisamente sencillo y exige un esfuerzo en estudio e investigación. ¿Qué tal ofrecer unas cuantas direcciones web de manuales o tutoriales de regex?
Un saludo,
Pablo
Hola rmf
Cap problema. El meu blog és estrictament multilingüe! No en conec cap de programa gratuït.
Pablo, la verdad es que no tengo ningún sitio específico de referencia. Cada vez que quiero hacer algo específico, hago una búsqueda para encontrar la comanda adecuada.
Way over my head, I’m afraid!!!