Pages in topic: < [1 2 3] > | How to convert TMX to tab-delimited? Thread poster: Hans Lenting
| Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER
Stepan Konev wrote: Hans Lenting wrote: Replace them with a unique sign? Err.. what lines do you mean? I don't quite understand. Multi-line segments | | | Samuel Murray Netherlands Local time: 15:29 Member (2006) English to Afrikaans + ...
Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning. Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (... See more Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning. Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (common under Windows). It can be difficult to decide whether converting them would be good or bad. What does the \n mean in both of your regular expressions? ▲ Collapse | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER A totally new line | Oct 14, 2022 |
Samuel Murray wrote: Some people use the term "new line" or "newline" for characters that are either a carriage return, a line feed, or a combination of both. The TMX specification uses the term "line-break". It is my understanding that TMX allows both characters, and both characters is assumed to have their actual meaning. Of course, it's possible that some converters also convert one to the other. For example, a converter might convert line feed (common under Linux) to carriage return + line feed (common under Windows). It can be difficult to decide whether converting them would be good or bad. What does the \n mean in both of your regular expressions? Guess what: new line. What I am referring to are segments that consists of several lines. When aligning source and target to tab-del, these segments become a problem. So my question was: should I replace the line-break with a ÿ or similar. | | | Stepan Konev Russian Federation Local time: 16:29 English to Russian Ah, line breaks | Oct 14, 2022 |
Well, the regex (\n|.) means a line break (\n) or (|) any character (.). Therefore it covers both single line segments and multiline segments. However manual work may be needed indeed to fix them (by removing the paragraph mark) before converting them into a 2-column table. A second option is to use a regex that only covers single line segments having sacrificed the multiline segments. Not sure which evil is lesser though.
[Edited at 2022-10-14 23:16 GMT] | |
|
|
Stepan Konev Russian Federation Local time: 16:29 English to Russian Replace line breaks with spaces | Oct 14, 2022 |
Hans Lenting wrote: So my question was: should I replace the line-break with a ÿ or similar. I don't know about your specific editor for Mac, but what regards Notepad++, it does not use soft line breaks. It simply inserts paragraph marks instead. That is why you have to fix it manually. If your editor uses line breaks, then replace them with spaces. This would make one segment from two.
[Edited at 2022-10-14 19:24 GMT] | | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER Smart approach | Oct 14, 2022 |
Stepan Konev wrote: Well, the regex (\n|.) means either line break (\n) or (|) any character (.). Therefore it covers both single line segments and multiline segments. However manual work may be needed indeed to fix them (by removing the paragraph mark) before converting them into a 2-column table. A second option is to use a regex that only covers single line segments having sacrificed the multiline segments. Not sure which evil is lesser though. I’ll try to come up with regex that replaces all line-breaks with ÿ, except when they are followed by <seg>
[Edited at 2022-10-14 19:59 GMT] | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER These segments | Oct 14, 2022 |
<seg>- First line - Second line - Third line </seg> Something like: Find: \n(?!(<seg>)) Replace: ÿ
[Edited at 2022-10-14 19:56 GMT] | |
|
|
Stepan Konev Russian Federation Local time: 16:29 English to Russian Checkbox in Notepad++ | Oct 14, 2022 |
Ok, I can't figure out a regex so far, but Notepad++ has a checkbox '. matches newline'. When it is checked, the match covers all new lines. Solution: use Windows Update Got it: (?<=<seg>)(\r|\n|.)+?(?=</seg>) This regex seems to cover new lines.
[Edited at 2022-10-14 21:32 GMT] | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER
Hans Lenting wrote: - First line - Second line - Third line Something like: Find: \n(?!()) Replace: ÿ Tested it, used the ¶ for the replacement. Better surround it with spaces, for better term recognition later, perhaps. Now every multi-line segment will be placed in its own "table cell", so that no lines are left out. | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER Further preparation | Oct 15, 2022 |
Further preparation: Remove in-segment tags: Find: <[eb]pt.*?> Replace: Nothing Find: <ph.*?> Replace: Nothing Convert entities: Find: & Replace: & Same for: <, >, ' and " Remove all other entities: Find: &.*?; Replace: Nothing ... See more Further preparation: Remove in-segment tags: Find: <[eb]pt.*?> Replace: Nothing Find: <ph.*?> Replace: Nothing Convert entities: Find: & Replace: & Same for: <, >, ' and " Remove all other entities: Find: &.*?; Replace: Nothing Further cleaning: Remove dashed lines: Find: _{2,} Replace: Nothing
[Edited at 2022-10-15 06:42 GMT] ▲ Collapse | | | Hans Lenting Netherlands Member (2006) German to Dutch TOPIC STARTER If it cannot be avoided | Oct 15, 2022 |
Stepan Konev wrote: Solution: use Windows I use Windows apps whenever I cannot avoid them. E.g. Trados. Or a silly SonicWall component to log in to a client’s Plunet portal (very irritating). | |
|
|
Samuel Murray Netherlands Local time: 15:29 Member (2006) English to Afrikaans + ...
Hans Lenting wrote: When aligning source and target to tab-del, these segments become a problem. Yes, so you have to first replace all whitespace characters (except spaces, duh) with replacement characters. I wonder if it would be sufficient (and if it would "work" in your target CAT tool) if you were to replace them with numbered entities: 	 = horizontal tab = line feed = carriage return = carriage return + line feed (i.e. Windows line endings) So, if your text editor can't distinguish between carriage returns and line feeds, then you'd just replace \n with inside segments. Windows programs tend to "understand" both line feeds and carriage returns + line feeds, but not carriage returns on their own, but I'm not sure about Mac programs... I'm under the impression that the carriage return by itself is a Mac thing...? When I convert TMs, and I have a replacement character in either the source or target text (I usually use {{LF}} and {{TAB}}), I add a flagging character such as ". " or "# " to the start of the source text in order to prevent it from being a 100% match to anything (and since the flagging character is clearly visible in the comparison window of the CAT tool, I know instantly that the segment is one that had some work done on it). Stepan Konev wrote: Notepad++ has a checkbox '. matches newline'. Different text editors do things differently: for some, "." includes horizontal white space such as new lines, but for others, it doesn't include it (so you have to specify it, or as with N++ you have to check a box) (and some text editors can't even handle regular expressions that span across new lines, i.e. they consider the new line to be the end of the expressible content). Note that the regular expressions in text editors tend to be deliberately limited or customized because text editors users typically have very specific sets of things that they want to do. Hans Lenting wrote: Find: \n(?!(<seg>)) Replace: ÿ Or just replace \n with ① and replace \t with ② throughout the file -- no need to restrict it to segments, for since you're not going to use the TMX file after this, it doesn't matter if you replace tabs and new lines outside of segments with other stuff (unless your content extraction method relies on the assumption that <seg> elements always start and end on a line break, which is a dodgy assumption).
[Edited at 2022-10-15 10:14 GMT] | | | Samuel Murray Netherlands Local time: 15:29 Member (2006) English to Afrikaans + ...
(ignore... I forgot that you're not trying to *fix* the TMX; you're trying to *convert* it).
[Edited at 2022-10-15 10:53 GMT] | | | Samuel Murray Netherlands Local time: 15:29 Member (2006) English to Afrikaans + ...
Since it would seem that you're trying to create your own TMX-to-text converter, you have to decide what you want to do with disallowed entities in the file. Ask yourself: if the TMX file contains this: <seg>Hello…</seg> what did the original CAT tool try to accomplish here? For when you convert it to tabbed text, would you then change it to: Hello…... See more Since it would seem that you're trying to create your own TMX-to-text converter, you have to decide what you want to do with disallowed entities in the file. Ask yourself: if the TMX file contains this: <seg>Hello…</seg> what did the original CAT tool try to accomplish here? For when you convert it to tabbed text, would you then change it to: Hello… or to: Hello… Did the CAT tool intend to have three dots in the texts or did it intend to have something that actually looks like an entity? You can only know the answer to this question if you can have a look at the text inside the CAT tool's own editing field.
[Edited at 2022-10-15 10:58 GMT] ▲ Collapse | | | Pages in topic: < [1 2 3] > | To report site rules violations or get help, contact a site moderator: You can also contact site staff by submitting a support request » How to convert TMX to tab-delimited? Trados Studio 2022 Freelance | The leading translation software used by over 270,000 translators.
Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop
and cloud solution, empowering you to work in the most efficient and cost-effective way.
More info » |
| TM-Town | Manage your TMs and Terms ... and boost your translation business
Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.
More info » |
|
| | | | X Sign in to your ProZ.com account... | | | | | |