Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we add translated-title and transliterated-title to the objectified title? #327

Open
denismaier opened this issue Jul 22, 2020 · 48 comments
Milestone

Comments

@denismaier
Copy link
Member

As a follow-up on converting titles into objects, I think we should discuss whether there is any value in adding alternate forms (translated or transliterated title forms) to these title objects. Maybe so:

title:
  main: "Война и миръ"
alternate:
  - main: "War and peace"
    type:  "translated"
  - main: "Vojna i mir"
    type: "transliteration"

Or just so:

title:
  main: "Война и миръ"
  translated:
    main: "War and peace"
  transliterated:
    main: "Vojna i mir"
@bdarcus
Copy link
Member

bdarcus commented Jul 22, 2020

@fbennett - you know more about this area than us. Thoughts?

@fbennett
Copy link
Member

fbennett commented Jul 23, 2020

There are various transliteration schemes (roman, cyrillic, many others), and some languages require a native transliteration for basic sorting (hiragana in Japanese, Taiwan sorts Unicode glyphs directly, but I'm not sure what they do in the PRC). So you need to provide for multiple transliterations, and key them by ID. BCP 47/RFC 5646 is a robust scheme covering pretty much everything. It can be validated loosely (regexp-wise) or tightly (by extending a regexp-wise scheme with a controlled list of allowed values---here's the raw data of the registry, batteries not included).

Translation may be into multiple languages also, so the same applies there.

Also, the language of (what I call) the "headline" field may differ from that of the item. In a scheme that pre-parses titles into main- sub- and short- elements, there would be a design decision over whether to apply a headline-field language to the entire set, or to the individual elements separately. Also, there would be an issue over whether to require full-parallel translation/transliteration of all sub-fields within the structured title (i.e. whether to allow a French version of the main title without also requiring a French version of the subtitle, and so forth).

Whatever scheme is applied to a structured title should also be available on some text fields, and on creator fields with their different structure. A "CSL JSON" export from Jurism would show the structures I've come up with over there, for what that's worth.

@bwiernik
Copy link
Member

@fbennett Can you post a Jurism CSL JSON export with some translated/transliterated fields or point to a test with some? I'm not familiar enough with the Jurism GUI to make such an item quickly.

Also @fbennett could you explain what the source>target structure for field languages conveys?

On the one hand, a single translated-title field accomplishes the needs of most citation styles and is easy for styles, applications, and processors to implement.

On the other hand, implementing translated and transliterated forms of all fields allows for more robust multilingual support and handling of things like native Japanese sorting by hiragana. If we did something like this, I would suggest we make specific translated and transliterated elements of relevant fields, with an indication of the language. These elements would be understood as translation/transliterations of the fields, not of the item content. Handling transliteration/translation across the board would be a much bigger lift in terms of applications, styles, and processors (e.g., we would like need something like the CSLm cs:alternative to handle general rendering of field translations; the GUI for multilingual fields in Jurism is much more complex than corresponding GUI in Zotero). I would suggest that we make these and other expanded multilingual features a discrete CSL module. A processor might choose to support those features or not and can explicitly declare so.

@bwiernik
Copy link
Member

For Frank's question of language codes applied to sub-elements of a title, I think we could adopt a general inheritance mechanism. If a subfield lacks a language code, it inherits from the parent field; if a field lacks a language code, it inherits from the item.

@fbennett
Copy link
Member

Can you post a Jurism CSL JSON export with some translated/transliterated fields or point to a test with some?

Here is the data behind this citation:

Ume Kenjirō (梅謙次郎), Commentary on the Civil Code [民法要義], 5 vols. (Tokyo: Yuhikaku Publications, 1898).

[
    {
        "type": "book",
        "multi": {
            "main": {
                "event-place": "ja",
                "publisher": "ja",
                "publisher-place": "ja",
                "title": "ja"
            },
            "_keys": {
                "event-place": {
                    "en": "Tokyo",
                    "ja-alalc97": "Tōkyō",
                    "ja-Hira": "とうきょう"
                },
                "publisher": {
                    "en": "Yuhikaku Publications",
                    "ja-alalc97": "Yūhikaku shobō",
                    "ja-Hira": "ゆうひかくしょぼう"
                },
                "publisher-place": {
                    "en": "Tokyo",
                    "ja-alalc97": "Tōkyō",
                    "ja-Hira": "とうきょう"
                },
                "title": {
                    "en": "Commentary on the Civil Code",
                    "ja-alalc97": "Minpō yōgi",
                    "ja-Hira": "みんぽうようぎ"
                }
            }
        },
        "event-place": "東京",
        "language": "ja",
        "number-of-volumes": "5",
        "publisher": "有斐閣書房",
        "publisher-place": "東京",
        "title": "民法要義",
        "author": [
            {
                "family": "",
                "given": "謙次郎",
                "multi": {
                    "_key": {
                        "en": {
                            "family": "Ume",
                            "given": "Kenjiro"
                        },
                        "ja-Hira": {
                            "family": "うめ",
                            "given": "けんじろう"
                        },
                        "ja-alalc97": {
                            "family": "Ume",
                            "given": "Kenjirō"
                        }
                    },
                    "main": "ja"
                }
            }
        ],
        "issued": {
            "date-parts": [
                [
                    "1898"
                ]
            ]
        }
    }
]

@fbennett
Copy link
Member

fbennett commented Jul 30, 2020

could you explain what the source>target structure for field languages conveys?

Jurism recognizes a vector in the Language field:

ja>en
or
en<ja

In those variables, the language code is mapped to the (English) name of the respective languages.

@bwiernik
Copy link
Member

Thanks @fbennett

Jurism recognizes a vector in the Language field

So this is for rendering the language in citations? So to say “In English” or “Translated from Japanese”?

@fbennett
Copy link
Member

Yes, the variables are available in citations. We've used them for translated legal documents in theses, where the original has been destroyed or is no longer available.

@denismaier
Copy link
Member Author

Jurism recognizes a vector in the Language field

So this is for rendering the language in citations? So to say “In English” or “Translated from Japanese”?

If I remember correctly, this can also be used for conditional rendering based on the language of the current document. Like, your item is en>fr, which means it's a french translation of an English item. If you're now writing an article in English, you can choose to only render the information about the English original, but omit information about the translation into French, which you don't really need in an English article. But if you're writing a French article you'll want to include information about the original and the translation in your citations. @fbennett Is that correct?

@denismaier
Copy link
Member Author

I would suggest that we make these and other expanded multilingual features a discrete CSL module.

That sounds like a good approach. Perhaps you could elaborate a bit more how you think this might work?

Some time ago, @cormacrelf envisioned introducing syntax for enabling/disabling certain features or sets of features: https://discourse.citationstyles.org/t/csl-1-2-planning/1476/7
Multilingual support could fit into this.

@denismaier
Copy link
Member Author

@fbennett Why is it that you have two multi objects in your CSL JSON export, one under author, and the other as a top level object containing standard variables? What's the reason for this? Is this better than just having one multi object at the top? Or one multi object under each variable?

@denismaier
Copy link
Member Author

On the other hand, implementing translated and transliterated forms of all fields allows for more robust multilingual support and handling of things like native Japanese sorting by hiragana. If we did something like this, I would suggest we make specific translated and transliterated elements of relevant fields, with an indication of the language.

We could either adopt the current CSLm JSON or simplify a bit to something like:

title: An English title
title--de: Ein englischer Titel

Something like this has been on the table anyway, see https://juris-m.readthedocs.io/en/latest/dev-sync-simplification.html

Handling transliteration/translation across the board would be a much bigger lift in terms of applications, styles, and processors (e.g., we would like need something like the CSLm cs:alternative to handle general rendering of field translations; the GUI for multilingual fields in Jurism is much more complex than corresponding GUI in Zotero).

Strictly speaking, cs:alternative is not necessary as you can use transliterations and translations in Jurism even with CSL 1.0.1 styles without adjusting the styles. This is currently done by processor directives:

>>===== LANGPARAMS =====>>
{
    "institutions": [
        "orig"
    ],
    "persons": [
        "orig"
    ],
    "titles": [
        "orig",
        "translat"
    ],
    "journals": [
        "orig"
    ],
    "places": [
        "orig"
    ],
    "publishers": [
        "orig"
    ]
}
<<===== LANGPARAMS =====<<

So, this will instruct citeproc-js to use the original variables for all types of variables, but for title variables it will also use the translated variant.

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020 via email

@bwiernik
Copy link
Member

I don't think we should do full ML anytime soon; certainly not for 1.1.

Agreed. There are three factors I'm considering.

  1. A separate translated-title variable covers most citation needs and is clear and simple to implement—it's just another title variable and it can be called like any other variable. That simplicity has value.
  2. A translated slot in the title object would either require additional forms (form="translated", form="translated-short", form="translated-main", form="translated-sub") or perhaps a new attribute (translated="true"). The new attribute might be better (e.g., a test could be <if variable="title" translated="true">).
  3. If we did create the option for full ML at some point, then the title object option makes that easily extensible. The translated slot can then be an object with elements marked by their locale.

With these considerations, making translated a part of the title object might be the more future-proof option. I would suggest just translated, rather than also transliterated (in a full ML solution, both could be represented using different locale codes).

If we did move translated to the title object, I suggest we pull the separate variable from v1.0.2.

@bwiernik
Copy link
Member

If I remember correctly, this can also be used for conditional rendering based on the language of the current document. Like, your item is en>fr, which means it's a french translation of an English item. If you're now writing an article in English, you can choose to only render the information about the English original, but omit information about the translation into French…

I don't think we should adopt this. With the @related structure, we provide a more formal and integrated way of referring to original item information. If you are citing a translation, you should always cite it as a translation. If you want to cite the original, then cite that as a separate item instead.

@bwiernik
Copy link
Member

We can discuss multilingual data structures in another thread, but my inclination would be for all of this to occur at the field-level. So, any field might be object with value, language, and translated elements. The translated element would be an array with elements holding value and language elements. Subordinate elements without a language would inherit language from their parent. That would have 3 benefits:

  1. It would permit simple indication that a field is a different language than the item (e.g., an English article published in a German journal).
  2. It would jive with https://juris-m.readthedocs.io/en/latest/dev-sync-simplification.html
  3. It would provide a consistent structure for providing translations of one or more fields for an item.

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020

If we did move translated to the title object, I suggest we pull the separate variable from v1.0.2.

My impulse is we should do this. The only reason I think not to is if it presented some future barrier to fuller ML support.

@denismaier - thoughts?

We can discuss multilingual data structures in another thread ...

Maybe take this comment and turn it into an issue ("reference in new issue"), for future reference?

@bwiernik
Copy link
Member

if it presented some future barrier to fuller ML support

I think it would be the opposite; doing it would make fuller ML support easier.

Maybe take this comment and turn it into an issue ("reference in new issue"), for future reference?

Cool! Didn't know that button existed.

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020

A translated slot in the title object would either require additional forms (form="translated", form="translated-short", form="translated-main", form="translated-sub") or perhaps a new attribute (translated="true"). The new attribute might be better (e.g., a test could be <if variable="title" translated="true">).

So are we talking a PR with this:

title:
  translated: foo
  main: bar

.. or this?

title:
  translated: 
    main: foo
  main: bar

I guess the latter?

And then remove the translated-title variables from v1.0.2, and finally add a new attribute to access them in styles.

Maybe, per @denismaier's initial impulse, we call it alternate or variant; or even, to be more specific, language-alternate?

That would give more future flexibility, should we possibly need it.

@bwiernik
Copy link
Member

The second option. language-alternate sounds good.

@fbennett
Copy link
Member

@fbennett Why is it that you have two multi objects in your CSL JSON export, one under author, and the other as a top level object containing standard variables? What's the reason for this? Is this better than just having one multi object at the top? Or one multi object under each variable?

The aim was (and is) to maintain compatibility with CSL-JSON to the extent possible. Ordinary fields are strings, so it's not possible to give them a sub-field without changing the data type. Creator variables are already objects, so a subfield can be added without changing the data type; and since creator fields are dynamic, it makes sense to tie the variants to each name instance---and CSLm-JSON just reflects that structure, which keeps exports simple.

@denismaier
Copy link
Member Author

Yes, we should add translated-title to the title object, rename accordingly, and remove from 1.0.2.
That's a good move.

language-alternate sounds good. Or what about language-alternative?

In terms of structure, it should mirror the standard structure of title variables, so:

title:
  main: A title
  sub: with a subtitle
  language-alternate: 
    main: An alternate title
    sub: with a subtitle

Such a structure would be extensible if need arises. We can add language variables, type variables to indicate if the alternate is a translation or a transliteration, and convert language-alternate to an object or an array, if we need more than one alternate title. (This won't be needed for most citation needs, but if CSL JSON should serve as an exchange format then that's a different story.)

@bwiernik
Copy link
Member

I don’t think a type is necessary. That will be clear from the language code (as in the CSLm JSON example above).

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020

language-alternate sounds good. Or what about language-alternative?

I'm agnostic.

In terms of structure, it should mirror the standard structure of title variables

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

@denismaier
Copy link
Member Author

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

Yes, it's broader. I just wanted to point out that it shouldn't just be a flat string, but have distinct properties for title parts.

@bwiernik
Copy link
Member

bwiernik commented Jul 30, 2020

This attribute is broader, and it's values would be things like "translated." So I don't think they need to mirror each other; do they?

No, I don't think the values should be "translated", etc. We should go one of three ways.

  1. A single "language-alternate" field, whose structure matches the structure of a title variable exactly. This would be analogous to having a separate translated-title variable.
  2. An object with properties being language codes (e.g., "fr" or "de-CH" or "ja-hiranga") as elements, each containing a title variable object (without further language-alternate fields).
  3. An array whose elements are each a title variable object with a mandatory language field. The language field needs to be a language code identifying the language and writing system (e.g., "fr" or "de-CH" or "ja-hiranga"). Whether it's a translation or transliteration is clear from the language code.

Of these, 1 and 3 are compatible with each other. We could do 1 now, but then easily add 3 as an option in a future version or in a multilingual extension.

@bwiernik
Copy link
Member

My thinking is that we should make a solution that flows easily into having multiple alternates for multilingual support (or even just picking a translation based on the document locale). We could even fairly easily do (3) in v1.1 without the expectation of full ML support by:

  1. Adding a language element to the title object
  2. Make language-alternate an array

That honestly might be the most straightforward approach.

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020

Whether it's a translation or transliteration is clear from the language code.

How so?

You mean by virtue of it being under an alternate-language property?

@denismaier
Copy link
Member Author

Whether it's a translation or transliteration is clear from the language code.

How so?

You mean by virtue of it being under an alternate-language property?

I guess "translation or transliteration" means two different things in that sentence... Certain language codes refer to transliterations: e.g. he-alalc97 transliterated according to the Library of Congress Romanization rules.

@bwiernik
Copy link
Member

The BCP 47/RFC 5646 scheme Frank linked to defines languages codes unambiguously not only for languages/locales but also for scripts and the like. It's summarized here. The basic structure is language-script-region, with each part following defined patterns.

For example, if an item with language: ru contains a language-alternate element with language: ru-Latn, that means it's a romanized transliteration of the title. A language-alternate element with language: en would mean an English translation of the the title.

@bwiernik
Copy link
Member

Put generally, "different language" = translation, "same language, different script" = transliteration.

@denismaier
Copy link
Member Author

So 3. would come down to this:

title:
  main: A title
  sub: with a subtitle
  language-alternate: 
    - lang: de-CH
      main: An alternate title
      sub: with a subtitle

@bdarcus
Copy link
Member

bdarcus commented Jul 30, 2020

Transliteration is not the issue; I should have made clear I was asking about the translation part.

How do you distinguish the original and translated title?

@bwiernik
Copy link
Member

Close, I was thinking this:

title:
  main: The title of the German item being cited
  sub: with a subtitle
  language: de
  language-alternate: 
    - language: en
      main: English translation of the title
      sub: translation of the subtitle

@bwiernik
Copy link
Member

How do you distinguish the original and translated title?

The title is the actual title of the item being cited (what's printed on the book). Things listed under language-alternate are the translations of the title into other languages (as in the English translation of the German title above) or transliterations.

@denismaier
Copy link
Member Author

Ok. Any serious reasons not to go with 3 now?

@bwiernik
Copy link
Member

If your question is: "I am citing an English translation of a Spanish book. How do I refer to the original Spanish title?" That would be stored in @related under original.

@denismaier
Copy link
Member Author

denismaier commented Jul 31, 2020

Close, I was thinking this:

title:
  main: The title of the German item being cited
  sub: with a subtitle
  language: de
  language-alternate: 
    - language: en
      main: English translation of the title
      sub: translation of the subtitle

I was thinking language would be an inheritable property, right? If so, this option would allow for something like this:

language: en
title:
  main: A title in English
  sub: with a subtitle
  language-alternate: 
    - lang: de
      main: An alternate title in German
      sub: with a subtitle
container-title:
  language: fr
  main: A title in French

@denismaier
Copy link
Member Author

denismaier commented Jul 31, 2020

My thinking is that we should make a solution that flows easily into having multiple alternates for multilingual support (or even just picking a translation based on the document locale). We could even fairly easily do (3) in v1.1 without the expectation of full ML support by:

  1. Adding a language element to the title object
  2. Make language-alternate an array

That honestly might be the most straightforward approach.

I really like that approach, and I think we should adopt this, unless there are serious drawbacks to this.
The good thing about this is that it would give us some flexibility for titles, and it would be easily extensible to other variables later on.

But, if we adopt this, we'll also have to figure out how these language alternates will be accessible in styles. <if variable="title" language-alternate="true"> and such could work for one language alternate, but what if it is an array? Maybe test for the language attribute, like <if variable="title" language-alternate="de">?

@bwiernik
Copy link
Member

How about a simpler syntax—<if variable="translated-title">, which asks is there a language alternate in the same language as the bibliography?

@denismaier
Copy link
Member Author

How about a simpler syntax—<if variable="translated-title">, which asks is there a language alternate in the same language as the bibliography?

Can you switch between transliterations and translations with that approach?

@bwiernik
Copy link
Member

We could have <if variable="transliterated-title"> too.

Transliterations are a bit more involved a question. For example, publications in Latin-script languages often want to print the transliteration instead of the original script version of a title (e.g., APA calls for transliteration "if possible or advisable"). So, we might want to offer a style-level option to substitute transliterations if available.

@denismaier
Copy link
Member Author

denismaier commented Jul 31, 2020

We could have <if variable="transliterated-title"> too.

Ok.

Transliterations are a bit more involved a question. For example, publications in Latin-script languages often want to print the transliteration instead of the original script version of a title (e.g., APA calls for transliteration "if possible or advisable"). So, we might want to offer a style-level option to substitute transliterations if available.

Jurism currently let's you cite a combination of the title in the original script and title, a transliteration, and a translation. I was aiming for something similar in the other thread. Style-level attributes could work. (But I imagine a cs:multilingual element could be simpler and more flexible.)

@denismaier
Copy link
Member Author

Implementation details aside: Do we have a consensus that translated-title should be removed from 1.0.2? @bdarcus @bwiernik Other opinions on this @fbennett @cormacrelf @PaulStanley @adam3smith

@HughP
Copy link

HughP commented Jan 12, 2021

For example, if an item with language: ru contains a language-alternate element with language: ru-Latn, that means it's a romanized transliteration of the title. A language-alternate element with language: en would mean an English translation of the the title.

@bwiernik did you by chance notice the BCP47 -t- element? it is for marking content as transformed, such as in transliterations. Here is the RFC: https://tools.ietf.org/html/rfc6497

@bwiernik
Copy link
Member

bwiernik commented Jan 16, 2021

@HughP Okay, that's interesting. That could potentially save the need to have distinct translated and transliterated fields. Instead, we could just enumerate by locale/script.

Edit: Though thinking about it, we wouldn't necessarily need to use -t because the main item language/script is already in the data. Per the RFC:

The 't' extension is not intended for use in structured data that
   already provides separate source and target language identifiers.
   For example, this is the case in localization interchange formats
   such as XLIFF.  In such cases, it would be inappropriate to use
   "ja-t-it" for the target language tag because the source language tag
   "it" would already be present in the data.  Instead, one would use
   the language tag "ja".

But we could still rely on the locale definitions to indicate whether a field is a transliteration or translation.

@HughP
Copy link

HughP commented Jan 17, 2021

Edit: Though thinking about it, we wouldn't necessarily need to use -t because the main item language/script is already in the data. Per the RFC:

Yes, that may be true. I think the discussion on this thread was debating the architecture of a set of key-value pairs. In the proposals I see, there is a hierarchical distinction between source language and target language so these functions can always be distinguished via the hierarchy. Though perhaps for human readability there might be some utility in using -t as part of the key name.
A second application of the -t device might be in CSL exports. I'm not sure what plugins to MSWord and other tools use where the data is exported as xml. In those context the xml tag attribute xml:lang="en-t-it" may be appropriate.
At any rate I was much encouraged at the suggestion to use BCP47 as many of the language I use in my references use language text outside of the ISO639-1 range (code set) and therefore rely on ISO639-3. Which uses a three letter code ID. BCP47 describes when to switch between using ISO639-1 and 639-3.

@denismaier
Copy link
Member Author

Though thinking about it, we wouldn't necessarily need to use -t because the main item language/script is already in the data.

It might nevertheless be useful once we go beyond the current model where each item has exactly one main language for the item as a whole. E.g., title and container-title might have different main languages. (But, of course, there may be other ways to indicate this kind of relationships as well.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants