Making a Dictionary in Ulaanbaatar

Transcription

Making a Dictionary in Ulaanbaatar
Making a Dictionary in Ulaanbaatar:
Corpus-based Lexicography with Limited Financial and
Technical Resources
Stefan Engelberg
(Institut für Deutsche Sprache & Universität Mannheim)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 1]
CONTENT
1)
2)
3)
4)
5)
Mongolia and its languages
Publishing dictionaries in Mongolia
The lexicographic workplace: Free corpuslinguistic resources
Improving bilingual dictionaries
Outlook
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 2]
1
1 Mongolia / Languages
Mongolia – basic data
2 Publishing dictionaries
3 Corpus linguistics
population: 2 951 000 (estimate 2007) = 1,9 / km²
4 Improving dictionaries
capital: Ulaanbaatar (> 1 000 000 inhabitants, fast growing)
5 Outlook
government: stable parliamentary democracy
economic basis: agriculture (sheep, cattle, …), mining (copper, gold, coal, …)
Gross National Income (World Bank, measuring GNI per person, 2004): Mongolia: $
600,- (rank: ca. 132/175) (lowest group)
Human Development Index (United Nations Development Programme; measuring
rate of literacy and life expectancy, 2007): Mongolia: 0,691 (Rank: 116/177) (medium
group)
tertiary education: about 200
private „colleges“ in Ulaanbaatar;
6 universities offer degrees that
are acknowledged in Germany
Fischer Weltalmanach 2007.
Frankfurt/M.: Fischer 2006.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 3]
1 Mongolia / Languages
2 Publishing dictionaries
Major languages in Mongolia
3 Corpus linguistics
4 Improving dictionaries
Eastern Mongolian languages:
Khalkha Mongolian: ca. 2 400 000 speakers
Kalmyk-Oirat: ca. 210 000 speakers
Buriat: ca. 70 000 speakers
Darkhat: ca. 32 000 speakers
5 Outlook
Turkic languages:
Kazakh: ca. 200 000 speakers
Tuvin: ca. 30 000 speakers
Other languages:
Chinese: ca. 35 000 speakers
Russian: ca. 4 000 speakers
(numbers of speakers extrapolated from numbers in Ethnologue relative to recent
population growth)
Ethonolgue: http://www.ethnologue.com.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 4]
2
1 Mongolia / Languages
Eastern Mongolian languages
2 Publishing dictionaries
3 Corpus linguistics
(Mongolian is considered a branch of the Altaic language
family; the internal classification of the Mongolian languages
is controversial; in addition there is one Western Mongolian
language: Moghloi, ca. 200 speakers in Afghanistan)
Sp. Mongolia
4 Improving dictionaries
5 Outlook
Group
Languages
Sp. China
Sp. Russia
Mongolian proper
Khalkha Mongolian,
Peripheral Mongolian
Buriat
Mongolian B., Chinese
B., Russian B.,
ca. 70 000
ca. 70 000
ca. 320 000
Oirat-Kalmyk-Darkhat
Darkhat, Kalmyk-Oirat
ca. 240 000
ca. 140 000
ca. 180 000
Mongour
Kangjia, Tu, Bonan,
Dongxian, East Yugur
ca. 500 000
Dagur
Daur
ca. 100 000
ca. 2 400 000 ca. 3 400 000
mutually intelligible (6 260 000 speakers)
Ethonolgue: http://www.ethnologue.com.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 5]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Foreign languages
Russian: formerly first second language; widespread among older Mongolians.
English: first second language since 2005.
German: about 30 000 second language speakers.
Chinese: not widespread.
Six universities offer degrees in German that are acknowledged at German
universities.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 6]
3
Monsudar publishers
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
The publisher
4 Improving dictionaries
5 Outlook
• Company: Monsudar publishers as part of the Admon company (printing and
publishing).
• Dictionary department: Monsudar dictionary department founded two years ago
(head: Bayarsaikhan).
• Publishing plan: a series of bilingual dictionaries (Mongolian – English / German /
Chinese / Korean) in cooperation with foreign partners (Oxford, Pons, …).
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 7]
1 Mongolia / Languages
The dictionary department
• Staff: Head of department and 2½ staff .
• German-Mongolian dictionary: (Head of dept., 1 staff and
about 20 part-time freelancers (university lecturers, interpreters,
translators, travel guides) currently working on GermanMongolian dictionary.
• Equipment: PCs and Internet connection (very slow) available
to staff and most freelancers.
• Reference works: a small collection of reference works
available in the editorial office, among them an older GermanMongolian dictionary (Vietze 1981). (Acquisition of the 10volume „Duden – Großes Wörterbuch der deutschen Sprache“
beyond financial resources.)
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 8]
4
1 Mongolia / Languages
2 Publishing dictionaries
The German-Mongolian /
Mongolian-German dictionary
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
• German partner: cooperation with Pons (Klett, Stuttgart).
• Dictionary basis: Pons provides the German part of the German-Mongolian
dictionary (identical to the new Pons Deutsch-Englisches Kompaktwörterbuch).
• Procedure German-Mongolian: The German part of the Pons German-English
dictionary has been used unaltered by the editorial staff; the Mongolian
lexicographers merely add translations.
• Procedure Mongolian-German: The Mongolian side has been compiled on the
basis of older dictionary and manual collection of neologisms (done by some
mongolist).
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 9]
1 Mongolia / Languages
2 Publishing dictionaries
Main problems of the German-Mongolian /
Mongolian-German dictionary
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
• Dictionary basis: The empirical dictionary basis was insufficient.
• Microstructure: The structure of the articles excluded the use of the dictionary as
an active ditionary.
Measures taken:
• Corpuslinguistic foundation: development of a corpus-based lexicographic
workplace.
• Training: training of staff and freelancers (connection between dictionary use and
dictionary stucture).
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 10]
5
Corpus-based lexicographic workplace
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
Contents of the CLW
4 Improving dictionaries
5 Outlook
1) Information: Basic information material on (i) dictionary
structure, (ii) the major functions of the software installed,
(iii) the compilation of own corpora.
2)
•
•
•
•
Corpus analysis software: Installation of corpus analysis software:
AntConc
KWICFinder
Leipzig Corpus Browser
Co-occurrence Database / COSMAS II (Institut für deutsche Sprache, Link)
3)
•
•
•
•
Corpora: Collection of corpora:
German newspaper corpus of the Leipzig Corpus Collection (15 million textwords)
English newspaper corpus of the Leipzig Corpus Collection (21 million textwords)
Monsudar Mongolian corpus (under construction)
Corpus-based frequency lists of German words: (i) based on the IDS corpus
collection (2000 million textwords), (ii) based on the German LCC corpus
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 11]
Corpusbased lexicographic workplace
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
Corpus analysis software
4 Improving dictionaries
5 Outlook
Four freely available sources for corpus analysis:
•
•
•
•
Corpus analysis software I: AntConc
Corpus analysis software II: Corpus Browser
Corpus analysis software III: COSMAS II & CCDB
Corpus analysis software IV: KWICFinder
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 12]
6
Corpus analysis software I: AntConc
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
AntConc
• Developer: Laurence Anthony, Faculty
of
Science and Engineering,Waseda
University, Japan.
• Version: 3.2.1w (Windows), release
March
10th, 2007.
• Search: offline.
• Software: installed on a local computer.
• Access: free download.
• Corpora: own (txt-files).
• Languages: all (Unicode): German, Englisch, Romanian,
Mongolian.
• URL: http://www.antlab.sci.waseda.ac.jp/antconc_index.html.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 13]
Corpus analysis software I: AntConc
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
(III) clusters
(IV) co-ocurrences
4 Improving dictionaries
5 Outlook
(I) concordances
(KWICs)
(II) frequencies /
word lists
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 14]
7
Search: concordances for
geloven in the Dutch corpus of
the Leipzig Corpus Collection
(newspapers).
Search term
(here: geloven)
Sort (here: alphabetically
according to the word on
the right of the search term)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 15]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Search: frequency list of all word
forms in part of the English corpus
of the LCC (newspapers)
Start (no
search term)
Sort (here: accord.
to frequency)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 16]
8
1 Mongolia / Languages
Search: clusters
out of 2dictionaries
words
2 Publishing
ending in off in part of the
3 Corpus linguistics
English corpus of the LCC
4 Improving dictionaries
5 Outlook
Search term
position (here:
on right)
Size of cluster (here:
clusters out of two words)
Search term (here: off)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 17]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
co-occurrence analysis
4 Improving dictionaries
5 Outlook
co-occurrence analysis – the basic idea
1) Assumption: In a certain corpus, word X occurs a 1000 times, word Y a 100 times,
word Z 10 times.
2) Probability: The combination XY is ten times as likely as the combination XZ. XY
should occur ten times as often as XZ.
3) Observation: Actually, XZ occurs about as often as XY.
4) Conclusion: There is a close linguistic connection between X and Z (close beyond
expectation).
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 18]
9
1 Mongolia / Languages
Search: co-occurrences for
2 Publishing
dictionaries
just in part
of the English
corpus of3 Corpus
the LCC.
linguistics
4 Improving dictionaries
5 Outlook
List of co-occurrence
partner words with
rank, frequency, and
significance measure
Definition of search
context (here: up to 2
words after the search
term)
Search term
(here: just)
Sort (here: accord. to
significance of co-occurrence)
Frequency condition
(here: at least 10 tokens)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 19]
Corpus analysis software I: AntConc
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
• can be recommended with smaller corpora (up to 20
Mill. text words)
• strenghts: sorted concordances, word lists, cluster
analyses, key word analyses
• less useful for co-occurrence analyses (too slow;
larger corpora are needed)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 20]
10
Corpus analysis software II: Corpus Browser
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Corpus Browser
•
•
•
•
•
•
•
•
Developer: Volker Boehlke (University of Leipzig).
Version: 1.00 (Windows).
Search: offline.
Software: locally installed.
Access: free download.
Corpora: integrated into the program; own corpora can be created.
Languages: 14 languages (see next slide).
URL: http://corpora.informatik.uni-leipzig.de/download.html.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 21]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
The corpus size is
measured by the
number of sentences
included in the corpus.
When downloaded as
Plain Text Files, the
corpora can also be
used under AntConc.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 22]
11
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Search term (here: vite)
Results (for word):
• absolute frequency
• frequency class
• corpus examples
• significant left and right neighbors
• co-occurrences
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 23]
Corpus analysis software III: COSMAS II &
CCDB
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
COSMAS II & CCDB
•
•
•
•
•
•
•
•
5 Outlook
Developer: Institut für Deutsche Sprache (CCDB: Cyil Belica).
Version: 1.2.1.
Search: online.
Software: installed locally (Client) or as web interface.
Access: free download of the client (registration).
Corpora: corpora of the IDS.
Languages: German.
URL: https://cosmas2.ids-mannheim.de
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 24]
12
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
loading corpora
with COSMAS II
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 25]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
Example for search in COSMAS II
5 Outlook
Looking for: dass-clauses as sentential subject with the verb helfen (‘to help’).
Assumption: Sentential subjects with helfen mainly occur within the construction <[…]
es […] hilft, dass/daß>.
Search: (es /+w3 &helfen) /+w1 (dass oder daß)
Beispiele
T04
Der SPD hat es nicht geholfen, dass der Sympathieträger und
B99
Uns könne es nur helfen, dass wir so früh den Weg zu
B02
Vielleicht hat es Metzelder geholfen, dass die Kollegen seinen
E96
Da wird es auch nicht helfen, dass der Publikumsrat
E99
Mir hat es viel geholfen, dass ich Kabuki-Theater
N98
"Uns könnte es helfen, daß gleichzeitig Landtagswahl ist",
P93 Saddam Hussein könnte es helfen, daß Zulieferstaaten ... eine volle
P98
"Wenn es Saddam hilft, daß Unscom von Diplomaten
R99
Was kann es nun helfen, daß inzwischen 13 der 15
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 26]
13
Question: co-occurrences for
Anwendungsbeispiel
II:bestehen
(in particular governed prepositions).
Kookkurrenzen zu bestehen
1 Mongolia
Textkorpora
/ Languages
2 Publishing
Recherchemethoden
dictionaries
3 Corpus
Anwendungen
linguistics
4 Improving
Rechercheprogramme
dictionaries
5 Schlussbemerkung
Outlook
Typical syntagmatic patterns in which the words
co-occur, e. g. besteht aus […] [zwei|drei] Teilen,
‘consists of […] [two|three] parts’
Secondary co-occurrence partners of bestehen +
aus, here: aus Mitgliedern / Teilen / Ortsteilen
bestehen, ‘consist of members / parts / suburbs’
Primary co-occurrence
partner of bestehen (here: aus)
Strength of the connection
(here: 40683)
Co-occurrence analysis for bestehen
as part of the CCDB (setting: do not
ignore function words)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 27]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
results (among others):
5 Outlook
besteht […] aus (‘consists of […]’)
besteht […] aus […] Mitgliedern (‘consists […] of […] members’)
darin: besteht […] darin, dass (‘is […] that’)
die Schwierigkeit […] besteht […] darin, dass (‘the difficulty […] is […] that’)
darauf: besteht […] darauf, dass (‘insists […] that’)
er bestand […] darauf, dass (‘he insisted […] that’)
worin: worin […] besteht
worin […] besteht der Unterschied zwischen (‘what […] is the difference between’)
aus:
• governed preposition: auf, aus, in
• prepositions auf and in in particular as prepositional complement clauses
• preposition in often in interogative sentences
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 28]
14
Corpus analysis software III: COSMAS II &
CCDB
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
• probably best co-ocurrence analysis available; easy
access via co-occurrence database
• very extended search language for corpora
• working with COSMAS II needs some training
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 29]
Corpus analysis software IV: KWICFinder
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
KWICFinder Key Word in Context Research Tool and
Concordancer for the Web
•
•
•
•
•
•
•
•
4 Improving dictionaries
5 Outlook
Developer: William Fletcher.
Version: 0.98.22 (Beta Version), 11. Dec. 2006 (Windows).
Search: online.
Software: locally installed.
Access: free download.
Corpora: WWW.
Languages: ca. 20 languages on the basis of the Latin script.
URL: http://www.kwicfinder.com/KWiCFinder.html.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 30]
15
Corpus analysis software IV: KWICFinder
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
• produces concordances on the basis of WWW pages
• search can be restricted to pages with particular titles or
in particular domains
• can be used to find examples for colloquial language
(chat rooms) or examples for special / technical language
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 31]
Employing corpus analysis
software in dictionary making
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
The application domains in detail:
• determining relevant meaning variants (by studying concordances and co-occurrence
analyses)
• identifying collocations and other fixed expressions (by evaluating co-occurrence
analyses)
• choosing examples and typical contexts of usage (by evaluating cluster analyses and
co-occurrence analyses)
• examining the lemma list (by comparing the existing list with frequency lists and lists
of keyword searches)
• Example I: Identification of meaning variants and contexts of use
• Example II: Exploration of collocations and fixed expressions
• Example III: Identification of special vocabulary
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 32]
16
Example I: Identification of meaning variants
and contexts of use
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
Article for abziehen (literally: to pull off) in
Vietze‘s (1981) German-Mongolian dictionary.
5 Outlook
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 33]
1 Mongolia / Languages
Lemma: abziehen
2 Publishing dictionaries
Inflection: <32a>
structure of the article
Grammatical variants
3 Corpus linguistics
1: tr
4 Improving dictionaries
Translations
general:
(‘pull off’?) 5 Outlook
specific
1: Fell (‘coat/fur’)
‘skin’
2: Flüssigkeit (‘liquid’)
(‘bottle’ ?)
3: Math (‘mathematics’)
‘subtract’
4: Typ (‘typography’)
‘run off’
Examples
1: das Rasiermesser ~ (‘the straight blade razor’)
‘sharpen’
2: Rinde ~ (‘bark’)
‘pull off’?
3: den Schlüssel ~ (‘the key’)
‘take out’
2: intr
Translations
specific
1: sich entfernen
‘go away’
2: sich zurückziehen
‘withdraw’
Examples
1: unverrichteterdinge ~ (‘go away without achieving anything’)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 34]
17
Korpusrecherchemethoden
Step 1: co-occurrence analysis for abziehen (CCDB);
Korpusrecherchesystem IV: Corpus Browser
function words not considered.
meanings covered in Vietze
meanings not covered in Vietze
Truppen abziehen, ‘to
withdraw troups’
unverrichteter Dinge wieder abziehen , ‘to
go away without having achieved anything’
wurden zwei Punkte
abgezogen , ‘two points
were deducted’
eine Show abziehen, ‘to
make a scene’
die Haut abziehen, ‘peel
(fruit), skin (an animal)’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 35]
Korpusrecherchemethoden
Korpusrecherchesystem IV: Corpus Browser
vom Einkommen abziehen ,
‘to deduct from the income’
den Zündschlüssel
abziehen , ‘to take out
the ignition key’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 36]
18
Korpusrecherchemethoden
Korpusrecherchesystem IV: Corpus Browser
aus 20 Metern abziehen ,
‘to shoot (a ball)
vigorously from 20 m
distance’
Botschafter (aus …) abziehen , ‘to
withdraw the ambassador (from …)’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 37]
Korpusrecherchemethoden
Korpusrecherchesystem IV: Corpus Browser
Kapital (aus …) abziehen
, ‘to withdraw capital
(from)’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 38]
19
Korpusrecherchemethoden
Korpusrecherchesystem IV: Corpus Browser
den Rauch abziehen lassen ,
‘to let the
smoke escape’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 39]
Step 2: using KWICFinder to collect concordances
reflecting colloquial German from the internet
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
enter search
term: abziehen
Search in pages that show
„chat“ in their title.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 40]
20
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
Results
4 Improving dictionaries
5 Outlook
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 41]
1 Mongolia / Languages
more meanings not covered in Vietze
(1) Die Leute die mich kennen, wissen, daß ich eigentlich eine ganz
Friedfertige und Versöhnliche bin. Aber was hier einige Leute
abziehen ... echt therapiebedürftig!!!
‘[…] what some people are pulling off here […]’
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
abziehen – ‘to pull off something’(coll.)
Was ziehst du hier ab?
‘What are you pulling off here?’
(2) Die Suppe mit Salz abschmecken, mit verquirltem Eigelb
abziehen und die Spargelstückchen hineingeben.
‘[…] thicken the soup with beaten egg yolk […]’
abziehen – ‘to thicken’ (gastr.)
er zieht die Suppe mit Eigelb ab
‘he thickens the soup with egg yolk’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 42]
21
(3) ich finde auch den preis etwas niedrig und der ebayer hat auch
nur 2 bewertungen,habe deshalb ihn gefragt,ob wir das geschäft
über den treuhandservice abwickeln können.jetzt warte ich auf
seine antwort.nicht das der mich abziehen will,nur weil vielleicht
zu wenig für das board geboten wurde.nicht mein problem.
‘[…] that he wants to swindle me […]’
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
‘to swindle / cheat’ (coll.)
er versuchte mich abzuziehen
‘he tried to swindle me’
(4) Bieretiketten kann mein einfach von der Flasche abziehen.
‘[…] Beer labels can be easily pulled off the bottle […]’
abziehen – ‘to pull off’
sie zog das Etikett von der Bierflasche ab
‘she pulled the label off the beer bottle’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 43]
1 Mongolia / Languages
more meanings not covered in Vietze
2 Publishing dictionaries
3 Corpus linguistics
meanings covered in Vietze
4 Improving dictionaries
5 Outlook
Relevant for a general bilingual dictionary
intr. ‘withdraw’
‘take out (key)’
tr. ‘withdraw
(troups,
ambassador)’
‘withdraw
(capital)’
‘deduct (points)’
intr. ‘go away’
(coll.) ‘swindle,
cheat’
(math.) ‘subtract’
Irrelevant for a general bilingual dict.
‘skin (coat/fur)’
(coll., neg.) ‘do
(something)’
itr. ‘escape (of
smoke’
‘pull off (label)’
(coll.) ‘shoot
vigorously’
‘deduct
(something from
income)’
‘sharpen (a straight
blade razor)’
(typogr.) ‘run off’
(youth) ‘extort’
‘pull off (bark)’
(youth) ‘tear off
and rob’
(gastr) ‘thicken
(soup)’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 44]
22
1 Mongolia / Languages
Example II: Exploration of collocations and
fixed expressions
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Article from the new Monsudar German-Mongolian dictionary (preliminary version).
20 Flaschen à 8 Euro, ‘20 bottles at 8 Euros each’
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 45]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
Concordances for à in a 1-million-TW selection
of the German corpus within the LCC
5 Outlook
Fixed expression à la,
‘after the fashion of’
(5 out of 10 hits)
Fixed expression peu à
peu, ‘bit by bit’
(1 out of 10 hits)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 46]
23
1 Mongolia / Languages
Co-occurrence analysis on the basis of the
German Reference Corpus (2 billion
textwords); COSMAS II web interface
2 Publishing dictionaries
la as the most siginificant
3 Corpus linguistics
cooccurrence partner of à
4 Improving
(log likelihood ratio:
135300) dictionaries
5 Outlook
Both collocations, à la and peu à
peu are missing in the dictionary.
peu as the second most siginificant
cooccurrence partner of à
(log likelihood ratio: 15974)
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 47]
Example III: Identification of special vocabulary
Task: The German part of the German-Mongolian dictionary is
supposed to contain those words used in German that are specific
to Mongolian culture and typically occur in German texts related
to Mongolia: Jurte, Airag, …
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Step 1: Google search
for texts containing
„Mongolei“.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 48]
24
1 Mongolia / Languages
2 Publishing dictionaries
Step 2: Copy all texts that seem suitable on
first sight into a txt-file.
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 49]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Step 3: Loading the text corpus resulting
from this procedure in AntConc.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 50]
25
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Step 4: Loading a reference corpus (e.g., newspaper texts)
under Tool Preferences / Keyword List.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 51]
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Step 5: Starting compilation of a keyword list
(i.e., words typical for the special corpus
compared to the reference corpus).
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 52]
26
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Step 6: Manual evaluation
of the keyword list yields
Airag, Airak, Jurte,
Nomade, Tugrik, Yak,
Khan, Obertongesang,
Chuuschuur,
Pferdekopfgeige,
Schamane, Milchschnaps,
etc.
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 53]
1 Mongolia / Languages
2 Publishing dictionaries
Corpus use in lexicography
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
allow for empirically
sound, scientific
dictionaries
big, expensive corpuslinguistic solutions
(large, annotated corpora, tailor-made analysis software)
small, inexpensive corpuslinguistic solutions
(small, unannotated, plain-text corpora, free analysis software)
much, much better
than „corpus-free“
dictionaries
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 54]
27
1 Mongolia / Languages
2 Publishing dictionaries
3 Corpus linguistics
4 Improving dictionaries
5 Outlook
Copy of the slides (on Monday) under:
http://www.ids-mannheim.de/ll/lehre/engelberg/talks/talks.html
[email protected]
Stefan Engelberg (IDS Mannheim), Facultés universitaires Saint-Louis, Bruxelles, October 2007 [Folie 55]
28