This is the support web page for the article “SEDES: Metrical Position in Greek Hexameter”. Here you can find instructions for reproducing the calculations in our article.
Last updated: .
It was made with the command:
git archive --prefix=sedes-dhq2023/ --output=sedes-dhq2023.zip dhq2023
You can clone from the bundle with the command:
git clone sedes-dhq2023.bundle sedes-dhq2023
This is what you would get after reproducing the source environment
and running make
.
Our source code is stored in the Git repository at https://github.com/sasansom/sedes. We have tried to ensure that all the results reported in the DHQ article reflect the source code as it existed at tag dhq2023 (commit b0136a4eb9d5bb2a2b8cb077ac49a7ecd7248e59). But checking out that past tag alone is not enough, because of dependencies: specifically cltk and its grc_models_cltk corpus, which is independently versioned. The dependencies notably affect lemmatization, so you will need to use the same versions we used in order to obtain the same results.
After cloning the source code repository, check out the dhq2023 tag.
$ git clone https://github.com/sasansom/sedes $ cd sedes sedes$ git checkout dhq2023
Follow the documented setup instructions, but specify particular versions for cltk and nltk:
sedes$ python3 -m venv venv sedes$ source venv/bin/activate sedes$ pip3 install -U pip setuptools wheel sedes$ pip3 install --force-reinstall cltk==v1.1.5 bs4 lxml
Install the grc_models_cltk corpus. The command below will install the latest version; later we will revert it to an older version if necessary.
sedes$ python3 -c 'from cltk.data.fetch import FetchCorpus; FetchCorpus("grc").import_corpus("grc_models_cltk")'
The proper version of grc_models_cltk is 94c04acac4405e264322d825978a2f2a80d01da5. You will have to manually check out an old version if CLTK has installed a newer one. By default, CLTK installs a shallow clone of the corpus, which you will first have to promote into a full clone.
$ cd ~/cltk_data/grc/model/grc_models_cltk ~/cltk_data/grc/model/grc_models_cltk$ git fetch --unshallow ~/cltk_data/grc/model/grc_models_cltk$ git checkout 94c04acac4405e264322d825978a2f2a80d01da5 Note: switching to '94c04acac4405e264322d825978a2f2a80d01da5'. ~/cltk_data/grc/model/grc_models_cltk$ git log -1 commit 94c04acac4405e264322d825978a2f2a80d01da5 (HEAD, origin/master, origin/HEAD, master) Merge: 3ded3be a325a05 Author: Kyle P. Johnson <kyle@kyle-p-johnson.com> Date: Fri Apr 30 07:49:41 2021 -0700 Merge pull request #4 from diyclassics/lemma-refactor Remove blank lemmas from greek_lemmatized_sents
Be aware the CLTK corpora are a system-wide shared resource, and checking out an old version for SEDES may affect other projects you have on the same computer that use CLTK.
Now you may proceed as normal;
i.e., run make -j4
to run the processing pipeline.
Here we show our reasoning and justification for some specific claims and how to reproduce our calculations. These use a mix of shell commands and R and Python scripts.
…we have used 12 TEI texts from the Perseus Project, totaling about 73,000 lines, with a minimum length of 479 lines and a maximum of 21,356.
$ (echo "work,lines"; for a in corpus/*.xml; do echo "$a,$(xmlstarlet sel -t -m '//l' -v '"l"' -n -t -m '//lb' -v '"lb"' -n "$a" | wc -l)"; done) > corpus.csv $ R > x <- read.csv("corpus.csv") > x work lines 1 corpus/aratus.xml 1155 2 corpus/argonautica.xml 5834 3 corpus/callimachushymns.xml 941 4 corpus/homerichymns.xml 2342 5 corpus/iliad.xml 15683 6 corpus/nonnusdionysiaca.xml 21356 7 corpus/odyssey.xml 12107 8 corpus/quintussmyrnaeus.xml 8801 9 corpus/shield.xml 479 10 corpus/theocritus.xml 2527 11 corpus/theogony.xml 1042 12 corpus/worksanddays.xml 831 > sum(x$lines) [1] 73098 > summary(x$lines) Min. 1st Qu. Median Mean 3rd Qu. Max. 479 1017 2434 6092 9628 21356
NB: the line counts you would get by counting distinct line numbers in the CSV files are slightly different (smaller):
$ R > library("tidyverse") > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > x %>% select(work, book_n, line_n) %>% unique() %>% nrow() [1] 72972 > x %>% select(work, book_n, line_n) %>% unique() %>% group_by(work) %>% summarize(n = n()) # A tibble: 12 x 2 work n * <chr> <int> 1 Argon. 5834 2 Callim.Hymn 940 3 Dion. 21259 4 Hom.Hymn 2342 5 Il. 15683 6 Od. 12107 7 Phaen. 1155 8 Q.S. 8800 9 Sh. 479 10 Theoc. 2500 11 Theog. 1042 12 W.D. 831
In our notes during the writing of the paper, we believed that the difference in line counts between the xmlstarlet command and what is output by tei2csv had to do with duplicate line numbers in the TEI. Further inspection after the fact revealed that duplicate line numbers were only a small part of it. The other causes were:
head
elements at the beginning of each book of Dion..
We probably should have just reported the tei2csv counts, but it does not make a big difference in the end.
Table 1. Works in the full SEDES corpus.
$ R > library("tidyverse") > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > x %>% group_by(work) %>% summarize(n = n()) # A tibble: 12 x 2 work n * <chr> <int> 1 Argon. 38841 2 Callim.Hymn 6480 3 Dion. 126876 4 Hom.Hymn 16020 5 Il. 111865 6 Od. 87185 7 Phaen. 7752 8 Q.S. 60098 9 Sh. 3298 10 Theoc. 18071 11 Theog. 7040 12 W.D. 5856
Running SEDES from start to finish on every work in our corpus (12 TEI files, 73,000 lines, and 490,000 words) on a 2019 MacBook Pro takes about one minute.
sedes$ make clean sedes$ make -j 4 $(find corpus/ -name '*.xml' | sed -e 's/.xml/.csv/') &>/dev/null 75.90s user 3.42s system 314% cpu 25.250 total sedes$ make -j 4 expectancy.all.csv &>/dev/null 4.82s user 0.26s system 99% cpu 5.098 total sedes$ make -j 4 &>/dev/null 98.33s user 3.71s system 386% cpu 26.376 total
If no lemma is found by any of these techniques, the last-resort fallback is to use the word itself as the lemma. The fallback occurs for about 2% of words in the corpus (7% of unique words).
$ R > library("tidyverse") > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > sum(is.na(x$lemma)) / nrow(x) * 100 [1] 1.713399 > u <- x %>% select(word, lemma) %>% unique() > sum(is.na(u$lemma)) / nrow(u) * 100 [1] 6.75545
There are 1,526 entries in the list of overrides, about 2.1% of the lines in the corpus.
sedes/src$ python3 -c 'import known; print(len(known.KNOWN_SCANSIONS))' 1526 sedes/src$ R > 1526 / 73098 * 100 [1] 2.087608
For example, in our corpus, the lemma βοῦς (“cow”) appears 448 times, while χέλυς (“tortoise”) appears only 8 times.
$ R > library("tidyverse") > library("stringr") > x <- bind_rows(map_dfr(Sys.glob("corpus/*.csv"), read_csv, col_types = cols(line_n = col_character(), book_n = col_character()))) > filter(x, lemma == stringi::stri_trans_nfd("βοῦς")) %>% nrow() [1] 448 > filter(x, lemma == stringi::stri_trans_nfd("χέλυς")) %>% nrow() [1] 8
Figure 2. Histogram of z‑scores for all words across our entire corpus. This chart excludes about 33,000 words with undefined z‑scores. Over 95% of z‑scores lie in the interval [−1.75, +1.75], though the tail of negative values extends as far as −11.5.
sedes$ src/join-expectancy corpus/*.csv expectancy.all.csv > joined.all.csv sedes$ R > library("tidyverse") > data <- read_csv("joined.all.csv", col_types = cols(book_n = col_character())) > nrow(filter(data, abs(z) <= 1.75)) / nrow(filter(data, !is.na(z))) [1] 0.9609975 > summary(data$z) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's -11.489 -0.830 0.219 0.000 0.830 1.796 27613 > summary(replace_na(data, list(z = 0.0))$z) Min. 1st Qu. Median Mean 3rd Qu. Max. -11.48912 -0.75853 0.08732 0.00000 0.80112 1.79559
First, consider the lemma μορφή (“shape; beauty”).
To find candidates for this example:
sedes$ R > library("tidyverse") > data <- read_csv("expectancy.all.csv", col_types = cols(x = col_integer(), sedes = col_factor())) > print(data %>% group_by(lemma) %>% filter(!is.na(z) & n() > 3 & max(x) / sum(x) > 0.90), n = 100) # A tibble: 67 x 4 # Groups: lemma [16] lemma sedes x z <chr> <fct> <int> <dbl> 1 ἀνάγκη 2.5 2 -3.72 2 ἀνάγκη 6.5 6 -3.59 3 ἀνάγκη 10.5 119 0.275 4 ἀνάγκη 12 1 -3.76 5 ἀοιδή 1 1 -3.51 6 ἀοιδή 2.5 5 -3.41 7 ἀοιδή 6.5 3 -3.46 8 ἀοιδή 7 1 -3.51 9 ἀοιδή 10.5 143 0.290 10 ἀοιδή 11 2 -3.49 11 αὔρα 1 6 -3.20 12 αὔρα 5 1 -3.42 13 αὔρα 6 1 -3.42 14 αὔρα 11 85 0.307 15 δηιοτής 1 2 -3.18 16 δηιοτής 3 5 -3.04 17 δηιοτής 8 1 -3.23 18 δηιοτής 9 77 0.322 19 ἦμος 1 76 0.229 20 ἦμος 2 1 -4.39 21 ἦμος 3 1 -4.39 22 ἦμος 9 2 -4.33 23 ἰάλλω 2.5 1 -3.77 24 ἰάλλω 4.5 4 -3.62 25 ἰάλλω 6.5 1 -3.77 26 ἰάλλω 10.5 81 0.272 27 κεραυνός 2.5 1 -5.63 28 κεραυνός 4.5 2 -5.60 29 κεραυνός 6.5 2 -5.60 30 κεραυνός 10.5 157 0.178 31 λιλαίομαι 2.5 4 -3.38 32 λιλαίομαι 6 1 -3.55 33 λιλαίομαι 6.5 71 0.291 34 λιλαίομαι 8 1 -3.55 35 μορφή 1 8 -3.01 36 μορφή 2 2 -3.15 37 μορφή 4 5 -3.08 38 μορφή 6 1 -3.18 39 μορφή 11 150 0.327 40 ὀδούς 2.5 3 -3.28 41 ὀδούς 4.5 3 -3.28 42 ὀδούς 8.5 1 -3.38 43 ὀδούς 10.5 76 0.303 44 ὀνομάζω 5 1 -3.54 45 ὀνομάζω 6 2 -3.46 46 ὀνομάζω 8 1 -3.54 47 ὀνομάζω 10 49 0.286 48 ῥᾴδιος 1 55 0.301 49 ῥᾴδιος 3 3 -3.26 50 ῥᾴδιος 4 1 -3.40 51 ῥᾴδιος 9 1 -3.40 52 σχέτλιος 1 56 0.327 53 σχέτλιος 3 1 -3.18 54 σχέτλιος 4 1 -3.18 55 σχέτλιος 9 4 -2.99 56 τόφρα 1 98 0.267 57 τόφρα 5 1 -3.86 58 τόφρα 6 1 -3.86 59 τόφρα 9 5 -3.69 60 τοὔνεκα 1 70 0.316 61 τοὔνεκα 3 2 -3.21 62 τοὔνεκα 5 1 -3.27 63 τοὔνεκα 9 4 -3.11 64 φαρέτρα 4 1 -4.16 65 φαρέτρα 6 1 -4.16 66 φαρέτρα 6.5 2 -4.09 67 φαρέτρα 10.5 68 0.243
ἀοιδή would also work for this example, but it would be one additional row and its greatest x is not the last listed sedes.
Next, consider the lemma δένδρεον (“tree”).
To find candidates for this example:
sedes$ R > library("tidyverse") > data <- read_csv("expectancy.all.csv", col_types = cols(x = col_integer(), sedes = col_factor())) > print(data %>% group_by(lemma) %>% arrange(z) %>% filter(sum(x) > 20 & n() < 7 & !is.na(z) & sum(z[1:n()-1] >= 0) == 0 & abs(min(z)) < 0.9 * abs(max(z))) %>% arrange(lemma, sedes), n = 100) # A tibble: 32 x 4 # Groups: lemma [8] lemma sedes x z <chr> <fct> <int> <dbl> 1 αἰγιαλός 1 9 -0.546 2 αἰγιαλός 3 9 -0.546 3 αἰγιαλός 7 13 1.39 4 αἰγιαλός 9 8 -1.03 5 δέκατος 4 12 1.33 6 δέκατος 6 8 -0.445 7 δέκατος 2 7 -0.889 8 δέκατος 8 7 -0.889 9 δένδρεον 1 19 1.30 10 δένδρεον 3 10 -1.02 11 δένδρεον 7 12 -0.506 12 δένδρεον 9 11 -0.764 13 ζάθεος 4 24 1.11 14 ζάθεος 10 7 -0.965 15 ζάθεος 6 7 -0.965 16 ζάθεος 2 9 -0.720 17 ζάθεος 8 7 -0.965 18 ἱμερόεις 1 19 -0.981 19 ἱμερόεις 3 50 1.09 20 ἱμερόεις 7 20 -0.914 21 ἱμερόεις 9 21 -0.847 22 κτέαρ 4 7 -0.845 23 κτέαρ 10 7 -0.845 24 κτέαρ 8 10 1.18 25 οἶστρος 1 13 -0.563 26 οἶστρος 3 11 -0.953 27 οἶστρος 9 15 -0.173 28 οἶστρος 11 23 1.39 29 οἶστρος 5 10 -1.15 30 χλοερός 4 7 -0.756 31 χλοερός 6 8 1.32 32 χλοερός 2 7 -0.756
Large negative z‑scores are only possible with frequently occurring lemmata. To reach a z‑score as low as −2, there must be at least 5 total instances of a lemma; for −5 there must be at least 26; and for −10 there must be at least 101.
sedes$ R > sd_pop <- function(x) { sd(x) * sqrt((length(x) - 1) / length(x)) } > do.call("rbind", lapply(1:101, function(v) with(list(x = c(1, v-1)), (x - mean(rep(x, x))) / sd_pop(rep(x, x))))) [,1] [,2] [1,] NA NA [2,] NaN NaN [3,] -1.414214 0.7071068 [4,] -1.732051 0.5773503 [5,] -2.000000 0.5000000 [6,] -2.236068 0.4472136 [7,] -2.449490 0.4082483 [8,] -2.645751 0.3779645 [9,] -2.828427 0.3535534 [10,] -3.000000 0.3333333 [11,] -3.162278 0.3162278 [12,] -3.316625 0.3015113 [13,] -3.464102 0.2886751 [14,] -3.605551 0.2773501 [15,] -3.741657 0.2672612 [16,] -3.872983 0.2581989 [17,] -4.000000 0.2500000 [18,] -4.123106 0.2425356 [19,] -4.242641 0.2357023 [20,] -4.358899 0.2294157 [21,] -4.472136 0.2236068 [22,] -4.582576 0.2182179 [23,] -4.690416 0.2132007 [24,] -4.795832 0.2085144 [25,] -4.898979 0.2041241 [26,] -5.000000 0.2000000 [27,] -5.099020 0.1961161 [28,] -5.196152 0.1924501 [29,] -5.291503 0.1889822 [30,] -5.385165 0.1856953 [31,] -5.477226 0.1825742 [32,] -5.567764 0.1796053 [33,] -5.656854 0.1767767 [34,] -5.744563 0.1740777 [35,] -5.830952 0.1714986 [36,] -5.916080 0.1690309 [37,] -6.000000 0.1666667 [38,] -6.082763 0.1643990 [39,] -6.164414 0.1622214 [40,] -6.244998 0.1601282 [41,] -6.324555 0.1581139 [42,] -6.403124 0.1561738 [43,] -6.480741 0.1543033 [44,] -6.557439 0.1524986 [45,] -6.633250 0.1507557 [46,] -6.708204 0.1490712 [47,] -6.782330 0.1474420 [48,] -6.855655 0.1458650 [49,] -6.928203 0.1443376 [50,] -7.000000 0.1428571 [51,] -7.071068 0.1414214 [52,] -7.141428 0.1400280 [53,] -7.211103 0.1386750 [54,] -7.280110 0.1373606 [55,] -7.348469 0.1360828 [56,] -7.416198 0.1348400 [57,] -7.483315 0.1336306 [58,] -7.549834 0.1324532 [59,] -7.615773 0.1313064 [60,] -7.681146 0.1301889 [61,] -7.745967 0.1290994 [62,] -7.810250 0.1280369 [63,] -7.874008 0.1270001 [64,] -7.937254 0.1259882 [65,] -8.000000 0.1250000 [66,] -8.062258 0.1240347 [67,] -8.124038 0.1230915 [68,] -8.185353 0.1221694 [69,] -8.246211 0.1212678 [70,] -8.306624 0.1203859 [71,] -8.366600 0.1195229 [72,] -8.426150 0.1186782 [73,] -8.485281 0.1178511 [74,] -8.544004 0.1170411 [75,] -8.602325 0.1162476 [76,] -8.660254 0.1154701 [77,] -8.717798 0.1147079 [78,] -8.774964 0.1139606 [79,] -8.831761 0.1132277 [80,] -8.888194 0.1125088 [81,] -8.944272 0.1118034 [82,] -9.000000 0.1111111 [83,] -9.055385 0.1104315 [84,] -9.110434 0.1097643 [85,] -9.165151 0.1091089 [86,] -9.219544 0.1084652 [87,] -9.273618 0.1078328 [88,] -9.327379 0.1072113 [89,] -9.380832 0.1066004 [90,] -9.433981 0.1059998 [91,] -9.486833 0.1054093 [92,] -9.539392 0.1048285 [93,] -9.591663 0.1042572 [94,] -9.643651 0.1036952 [95,] -9.695360 0.1031421 [96,] -9.746794 0.1025978 [97,] -9.797959 0.1020621 [98,] -9.848858 0.1015346 [99,] -9.899495 0.1010153 [100,] -9.949874 0.1005038 [101,] -10.000000 0.1000000