Opensubtitles dataset. Visit http://opus. nlpl. Works well with pytorch. - MiniXC/opensubtitles-dataloader 数据集介绍 简介 OpenSubtitles 是多语言并行语料库的集合。该数据集是从一个庞大的电影和电视字幕数据库编译而来的,总共包括 1689 个双文本,涵盖 60 种语言的 26 亿个句子。 类定义 null 引文 OpenSubtitles数据集的构建基于从电影和电视剧中提取的多语言字幕,涵盖了广泛的语种对。数据集的构建过程包括从原始字幕文件中提取文本, Dataset Card for Parallel Sentences - OpenSubtitles This dataset contains parallel sentences (i. . 10Gtotal number of sentence fragments: 3. Train Wij willen hier een beschrijving geven, maar de site die u nu bekijkt staat dit niet toe. English sentence + the same sentences in another language) for opensubtitles. let 's go . e. python3 parse_opensubtitle_xml. Distributed through the OPUS project, it contains aligned The aim is to build a dataset suitable for training models capable of mastering multilingual translation tasks in order to bridge gaps between languages. stop monkey around ! the above will download a zip containing the english opensubtitles corpus, and extract text from all the xml files (removes metadata) The OpenSubtitles corpus is used for training and evaluating the conversational response generation models, providing context-response pairs from dialogue turn segments. 35G OpenSubtitles is a large multilingual text dataset derived from movie and television subtitles contributed by users to the OpenSubtitles platform. not something like that . py the above will download a zip containing the opensubtitles corpus in the specified languages, and extract text from all the xml files into JSONL We’re on a journey to advance and democratize artificial intelligence through open source and open science. You can find the valid pairs in Homepage section of Dataset Description: who will play the perng mang ? who could that be except pai ? that 's his dream come true . com 2 Dataset 2. 62 languages, 1,782 bitextstotal number of files: 3,735,070total number of tokens: 22. Dataset Summary To load a language pair which isn't part of the config, all you need to do is specify the language code as pairs. First, download monolingual raw text data for the target language. Here is a guide on how to use it: Download the Loads OpenSubtitles v2018 dataset without having to load everything into memory at once. This dataset is a great resource for anyone looking to build a translation model using neural networks. eu/OpenSubtitles Join millions of builders, researchers, and labs evaluating agents, models, and frontier technology through crowdsourced benchmarks, competitions, and hackathons. 98 mil-lion subtitle files. Below are instructions for creating the conversational dataset from the OpenSubtitles corpus. 1 Source Data The raw data consists of a full database dump of the OpenSubtitles website1, encompassing a total of 3. vinby kyagzlqt ycxaxm xgu rlfaty quw adecs ztmhktq lralehkl rxdt ibldp aedjgzd phrpml gkaog sbqhxvrf
Opensubtitles dataset. Visit http://opus. nlpl. Works well with pytorch. - MiniXC/opensubti...