API

word_vectors

Read, Write, and Convert between different word vector serialization formats.

class word_vectors.FileType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

An Enumeration of the Word Vector file types supported.

GLOVE = 'glove'

The format used by Glove. See read_glove() for a description of file format and common pre-trained embeddings that use this format.

W2V_TEXT = 'w2v-text'

The text format introduced by Word2Vec. See read_w2v_text() for a description of the file format and common pre-trained embeddings that use this format.

W2V = 'w2v'

The binary format used by Word2Vec and pre-trained GoogleNews vectors. See read_w2v() for a description of the file format and common pre-trained embeddings that use this format.

LEADER = 'leader'

Our new Leader file format. See read_leader() for a description of the file format.

FASTTEXT = 'w2v-text'

The file format used to distribute FastText vectors, it is just the word2vec text format. See read_w2v_text() for a description of the file format.

NUMBERBATCH = 'w2v-text'

The file format used to distribute Numberbatch vectors, it is just the word2vec text format. See read_w2v_text() for a description of the file format.

classmethod from_string(value)[source]

Convert a string into the Enum value.

Parameters:

value (str) – The string specifying the file type.

Returns:

The Enum value parsed from the string.

Raises:

ValueError – If the string wasn’t able to be parsed into an Enum value.

Return type:

FileType

word_vectors.INT_SIZE = 4

The size of an int32 in bytes used when reading binary files.

word_vectors.FLOAT_SIZE = 4

The size of a float32 in bytes when reading a binary file.

word_vectors.LONG_SIZE = 8

The size of an int64 in bytes when reading binary files.

word_vectors.LEADER_HEADER = 3

The number of elements in the Leader format header.

word_vectors.LEADER_MAGIC_NUMBER = 38941

A magic number used to identify a Leader format file.

word_vectors.read

Read word vectors from a file.

We provide a main read() function for reading vectors from a file. The serialization format can be explicitly provided with by passing a FileType or automatically inferred using sniff(). There are also several provided convenience functions for reading from specific formats.

word_vectors.read.read(f, file_type=None)[source]

Read vectors from a file.

This function can dispatch to one of the following word vector format readers:

Check the documentation of a specific reader to see a description of the file format as well as common pre-trained vectors that ship with this format.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Note

Without a specified file type this function uses word_vectors.read.sniff() to determine the word vector format and dispatches to the appropriate reader.

I haven’t seen a sniffing failure but if your file type can’t be determined you can pass the file_type explicitly or call the specific reading function yourself.

Parameters:
  • f (str | TextIO | BinaryIO) – The file to read from.

  • file_type (FileType | None) – The vector file format. If None the file is sniffed to determine format.

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_with_vocab(f, user_vocab, initializer=<function uniform_initializer.<locals>._unif_initializer>, keep_extra=False, file_type=None)[source]

Read vectors from a file subject user provided vocabulary constraints.

This function can dispatch to one of the following word vector format readers:

Check the documentation of a specific reader to see a description of the file format as well as common pre-trained vectors that ship with this format.

When provided a vocabulary this function will not reorder it. If you pass in that the word dog is index 12 then in the resulting vocabulary it will still be index 12.

When collecting extra vocabulary (words that are in the pre-trained embeddings but not in the user vocab) these will all be at the end of the vocabulary. Again the indices of user provided words will not change.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Note

Without a specified file type this function uses word_vectors.read.sniff() to determine the word vector format and dispatches to the appropriate reader.

I haven’t seen a sniffing failure but if your file type can’t be determined you can pass the file_type explicitly or call the specific reading function yourself.

Parameters:
  • f (str | IO) – The file to read from.

  • user_vocab (Dict[str, int]) – A specific vocabulary the user wants to extract form the pre-trained embeddings.

  • initializer (Callable[[int], ndarray]) – A function that takes the vector size and generates a new vector. this is used to generate a representation for a word in the user vocab that is not in the pre-train embeddings.

  • keep_extra (bool) – Should you also include vectors that are in the pre-trained embedding but not in the user provided vocab?

  • file_type (FileType | None) – The vector file format. If None the file is sniffed to determine format.

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_glove(f)[source]

Read vectors from a glove file.

The GloVe format is a pure text format. Each (word, vector) pair is represented by a single line in the file. The line starts with the word, a space, and then the float32 text representations of the elements in the vector associated with that word. Each of these vector elements are also separated with a space.

The main vectors distributed in this format are the GloVe vectors (Pennington, et. al., 2014)

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:

f (str | TextIO) – The file to read from

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_glove_with_vocab(f, user_vocab, initializer=<function uniform_initializer.<locals>._unif_initializer>, keep_extra=False)[source]

Read vectors from a glove file subject to user vocabulary constraints.

See read_glove() for a description of the file format and common pre-train embeddings that use this format.

When provided a vocabulary this function will not reorder it. If you pass in that the word dog is index 12 then in the resulting vocabulary it will still be index 12.

When collecting extra vocabulary (words that are in the pre-trained embeddings but not in the user vocab) these will all be at the end of the vocabulary. Again the indices of user provided words will not change.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:
  • f (str | TextIO) – The file to read from.

  • user_vocab (Dict[str, int]) – A specific vocabulary the user wants to extract form the pre-trained embeddings.

  • initializer (Callable[[int], ndarray]) – A function that takes the vector size and generates a new vector. this is used to generate a representation for a word in the user vocab that is not in the pre-train embeddings.

  • keep_extra (bool) – Should you also include vectors that are in the pre-trained embedding but not in the user provided vocab?

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_w2v_text(f)[source]

Read vectors from a text based w2v file.

One of two different vector serialization formats introduced in the word2vec software (Mikolov, et. al., 2013).

The word2vec text format is a pure text format. The first line is two integers, represented as text and separated by a space, that specify the number of types in the vocabulary and the size of the word vectors respectively. Each following line represents a (word, vector) pair. The line stars with the word, a space, and then the float 32 text representations of the elements in the vector associated with that word. Each of these vector elements are also separated with a space.

One can see that that this is actually the same as the GloVe format except that in GloVe they removed the header line.

The main embeddings distributed in this format are FastText (Bojanowski, et. al., 2017) and NumberBatch (Speer, et. al., 2017)

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:

f (str | TextIO) – The file to read from

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_w2v_text_with_vocab(f, user_vocab, initializer=<function uniform_initializer.<locals>._unif_initializer>, keep_extra=False)[source]

Read vectors from a Word2Vec text file subject to user vocabulary constraints.

See read_w2v_text() for a description of the file format and common pre-train embeddings that use this format.

When provided a vocabulary this function will not reorder it. If you pass in that the word dog is index 12 then in the resulting vocabulary it will still be index 12.

When collecting extra vocabulary (words that are in the pre-trained embeddings but not in the user vocab) these will all be at the end of the vocabulary. Again the indices of user provided words will not change.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:
  • f (str | TextIO) – The file to read from.

  • user_vocab (Dict[str, int]) – A specific vocabulary the user wants to extract form the pre-trained embeddings.

  • initializer (Callable[[int], ndarray]) – A function that takes the vector size and generates a new vector. this is used to generate a representation for a word in the user vocab that is not in the pre-train embeddings.

  • keep_extra (bool) – Should you also include vectors that are in the pre-trained embedding but not in the user provided vocab?

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_w2v(f)[source]

Read vectors from a word2vec file.

One of two different vector serialization formats introduced in the word2vec software (Mikolov, et. al., 2013).

The word2vec binary format is a mix of textual an binary representations. The first line is two integers (as text, separated be a space) representing the number of types in the vocabulary and the size of the word vectors respectively. (word, vector) pairs follow. The word is represented as text and a space. After the space each element of a vector is represented as a binary float32.

The most well-known pre-trained embeddings distributed in this format are the GoogleNews vectors.

Note

There is no formal definition of this file format, the only definitive reference on it is the original implementation in the word2vec software

Due to the lack of a definition (and no special handling of it in the code) there is no explicit statements about the endianness of the binary representations. Most code just uses the numpy.from_buffer and that seems to work now that most people have little-endian machines. However due to the lack of explicit direction on this encoding I would advise caution when loading vectors that were trained on big-endian hardware.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:

f (str | BinaryIO) – The file to read from

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_w2v_with_vocab(f, user_vocab, initializer=<function uniform_initializer.<locals>._unif_initializer>, keep_extra=False)[source]

Read vectors from a Word2Vec file subject to user vocabulary constraints.

See read_w2v() for a description of the file format and common pre-train embeddings that use this format.

When provided a vocabulary this function will not reorder it. If you pass in that the word dog is index 12 then in the resulting vocabulary it will still be index 12.

When collecting extra vocabulary (words that are in the pre-trained embeddings but not in the user vocab) these will all be at the end of the vocabulary. Again the indices of user provided words will not change.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • user_vocab (Dict[str, int]) – A specific vocabulary the user wants to extract form the pre-trained embeddings.

  • initializer (Callable[[int], ndarray]) – A function that takes the vector size and generates a new vector. this is used to generate a representation for a word in the user vocab that is not in the pre-train embeddings.

  • keep_extra (bool) – Should you also include vectors that are in the pre-trained embedding but not in the user provided vocab?

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_leader(f)[source]

Read vectors from a leader file.

This is our fully binary vector format.

The first line is a header for the leader format and it is a 3-tuple. The elements of this tuple are: A magic number, the size of the vocabulary, and the size of the vectors. These numbers are represented as little-endian unsigned long longs that have a size of 8 bytes.

Following the header there are (length, word, vector) tuples. The length is the length of this particular word encoded as a little-endian unsigned integer. The word is stored as utf-8 bytes. After the word the vector is stored where each element is a little-endian float32 (4 bytes).

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:

f (str | BinaryIO) – The file to read from

Returns:

The vocab and vectors.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.read_leader_with_vocab(f, user_vocab, initializer=<function uniform_initializer.<locals>._unif_initializer>, keep_extra=False)[source]

Read vectors from a Leader file subject to user vocabulary constraints.

See read_leader() for a description of the file format and common pre-train embeddings that use this format.

When provided a vocabulary this function will not reorder it. If you pass in that the word dog is index 12 then in the resulting vocabulary it will still be index 12.

When collecting extra vocabulary (words that are in the pre-trained embeddings but not in the user vocab) these will all be at the end of the vocabulary. Again the indices of user provided words will not change.

Note

In the case of duplicated words in the saved vectors we use the index and associated vector from the first occurrence of the word.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • user_vocab (Dict[str, int]) – A specific vocabulary the user wants to extract form the pre-trained embeddings.

  • initializer (Callable[[int], ndarray]) – A function that takes the vector size and generates a new vector. this is used to generate a representation for a word in the user vocab that is not in the pre-train embeddings.

  • keep_extra (bool) – Should you also include vectors that are in the pre-trained embedding but not in the user provided vocab?

Returns:

The vocab and vectors. The vocab is a mapping from word to integer and vectors are a numpy array of shape [vocab size, vector size]. The vocab gives the index offset into the vector matrix for some word.

Return type:

Tuple[Dict[str, int], ndarray]

word_vectors.read.sniff(f, buf_size=1024)[source]

Figure out what kind of vector file it is.

Parameters:
  • f (str | TextIO) – The file we are sniffing.

  • buf_size (int) – How many bytes to read in when sniffing the file.

Returns:

The guessed file type.

Return type:

FileType

word_vectors.read.read_leader_header(buf)[source]

Read the header from the leader file.

The header for the leader format is a 3-tuple. The elements of this tuple are: A magic number, the size of the vocabulary, and the size of the vectors. These numbers are represented as little-endian unsigned long longs that have a size of 8 bytes.

Note

The magic number if used to make sure this is can actual file and not just trying to extract word vectors from a random binary file. The Magic Number is 38941.

Parameters:

buf (bytes) – The beginning of the file we are reading the header from.

Returns:

The vocab size, the vector size, and the maximum length of any of the words

Raises:

ValueError – If the magic number doesn’t match.

Return type:

Tuple[int, int]

word_vectors.read.verify_leader(buf)[source]

Check if a file is in the leader format by comparing the magic number.

Parameters:
  • buf (bytes) – The beginning of the file we are trying to determine if the it

  • file. (is a Leader formatted) –

Returns:

True if the magic number matched, False otherwise.

Return type:

bool

word_vectors.write

Write Word Vectors to a file.

We provide the main write() function that can write to various vector serialization formats based on the passed FileType. There are also several convenience functions for writing specific formats.

word_vectors.write.write(wf, vocab, vectors, file_type, max_len=None)[source]

Write word vectors to a file.

This function dispatches to on of the following word vector format writers based on the file of file_type.

Parameters:
  • wf (str | IO) – The file we are writing to.

  • vocab (Dict[str, int] | Iterable[str]) – The vocab mapping words -> ints.

  • vectors (ndarray) – The vectors as a np.ndarray.

  • file_type (FileType) – The format to use when writing the vectors to disk.

  • max_len (int | None) – The maximum length of a word in vocab. Only used when writing Leader vectors.

Raises:

ValueError – If the an unsupported file type is passed

word_vectors.write.write_glove(wf, vocab, vectors)[source]

Write vectors to a glove file.

See word_vectors.read.read_glove() for a description of the file format and examples of common pre-trained embeddings that use this format.

Parameters:
  • wf (str | TextIO) – The file we are writing to

  • vocab (Dict[str, int] | Iterable[str]) – The vocab of words -> ints.

  • vectors (ndarray) – The vectors as a np.ndarray.

word_vectors.write.write_w2v_text(wf, vocab, vectors)[source]

Write vectors in the word2vec format in a text file.

See word_vectors.read.read_w2v_text() for a description of the file format and examples of common pre-trained embeddings that use this format.

Parameters:
  • wf (str | TextIO) – The file we are writing to

  • vocab (Dict[str, int] | Iterable[str]) – The vocab of words -> ints

  • vectors (ndarray) – The vectors we are writing

word_vectors.write.write_w2v(wf, vocab, vectors)[source]

Write vectors to the word2vec format as a binary file.

See word_vectors.read.read_w2v() for a description of the file format and examples of common pre-trained embeddings that use this format.

Parameters:
  • wf (str | BinaryIO) – The file we are writing to

  • vocab (Dict[str, int] | Iterable[str]) – The vocab of words -> ints.

  • vectors (ndarray) – The vectors as a np.ndarray.

word_vectors.write.write_leader(wf, vocab, vectors)[source]

Write vectors to a leader file.

See word_vectors.read.read_leader() for a description of the file format.

Parameters:
  • wf (str | BinaryIO) – The file we are writing to.

  • vocab (Dict[str, int] | Iterable[str]) – The vocab of words -> ints.

  • vectors (ndarray) – The vectors as a np.ndarray.

  • max_len – The longest length of the words as (utf-8) bytes.

word_vectors.convert

Convert between word vector formats.

We provide the main convert() function for converting between arbitrary formats based on the passed FileType (or by sniffing the input file with sniff() when not provided) as well as several convenience function for converting between different pairs of formats.

word_vectors.convert.convert(f, output=None, output_file_type=FileType.LEADER, input_file_type=None)[source]

Convert vectors from one format to another.

Parameters:
  • f (str | TextIO | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

  • output_file_type (FileType) – The vector serialization format to use when writing out the vectors.

  • input_file_type (FileType | None) – An explicit vector format to use when reading.

word_vectors.convert.w2v_to_leader(f, output=None)[source]

Convert binary Word2Vec formatted vectors to the Leader format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.glove_to_leader(f, output=None)[source]

Convert GloVe formatted vectors to the Leader format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.w2v_text_to_leader(f, output=None)[source]

Convert text Word2Vec formatted vectors to the Leader format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.w2v_to_w2v_text(f, output=None)[source]

Convert binary Word2Vec formatted vectors to the Binary Word2Vec format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.w2v_to_glove(f, output=None)[source]

Convert binary Word2Vec formatted vectors to the GloVe format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.w2v_text_to_glove(f, output=None)[source]

Convert text Word2Vec formatted vectors to the Glove format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.w2v_text_to_w2v(f, output=None)[source]

Convert text Word2Vec formatted vectors to the binary Word2Vec format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.glove_to_w2v(f, output=None)[source]

Convert GloVe formatted vectors to the binary Word2Vec format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.glove_to_w2v_text(f, output=None)[source]

Convert GloVe formatted vectors to the text Word2Vec format.

Parameters:
  • f (str | TextIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.leader_to_w2v(f, output=None)[source]

Convert Leader formatted vectors to the binary Word2Vec format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.leader_to_w2v_text(f, output=None)[source]

Convert Leader formatted vectors to the text Word2Vec format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.convert.leader_to_glove(f, output=None)[source]

Convert Leader formatted vectors to the GloVe format.

Parameters:
  • f (str | BinaryIO) – The file to read from.

  • output (str | None) – The name for the output file. If not provided we use the input file name with a modified extension.

word_vectors.utils

Utilities for working with word vector I/O.

word_vectors.utils.find_space(buf, offset)[source]

Find the first space starting from offset and return word that spans the spaces and the new offset.

Parameters:
  • buf (bytes) – The bytes buffer we are looking for a space in.

  • offset (int) – Where in the buffer we start looking.

Returns:

A (word, offset) tuple where word is the text (decoded from utf-8) starting at the original offset until the first space. Offset is index of the location just after the space we just found.

Return type:

Tuple[str, int]

word_vectors.utils.is_binary(f, block_size=512, ratio=0.3, text_characters=b' !"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\n\r\t\x0c\x08')[source]

Guess if a file is binary or not.

This is based on the implementation from here

Parameters:
  • f (str | BinaryIO) – The file we are testing.

  • block_size (int) – The amount of the file to read in for checking.

  • ratio (float) – How many non-ascii characters before we assume it is binary.

  • text_characters (bytes) – Characters that we define as text characters, the ratio of these characters to others is used to determine if the file was binary or not.

Returns:

True if the file is binary, False otherwise

Return type:

bool

word_vectors.utils.bookmark(f)[source]

Bookmark where we are in a file so we can return.

This is a context manager that lets us save our spot in an open file, to some operations on that file, and then return to the original stop.

This is very useful for things like sniffing a file. If the file is already open and you read in some bytes to estimate the format you need to remember to reset to the start or else you will get wrong results. This context manager automates this.

f.tell()
>>> 120
with bookmark(f):
    _ = f.read(1024)
    print(f.tell())
>>> 1144
f.tell()
>>> 120
Parameters:

f (IO) – The file we are bookmarking.

word_vectors.utils.to_vocab(words)[source]

Convert a series of words to a vocab mapping strings to ints.

Parameters:

words (Iterable[str]) – The words in the vocab

Returns:

The Vocabulary

Return type:

Dict[str, int]

word_vectors.utils.create_output_path(path, file_type)[source]

Create the output path by stripping the extension and added a new one based on the vector format.

Parameters:
  • path (str | IO | PurePath) – The path to the input file.

  • file_type (FileType) – The vector format we are converting to.

Returns:

The new output path with an extension determined by the file type.

Return type:

str

word_vectors.utils.uniform_initializer(unif)[source]

Create a vector initialization function that takes a vector size as input.

Parameters:

unif (float) – The bounds that the new vector will be initialized within

Returns:

A function that returns a uniformly random vector between -unif and unif,

Return type:

Callable[[int], ndarray]