Vectorize io

This reduces the vocab size to max_tokens-2 instead of max_tokens-1 When output = "int", 0 is reserved for masked locations "int": Outputs integer indices, one integer index per split string Values canīe "int", "binary", "count" or "tf-idf", configuring the layer as follows:

output_mode: Optional specification for the output of the layer.

None means that no ngrams will be created. Integers will create ngrams for the specified values in the tuple. Values can be None, an integer or tuple of integers passingĪn integer will create ngrams up to that integer, and passing a tuple of ngrams: Optional specification for ngrams to create from the possibly-split.None (no splitting), 'whitespace' (split on ASCII whitespace), or a split: Optional specification for splitting the input text.Default is 'lower_and_strip_punctuation'. 'lower_and_strip_punctuation' (lowercase and remove punctuation) or aĬallable. standardize: Optional specification for standardization to apply to the.Note that this vocabularyĬontains 1 OOV token, so the effective number of tokens is (max_tokens. There is no cap on the size of the vocabulary. max_tokens: The maximum size of the vocabulary for this layer.

Site natively compatible with tf.strings.split(). In this example, we should see something like [["string", "to", Return a Tensor with the first dimension containing the split tokens. When using a custom callable for split, the data received by theĬallable will have the 1st dimension squeezed out - instead of Should return a tensor of the same shape as the input.ģ. When using a custom callable for standardize, the data receivedīy the callable will be exactly as passed to this layer. Serializables (see tf._keras_serializable for moreĢ. This object you should only pass functions that are registered Keras Any callable can be passed to this Layer, but if you want to serialize Some notes on passing Callables to customize splitting and normalization forġ. transform each sample using this index, either into a vector of ints or index tokens (associate a unique int value with each token)ĥ. recombine substrings into tokens (usually ngrams)Ĥ. split each sample into substrings (usually words)ģ. standardize each sample (usually lowercasing + punctuation stripping)Ģ. The processing of each sample contains the following steps:ġ. Input than the maximum vocabulary size, the most frequent terms will be used This vocabulary can have unlimited size or be capped, depending on theĬonfiguration options for this layer if there are more unique values in the When this layer is adapted, it will analyze the dataset, determine theįrequency of individual string values, and create a 'vocabulary' from them. If desired, the user can call this layer's adapt() method on a dataset. Representation (one sample = 1D tensor of float values representing data about Token indices (one sample = 1D tensor of integer token indices) or a dense Transforms a batch of strings (one sample = one string) into either a list of This layer has basic options for managing text in a Keras model. TextVectorization ( max_tokens = None, standardize = "lower_and_strip_punctuation", split = "whitespace", ngrams = None, output_mode = "int", output_sequence_length = None, pad_to_max_tokens = False, vocabulary = None, ** kwargs )