Importing California’s Sales & Use Tax

Here we review the structure of a web site or laws and regulations to be imported into the Linguist™.

The following screen shows the law at the time of this article was written:[1]

The arrow list contains links to each chapter, as shown in the following HTML from the page:

Following the link for Chapter 1 leads to the following page:

The links on this page have the same structure.

Following the link for Section 6006 leads to the following page:

This page includes headings and content of the law as shown in the following HTML from the page:

Getting the Content of the Law

The following Java code processes these pages as follows:

1.       process all pages linked within a list of class “arrow-list-law”

2.       save the XML containing each division of law content

Here is an example of the XML saved as a result:

Note that there are some extra divisions that are not shown above.

When the XML files of law content are imported, the following code

The child elements of  law content division are not limited to paragraphs.  And besides paragraphs with “level” classes, there are paragraphs of class “history” and “small”, as shown below:

The following code drops the history paragraphs:[2]

When processing a paragraph, special care is given to the text of any law section number followed by a bolded title.

The following code processes these elements, which may remove some of the text of the paragraph, and then writes a tab-separated line to the given file.

In addition, the bold of any “small” paragraph is separated from the text of the note, along with any italicized litigants along with the case reference in which they occur.

The following code writes additional lines to the file for the span and bolded items, removing their text from the result returned:

Note that the hyphen left behind is removed by the cleanup routine shown below.

The following shows the resulting file in Excel:

Sorting on column I showed lines that are not part of any chapter of the law, which can be removed by adding the following “if” to code shown previously.

Sorting on column F also showed some additional asterisked items which can also be omitted by adding another “if” to the above code.

The same may be done for the following:

These changes are not necessary, however, in the following section, which only processes paragraphs at some level (not “small” paragraphs) within a chapter.

Importing Sections into an Outline of Groups

At this point we have a tab-delimited file with columns for:

·         the part of the law (i.e., Sales and Use Tax Law)

·         articles of the law (which contain sections and may occur within chapters)

·         sections of the law (which may be subsections but are not organized as)

Also, any single piece of the law may have a few lines, possibly including a line with a bold or strong title for the given content.

The importer ignores any paragraphs that are small or otherwise not at some “level” as well as any that are not part of a chapter, as in the following filtered view:

There are some issues yet to be addressed here, such as the following from the “TEXT” column:

Note that these lines have outline structure as in:

·         6006. “Sale.” "Sale" means and includes:

o   (a) Any transfer of title or possession, exchange, or barter, conditional or otherwise, in any manner or by any means whatsoever, of tangible personal property for a consideration. "Transfer of possession" includes only transactions found by the board to be in lieu of a transfer of title, exchange, or barter.

o   (b) The producing, fabricating, processing, printing, or imprinting of tangible personal property for a consideration for consumers who furnish either directly or indirectly the materials used in the producing, fabricating, processing, printing, or imprinting.

Note the section number and title, which should be in the resulting outline.  The fragment introducing the subsections should also be imported for parsing.

The following lines indicate that the issue of “fixing up” the outline from the text can involve multiple levels:

These lines should have the following outline structure:

·         (a)

o   (1)A "retail sale" or "sale at retail" means a sale for a purpose other than resale in the regular course of business in the form of tangible personal property.

o   (2) When tangible personal property is delivered by an owner or former owner thereof, or by a factor or agent of that owner, former owner, or factor to a consumer or to a person for redelivery to a consumer, pursuant to a retail sale made by a retailer not engaged in business in this state, the person making the delivery shall be deemed the retailer of that property. He or she shall include the retail selling price of the property in his or her gross receipts or sales price.

·         (b)

o   (1) Notwithstanding subdivision (a), a "retail sale" or "sale at retail" shall include a sale by a convicted seller of tangible personal property with a counterfeit mark, a counterfeit label, or an illicit label on that property, or in connection with that sale, regardless of whether the sale is for resale in the regular course of business.

o   (2) For purposes of this subdivision, all of the following shall apply:

§  (A) A "convicted seller" means a person convicted of a counterfeiting offense, including, but not limited to, a violation under Section 350 or 653w of the Penal Code or Section 2318, 2319, or 2320 of Title 18 of the United States Code on or after the date of sale.

§  (B) "Counterfeit mark" has the same meaning as that term is defined in Section 2320 of Title 18 of the United States Code.

Note that there is no text with which to label (a) or (b), which may seem unfortunate but is nonetheless common.

This can go even further with cases such as:

·         (b) (1) (A) The Taxpayers' Rights Advocate may order the release of any levy or notice to withhold issued pursuant to this part or, within 90 days from the receipt of funds pursuant to a levy or notice to withhold, order the return of any amount up to two thousand three hundred dollars ($2,300) of moneys received, upon his or her finding that the levy or notice to withhold threatens the health or welfare of the taxpayer or his or her spouse and dependents or family.

There are two approaches that can be taken to fixing these issues.

·         consider breaking a section between a colon and a following parenthesized index starting from its initial value (e.g., 1, a, A, i).

·         break a section between an initial parenthesized index and a following parenthesized index starting from its initial value

·         at each break, push the current section on a stack and recursively process the line beginning with the initial parenthesized index

·         pop the stack when encountering a non-initial parenthesized index beginning a line to be processed

·         reset the stack each time a new section is encountered

·         Here is the core routine which processes lines written above:


·         The uses of the matchers handle section labels at the beginning of the line and in the limited embedded case shown above.

This works out well for this site, but there are exceptions, so the code needs to be careful and informative.

The following, for example, starts with subsection (b):

The implementation of “pushPopOrIncrementSection” logs a warning for this case.  In the course of getting clean lines for all the content, there were a couple more things to fix, like an extra closing parenthesis and 3 cases that required custom code or hand tweaking.

In the following, there are 2 sections 6011(c)(4)(B), the first one of which has an extra closing parenthesis.

This did not require custom code or hand tweaking, because the implementation of the routines that support “pushPopOrIncrementSection” recognize and implement an increment for this case, logging the modification and any gaps (which did not happen on this site except the one case above).  The same body of code handles nested integer and alphabetic indices, including Roman numerals, with case sensitivity.

This was an example that we handled by hand before switching to special-case code:

Here is another such example.  Note the embedding of section (e) within (d)(2):

Overall, writing the code and cleaning up the few special cases that remained took a few hours of Java programming.

The following shows the lines written by “writeGroupInfo”:

All that remains is to use the Linguist APIs to create groups and split the content into sentences (or fragments thereof) and add them to the corresponding groups.

Separating Paragraphs and Sentences

The steps discussed above have effectively separated section number and titles (where they occur) from the text within such sections.  The next step in the import process is to break any paragraphs of text into their constituent sentences.

Here is what Wikipedia has to say on the remaining matter:

Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing tools often require their input to be divided into sentences; however, sentence boundary identification can be challenging due to the potential ambiguity of punctuation marks. In written English, a period may indicate the end of a sentence, or may denote an abbreviation, a decimal point, an ellipsis, or an email address, among other possibilities.

The Linguist provides a variety of methods for segmenting sentences, including the following:

1.       break the content (e.g., of a section) into a sequence of tokens[i]

a.       each token matches some potentially complex regular expression (or its pattern is ‘null’)

b.       each token encountered is passed to the tokenizer’s listeners with the index at which it starts

2.       accumulate the tokens of the content to be split

3.       encapsulate the text of the content and its tokens as a ‘Paragraph’

4.       split the tokens of the ‘paragraph’ into a sequence of sentences (or fragments thereof)

a.       each sentence matches some pattern over tokens or their patterns (or its pattern is null)

b.       each sentence encountered is passed to a method as a ‘ParagraphSentence’

5.       write the split sentence (or fragment) as a tab-delimited line along with its group information

The following code takes the tab-separated file of sections and their content produced above and writes a file of sentences per group.

The following listener accumulates the tokens of a ‘paragraph’ by invoking its listener’s ‘encountered’ method for each token encountered during the Lexicon’s ‘process(String)’ method.

Then, a Paragraph is constructed with the text and its tokens.  The ‘split(Continuation)’ method invokes the ‘resultDelivered(ParagraphSentence)’ method which writes each sentence to the file (along with position, length, and pattern information).

As a result of running this splitter, we obtain the following:

Reviewing more than a couple of hundred of these indicates no issue in splitting the sentences (which is better than may be commonly expected).

Importing Sentences and Groups into a Knowledge Base

Given a file of sentences to be imported into groups, such as that produced above, the Linguist APIs allow programmatic access to a knowledge base.

The following code imports the file produced above into a knowledge base at the URL shown below:

The method that gets the group into which a sentence should be imported handles the case where the “ARTICLE” is empty and where a parenthesized subsection belongs within a parent section, as shown below:

The next step might be any or all the following:

1.       analyze the tokens found in the text versus the Linguist’s vocabulary and fix typos or extend the lexicon as appropriate

2.       analyze the text for multiple word expressions that carry significant mutual information

3.       programmatically parse the sentences in advance of any disambiguation efforts

Each of these are addressed in other documents.

[1] sales-and-use-tax-law-chapters.html

[2] switch statements are used in the event that other sections are to be processed or other classes to be dropped.

[i] In the most general case, tokenization may produce a lattice of tokens, which the Linguist can handle.  In practice, especially if tokenization is sophisticated enough, such ambiguities are rare.