I've been working on some tools that integrate with USPTO (both from the application side and the validation side) for quite a few years now and they've been making a TON of formatting changes recently. A lot of their PDF forms have changed, they're requiring XML versions of all data we submit, they're handling classifications differently, etc. Their process always felt like it was stuck in the past and being handled manually by humans before and now it feels like they're moving everything toward automated intake and initial reviews. I imagine this change is for the same reasons, and that's a hefty fee to force it.
This also likely means I will need to rework our systems to spit out docx instead of/in addition to PDFs, which will be a nightmare to do. So that's fun.
The consolation is that, if I remember correctly, docx is just a zip file containing xml.
I made an xlsx exporter in actionscript3 (lol) years ago and it worked like this. What I ultimately did was made a "template" document, and my code just injected strings into key spots, zipped it up in memory and gave it to you as a file.xlsx. Probably took me 3 days?
I didn't have the benefit of libraries so I imagine this is significantly easier in less hobbled environments, nodejs or whatever probably has a kitchen sink package to do it.
That's exactly right. There are definitely nodejs docx templating packages (I've worked on codebases that used them in the past), but they're certainly not required provided your documents are reasonably simple.
If anything, generating a pdf from various input files/structured text has been a much harder task. We generated docx files to allow for easy modification by non-technical staff, but to generate a pdf we had to use a headless instance of libreoffice since pandoc was struggling with the rendering.
ISO is not free which is very cumbersome and why many ISO standards are not implemented correctly (like 8601). I consider ISO proprietary and bad for that reason.
It is for the .doc format and other Office firmat until Microsoft Office 2008 included and do not provide a standard for the the post-2008 Office format (ending with the `x` in the file extension), AFAIK.
You're mistaken, ISO 29500 and ECMA-376 are both (they're actually three somewhat different standards all in all, as I understand it) for "Office Open XML" (totally not intended to be confused with Open Office and Open Office XML), which is docx/pptx/xlsx etc.
ODT (OpenDocument) is also XML-in-a-ZIP and is also an ISO standard (older than Microsoft's, in fact).
I don't think the older variants (.doc, .xls etc.) are standardized, but iirc Microsoft did eventually release some documentation on them.
> Due to aggressive automated scraping of FederalRegister.gov and eCFR.gov, programmatic access to these sites is limited to access to our extensive developer APIs.
Apparently. Then a captcha and a button to request access, which if you complete, returns a 500 Internal Server Error.
… my tax dollars are hard at work, I see.
The Wayback Machine hasn't got a snapshot, either, it seems.
Seriously, I've had my share of horror stories on government systems. Do they just not dogfood their own product? Where is their QA team? It's atrocious how basic tasks are so broken all the time.
Government can't afford to hire the competent talent, only the scraps after everyone else (even the consulting bodyshops) are done. The top GS pay bracket is lower than entry level engineers at many companies (not just FAANG, but also defense, F500 companies, etc.)
One of the many things that I found tempting about working for the U.S. Digital Service was that, while the GS-15 pay grade is definitely way less than I'd make in the private sector, my spouse's family is military/government and the difference between "hippy programming thingy" and "has a GS-15/O-6 job" would've been night and day. The one puts me in a pile of stereotypes, but the other says "oh, he's basically the bureaucratic equivalent of a Captain, that's very respectable."
Different circles I guess. While my spouse's family considers government jobs to be stable and somewhat respectable, there is a lot more respect for FAANG and other high paying jobs. One is respectable, the other is prestigious.
That’s all perspective. Many people who realize you work the government will see you as a do nothing bureaucratic leech and for most government employees that is the correct assumption.
I was wondering why docx format would be chosen instead of PDF but they answer it pretty completely here if anyone else is interested: https://www.uspto.gov/patents/docx
It actually seems like a sane choice to me. PDF is good for rendering, but horrible for parsing. DOCX is a ZIP file with XML data. Maybe ODT or whatever would've been a better choice, I don't know what the format is like. But if you disregard the usual knee-jerk "but it's Microsoft!" reaction, it doesn't seem like a bad choice.
The Office Open XML file format is extremely complex, and takes up around 6,500 pages (compared to ~1000 for ODF). One thing you notice when reading the DOCX spec is that they designed it with the sole constraint that DOC files could easily be converted to DOCX. For example, you'll frequently see compatibility tags like "autoSpaceLikeWord95", "footnoteLayoutLikeWW8", "useWord2002TableStyleRules", and "lineWrapLikeWord6" that expose internal implementation details. Rather than creating a useful standard allowing all users to store their documents in a clean, portable way, Microsoft decided to make their standard faithfully reproduce all of the quirks and bugs of their legacy binary formats. It's so difficult to correctly implement the Office Open XML standard that even Microsoft took until Office 2013 to do so (the standard was approved in 2006).
So true. "It's XML, so it must easy to parse and manipulate" is such a naive, even misleading attitude. If what you do is take a byzantine, legacy-encrusted implementation and just serialise its data strucures to an XML representation, very little has been gained.
[edit: but I will grant that almost anything is better than attempting to parse useful content from PDF.]
I had a CTO who told me that "anytime someone says something will be simple because it uses XML, just repeat what they said and use the word text instead of XML. Then tell me if it sounds smart."
Microsoft "adopted" XML purely to assuage various regulators that their new format was totally gonna be open, interoperable and standardized. It is a pure fig leaf but judging by all the comments in here mentioning "XML", it seems to have worked.
I think they pretty much had to do that to preserve the formatting of existing documents for users who are force-upgraded by their employers to new Office versions. But it seems a scraper that just wants the information in the document can ignore almost all of those tags.
Interesting! How do they compare feature-wise? I feel like there must be things each of them support that the other one doesn't, but I don't know how consequential they are.
OCR has come a long way, so much that visually interpreting a PDF is about as error-prom as parsing XML output from Microsoft in non-microsoft software.
Try extracting tabular data from a PDF! With XML it's trivial, but for PDF you need highly specialized software packages to do this. One of the best, pdfplumber, is largely based [1] on a Master's thesis titled Algorithmic Extraction of Data in Tables in PDF Documents [2].
This was mostly aimed at the various ways the XML document may or may not conform to any number of XSD types. What we 'see' as a table might not be described and stored as a table in the same way in XML. And with XML I mean whatever XML office (the one from microsoft) generates.
A 6000-page spec and attributes that specify if the data is tabular data based on various properties (be it columns and rows or just plain text with start and stop pointers) and then may or may not render it visually as a table is error prone, even on first-party implementations (first-party desktop versions within Windows vary, as wel as on macOS, Android, iOS and their web offering).
If there was one simple data structure describing the table and all other aspects being optional, then yes, a XML based format is easier than OCR. But that's not the case I was pointing at.
> for PDF you need highly specialized software packages to do this.
Not really, or at least not all that specialized. You need:
a: a pdf-to-raster-image converter (ie any working PDF viewer, plus maybe the X server it talks to)
b: a reasonably decent OCR system capable of scanning tables (definitely nontrivial, but hardly "highly specialized" since things other than PDFs display data in tables).
I guess, but you were replying to "I was wondering why docx format would be chosen instead of PDF" seemingly unconvinced, so I assumed you thought PDF would've made more sense.
Neither of these are sane. PDF and DOCX are both built to ape analog paper processes. It's been 40 years of this computer thing, it's time to get rid of the 70s office facsimiles.
I think the knee-jerk is against any alternative to Office, not against Office. Statistically speaking, trying to use anything reasonable that doesn't genuflect to Microsoft's monopoly is what seems to be met with a knee-jerk reply such as yours. As in, there's probably more people who don't care but hate the complaining about libre stuff than their are advocates for libre stuff.
> But it doesn't really say why they chose to build on docx.
Having worked directly with their teams in the past (although not on this), a lot of their systems seemed to evolve naturally over time based on the needs present. In that industry, a large majority of the documents being passed back and forth are DOCX. So my semi-educated guess is someone built a system to handle some simple intake tasks for DOCX applications because a large majority already were, it evolved over a few years, and when they finally decided to fully automate the process, they decided to build upon what they have which only supported DOCX and it was cheaper/easier to mandate everyone submit in that format than to build a new system or add support for others.
I get that you worked with them it seems, but would argue that your hunch is wrong here.
Regulatory processes, business systems, and international integration are plagued by PDF OCR complexities. OCR creates systemic issues and an anatomy of complex system architectures. Im sure XML is a typical downstream for parsing anyways. Use DOCX to enhance quality of the overall scope of integrations.
They could just use standardized application forms the way they do for research reports they require (the "ISA ###" forms). Those forms are easily parsable by things like pdftk and don't require any OCR.
I don't necessarily disagree with your point (since it makes complete sense), just wanting to point out that they already have a system in place for this using other means (although even there they are moving toward XML instead, likely because of what a pain it is to deal with text that exceeds the area of the input in PDFs)
As a few others have mentioned, the parsing alone means DOCX is a huge win over PDF. I had to parse a bunch of PDF data related to COVID and it was always a PITA. Every time they changed their layout even a little bit I had to rewrite parts of my extractor. The worst part? The headers/metadata showed it was all made in Word so they could have exported to DOCX as well as PDF if they wanted to but they only provided PDF.
>But it doesn't really say why they chose to build on docx.
Is requiring the DOCX format just adding another step in the process for applicants?
Actually, it's the opposite. The USPTO conducted a study and found that over 80% of applicants are authoring their applications in DOCX format (through writing tools such as Microsoft Word). Because the files are originally in a DOCX format, uploading the original file eliminates the step for the applicant to convert the document to PDF prior to submission. Instead, the applicant is able to save the step of converting because our system will do that automatically.
I guess you have little exposure to the industry. My experience is the vast majority already use Word or something else which supports DOCX. I cannot think of another format which practitioners have easy access to and would use. PDF just needs to go away for this process.
I actually do have a lot of exposure to patents, and I know everyone uses DocX already. I'm just saying that that web page doesn't say why they chose DocX, only why you should use DocX.
Sure it does. I'll give you it does not say why they chose it over other alternatives which I'm thinking is what you are looking for. Are there really any alternatives? The only real alternative I can think of is OpenDocument Format and I don't consider it alternative. As they say on that page, 80% of their users already deal with DOCX so 80+% of them will have to convert to ODF. I can't imagine ODF having any sort of benefit worth requiring most people to convert their documents before sending.
To me, the salient question is why is the government officially adopting a proprietary file format? Why is it important to optimize for the trivial convenience of patent applicants?
That's a standard for a file format, but that doesn't necessarily mean it is a free/open format. I don't think the FSF, for instance, considers it open (maybe because of patent issues). I'll leave that for you to decide, but just mean to suggest a standard doesn't automatically mean free/open.
Interestingly, a section in the wiki article linked mentioned the standard proposal was controversial because ODF already existed (and ODF was considered less complicated as a specification).
Nevertheless, good point. It depends on what you mean by "open".
There's this one guy I've dealt with. He uses a editor he wrote himself. He'll convert his documents to Pages and then use Pages for any other conversion needed.
Pages reminds me of using various desktop publishing apps. FrameMaker, Ventura, InDesign.
I used to be able to eke out accurate page layout in MS Word, but it was always fraught and more difficult. But if you’re not perfectionist about layout precision, MS Word provides the same meta-data encoding function through the application of style (‘this is a heading’, ‘this is a definition’, etc).
Coool. What unique features does it have / how does it work / what does it look like / what were the real-world factors that contributed to the insanity of deciding to build it? xD
Oh right, I bet Pages is up there. I dunno though—I know lots of patent attorneys who are surprisingly non-technical and probably haven’t given a thought to cloud security issues.
It sounds like they're going to try to catch and remove that - one of the bullet points under DOCX Benefits on https://www.uspto.gov/patents/docx reads:
> Privacy: provides automatic metadata detection (e.g. author and comments) and removal features to support the submission of only substantive information in the DOCX file.
And, then further down in the FAQ it says:
> What happens to the metadata in DOCX files?
> Metadata is generally removed by applicants prior to submission. However, if metadata is found during the validation process, it is automatically removed prior to submission. Examples of metadata include author, company, last modified by, etc. The only information that is preserved is the size, page count, and word count.
> Outgoing DOCX documents (i.e. Office actions) from the USPTO to applicants will also have metadata removed.
So, $100-400 depending on the size (CFR section 1.16(u)). That feels... excessive, but if processing a DOCX is automatic and PDFs require humans (I'm assuming), it makes sense.
Office Open XML (docx), to be fair, is a non-proprietary standard. Microsoft has been forced into that by the EU; and they keep messing with it, but it is actually technically "open".
It is open only on paper - at least half of it MSO-specific compatibility cruft, plus real documents often contain binary blobs that aren't described in the standard. Name "open" is a lie in this case. Moreover, it's made in ISO only because of corruption of these standardization bodies. It wasn't even properly reviewed!
Due to decades of work reverse engineering the behavior of MS Office (not only after the invention of DOCX, many of the specified behaviors in DOCX reference behaviors of older versions of Office).
It would be a monumental effort to create another implementation and completely impossible without referring to MS Office as a reference implementation.
It's completely unrelated. LibreOffice can open most completely binary DOC or XLS as well, they were just reverse engineered. It has nothing to do with openness.
Correlation does not imply causation. The fact that Libreoffice can open binary proprietary formats does not automatically imply that all formats it uses are binary and proprietary. DOCX is an open standard.
DOCX is not an open standard, there are many articles[1][2] why it isn't - even in Wikipedia article [3]. ISO standards committee were just a rubber stamp Microsoft puppets to ensure vendor lock for decades.
I stand corrected, sorry. Emulating some other proprietary software really shouldn't be a paragraph in something we call an "open standard". Thank you for the links.
If you’re not perfectionist about layout precision, MS Word provides the same meta-data encoding function through the application of style (‘this is a heading’, ‘this is a definition’, etc), as one can get from structured document layout applications like FrameMaker, Ventura, InDesign, Quark.
If the PTO has provided an MS Word Style Template and a document schema (document template), it is dead easy to extract a useful XML encoding for further analysis. There is a lot you can ignore in an MS Word file. Dead easy to write XPaths and XQueries that provide an API for the original DOCX document collection.
Some aspects of patent administration are being assisted by machine learning, which might mean that statistically a few jobs have already been lost, maybe not examiners per se.
Patent examining seems pretty complex analytically, perhaps some aspects of patent drafting, making a common description from a bunch of 3D renders (like scene description) might be the next low-hanging fruit after classification?
The USPTO uses a lot of automation, but frequently not where they should.
Note: I am a current USPTO patent examiner, and this is my opinion, not that of the USPTO or US govt.
The USPTO apparently has two contractors to classify patent documents. I've heard that some sort of AI system is used for classification, in combination with a lot of poorly paid contractors. In my experience, the classification is so frequently wrong that this is clearly not working. It might seem okay to upper management, who never has to actually deal with the classification being inaccurate. But examiners aren't happy with it.
Many people are calling for AI search. The new head of the USPTO mentioned it during a recent all-hands meeting. Unfortunately, the people who propose AI search don't seem to realize that 1. the USPTO has at least 5 AI search tools at their disposal (PLUS, More Like This, Dialog's similarity search, IP.com's similarity search, and Google Patents similar documents) and 2. none of these AI search tools work that well. In my experience, most of the time these tools don't return useful documents. (I still try them for every application as there's little downside.) The documents are usually close but it's rare that I'll actually use one of these documents in a prior art rejection. AI search sounds good to people who have never searched for patents and particularly have never used the existing AI search tools. AI search technology probably won't be good for a decade or more.
In contrast, tools to analyze patent claims for various problems (basically, linters) have been available for around 30 years and can be quite useful in my opinion. But the USPTO has no such tool available to examiners, and analysis under 112(b), etc., is almost always done manually. I wrote my own tool, which I run on my USPTO computer on a regular basis.
There are a huge number of opportunities to streamline USPTO operations with automation. Why are IDS forms not computer-readable? Why is so much information not auto-filled? Why do I have to fill out "bib data sheets" for every application? Why do I have to manually upload my search history when the system could easily automatically grab it for me? Etc.
And automation isn't enough. There should be more data validation in the process, as a lot of problems can be automatically caught at the time of filing or when I post an office action. That's when fixing these problems would be easiest.
> Could you elaborate on this? I'm still on the fence about even having independent dotfiles in a separate repo.
I'm not quite sure what specifically you want me to elaborate on, but I'll write about how similar my work and personal computers are.
My USPTO computer is almost entirely independent of my personal computer. I run Linux on my personal computer, and the USPTO computer is Windows. The repository for my tool, plint, is on both my USPTO and personal computers. With that being said, I make all the commits to the GitHub repository on my personal computer.
> If I may ask yet another question: if your tools make you more efficient, does that impact perception of other examiners' "KPI metrics?"
plint doesn't seem to have any effect on how I am perceived or how other examiners are perceived. I've told some examiners about plint. Some don't seem to be interested, others like the idea. I don't know of anyone else using plint on a regular basis.
Also, using a tool like this doesn't necessarily make an examiner more efficient in the KPIs that upper management cares about. The most "efficient" approach would be to ignore 112(b) problems unless they are particularly obvious. In fact, I think that's what the incentives encourage: doing the bare minimum and moving on to the next application. However, a tool like this makes quality examination more efficient. I'd like to do as high quality a job as I can given the absurd time restrictions on my work. USPTO upper management gives a lot of lip service to quality, but quality isn't rewarded like "production" is. (Again, all this is my opinion and I don't speak for anyone else.)
Thank-you for the reply--and for your frankness. Maybe one of the ways to make things better is from the inside, just by being there in this moment versus anytime else, with what we can work with.
Our fair advantage is in the authoring of computing instructions, and that's as good a chance as we may get.
I hope we will both have stories to share a decade hence.
Current USPTO patent examiner here. The job is regarded as difficult by the vast majority of examiners, as it's based on a quota system and examiners don't get enough time to do a quality job. I don't know of a single examiner who works only one day a week. Many work unpaid overtime to meet their quotas, euphemistically called "voluntary overtime", which is rarely voluntary. There's about 50% attrition in the first year for new hires right now. That sound easy to you? I don't know any other job that hires white-collar professionals with such high attrition.
You can find patent examiners who say that being a patent examiner is relatively easy. These folks typically changed from a notoriously difficult job in law or academia. So it's only relatively easier. Beyond these cases, there do seem to be some examiners who have an easy time, but that's quite rare in my experience, and I'd question the quality of their work.
As for why I took the job: I came from academia, so the job is easier in some respects. The quota isn't entirely bad: It's hard to meet, yes, but it's also mostly objective. If an examiner's numbers are good, the USPTO is happy with them. That contrasts with my experience in academia, where one's job performance is often subjective. If a researcher's boss decides they don't like the researcher for whatever reason (office politics, the researcher is socially awkward, etc.), there might be nothing the researcher could do to recover from that. The job also does give me a lot of freedom in other ways: I can live mostly anywhere I want to, I'm not expected to work all the time (production numbers beat looking busy by working all the time), there are few IP restrictions (aside from that I can't get a US patent), etc. The main problems for me are that the quota is too high and that I don't care for the technology I'm assigned (though the technology could be a lot worse).
(It should go without saying that this comment is my own opinion and not that of the USPTO, US govt., my previous employers, etc.)
In case that the overagressive HN filer ate it again, it's https www-federalregister-gov documents/2022/04/28/2022-09027/filing-patent-applications-in-docx-format
Also dang, recently HN's filters are too aggressive - from sentence-casing USPTO (to Uspto which isn't even a pronounceable acronym) to removing Twitter's search queries.
This also likely means I will need to rework our systems to spit out docx instead of/in addition to PDFs, which will be a nightmare to do. So that's fun.