Jump to content

DMS Document Management System


Tido

Recommended Posts

@w0ndersp00n   -  I moved this here, not to hijack the Mayan thread

 

Quote

I am looking for a DMS for a long time. There is another DMS Paperless, similar idea, but smaller: https://paperless.readthedocs.io/en/latest/setup.html#setup-installation-docker  could this be a better solution?

 

 

Quote

I've looked at Paperless myself as well. About a year ago the developer announced he wouldn't be very active anymore. I just noticed that the community took over, so I quickly did a test and these are my findings:

Paperless, just like Mayan, only offers an AMD64 image, so ARM images should be build by yourself

On my C2, Building Paperless takes about 10 - 15 minutes, compared to the 1 hour from Mayan

Memory consumption out of the box is about 50 MiB, versus Mayan's 350 MiB on my C2

Building the image is way more fiddly, and scripting it would require too much effort

  • You need to clone the Git repo
  • You need to edit some config files
  • And from that moment you can start building

My conclusion would be that anyone who would like to use Paperless should build the image with the instructions from Paperless manual. After you've build the image, you probably can build new versions again after refreshing from Git.

 

After playing arount with Paperless, I found it less suitable for me. I also am looking for a DMS, and Mayan offers way more flexibility and more features regarding sorting files, searching, metadata etc. Paperless doesn't even allow for a web upload of a file, which is something that I need.

 

Building it, sounds a bit similar to here: https://github.com/armbian/autotests  After cloning,  Igor runs an initial part of the script. Once it configured itself (and you have adjusted the configuration file) you run the script again. A similar approach could be done for DMS Paperless I guess.

 

I agree, Mayan has business like options. Do you want it for business or private use?

Not sure what exactly you mean by  web-upload, paperless offers:  Currently, there are three options: the consumption directory, IMAP (email), and HTTP POST.

 

Link to comment
Share on other sites

No problem! I want it for private use. It’s true wat you say about paperless, that it supports web upload (HTTP POST). The big difference is that you need to create/write/develop the form to upload a document yourself. A lot of functionality lacks out of the box.

 

In Mayan, you simply go to the settings and enable a web form for the upload and it works. So Mayan has a larger footprint, but is also more complete in my opinion. Actually, there is about nothing you can do regarding settings from the Web UI with Paperless, in contrary to Mayan.

 

Maybe I haven’t looked at it long enough, but if such features would be user friendly to enable and work with, then I’d probably opt for paperless.

 

Still some functions that make me feel that Mayan also is a great solution for personal use:

  • Custom metadata (such as who send a document, which category, date, etc)
  • Custom indexes: it’s possible to navigate documents using the custom metadata (e.g. I want to look through my invoices which I received in January of 2020)
  • Besides these configuration steps everything works out of the box

Have you installed Paperless and compared it to Mayan? I’m very curious if you may find it different.

Link to comment
Share on other sites

8 hours ago, w0ndersp00n said:

The big difference is that you need to create/write/develop the form to upload a document yourself. A lot of functionality lacks out of the box.

I bought a Brother scanner ADS-1100W that works with Linux.

I wrote a document/collection for me about a paperless-office and how I would like it to be, how to name documents and keep a folder structure, if the DMS goes on strike, I'ill still find my documents.

I was never looking at HTTP Post, it never seemed necessary to me. I would scan documents, get some support of finding the right naming for the document: Autokey-py3.

 

9 hours ago, w0ndersp00n said:

Have you installed Paperless and compared it to Mayan? I’m very curious if you may find it different.

Not yet, I tried paperless quite a while ago - it showed some potential, but as container should easily live next to eachother I will try it again. Doing some tests with some documents.

 

My focus until now was, naming, keep a folder structure, indexing (tesseract), metadata (in PDF for example), search&find.

 

Link to comment
Share on other sites

19 hours ago, Tido said:

I will try it again.

@w0ndersp00n, I failed miserably :(   This morning I spent 2 hrs trying to install it on the same SDcard as I have the Mayan.

So, I tried now again on a fresh Debian. Becaus the DOCKER steps look like a lot and I am not familiar, I thought I try this way.   However, just

apt install lxc
   follwed from:
lxc launch ubuntu: paperless
  fails in this way: Command 'lxc' not found.

 

Aha, I found something that I may first have to read to understand what I am doing unless you sugest to try the Docker-Route anyway ?

Setting up LXC on Debian desktop: https://gudok.xyz/lxcdeb/

While the creater themselves suggest:  If using Ubuntu, we recommend you use Ubuntu 18.04 LTS as your container host. LXC bugfix releases are available directly in the distribution package repository shortly after release and those offer a clean (unpatched) upstream experience. https://linuxcontainers.org/lxc/getting-started/

 

oh gosh, I just wanted to test it.

Link to comment
Share on other sites

I don't know why it failed. But here are the steps I took to get Paperless running next to Mayan on my dev/test Odroid:

 

git clone https://github.com/the-paperless-project/paperless.git
cd paperless
cp docker-compose.yml.example docker-compose.yml
cp docker-compose.env.example docker-compose.env
docker-compose up -d

I edited docker-compose.env to change variables for Timezone and OCR Languages:

 

TZ=Europe/Amsterdam
PAPERLESS_OCR_LANGUAGES=nld deu eng

With these steps, Paperless will be build and run on port 8000. This process takes about 15 minutes on my Odroid C2.

 

Maybe you can try again with these steps.

 

One thing I did notice already: the consumption of PDF files takes waaaay longer with Paperless then with Mayan. But in my short test this application actually seems worthwile. I'm going to test it further as well.

Link to comment
Share on other sites

On 4/11/2020 at 10:25 AM, Tido said:

Building the image is way more fiddly, and scripting it would require too much effort

 

12 hours ago, w0ndersp00n said:

With these steps, Paperless will be build and run on port 8000.

Thank you, in oppsite to your initial comment (fiddly) it was indeed easy.

 

How to start the instance on your next login:

List all docker thinges: docker ps -a

docker start paperless_    (just hit Tab, so it automagically completes the comand)

 

However, I came to the login and in armbian it is clearly defined on the website, but I cannot find it here, so I searched the issues:  https://github.com/the-paperless-project/paperless/issues/578

To create the user run in the paperless-docker-folder:   docker exec -it paperless_webserver_1 ./manage.py createsuperuser

 

What was your experience?

 

12 hours ago, w0ndersp00n said:

consumption of PDF files takes waaaay longer

You fed it an invoice or what was your test file?

 

Edited by Tido
crossed out, I forgot to return to the documentation and read until the end.
Link to comment
Share on other sites

via FTP connection I put two PDF invoices into the: /paperless/consume   folder.  Error message in the log:

PARSE FAILURE for /consume/Kaufbeleg_digitec_AMD_Ryzen_186722.pdf: Thumbnail (gs) failed at ['gs', '-q', '-sDEVICE=pngalpha', '-o', '/tmp/paperless/paperless-rtfp6afe/gs_out.png', '/consume/Kaufbeleg_digitec_AMD_Ryzen_186722.pdf']

 

Link to comment
Share on other sites

@w0ndersp00n,

While writing to you the other day, I thought:  I want the software to scan & index, based on found words with OCR, guess what it is (invoice, bank, tax, (should an invoice be named differently if it is telephone or insurance?) and change the file-name like:  2020-02-26_IN_Parking.

If the DMS cannot rename the file I want another SW to takeover this task, because I have to scan quite a few paper folders.

Last but not least, I don't want the documents to sank in some database, I want to keep my folder structure  7-Folder Structure.

 

Yesterday evening I was looking for films on YT and came across this one : Open Source Document Management System - Papermerge

Warning, it is pretty new and maybe not as feature rich, but it looks good what he shows:   https://www.youtube.com/watch?v=U_x8fOhuMTI

  https://github.com/ciur/papermerge

 

Link to comment
Share on other sites

It’s true that initial setup of Paperless is quite ok. However, as you noticed, you have to use the command line in order to change settings and setup the user. Basically everything to setup happens via the CLI.

 

In my case I also had to disable inotify. This is once again a specific setting which needs to be changed before starting the container, because otherwise it would never detect any of my files in the consume directory.

 

The consumption process on my side is also less reliable than with Mayan.

  • It’s very slow. Where Mayan processes a PDF including OCR with in seconds I need to wait minutes on Paperless
  • When Paperless is unable to execute OCR or detects a language for which you haven’t installed tesserae-ocr, than the consumption fails and your document is never added. There is no way to override it. Because of some reason Paperless thought one of my documents was in Portuguese, so it simply doesn’t process it.
  • It’s impossible to add files other that PDF, JPG and TIFF. Mayan will convert DOCX and ODT to PDF while processing the file.

On the other hand:

  • Paperless gives me the ability to quickly search on correspondent, date and text in the file which is the basic functionality I’m looking for. This works out of the box, while you need to configure this with Mayan.

So I’m still very much on the fence regarding Paperless... I still prefer=er Mayan, but it’s resource requirements are out of the roof...

 

So neither Mayan or Paperless will (by default) rename files. Mayan will not do that, because it’s Metadata-driven. You should use metadata and I think you can setup rules to add specific metadata when some text has been found in the OCR data.

Paperless can’t do that, but when reading the manual, you probably can write a hook function in order to do that after the consumption.

 

Also most DMS systems can export all files, I’d desired, using specific naming conventions. 

 

37 minutes ago, Tido said:

Yesterday evening I was looking for films on YT and came across this one : Open Source Document Management System - Papermerge

Warning, it is pretty new and maybe not as feature rich, but it looks good what he shows:   https://www.youtube.com/watch?v=U_x8fOhuMTI

  https://github.com/ciur/papermerge

 

 

I have seen this application as well indeed. But my biggest gripe is that it tries to be a file system, with conventions such as folders. For my the file names are irrelevant. My requirements are:

  • Categorise on sender (correspondent)
  • Categorise on type of document (tagging)
  • OCR
  • Search on date, correspondent, tag and document text

I don’t want another file system. I want to use metadata in order to find my files. So with these requirements Paperless should be a fit, but I’m still not liking the small issues I’ve found until now.

 

Edit: I just found out about https://github.com/the-paperless-project/paperless/blob/1c956652f360e58409c8fca148b7662585dd1087/paperless.conf.example. Maybe this file should be added to volumes in order to change settings somewhat more easy.

Link to comment
Share on other sites

So after playing around with Paperless for a few hours and with a few testruns, I have to admit I'm starting to get more positive. The downside still is that documentation is lacking and that development seems slowed down quite a bit. On the plus side, when you know what to do, this tool can work very well for you.

 

Some less positive points:

  • It doesn't support Office-type of files. So no docx, odt, xlsx, etc. This issue isn't too big, since I'm using it mostly for files I received/scanned. And if I receive an Office-type of document, it isn't very hard converting it to PDF.
  • It does support TXT, CSV and MD files. However, the default Dockerfile doesn't account for this functionality. So you'll need to add 'ghostscript-fonts' to the added packages, or add it to the consumer container manually afterwards if you want to use this.
  • While consuming, it's impossible to add new tags or correspondents due to SQLite being locked.

 

Some more positive points:

  • By additionally mounting paperless.conf via the volumes in both containers, a lot of configuration changes can be made. And easier then sh-ing into the containers.
    • E.g. encryption of saved files can be enabled
    • You can set the date order (default YMD, I set it to DMY)
      • This is very worthwile, because with this, Paperless will read the OCR'ed file and detect the date of the file. I successfully tested this with some invoices.
    • You can set the default language for tesseract. I set it to Dutch, and this really speeds up the process of consuming documents, instead of trying English first, before any other languages.
    • Also I've set the OCR quality to 150dpi instead of 300dpi, which also helps speeding up the OCR process. Since my scanner is high quality, I don't assume this will become an issue.
    • And most important: you can enable "forgiving OCR", which means that a file is consumed, even if OCR failed!
  • In tags and correspondents you can set some Match keywords to search for. If Paperless finds those, the correspondent and or tags will automatically be added to the consumed document. I added an invoice from a retailer, and after consuming, the date was correct, the correspontent was correct and the right tags were added

 

I'm going to scan some files and upload them to my development box to test if everything will still work as expected. If it does, I might replace Mayan wih Paperless.

 

Edit: In the image you can see two POS-tickets I scanned, which were from 2015. Paperless automatically found the correct correspondent, MediaMarkt. In the first run I tried to see if it would automatically find the correct date (July 2), which was correct. On the second run I wanted to know if automatic tagging worked (bon), and that worked.

 

It isn't flawless, especially with these POS-tickets degrading over time. E.g. for some IKEA tickets Paperless was unable to determine the correct date. I guess that this issue will be smaller with recent tickets without the amount of degradation these have.

 

Paperless.png.547ec644cf2362d5fcedb94779c2bb14.png

 

Link to comment
Share on other sites

12 hours ago, w0ndersp00n said:

I still prefer=er Mayan, but it’s resource requirements are out of the roof...

I read your comments and I want to deep dive into your findings, but I also felt, because of my ideas, to read the docs of Papermerge. What I have done now.  I could see the potential there to index all files, but leave the structure.

On the other hand I agree with you that Metadata is key, but at least a part of it should be kept inside the document AND before you put everything in the DMS, testing the export capabilities seem to me to be essential.

 

Link to comment
Share on other sites

I haven’t tested exporting documents yet. I did notice that when downloading a file, the filename will become ‘Date of the file_Title of document’.

 

Title of document is the original filename, if you didn’t change it.

 

Documentation regarding exporting all files: https://paperless.readthedocs.io/en/latest/utilities.html#the-exporter

 

I’ve been using Paperless for a few moments now and I have to say it really seems to be what I was looking for.

 

On a technical level I’ve seen that the documents are saved with numeric filename (00000010.pdf), while Mayan uses UIDs. I don’t know which is better, but since Paperless seems to have quite a large community of home users, I assume that any issues regarding this would’ve been noticed by now.

Link to comment
Share on other sites

This sounds promising :)

1 hour ago, w0ndersp00n said:

I have to say it really seems to be what I was looking for.

Who would have thought so a couple days ago - lucky strike.

 

Quote

on their documentation:

If you’re using Docker, you can set a restart-policy in the  docker-compose.yml  to have the containers automatically start with the Docker daemon.

Mayan simply boots automagically within the boot process.  How do you handle it with Paperless?

 

Link to comment
Share on other sites

18 hours ago, Tido said:

Mayan simply boots automagically within the boot process.  How do you handle it with Paperless?

 

 This is the same with Paperless as it is with Mayan. By setting the restart policy in the docker-compose file or in the docker run command. I use “unless-stopped”.

Link to comment
Share on other sites

On 4/15/2020 at 4:51 PM, w0ndersp00n said:

I use “unless-stopped”

sounds good to me, now I just need to figure out how and where to place this command.  Here are all options: https://docs.docker.com/compose/compose-file/#/deploy#restart

docker-compose.yml:

services:
    webserver:
        build: ./
        # uncomment the following line to start automatically on system boot
        # restart: always

always - Always restart the container if it stops. If it is manually stopped, it is restarted only when Docker daemon restarts or the container itself is manually restarted.

unless-stopped - Similar to always, except that when the container is stopped (manually or otherwise), it is not restarted even after Docker daemon restarts.

 

For beginners like me:  Just open the file in your favorite text-editor and remove the leading #,  but the line must start exactly there. Delete spaces until it alignes like before.

 

 

Although the above works somebody recommended to do that with "systemd".

https://stackoverflow.com/questions/30449313/how-do-i-make-a-docker-container-start-automatically-on-system-boot/39493500#39493500

 

Edited by Tido
Link to comment
Share on other sites

Voila, back in town :)

 

On 4/13/2020 at 9:21 AM, w0ndersp00n said:

In my case I also had to disable inotify.

I don't understand, it says in the man page: Inotify can be used to monitor individual files, or to monitor directories.

So, this seems to be a necessary function. I looked within Paperless in the log and saw by accident that the: PARSE FAILURE for /consume/Kaufbeleg_digitec_AMD_Ryzen_186722.pdf   Actually, was missing permission on the file I pasted there.  So, quick and dirty: chmod 0777 *

 

Reading my comment above, my PDF files were scanned and it grabed the right invoice date from the file. As I am writing...  your step below maybe only necessary for scanned OCR docs, and if so, why. I mean the sorting order gets destroyed.

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

You can set the date order (default YMD, I set it to DMY)

  • This is very worthwile, because with this, Paperless will read the OCR'ed file and detect the date of the file. I successfully tested this with some invoices.

 

 

On 4/13/2020 at 9:21 AM, w0ndersp00n said:

So neither Mayan or Paperless will (by default) rename files.

I guess I wanted to much. So, let's start little:

The scanner shall put the scanned documents into a folder. Based on the document I would like to use a naming schema and a tool can help with shortcuts to easy rename the scanned document (Autokey-py3).  Finally, it should be put into the consume folder of Paperless - something like this could move it based on a rule: https://github.com/benjaminoakes/maid   or a tool that 5 minutes after the file-name changed from the format of the scanner, it puts it into the consume folder of Paperless.

 

import/export:  Filename is changed to small letters during import.

 

"Correspondents you can set some Match keywords"  - in your case either a person or a shop name?
Did you have add  MediaMarkt  so it would match it to the bon?

 

Link to comment
Share on other sites

 

2 hours ago, Tido said:

Voila, back in town :)

 

I don't understand, it says in the man page: Inotify can be used to monitor individual files, or to monitor directories.

So, this seems to be a necessary function. I looked within Paperless in the log and saw by accident that the: PARSE FAILURE for /consume/Kaufbeleg_digitec_AMD_Ryzen_186722.pdf   Actually, was missing permission on the file I pasted there.  So, quick and dirty: chmod 0777 *

 

In my case `disable-notify` is needed, because my consume folder is on a NFS-network location. inotify doesn't work on network locations, so new files would never be processed. Because I'm using NFS, rights are not an issue (if you set the UID and GID to the same from the NFS server.)

 

Quote

Reading my comment above, my PDF files were scanned and it grabed the right invoice date from the file. As I am writing...  your step below maybe only necessary for scanned OCR docs, and if so, why. I mean the sorting order gets destroyed.

 

Regarding the date: in the Netherlands we use DMY dates. I noticed that Paperless wouldn't detect any correct date when this was set to YMD.

 

Quote

I guess I wanted to much. So, let's start little:

The scanner shall put the scanned documents into a folder. Based on the document I would like to use a naming schema and a tool can help with shortcuts to easy rename the scanned document (Autokey-py3).  Finally, it should be put into the consume folder of Paperless - something like this could move it based on a rule: https://github.com/benjaminoakes/maid   or a tool that 5 minutes after the file-name changed from the format of the scanner, it puts it into the consume folder of Paperless.

 

I guess this idea is nice, but I still don't understand how that script would be able to define the correct filename. E.g. my scanner scans to my NAS and it sets a default filename, which always is the serial number of the scanner, appended with a number. So ABC7483393939_000001.pdf. There would be no way for any software to know what this file is. So that's where Paperless and OCR come in, reading and automatically tagging it.

 

I guess the only way to make this work, is to indeed scan to a 'pre-Paperless' location and pre-process it. Paperless however does support POST-OCR scripts, so maybe it would be worthwile to check that out. I assume you can change the Title (Filename) of the consumed file, based on the OCR results:

 

Quote

"Correspondents you can set some Match keywords"  - in your case either a person or a shop name?
Did you have add  MediaMarkt  so it would match it to the bon?

 

Yes, so in the match field I've added `mediamarkt "media markt" redcoon`, so all scans containing those strings will automatically be assigned to correspondent MediaMarkt.

Link to comment
Share on other sites

After reading the whole documentation it is time to wrap up some of the open points.

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

The downside still is that documentation is lacking

Do you have some particular things or just in general, maybe also it is not so nice structured?

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

need to add 'ghostscript-fonts' to the added packages,

is this a large package or rather small and it makes sense to include it in any case?

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

While consuming, it's impossible to add new tags or correspondents due to SQLite being locked.

But there is a command to re-run the tagging algorithm

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

additionally mounting paperless.conf via the volumes in both containers

Where did you find this information  or  how did you do that. Did you define the Volumes outside of /var/lib/docker/ ?

 

On 4/16/2020 at 11:18 PM, w0ndersp00n said:

in the Netherlands we use DMY dates.

I guess this is the case for Europe. However, it did it correctly for me without any changes, but reading all the documentation I have to re-check if it extracted the date from the file name.

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

I've set the OCR quality to 150dpi instead of 300dpi,

But you scan documents still with 300dpi ?

 

On 4/16/2020 at 11:18 PM, w0ndersp00n said:

So ABC7483393939_000001.pdf. There would be no way for any software to know what this file is.

One can support that process if the scanner allows to configure:  Button 1  |  Button 2  for the scanning with a different name, as described in the documentation:

The example below is for a Brother ADS-2400N, a scanner that allows different names to different hardware buttons (useful for handling multiple entities in one instance), but insists on adding  _<count>  to the filename.

# Brother profile configuration, support "Name_Date_Count" (the default
# setting) and "Name_Count" (use "Name" as tag and "Count" as title).
PAPERLESS_FILENAME_PARSE_TRANSFORMS=[{"pattern":"^([a-z]+)_(\\d{8})_(\\d{6})_([0-9]+)\\.", "repl":"\\2\\3Z - \\4 - \\1."}, {"pattern":"^([a-z]+)_([0-9]+)\\.", "repl":" - \\2 - \\1."}]

 

On 4/16/2020 at 11:18 PM, w0ndersp00n said:

support POST-OCR scripts

Indeed, something I have to investigate for documents from the scanner.

Not necessary for documents you receive via Email and rename it... in this situation my idea: a tool can help with keyboard-shortcuts to easy rename the document (Autokey-py3)  :)

 

In the documentation I also came across this suggestion. The OCR will only be stored within the PAPERLESS - my understanding, but if you export the PDF you may want to have it searchable too?

Pre-Processing: A free tool to OCR a PDF and add a text "layer" in the original file, making a searchable PDF.   As tesseract is mentioned as well, does this have the side effect to run OCR twice?

 

If I write a Text-Document or a Table-Sheet I maintain the metadata for years already, because I believe some day I have a CMS that can read that data and this will support my search for it  :thumbup:

Metadata-driven: reading the metadata from PDFs and others would be useful. Did you come across something that supports this in general, not only for paperless?

 

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

And if I receive an Office-type of document, it isn't very hard converting it to PDF.

Did look for or write some parsers.py?

 

Link to comment
Share on other sites

On 4/13/2020 at 1:16 PM, w0ndersp00n said:

This is very worthwile, because with this, Paperless will read the OCR'ed file and detect the date of the file. I successfully tested this with some invoices.

While searching the issues for some help/ideas regarding changing the filename script, I came across this issue/PR that would explain to me how Paperless found the document date even though I didn't make the change you mentioned.

 

Add support for a heuristic that extracts the document date from its text #291  https://github.com/the-paperless-project/paperless/pull/291

 

Link to comment
Share on other sites

On 4/16/2020 at 11:18 PM, w0ndersp00n said:

Paperless however does support POST-OCR scripts, so maybe it would be worthwile to check that out

fair enough, but where do I place such a file? In the documentation it says: 

# After a document is consumed, Paperless can trigger an arbitrary script if you like. This script will be passed a number of 
# arguments for you to work with.  The default is blank, which means nothing will be executed.  For more
# information, take a look at the docs:
# http://paperless.readthedocs.org/en/latest/consumption.html#hooking-into-the-consumption-process
PAPERLESS_POST_CONSUME_SCRIPT="/path/to/an/arbitrary/script.sh"

/home/paperless/  because this one is created with step 7. of the docker installation?  (I guess unless I got it simply by one of my many try & error)

Well, I give it a shot.   /home/paperless/post_consume_script.sh

 

Link to comment
Share on other sites

On 4/25/2020 at 12:25 PM, Tido said:

After reading the whole documentation it is time to wrap up some of the open points.

 

Do you have some particular things or just in general, maybe also it is not so nice structured?

 

In general, a lot of things are not documented or are documented as comments within the code and config files. So besides the documentation, you need to read the code and config files to figure out how everything works.

 

On 4/25/2020 at 12:25 PM, Tido said:

is this a large package or rather small and it makes sense to include it in any case?

 

This is a small package, but probably not worthwile since it's only helpful for storing text-based documents, which isn't the core.

 

On 4/25/2020 at 12:25 PM, Tido said:

But there is a command to re-run the tagging algorithm

 

True, but it doesn't fix the fact that the database is locked during operations, making it sometimes annoying to work with Paperless. I'm going to find out if it is possible to use MariaDB or PostgreSQL, since they only lock records and not the complete database for writing.

 

On 4/25/2020 at 12:25 PM, Tido said:

Where did you find this information  or  how did you do that. Did you define the Volumes outside of /var/lib/docker/ ?

 

I didn't, it's not documented. I've read the docker documentation and found out how I was able to use my own locations for volumes. I'll upload my docker-compose.yml sometime.

 

 

On 4/25/2020 at 12:25 PM, Tido said:

But you scan documents still with 300dpi ?

 

Yes, the scanned documents are still 300dpi. I actually think it is 600dpi in my case.

 

On 4/25/2020 at 12:25 PM, Tido said:

Did look for or write some parsers.py?

 

Yes, some people are trying this, but it isn't very easy since Paperless in it's core wasn't designed for this.

 

 

17 hours ago, Tido said:

fair enough, but where do I place such a file? In the documentation it says: 


# After a document is consumed, Paperless can trigger an arbitrary script if you like. This script will be passed a number of 
# arguments for you to work with.  The default is blank, which means nothing will be executed.  For more
# information, take a look at the docs:
# http://paperless.readthedocs.org/en/latest/consumption.html#hooking-into-the-consumption-process
PAPERLESS_POST_CONSUME_SCRIPT="/path/to/an/arbitrary/script.sh"

/home/paperless/  because this one is created with step 7. of the docker installation?  (I guess unless I got it simply by one of my many try & error)

Well, I give it a shot.   /home/paperless/post_consume_script.sh

 

You should probably place the script in a folder on your host and then use file mapping to map a file from the host to the container: https://docs.docker.com/compose/compose-file/#volumes

 

 

Link to comment
Share on other sites

 

5 hours ago, w0ndersp00n said:

find out if it is possible to use MariaDB or PostgreSQL, since they only lock records

looks like you are not the only one: https://github.com/the-paperless-project/paperless/issues?q=is%3Aissue+Mysql

 

On 4/25/2020 at 12:25 PM, Tido said:

additionally mounting paperless.conf via the volumes in both containers

I found that in the meantime. You can add the .conf file in the   docker-compose.env  -  just like you wrote before, the documentation is spread every where. As we found some, we can submit some PR to clean up and gain more people to use it and improve it :)

 

6 hours ago, w0ndersp00n said:
Quote

parsers.py?

 

Yes, some people are trying this, but it isn't very easy since Paperless in it's core wasn't designed for this.

Do you have a link?  Have seen any where people try to read the metadata from documents?

 

5 hours ago, w0ndersp00n said:

You should probably place the script in a folder on your host and then use file mapping to map a file from the host to the container: https://docs.docker.com/compose/compose-file/#volumes

docker-compose.env - will this be read every time the container gets loaded or only if I re-create the image?  Because in there is the   paperless.conf   and within its TWEAK-Section I have my path declared:  PAPERLESS_POST_CONSUME_SCRIPT="/home/paperless/script.sh"

However, as I wrote over here, nothing happend :-(

 

Link to comment
Share on other sites

@w0ndersp00n have you ever heard of: https://teedy.io/

Because Eugen of Papermerge mentioned its 1.4 release on https://www.reddit.com/r/selfhosted/comments/ibwuj2/papermerge_14_out/?sort=old  I learned about Teedy. Selfhosting is free.

Teedy is an open source, lightweight document management system for individuals and businesses.

Oh, while reading the readme.md:  Java 8 

 

Docker for  ARM  was mentioned on Reddit:  https://hub.docker.com/r/jdreinhardt/teedy/tags

 

Link to comment
Share on other sites

I never heard of Teedy before... It has a beautiful UI though. But considering it's written in Java, I can't call it lightweight. Every Java app I run uses about 6 times the amount of memory compared to Python apps. Rust or C/C++ would be best, considering most ARM boards don't get more than 4GB's of RAM.

 

I might try it out on an old C2, to get a feeling of how it compares to the other ones.

Link to comment
Share on other sites

Hello again,

 

any progress on that front?

On 9/7/2020 at 9:17 PM, w0ndersp00n said:

I might try it out on an old C2, to get a feeling of how it compares to the other ones.

 

You won't believe it but Paperless-NG (a thing I thought about it myself) is born: https://github.com/jonaswinkler/paperless-ng  and it sounds promising: https://github.com/the-paperless-project/paperless/issues/711

Apart from that, this large PR has finally received approval and the author wrote me that your problem should be fixed within as well: https://github.com/the-paperless-project/paperless/pull/652

 

You think this is enough, no!! there is more: https://github.com/eikek/docspell  a new project (around 12-14 months) but sounds interesting too, Scala 41.8%    Elm 41.7%  programing language.

Last but not least, I haven't looked further into https://github.com/ciur/papermerge he wrote once he is working full-time on it, but if he doesn't make any money from it and there is no active community around it I guess it will sooner or later go down the same road as paperless.

Link to comment
Share on other sites

This thread is quite old. Please consider starting a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
×
×
  • Create New...

Important Information

Terms of Use - Privacy Policy - Guidelines