VB.NET - Reading the contents of a PDF

Reply
 
LinkBack Thread Tools
  #1 (permalink)  
Old 08-01-2006
WhapSumi's Avatar
Curious

Join Date: Jul 2006
Posts: 9
WhapSumi is an unknown quantity at this point
Question VB.NET - Reading the contents of a PDF

Hey all,

I’m currently creating a document management system in VB/ASP.net for a hospital in the Cleveland area. I'm using a SQL database to store information about each document such as contents, location, author, and other info. I've been able to read the contents of almost every document type programmically using the MS Office Clipboard. However I have not been able to read PDFs. I know there are some add-ins for .net that may give me the ability to do this, but they are not open source or free to use. I’ve tried using IE to first open the PDF, but I have not been able to figure out how to copy the contents to the clipboard programmically using IE. Does anyone have an idea how to do this or maybe another way of doing it? Any help would be greatly appreciated!
Reply With Quote
  #2 (permalink)  
Old 08-01-2006
Zythryn's Avatar
Creating
Platinum Subscription
Sponsor
Re: VB.NET - Reading the contents of a PDF

Sorry, can't help off hand although I am pretty sure we did something like this at work.

Try this thread, some of the information there may be helpful for you:

http://groups.google.com/group/micro...6c45f7cf2713b5
__________________
"Treat the earth well: it was not given to you by your parents; it was loaned to you by your children. We do not inherit the earth from our ancestors, we borrow it from our children.

(Ancient Indian Proverb)"

1874 engraving of Mount Hood and the Columbia River by R. Henshel Wood
Reply With Quote
  #4 (permalink)  
Old 08-02-2006
C1ay's Avatar
¿42?
Hypography Staff Member
Administrator
Senior Editor
Editor

Join Date: Feb 2005
Location: 33.78N 84.66W
Posts: 5,756
C1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant future
Re: VB.NET - Reading the contents of a PDF

Sorry, I think you're up a creek without a paddle. Generating PDF's automatically from text or marked up content is pretty straight forward, going the other way is not. In PDFs that I have tried to convert back to text I have found quite a variety of techniques used to prevent this.

In one document I worked on I found that when I selected all of the text in the PDF and pasted it to a file editor all of the words appeared backwards even though they displayed correctly in the pdf. I had to write a script to actually reverse each word.

In another I found that the text transferred via the clipboard had hidden characters inserted between all of the characters in the text. They didn't display but they were imbedded in the binary information.

In yet another I found the whole document was actually an inverted mirror image of the file displayed as a pdf, i.e. the whole document was turned around backwards and then flipped upside down. It required a custom script to straighten out as well.

For plain PDFs where such techniques are not used you mighht be successful but with the great variety of technique I have seen at preventing document copying you will need an equal variety of program functions to handle them.
__________________
Clay

Editor and Forum Administrator
stego anyone?
Add yourself to Hypography's Frappr.
"There are only 10 kinds of people in the world --
.....Those who understand binary, and those who don't."
"Draw no conclusions before their time."
Reply With Quote
  #5 (permalink)  
Old 08-02-2006
WhapSumi's Avatar
Curious

Join Date: Jul 2006
Posts: 9
WhapSumi is an unknown quantity at this point
Re: VB.NET - Reading the contents of a PDF

Well I figured out a way to do it, but it requires that you have the full version of Adobe installed on your PC so that you can gain access to the Adobe APIs (which doesn't technically qualify as a free way to do it). Here is the code I used to read the contents of a PDF. You will have to add a reference to the Adobe APIs in your project:

Dim objPDFPage As AcroPDPage

Dim objPDFDoc As New AcroPDDoc
Dim objPDFAVDoc As AcroAVDoc
Dim objAcroApp As AcroApp
Dim objPDFRectTemp As Object
Dim objPDFRect As New AcroRect
Dim lngTextRangeCount As Long
Dim objPDFTextSelection As AcroPDTextSelect
Dim temptextcount As Long
Dim strText As String

Dim lngPageCount As Long
Dim Fora As Long

objPDFDoc.Open(tbdocdisplaypath.Text)
lngPageCount = objPDFDoc.GetNumPages

For Fora = 0 To lngPageCount - 1

objPDFPage = objPDFDoc.AcquirePage(Fora)
objPDFRectTemp = objPDFPage.GetSize
objPDFRect.Left = 0
objPDFRect.right = objPDFRectTemp.x
objPDFRect.Top = objPDFRectTemp.y
objPDFRect.bottom = 0

' objPDFTextSelection = objPDFDoc.CreateTextSelect(lngPageCount, objPDFRect)
objPDFTextSelection = objPDFDoc.CreateTextSelect(Fora, objPDFRect)

' Get The Text Of The Range

temptextcount = objPDFTextSelection.GetNumText
For lngTextRangeCount = 1 To objPDFTextSelection.GetNumText
doctext = doctext & objPDFTextSelection.GetText(lngTextRangeCount - 1)
Next

doctext = doctext & vbCrLf

Next

doctype = "PDF"

objPDFDoc.Close()
Reply With Quote
  #6 (permalink)  
Old 08-02-2006
C1ay's Avatar
¿42?
Hypography Staff Member
Administrator
Senior Editor
Editor

Join Date: Feb 2005
Location: 33.78N 84.66W
Posts: 5,756
C1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant future
Re: VB.NET - Reading the contents of a PDF

Try it on this one...http://cepi.auctionsourceonline.com/pdf/gearboxes.pdf
__________________
Clay

Editor and Forum Administrator
stego anyone?
Add yourself to Hypography's Frappr.
"There are only 10 kinds of people in the world --
.....Those who understand binary, and those who don't."
"Draw no conclusions before their time."
Reply With Quote
  #7 (permalink)  
Old 08-02-2006
alexander's Avatar
Resident USSRian
Hypography Staff Member
Administrator
Gallery Curator
Dev Team Member
Re: VB.NET - Reading the contents of a PDF

you could look into how gpdf or xpdf do it... except that you would need to maybe write some API integration of the gpdf libraries with VB
__________________
And remember that great question that Pierre-Simon Laplace and Sir Isaac Newton, Andrei Markov and David Hilbert, Richard Feynman and Enrico Fermi, Albert Einstein and Edmund Halley did not come to ask throughout all of their dedication and work: "Who the hell is IMing me?"


This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 License.
Reply With Quote
  #8 (permalink)  
Old 08-02-2006
WhapSumi's Avatar
Curious

Join Date: Jul 2006
Posts: 9
WhapSumi is an unknown quantity at this point
Re: VB.NET - Reading the contents of a PDF

C1ay,

That PDF is has security enabled with password protection so my program can't read it (Unless you know the password). I would first have to write a program to hack it or employ the use a 3rd party hacker tool in my code. This however would make the import process unberribly slow which it already seems to be. It took my program 7 hours to import the contents of 15, 450 page PDFs totalling about 36 MB. I'm pretty sure it's not my program. It seems that the Adobe APIs are causing the slowness. I'm also pretty sure that the azzholes at Adobe put this lovely feature in by design (I've always hated Adobe applications. There is a special place in hell set aside for Adobe programmers I'm sure of it).

I still need to find a better solution for extracting text from PDFs so my quest continues tomorrow... I'm still open to any suggestions from anyone.
Reply With Quote
  #9 (permalink)  
Old 08-02-2006
C1ay's Avatar
¿42?
Hypography Staff Member
Administrator
Senior Editor
Editor

Join Date: Feb 2005
Location: 33.78N 84.66W
Posts: 5,756
C1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant futureC1ay has a brilliant future
Re: VB.NET - Reading the contents of a PDF

I'm only pointing out yet another pdf copy protection. The data you're after is in the pdf file and you'll be OK as long as you're only trying to extract the data that isn't protected one way or another. OTOH, I've seen a variety of methods employed to prevent what you're trying to do.
__________________
Clay

Editor and Forum Administrator
stego anyone?
Add yourself to Hypography's Frappr.
"There are only 10 kinds of people in the world --
.....Those who understand binary, and those who don't."
"Draw no conclusions before their time."
Reply With Quote
  #10 (permalink)  
Old 08-03-2006
WhapSumi's Avatar
Curious

Join Date: Jul 2006
Posts: 9
WhapSumi is an unknown quantity at this point
Re: VB.NET - Reading the contents of a PDF

Thanks. I wasn't aware of how extensive Adobe's copy protection was. I'm going to tinker with OCR today to see if I can convert protected Adobe files to a tiff and then use OCR to "read" the document. I'll let you know how it goes.
Reply With Quote
Reply

Bookmarks


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
So what is everyone reading? Tormod Books, movies, games 430 2 Weeks Ago
Reading Backwards coberst Philosophy and Humanities 0 07-09-2006
Suggested reading Michaelangelica Books, movies, games 9 05-18-2006
Contents of the Tabernacle Revealed Eddy_P Theology forum 4 08-12-2005
Redundant reading material zadojla Watercooler 3 02-12-2005

» Current Poll
Favorite James Bond?
Sean Connery - 63.64%
7 Votes
George Lazenby - 0%
0 Votes
David Niven - 9.09%
1 Vote
Roger Moore - 9.09%
1 Vote
Timothy Dalton - 9.09%
1 Vote
Pierce Brosnan - 0%
0 Votes
Daniel Craig - 9.09%
1 Vote
Hate 'em all - 0%
0 Votes
Who's James Bond? - 0%
0 Votes
Total Votes: 11
You may not vote on this poll.

All times are GMT -8. The time now is 07:43 PM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
SEO by vBSEO 3.2.0 ©2008, Crawlability, Inc.
Copyright © 2000-2008 Hypography
Part of the Hypography - Science for Everyone Network