How do you excise a random list of eBooks from a library database?

I had a new project tossed my way. How to remove a set of eBooks from our database that has 30,000 eBooks in it. The fun part is that the list of removals has only a small number in a large field, the URL to see the eBook, to match on.

I set about building a python script to do this rather than let the librarians look up each book separately.

Below is the final version of a program to take a list of all our eBooks at the library and a list of eBooks to remove from the catalog and create a list of ISBN numbers so I could edit the records.
The list of removed eBooks only came with a special URL used to get to the book. I had to isolate this number in the removal file and the list of all eBooks and compare them. I have tried to comment heavily so that the program will make sense and help someone .


# Created on a win system with IDLE

# Read in the deleted_ebooks file and make an array(list) of the titles
# Read in ebook.txt one line at a time and compare it to the Remove array
# Write that record2 line out to export.
# The goal is to extract the item id of ebooks the library will delete
# due to consortium ebook deletion.

print " first the housekeeping" # So you know we have started
# Open file for read only access, change the filename to your needs
record = open("c:\data\ebookRemoval\deleted_ebooks.txt", "r")

# Open a file for output, write access, change the filename to your needs
output = open("c:\data\ebookRemoval\export.txt","w")

# Open a file for output, write append access, change the filename to your needs
output3 = open("c:\data\ebookRemoval\etitle.txt","a")
output2 = open("c:\data\ebookRemoval\isbn.txt","a")

# Setup the empty lists
a = 0  # left of book id number in url
a1 = "&AN=" # Text to compare 
b = 0  # right of book id number in url
b1 = "&site="  # Text to compare 

w = "" # holds the element of WorkingRemove for testing
m = ""  # holds the result of finding the unique ID in the URL
m1 = "" # used as temp storage in the searching process
m2 = "" # used as temp storage in the searching process
n = ""  # temp storage for Remove[x] comparing the unique ID in url

u = "856  4"  # Text to compare  for 856 field for URL
ui = 0   # element index in list of 856

isbn1 = "035 " # 035 field text to compare 
isbni = ""  # Temporary storage of isbn field. As 035 it holds any kind of system comtrol number
isbn = "" # The isbn snipped from isbn1
i = 0     # The element index of the isbn field

xxx = "not present" # put in list index 0 for to use missing fields. 

ExportTitle = "" # Temporary storage of Title and "|"
Title = []  # The Title of the removed ebook ooutput to the title.txt file to help locate failed comparisons

Remove = [] # The list of unique ID's snipped from the deleted list url's
r = "" # temporary storage of the unique ID while building WorkingRemove

WorkingRemove = [] # The entire input from, record, one file read, split by "|"

WorkingEbook = [] # The entire input from, record2, one file read, split by "|"

# Setup loop to read through entire file for processing
# Loads a line of text from the object record into lineuntil the file end
print " build the list of items being removed to search for" # So you know we started this for Line in record:
# Split the record into a list called WorkingRemove by the pipe symbol
      WorkingRemove = Line.split("|")

      m1 = WorkingRemove[4]   
      for x in range(0,(len(m1)-1)): # find the unique id location in this url
                   if m1[x:x+4] == a1: # find just before
                        a = x
      for x in range(0,(len(m1)-1)): # find just after
                   if m1[x:x+6] == b1:
                        b = x
      r =m1[a+4:b] # snip out the unique ID with the location found
      Remove.append(r) # Append the URL to the list WorkingRemove
      Title.append(WorkingRemove[0]) # Capture the title in matching index of Remove
      lengthRemove = len(Remove) # Change lengthRemove to the current length of the list WorkingRemove 

record.close() # Close the PDA list file

print "Now we search through the list of items comparing the uique book id to the one in our database and saving the isbn of that title."
# Open file for read only access, change the filename to your needs
record2 = open("c:\data\ebookRemoval\ebooksall2.txt", "r")

# The loop below will cycle throught eash record in the file of items from the library until it reaches the end.

for Line2 in record2:
      # Split the record into a list called WorkingCounter by the pipe symbol
      WorkingEbook = Line2.split("|")
      WorkingEbook.insert(0,xxx) # gives us a message for missing fields
      ui = 0 #Preload the "Not Present" message incase, good housekeeping
      i = 0  #Preload the "Not Present" message incase, good housekeeping

      for y in range(0,(len(WorkingEbook)-1)):
              w = WorkingEbook[y] # Put the field being tested in w

              if ui == 0:
                if w[:5] == u[:5]: # Test for the 856 field
                    ui=y           # This is the list index with the url and unique id in it
              if i == 0:                  
                if w[:4] == isbn1: # Test for the 035 field, that has the ISBN or other control number in it
                    i = y          # This is the list index with the control number in it

      isbni = WorkingEbook[i] # Put the list item into isbn1 for editing
      isbn = isbni[17:31] # Snip out the ISBN from surrounding data
      m2 = WorkingEbook[ui] # Copy the entire url to  m2
      m1 = m2[39:]  # Remove the local section by starting the copy after it

      for x in range(0,len(m1)):  # Find the unique id location in this url
              if m1[x:x+4] == a1: # Find just before the text just before
                   a = x          # The position just before the unique id
      for x in range(0,len(m1)):   # Find the unique id location in this url
              if m1[x:x+6] == b1:  # Find the text just after the unique id
                   b = x           # The position just after the unique id
      m =m1[a+4:b] # Snip out the unique ID with the locations just found and place it in m
      for x in range(0,len(Remove)): # Start the compaqrison loop based on the length of the Remove list
             n = Remove[x]      # Copy the deleted ID for testing to n         
             if  m == n: # Compare the current record ID to deleted ID first  

                 output.write(Line2)    # If it matches write out the entire record for reference
                 isbnout = isbni+"|"    # I choose to write out the entire 035 element due to variations in how the field was filled
                                        # Only half the items had an ISBN in the record
                                        # Simpler to use edit tools later to remove unwanted text 
                 ExportTitle = Title[x]+"|" # Build title with "|" delimeter  
                 output2.write(isbnout) # Append the isbnout field , the entire 035 field, to the file at output2
                 output3.write(ExportTitle) # Output the matched Title
#Close the files and end
print " All Done!"
print lengthRemove # This is how many titles we tested for removal
print Remove       # This is the unique id snipped from the url for testing against
# Close all the files 

raw_input() # To keep the window open on gui systems

Here are two records for use as examples.

This is the deleted eBook example:

100 Top Internet Job Sites : Get Wired, Get Hired in Today's New Job Market {Career Savvy Series} |9780585273907||2000||Ackley, Kristina.|ebook|Computer Science |Impact,||45732492|collection 1
101 Home Office Success Secrets |9780585323176||2000||Kanarek, Lisa.|ebook|Business, Economics and Management |Career Press,||45843138|collection 1

This is the eBooks in the system example:

01506cam 2200349La 45{92}0 |001 ocm42856319\ |003 OCoLC |005 20050111035605.0 |006 m\\\\\\\\u\\\\\\\\ |007 cr\cn- |008 970522s1997\\\\nyua\\\\sb\\\\001\0\eng\d |035 \\$a(Sirsi) i9780585089164 |040 \\$aN{dollar}T$cN{dollar}T$dOCL$dOCLCQ |020 \\$a0585089167 (electronic bk.) |050 14$aBF575.A85$bD37 1997eb |082 04$a158.2$221 |049 \\$aAMFA |100 1\$aDavidson, Jeffrey P. |245 14$aThe complete idiot's guide to assertiveness$h[electronic resource] /$cby Jeff Davidson. |246 3\$aIdiot's guide to assertiveness |246 3\$aAssertiveness |260 \\$aNew York :$bAlpha Books,$cc1997. |300 \\$axxiv, 337 p. :$bill. ;$c23 cm. |504 \\$aIncludes bibliographical references (p. [323]-324) and index. |533 \\$aElectronic reproduction.$bBoulder, Colo. :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in multiple electronic file formats.$nAccess may be limited to NetLibrary affiliated libraries. |650 \0$aAssertiveness (Psychology) |655 \7$aElectronic books.$2local |710 2\$aNetLibrary, Inc. |776 1\$cOriginal$z0028619641$w(DLC) 97073152$w(OCoLC)37792767 |856 4\$3Bibliographic record display$u,ip,url,cpid&custid=s6243341&db=nlebk&AN=8945&site=ehost-live |994 \\$a92$bAMF |999 \\$aBF575 .A85 D37 1997 eb$wLC$c1$i100020-1001$lONLINE$mMAIN$rY$sY$tONLINE-BK$u2/3/2005 
 01621cam 2200373La 45{92}0 |001 ocm42922747\ |003 OCoLC |005 20050111035606.0 |006 m\\\\\\\\u\\\\\\\\ |007 cr\cn- |008 970131s1997\\\\flu\\\\\sb\\\\000\0\eng\d |035 \\$a(Sirsi) i9780585105444 |040 \\$aN{dollar}T$cN{dollar}T$dOCL$dOCLCQ |020 \\$a0585105448 (electronic bk.) |050 14$aBF575.I5$bM55 1997eb |082 04$a158.2$221 |049 \\$aAMFA |100 1\$aMiller, Keith. |245 10$aCompelled to control$h[electronic resource] :$brecovering intimacy in broken relationships /$cJ. Keith Miller ; with a foreword by John Bradshaw. |250 \\$aRev. ed. |260 \\$aDeerfield Beach, Fla. :$bHealth Communications,$c1997. |300 \\$axviii, 258 p. ;$c22 cm. |504 \\$aIncludes bibliographical references (p. 253-256). |533 \\$aElectronic reproduction.$bBoulder, Colo. :$cNetLibrary,$d1999.$nAvailable via the World Wide Web.$nAvailable in multiple electronic file formats.$nAccess may be limited to NetLibrary affiliated libraries. |650 \0$aIntimacy (Psychology) |650 \0$aControl (Psychology) |650 \0$aInterpersonal conflict. |650 \0$aInterpersonal relations. |655 \7$aElectronic books.$2local |710 2\$aNetLibrary, Inc. |776 1\$cOriginal$z1558744614$w(DLC) 97005376$w(OCoLC)36364161 |856 4\$3Bibliographic record display$u,ip,url,cpid&custid=s6243341&db=nlebk&AN=11743&site=ehost-live |994 \\$a92$bAMF |999 \\$aBF575 .I5 M55 1997 eb$wLC$c1$i100021-1001$lONLINE$mMAIN$rY$sY$tONLINE-BK$u2/3/2005 
Posted in Programming, Python, Tutorial | Tagged , , | Leave a comment

Finding Python in a haystack

My latest torment I mean task, was to take two files from an eBook supplier for analysis. One was the eBooks we have purchased and the other was the statistics for the eBook use.

My task is to take the statistics from the one file, but only for the eBooks we own, and put them in a separate file.

After determining what field matched, the ISBN with some mass editing, and adjusting the delimiter of the files from tabs to a “|” I began the journey with Python.

My first thought was that I wold open both files and read the purchased record comparing it to the records in the statistics file, saving the matched record to a new file. Awesome idea, poor logic. Python did it great for the first purchased record, then quit reading from the statistics file.

Here is the code:


I then opted for a different logic path. Read in the purchased file first, putting all the ISBNs into a list, then read each statistics entry and compare it the ISBN list. HUZZAH it works, but it takes a long time.

Here is that code:


I have left some debugging line in and I hope I commented enough to make it easy to follow. Let me know what you think.

Posted in Programming, Python, Tips | Tagged , , | Leave a comment