So one of the subjects I'm doing this semester is Informatics 2. It's basically an introductory programming/comp.sci course. Coming up is a major major assessment, worth 25% of our final unit mark. For this assessment, we have to write a web application to search/view movies from a database.
Simple enough, and the code is fairly trivial. I was working on code to scrape Amazon.com for movie images, so we could display them alongside the search results. Now, we have to use this online system called IVLE, which was written by the university (apparently). IVLE is essentially an online web-based IDE for developing in Python. It stores your files on a central server at the uni, and you develop your scripts through your browser and execute them remotely on the university's servers.
On the surface it sounds pretty good. If only the damn thing actually worked.
You see, IVLE is so buggy that at times it becomes near-unusable. Sometimes IVLE will incorrectly copy/paste text. Sometimes it won't properly save your work, causing you to lose data. And every now and then it has a habit of inserting phantom text into your code at random, causing all sorts of weirdness.
But those problems I can live with. Those bugs can be worked around with enough effort. No, the real problem is that IVLE fails at doing the one thing it was designed for: running scripts.
I can't do any work if I can't actually run my scripts through IVLE. If the IDE is crappy and buggy, I can work around that with enough effort. But this is just ridiculous.
So I'm trying to write a python script to scrape Amazon for movie images. All it involves is unpickling about 4200 files, downloading a page from Amazon, scraping the page for an image link, then downloading the image to file.
Firstly, doing it one by one is too slow, since urllib is synchronous (blocking). So I used Python threading to run a bunch of threads to scrape multiple images in parallel. Problem solved, right? Well, IVLE can't even handle that. Creating a measly 32 threads doesn't work - I get a "cannot create any more threads" exception. You're kidding me, right? I had to drop it down to 16 simultaneous threads.
Well, 16 threads should still run pretty fast, right? Sure. IF IVLE ACTUALLY ALLOWED ME TO RUN MY SCRIPT. You see, IVLE is so buggy that 90% of the time, it throws up a "internal Python console error". And then the Python console becomes unusable - resetting it has no effect (it just throws up another internal console error).
So I have to log out of IVLE, and log back in again. This time I successfully start the script, but I soon discover that there's a minor bug in my program, so I hit "interrupt" to stop the program. I fixed the bug, and then try running it again. OH WAIT, I SUPPOSE IVLE DOESN'T ACTUALLY LIKE PEOPLE ACCOMPLISHING ANY WORK. It throws another "internal Python console error".
Now rinse and repeat this experience. Many, many times. I spent 4 hours battling with IVLE and it's incessant internal Python console errors. Obviously, IVLE either doesn't like my script or it handles creation/destruction of threads really, really badly.
Now, after spending an afternoon battling with IVLE, I finally got it to run without any problems. And after a few minutes, with only half the images downloaded, IVLE threw an error saying that I'd exceeded the CPU time limit. CPU time limits are fair enough, the simple solution is just run the script a second time and skip over the work done in the first pass.
Wait, you mean I have to run the script a second time? And go through all that pain again!? Already I can hear IVLE laughing at me from beyond the deepest gates of Hell as I vainly try to finish my work before the due date.
Eugh.
Friday, October 23, 2009
Subscribe to:
Posts (Atom)