⮜ Go back to Fun Zone

Wikipedizer - Academic Dishonesty, with effort

And it can't even do my assignments by itself.

Let me set the scene. You're a student who's fed up with your teacher's busy-work assignments. You've had enough of going onto Wikipedia to copy-paste the definitions of 25-30 vocab words and terms. Thirty instances of searching on Wikipedia!

If only you had a magical program where you could give it all the terms, all the words, and it would chug away and spit out a finished assignment. You would have the five minutes you spent googling to yourself again! Imagine what you could do with all that time!

...but that could never be reality. You know your school disabled all Google Docs plugins and custom scripts. You can't be bothered to see if such a thing already exists. If only you were a novice at coding and could find a way to run C# code on a Chromebook, then maybe there'd be a chance. But that's not the case...

...or is it?

Chapter One: Queue Moonman

"Hey, Vsauce! Michael here." -Michael Stevens, of Vsauce

If you haven't picked up on it, I'm a novice in C#. I had little to no experience outside of making Keep Talking and Nobody Explodes mods with Unity, but if I could just get some input and output I thought I would be fine.

I started by looking at my methods. I firsted tried making a plugin-type thing for Google Docs. I was delighted, and then promptly horrified, to see "Apps Script" available to me. I became horrified because Google's documentation of how to write Apps Script code was horrible! It took me an hour to get a simple test script going. I ran it, and the blessing in disguise popped up on my face- I did not have the permissions to edit a document with my code. I'd have to get permission from an administrator to actually run it, and I doubt the higher-ups would let some random snippet of code go willy-nilly, so I promptly gave up on Apps Script. And all seemed lost...

...it really didn't. I'd hype it up and say "but then I remembered I could use GitHub Codespaces to do exactly what I need", but that's just not true; I had GitHub Codespaces in my back pocket as a backup. I went on it during school before, and soon enough I was logged in on my account to do some coding! I had completely forgetten how to set up a C# Codespace, but it couldn't be that hard, right?

Spoiler alert: It took me a while. It was kind of my own fault, I was using a guide that was using Microsoft's C# Codespace template, but to defend myself there is literally a C# option when making a new Codespace. After figuring that out, and doing all of the setup, I was ready to go.

Chapter Two: google dot com "how to scrape wikipedia"

What's a "web page"? Something ducks flock on?

After a little bit of coding, I already had a setup where you would enter in as many terms as you wanted, it would go through each of those terms one by one and come up with a definition, and then spit back the list with definitions. Of course, the "coming up with definitions" part wasn't coded yet. That was Wikipedia's job anyway.

I thought this would be the hardest part of the project because I had zero familiarity with how to deal with web pages in general, much less with C# code. So there I was, googling away, asking "how can I get the html from a page C#???".

Every answer I got went way over my head until I found one that actually worked. I shamelessly pasted it in, and in a matter of seconds I was getting the HTML of Wikipedia pages! This was huge in its own right; now I could look at the source code of web pages (the school had turned off Inspect Element). I still needed the HTML, but having it be able to give me a copy to reference on the fly was a huge bonus.

The code still does go over my head, but I would like to stay I know a bit more about dealing with the internet with C#. It's an extremely edge case scenario of knowledge; I doubt I'm going to need to scrape the HTML of an entire page for a long while, but it's cool and nice and fun, I guess.

I was surprised and pleased; the hardest part of the project was over, and faster than I expected, too! Now, it was on to the hardest-er part!

Chapter Three: How and how not to get paragraphs

"Measure once, cut twice, reglue, cut again" -Someone that isn't me

So I had the HTML. Great job, script! But it's not enough to just slab the HTML onto my assignment and call it "job done"; I think my teacher would definitely notice the abundance of HTML tags. I needed a way to excise the paragraphs from the HTML code.

Here's where I made a big dumb dumb decision; if you're thinking of scraping the HTML from Wikipedia and then algorithmically getting it from said HTML, learn from my mistakes. I originally tried finding the specific {div}s (using {curly brackets} because HTML does not like angle brackets inside of regular text) that the summary paragraph would be contained in, since I knew the layout wasn't likely to change between articles. This was a very bad idea, because I wanted everything that would be inside the paragraph tag; this meant lots of recursion and weird stuff happening in my script and it was all just messy and bad and not good.

The better approach that I later took is realizing that there exists no paragraph tags before the article begins, so you can just split the HTML into a LOT of pieces of text, splitting them into individual chunks at the beginning of tags. Then, if a chunk of text started with "{p}", the script would know it was a paragraph. I would then just copy everything after that until I ran into a {/p} tag. Huzzah!

I now had the paragraph tag by itself. I thought this would be the end of it; little did I know, it was only the halfway point.

Chapter Four: Making the paragraph readable

"Clean up, clean up, everybody clean up!" -The Clean Up Song

At this point, I had the paragraph but I still had to clean up HTML tags, broken unicode characters (probably b/c I suck at coding), citation markers (these things -> [1]), and any information in parenthesis (its likely not needed). The code for all of these is pretty similar, except for maybe the code for removing citations. I'll cover them all in the order the program removes them.

Step 1: Removing Citation Markers - This is the step the program takes first, because it's the only way to practically remove them. The program is dependant on the HTML tags to remove markers; if they were removed, they become indistingushable from regular text and append extra letters, like thisb. It breaks the paragraph further into its own individual tags, and uses that to remove everything inside any {sup} tag, which the markers are under.

Step 2: Removing HTML Tags - The remaining steps are pretty interchangable, but my logic is that removing HTML tags as early as I could would drop the likelihood that something would break. They all have a spin on the same formula; look for a character, in HTML tag's case that's "<", and remove everything starting from that character and ending at a different character, in this case ">". HTML is a little special, though, because tags can be nested; this requires keeping track of how many layers of tags you're in so you don't stop deleting text prematurely.

Step 3: Removing Broken Unicode - Broken unicode showed up on my end as &#??;, with the two question marks being a number. & is unlikely to be in a summary, in my opinion, but just in case I also check the character after & before I start deleting. This does introduce an edge case where the program breaks if I check this with the last character, because there is no character after the last character, but I can just tell it to not check it if it's the last character.

Step 4: Removing Parenthesis - Information in parenthesis are typically pronounciation guides or other things that aren't useful for the purpose I'm using the article for. When the program runs into an opening parenthesis "(" it deletes the space character " " before it too to prevent two space characters right next to each other. I've had one article that had a pair of parenthesis inside of another, so the layering from removing HTML tags is reused here as well.

The program puts all the paragraph tags it got into a loop, does all the steps to every one, and chooses the first paragraph to have a period "." in it, to help filter out unrelated paragraph tagged text. After that, it simply takes the first sentence and uses it as the definition. At this point, all HTML tags and formatting blemishes are gone automatically. Hooray!

...so we're done, right? We grabbed a Wikipedia's page using a term we put it, it grabs the first sentence and spits it back out as text for us! That's what we wanted! We're done!

Chapter Five: OH HO HO HO HO HO HO, NO YOU'RE NOT!!

-Giant Floating Head of Brian David Gilbert's Health Insurance Video

That's all well and good...for the pages that work. There's still a lot of pages that don't work, like Absolutism, which brings you to a "[SOMETHING] may refer too..." page. There are some terms that just won't have pages at all, but it'd at least be nice to have these work. Just this, and nothing more, and we'll actually be done.

Let's see what we can do about it. Let's go back before we do any of that wacky cleanup stuff, and check if the first paragraph contains the phrase "may refer to". That'll redirect us to a chunk of code where we can handle these pages specially.

"May refer to" pages are comprised of several links that have the name of the thing you're looking for. The plan would be to present all of these options to the user which they can pick from, and then use that page.

So go ahead, and reuse that code for finding {p}s and find {a href="} tags instead. Instead of just sending over all of the tags with links, you'll need to filter them out first to not present a lot of unnecessary baggage alongside the options you actually want. I do this by making sure the URL the page is linking to has the term of the "may refer to" page.

Afterwords, it's really easy; present all the links we gathered into a list and ask the user to choose one to use. They're the actual links after the ".../wiki/" part of the URL, but they should be pretty explanatory.

Just for safe keeping, I also gave users the option to skip the page to fill in later, or an "I'm feeling lucky" option to randomly choose one, for funsies.

Epilogue: It's okay I guess

Thousands of minutes in funding, yet still no real-world use for Wikipedizer.

So, let's recap.

Prompted by crappy busy-work assignments, I wrote a program where I can put a list of terms into it and it spits out the Wikipedia summary into a neat, compiled list. When it works, it really works!

It's not perfect, but I never expected that. In the most recent assignment of this type, running it through the program gives 2 out of 25 terms that don't have a Wikipedia page. That's okay, though; if I wanted a program that gave me answers 100% of the time, I would've gone to ChatGPT, and then I wouldn't care what it spits out because I'm pretty sure my teacher wouldn't either. But if I don't care about the results, I suppose that gives one question to ask; why did I do all of this?

I like to think that it's less about automating my work and instead proves a point. This type of assignment is so monotonous, I can write a program to do almost the entire assignment for me; does that not sound like a huge issue? This type of assignment needs to die...and I did my part, one line of code at a time.

...

...are the teachers gone? Okay, you and I both know that I just wanted to automate my workflow so I didn't have to do these assignments anymore. That's the long and short of it. Red peng, OUT!!

*shreds on an air guitar whilst kickflipping on my skateboard right as the background explodes behind me*

And as they kickflipped away, bright lights from the explosion dimming, a few words fade into the screen, as you approached the end of the article...

THE END

⮜ Go back to Fun Zone