How would a historian tag my blog?

Tagging blog’s posts is a good practise of bloggers. It is an easy way to classify contents, which helps readers to search for information. I created this blog for Digital History class but I have not either categorized or tagged. This class is leaded by a historian mainly for historians. How could I be sure of choosing the proper historian tags? How would a historian tag my blog? In this post, I present my experiment.

The idea is to study how historians tag their own blogs and apply their tags to my blog. For this, it is necessary to analyze historians post contents, my own posts and deduce what of their tags are relevant to my blog. Let’s go!

The first step was to look for appropiate blogs. I thought that I should use at least two tagged blogs. My first option was Bill Turkel’s blog, williamjturkel.net/updates/, but Bill does not usually tag this blog, so I had to search Google “digital history blog”. I found one called Digital History Hacks (2005-08) and yeah!, it is another (older) Bill’s blog (casualty?). I decided that the second blog should be one of my classmates’. I explored all their blogs and the only one full of tags was Dave’s. So Backwards with Time was my second choice.

Once I had my two blogs, the second step was to select their posts. 20 posts would be enough: 15 from Bill’s and 5 from Dave’s. For each post, I extracted the 10 most frequent words (I call this set keywords) and associated them to the tags supplied by the bloggers. I used Wordle to achieve it. Wordle is an online tool that read a text and show its word cloud.

Fig.1 Word cloud

Wordle also offer a tool to count words and sort by frequency. I had to discard some common words and select the important ones. For the example in Fig.1, I set up the next keywords: {physical, past, machine, fabrication, spaces, historians, history, digital, humanist, computer} for one of the Bill’s posts. And the tags supplied by Bill were: {bricolage, DIY, fabrication, hacking, physical computing}. Here is the complete set of posts along with extracted keywords and blogger’s tags. And here are my posts (remember that my blog has no tags, that is the reason for this experiment):

a)
https://antoniojimenezmavillard.wordpress.com/2011/10/31/a-new-scarcity/
Keywords: web, information, knowledge, machine, data, semantic, historians, abundance, technologies, computer.

b)
https://antoniojimenezmavillard.wordpress.com/2011/10/16/neither-field-nor-fad-nor-fashion/
Keywords: digital, history, internet, technologies, field, future, fashion, research, world, past.

c)
https://antoniojimenezmavillard.wordpress.com/2011/10/02/what-is-real/
Keywords: life, real, world, virtual, Internet, users, cyberspace, facebook, examples, friends.

d)
This entry corresponds to this post, which did not exist when I did the experiment. However, I knew the issue of it and I added some keywords according to my own criterion.
Keywords: blog, data, historians, online, programming, tag.

The third step was to apply some Artificial Intelligence technique to deduce the tags for my posts by basing on their keywords and keywords and tags from the posts before. This collection of 20 posts is called training set in Machine Learning field. The technique would be the ID3 algorithm. This algorithm can deduce the value of an attribute from other attributes. That is, ID3 works with a set of examples (training set). Each example has attributes and one of them is the target (that we want to learn about). In the training set all the information is provided. For example, Fig.2 shows “real facts” about what days a team played ball.

Fig.2 Example of training set

What will happen the day D15 and others? ID3 can deduce whether the team will play or will not by basing on the outlook, temperature, humidity and wind of that day. For that, ID3 builds a decision tree* (Fig. 3):

Fig.3 Decision tree

Or expressed in rule format*:

  • IF outlook = “sunny” AND humidity = “high” THEN play ball = “no”
  • IF outlook = “sunny” AND humidity = “normal” THEN play ball = “yes”
  • IF outlook = “overcast” THEN play ball = “yes”
  • IF outlook = “rain” AND wind = “strong” THEN play ball = “no”
  • IF outlook = “rain” AND wind = “weak” THEN play ball = “yes”

* In order to build the smallest tree (the simpliest rules) it is necessary to choose on top the tree the “best” attributes, that is, the attributes with the least entropy (entropy gives an idea of homogeneity).

The pseudocode of ID3 can be found here.

My goal was to apply ID3 in order to teach the computer the rules to, given the training set of posts, learn when a tag appears in a post and when not (I mean, its classification). In this experiment, an example is a post, an attribute is a keyword and the target is a tag.

First of all, I had to set up the total set of keywords: the intersection of my own keywords and historians’ ones (19 in total): {blog, computation, data, digital, examples, historians, history, information, knowledge, past, reality, research, users, virtual, web, world, online, programming, tag}. It is necessary to say that some keywords with similar meanings were fusioned into one (for instance, computer and computation). Secondly, I had to choose what tags from training set were relevant to my blog (I mean, the most likely tags applicable to my posts). I selected these 30: {browser, computational history, data mining, digital, digital history, diy, entropy, google, history, html, interdisciplinarity, machine learning, markup language, ocr, online research, physical computing, programming, reality, representation, search, search engines, teaching, technology, text mining, thing knowledge, turing test, virtual reality, web resources, wikipedia, wikis}.

At this point, I had 20 posts, 19 keywords and 30 tags. For each post, which is the correspondence between its keywords and its tags?

Fig.4 Training set

The training set is displayed in Fig.4. For each post, a “” (“yes“) means that the post has the keyword and “no“, the opposite. The fourth step was to programme the ID3 algorithm and represent this information properly. I had to add a last column with the tag I wanted to learn and give the values for each historian’s blog. I repeated the process 30 times (one for each tag)! The programming language I best know is Java, and I found an implementation of ID3 in this language, so the only thing I needed was to create a project in Eclipse, provide the training set in a proper format and run the application. As an example, here is the rule to learn when a post have to be tagged/classified as “history”:

  • IF knowledge = “no” AND world = “no” THEN history = “no”
  • IF knowledge = “no” AND world = “yes” AND digital = “yes” THEN history = “no”
  • IF knowledge = “no” AND world = “yes” AND digital = “no” THEN history = “yes”
  • IF knowledge = “yes” THEN history = “yes”

That means, “if keyword knowledge belongs to the post, then the post is tagged with history; in other case, if knowledge doesn’t belong, but world does and digital doesn’t, then the post is tagged with history too”. This rule expresses the idea that (traditional) history is linked to knowledge and world, but not to digital. On the other hand, the rule that learn digital history set that this tag is related in some sense to online, history, digital, knowledge, historians and web (this rule is too complex to be shown).

The fifth step is, for all rules learned, to check if my posts satisfy them. According to the example above, the post titled A new scarcity has the tag history. Actually, is the only post where I mention History in the past and compare it with current methods of doing History. I did this step manually, i.e., I analyzed every rule and checked if the tag was suitable for each of my posts.

Results

The result was a set of learned tags* for my posts. My blog was originally untagged but now it has its own tags!

a)
https://antoniojimenezmavillard.wordpress.com/2011/10/31/a-new-scarcity/
Keywords: web, information, knowledge, machine, data, semantic, historians, abundance, technologies, computer.
Tags: browser, data mining, digital history, google, history, html, interdisciplinarity, machine learning, markup language, technology, text mining, virtual reality, web resources.

b)
https://antoniojimenezmavillard.wordpress.com/2011/10/16/neither-field-nor-fad-nor-fashion/
Keywords: digital, history, internet, technologies, field, future, fashion, research, world, past.
Tags: digital history, google, online research, search engines, teaching, technology, virtual reality, web resources, wikipedia.

c)
https://antoniojimenezmavillard.wordpress.com/2011/10/02/what-is-real/
Keywords: life, real, world, virtual, Internet, users, cyberspace, facebook, examples, friends.
Tags: digitial, digital history, history, online research, reality, virtual reality.

d)
This entry corresponds to this post, which did not exist when I did the experiment. However, I knew the issue of it and I added some keywords according to my own criterion.
Keywords: blog, data, historians, online, programming, tag.
Tags: data mining, digital history, entropy, machine learning, ocr, programming, text mining, turing test, wikis.

*green = suitable tag
*red = unsuitable tag

Conclusions

This experiment has shown some interesting results to me:

  1. About 75% of tags are suitable with the post content.
  2. Every posts were tagged as “digital history” (good… after all, this is a blog for Digital History class).
  3. A new scarcity was tagged with so relevant tags like “google“, “markup language” (let’s remind that this post mentions XML) or “web resources“.
  4. Neither field nor fad nor fashion was classified as “online research“, “search engines” or “web resources“.
  5. “What is real?” was tagged with coherent tags such as “reality” or “virtual reality” (just the topics of this post).
  6. This post has been tagged with “data mining“, “entropy” (remember that entropy was not a keyword of this post and even so, it has been correctly classified), “machine learning“, “programming“, “text mining” or “turing test” (a test of a machine’s ability to exhibit intelligent behaviour).

These results may not be perfect and the methodology followed is far from scientific method. However, it is a good approach. And this prove one more time the power of Artificial Intelligence, in this case, Machine Learning and ID3 algorithm.

Advertisements

About mavillard

Graduate student of PhD. in Hispanic Studies at The University of Western Ontario.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s