Oct 21, 2011

Visualizing the schooX dataset with Gephi, a colorful poster.

So,schooX and Gephi.
First of all, a few words about these two platforms.



schooX - the academy for Self Learners - is a newfound startup which helps you collect and share content (wikis,videos,slides, etc), and optionally organize them into courses always accessible within schooX eliminating the need to visit again different media hubs tracking down knowledge you already found in the past.

Gephi is a visualization tool based on graph theory and the gexf file format. Feed it with data, select a layout algorithm, tweak colors and settings and voila (sic): a beautiful and informative image.


The Model
In schooX each user has a set of collected articles and each article is described by user assigned tags. The superset of all articles' tags forms the user's tag cloud. In graph theory terms, a user is a node and his tags are the node's edges. We will put Gephi to layout all user nodes and their respective edges, but first we have to get the dataset.


The Process
This first part of the article will detail steps 1,2,3 of the process.
  1. Extract the model in csv format (user-tags)
  2. Parse schooX user profiles with cURL
  3. Store html profile pages locally.
  4. Grep or XPath user tag clouds from the htmls.
  5. Create the csv file and import into Gephi.
  6. Perform cluster analysis and graph drawing layout algorithms.
  7. Export and Gimp the final image with different fx. 

Final Result
Before detailing the process, some images of the final result.
We can identify user concentration around some topics such as social media, software development, medical sciences etc. Interesting...the startup's dataset has already started getting in shape forming communities around some hot topics. On to the images (available in flickr too):
schooX Network Graph
schooX Network Graph with Labels
Glass Clustering Effect
Network Painting - qualifies as an abstract poster?
Dark Neon Abstract


Parsing data, assessing the model
Since we haven't got direct access to any dataset or database we can only do the...unthinkable :) Create a schooX account, parse all users' public tag-clouds and create our model in csv format. Each row of the csv file will contain two columns and will be imported to Gephi:
  • tag name
  • unique user name
schooX' entry point is:
http://www.schoox.com/login/index.php
User profile pages containing the user's tag cloud follow this URL pattern:
http://www.schoox.com/user/7969/tag-cloud/
For example 7969 is my user id. After some trial and error we can find the last user id since an error comes up in place of a user profile page simply because there is no user with such an id.

The cURL (see-url) command
We will use curl (see-url) to login to schooX and store the login cookie, parse all user's profile pages and store them locally as html files. Later, we will extract from pages the tags for each user.
Login to schooX and store cookie:
 curl --cookie-jar cjar --data 'username=arapidhs@gmail.com' --data 'password=********' --output /dev/null http://www.schoox.com/login/index.php
Starting from id 1 to the user max id we consecutively execute curl ,(see-url), to parse all user profiles:
#!/bin/bash
for i in {1..200000}
do
curl --cookie cjar --output /home/arapidhs/tmp/i.html           http://www.schoox.com/user/i/tag-cloud/
done

Our tmp local directory now stores all user profiles and tag clouds in separate html pages. Each html page is named after the corresponding user's id.
Have a look:


In part 2 we will parse these pages with Tidy to generate a csv model file to import into Gephi.
The format of the csv will be something like with each row representing an edge of the network:

coefficient,Charalampos Arapidis
paok,Charalampos Arapidis
neural,Charalampos Arapidis
desktop,Charalampos Arapidis
minimal,Charalampos Arapidis
analog,Charalampos Arapidis
Fourier,Charalampos Arapidis
Java,Charalampos Arapidis
visualization,Charalampos Arapidis
software,Charalampos Arapidis
subversion,Charalampos Arapidis
social,Charalampos Arapidis
processing,Charalampos Arapidis


See you soon, and happy collecting!



Aug 29, 2011

Oracle Certified MySQL Associate uCertify prepkit review

Back from vacation, feeling refreshed and ready for new things:)

Here is a review of a preparation kit i received from uCertify  for the Oracle Certified MySQL Associate exam. I found it worthwhile to share my my experience with the kit for developers interested in training to get a technical certification.

The kit is by uCertify a company which specializes in the certification training domain providing
many kits for various technical certificates.
uCertify has developed a prepkit engine which runs al the available certification preparation kits.
Exploring the kit i found three types of tests, two of them were approximates of the final test while 
the final one was a little bit more difficult than the actual certification test.

The User Interface is elegant, responsive and i liked the extensive reporting after having taken some tests.
Yes, i love stats. They cover all the aspects of database usage, from terminology to transactions.
It is also great that you can clear your doubts about a question  or actually learn from the tests since the feedback is immediate and articles, extensive explanations etc are included within the kit.


performance report
sample question
The tests can be taken in two distinct modes. Learning and exam mode. 
Learning mode is the feautre i liked the most providing feedback and insight immediately after each question  thus helping to not repeat the same mistakes again and really understand what went wrong and why.

Do they actually help?

I think they do.
The kit identified my SQL weaknesses correctly: transactions and import/export procedures. Could i improve using the prepkit? To test this i created a custom test - you can do that - with questions concerning the above fields. I ran it in learning mode following up the explanations after each question. Took a break for a day and retook the simulated tests and yes! my stats were improved.

Most important is that the improvisation does not feel technical. I actually gained knowledge from the tests
and helped get the feeling of the final exams.

smaple ER question
articles and concepts included in the kit






Jul 29, 2011

Do not use Java 7 yet

The 7th release of Java today seems that introduced some nasty bugs caused by hotspot compiler optimizations miscompiling some loops. Code containing loops will propably be affected by this bug.


If you use Java 7 use this switch when starting the JVM:
-XX:-UseLoopPredicate

Uwe Schindler an Apache Lucene PMC Member tweeted a warning earlier today as the bug affected Apache Lucene and Solr project causing wrong compilation of some loops.

The warning mail and the full story from LucidImagination:

From: Uwe Schindler
Date: Thu, 28 Jul 2011 23:13:36 +0200
Subject: [WARNING] Index corruption and crashes in Apache Lucene Core / Apache Solr with Java 7

Hello Apache Lucene & Apache Solr users,
Hello users of other Java-based Apache projects,

Oracle released Java 7 today. Unfortunately it contains hotspot compiler
optimizations, which miscompile some loops. This can affect code of several
Apache projects. Sometimes JVMs only crash, but in several cases, results
calculated can be incorrect, leading to bugs in applications (see Hotspot
bugs 7070134 [1], 7044738 [2], 7068051 [3]).

Apache Lucene Core and Apache Solr are two Apache projects, which are
affected by these bugs, namely all versions released until today. Solr users
with the default configuration will have Java crashing with SIGSEGV as soon
as they start to index documents, as one affected part is the well-known
Porter stemmer (see LUCENE-3335 [4]). Other loops in Lucene may be
miscompiled, too, leading to index corruption (especially on Lucene trunk
with pulsing codec; other loops may be affected, too - LUCENE-3346 [5]).

These problems were detected only 5 days before the official Java 7 release,
so Oracle had no time to fix those bugs, affecting also many more
applications. In response to our questions, they proposed to include the
fixes into service release u2 (eventually into service release u1, see [6]).
This means you cannot use Apache Lucene/Solr with Java 7 releases before
Update 2! If you do, please don't open bug reports, it is not the
committers' fault! At least disable loop optimizations using the
-XX:-UseLoopPredicate JVM option to not risk index corruptions.

Please note: Also Java 6 users are affected, if they use one of those JVM
options, which are not enabled by default: -XX:+OptimizeStringConcat or
-XX:+AggressiveOpts

It is strongly recommended not to use any hotspot optimization switches in
any Java version without extensive testing!

In case you upgrade to Java 7, remember that you may have to reindex, as the
unicode version shipped with Java 7 changed and tokenization behaves
differently (e.g. lowercasing). For more information, read
JRE_VERSION_MIGRATION.txt in your distribution package!

On behalf of the Lucene project,
Uwe

[1] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7070134
[2] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7044738
[3] http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7068051
[4] https://issues.apache.org/jira/browse/LUCENE-3335
[5] https://issues.apache.org/jira/browse/LUCENE-3346
[6] http://s.apache.org/StQ

Better play safe and expect an update form Oracle i guess...
Real Time Web Analytics