CPU-intensive and Shuffle-intensive Jobs scale differently

Being a curious observer of the Hadoop/MapReduce technology, I always wonder what kinds of jobs are most suitable for such technology.  Reported by [1],  ”we find that scale-out works better for CPU-intensive tasks since there are more cores and more aggregate memory bandwidth. Scale-up works better for shuffle-intensive tasks since it has fast intermediate storage and no …


1,751 total views, no views today

Multi-Word TagCloud on Web N-gram Now

Check out the tagCloud below, can you see why it is interesting? Please compare the two tag clouds generated from the same text (a text corpus from the title of about 2000 data.gov datasets), and see why they are different. Novel Multi-word TagCloud Conventional Single-word TagCloud Highlights Meaningful Visualization. As you may see from the …


1,221 total views, 2 views today

Putting open Facebook data into Linked Data Cloud

I recently build a proof-of-concept demo on getting Facebook data (public data only) into LOD   their recently announced Graph API. The demo is available at http://sam.tw.rpi.edu/ws/face_lod.html. It is fairly straightforward to convert the JSON object into RDF and make the URI dereferenceable. Now the data are linkable, but not yet linked to other LOD data. …


1,153 total views, 1 views today

Sameas Network

Sameas Network is a network of URIs which are inter-connected by owl:sameAs relation. It is such an interesting network as it is not  a conventional social network, but rather a socially contributed directed graph DAG connecting “equivalent” identity.   Our recent study [1] crawls sameas network following linked data principles: starting from a given seeding …


1,026 total views, 2 views today

Three principles for building government dataset catalog vocabulary

There are some ongoing interests in vocabulary for government dataset publishing. There are a  number of proposals such as DERI dcat, Sunlight Lab’s guidelines and RPI’s proposal on Data-gov Vocabulary. Based on our experiences on data.gov catalog data, we found the following principles are useful for consolidate the vocabulary building process and potentially bring consensus: …


1,004 total views, 2 views today