Tuesday, May 24, 2011

Link Extraction From Tweets using Java Function

One of the common problems in analyzing tweets is extracting the links / URLs out of tweets. You can do all sorts of analytics on the links such as determine most popular links for a given day, etc. This can be done with Regex (see RegexParser below) but most of the links shared on Twitter are URL-shortened using services like bit.ly, fb.me, etc. To get a better understanding of the links shared in the tweet, you need to resolve the links and get the actual link they are pointing to. In short, reverse shorten or expand it! The following code gives two functions that allows you to extract and resolve the URL to its final destination URL in JAVA.


 
package routines;
import java.sql.Date;
import java.text.ParseException;
import java.util.regex.*;
import java.net.*;
public class Parsers{
public static String RegexParser(String stringToParse, String regexPattern) {
// Create a pattern to match url
Pattern p = Pattern.compile("((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?((/)?([-\\w/_\\.]*(\\?\\S+)?)?)*)");
Matcher m = p.matcher(stringToParse);
if (m.find())
return m.group(1);
else
return "";
}
public static String ExpandURL(String urlString) {
      String resolvedURL = urlString;   
      try {
            //Open connection and retry till no longer redirected
            HttpURLConnection connection = (HttpURLConnection) new URL(urlString).openConnection();
            connection.setInstanceFollowRedirects(false);
            connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.205 Safari/534.16");
            while (connection.getResponseCode() / 100 == 3) {
                resolvedURL = connection.getHeaderField("location");
                connection = (HttpURLConnection) new URL(resolvedURL).openConnection();
            }
      } catch (Exception e) {
      }
      return resolvedURL; }
}

No comments:

Post a Comment