finding most likes on a tag on instagram

Finally, a technical post. I’m on the NYC FRC (FIRST Robotics Challenge) planning committee. Even though the competition was cancelled, the Instagram photo contest was not. The idea was that students post pictures to a specific hashtag and the ones with a lot of likes win.

On a planning call this week, someone noted that it was no longer easy to see the number of likes making it a pain to find the ones with the most likes. That sounded like something a computer would be good at so I volunteered to take care of it.

There were only 86 submissions so mousing over each by hand and keeping a list wouldn’t have been terrible. And I probably could have gotten it done faster that way than by automating it. But where’s the fun in that.

Attempt #1 (failed) – API

There is a documented API to get posts by hashtag. It requires you to have a business or creator account to use. I have neither. This page says anyone can get a creator page. I don’t see that option. Possibly because my account is new or private. And I don’t want to make it public so not going down this road.

Attempt #2 (failed) – screenscraping

I know the URL of the hashtag. And it is available without a login. Great. I can just use Selenium to scrape the data. Well, I couldn’t get this working. The page uses progressive rendering. I did try using code from StackOverflow to page down. I used the ChromeDriver so I could confirm it really was scrolling. It did. But I still didn’t get all the images available to the Selenium driver. So I had to abandon that approach.

private void scrollToBottomOfPage() {
		
  JavascriptExecutor js = (JavascriptExecutor) driver;
  try {
     long lastHeight = ((Number) js.executeScript("return document.body.scrollHeight")).longValue();
     while (true) {
        ((JavascriptExecutor) driver).executeScript("window.scrollTo(0, document.body.scrollHeight);");
        Thread.sleep(2000);
        long newHeight = ((Number) js.executeScript("return document.body.scrollHeight")).longValue();
        if (newHeight == lastHeight) {
           break;
        }
	lastHeight = newHeight;
     }
   } catch (InterruptedException e) {
      e.printStackTrace();
   }
}

Attempt #3 (failed) – logging in

When watching it in ChromeDriver, I noticed that there was a prompt about logging in. So I thought maybe that was the problem. I wrote some sloppy Selenium code to login and saw the same behavior. It did login, but I still only got a subset of images. (I would have refactored the timeout, hard coded credentials and loop if it had helped)

driver.get("https://www.instagram.com/");
System.out.println(driver.getPageSource());

List<WebElement> x = driver.findElements(By.tagName("input"));
System.out.println(x);

// TODO change timeout to a wait until
try {
   Thread.sleep(5000);
} catch (InterruptedException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
}

WebElement username = driver.findElement(By.name("username"));
WebElement password = driver.findElement(By.name("password"));

// TODO don't hard code
username.sendKeys("xxx");
password.sendKeys("xxx");

// TODO rewrite
List<WebElement> l = driver.findElements(By.tagName("button"));
System.out.println(l);
for (WebElement webElement : l) {
   System.out.println(webElement.getAttribute("type"));
   if (webElement.getAttribute("type").equals("submit")) {
     webElement.click();
   }
}

Attempt #4 (failed) – save file

At this point, I decided to stop messing with Selenium and just download the data myself. I opened the web page and scrolled to the bottom to get all the images. I then saved the page in chrome to get all the files. And… still didn’t have everything. This suggests the page is set up to not store everything and no amount of fiddling with Selenium was going to work.

Attempt #5 (failed) – network traffic

The files are all downloaded in my browser at some point. So I used Chrome’s network traffic monitor (in developer tools). Unfortunately, you can’t get the actual Instagram URL from the image link used for the CDN (content delivery network)

Attempt #6 (success kinda) – JSON

The “kinda” is because I don’t have paging working and the__a API is deprecated

Then I found this post which tells me I can use https://www.instagram.com/explore/tags/frcnyc2020/?__a=1 to get the results as JSON. Whoo hoo! This returns the data. Then it was just a matter of parsing it and creating the report.

That worked. The completed code is below and on GitHub

package com.jeanneboyarsky.instagram;

import java.util.*;
import java.util.Map.*;
import java.util.stream.*;

import org.junit.jupiter.api.*;
import org.openqa.selenium.*;
import org.openqa.selenium.htmlunit.*;

import com.fasterxml.jackson.databind.*;

public class CountLikesIT {

  private static final String TAG = "frcnyc2020";

  private WebDriver driver;

  @BeforeEach
  void setup() {
    driver = new HtmlUnitDriver();
  }

  @AfterEach
  void tearDown() {
    if (driver != null) {
      driver.close();
    }
  }

  @Test
  void graphQlJson() throws Exception {
    // https://stackoverflow.com/questions/43655098/how-to-get-all-instagram-posts-by-hashtag-with-the-api-not-only-the-posts-of-my
    // "count" shows up 258 times (this is three times per image)
    // 1) edge_media_to_comment
    // 2) edge_liked_by
    // 3) edge_media_preview_like - looks same as #2
    String json = getJson();

    ObjectMapper objectMapper = new ObjectMapper();
    JsonNode rootNode = objectMapper.readTree(json);
    List<JsonNode> nodes = rootNode.findValues("node");
		
    Map<String, Integer> result = nodes.stream()
   // node occurs at multiple levels; we only want the ones that go with posts
   .filter(this::isPost)
   .collect(Collectors.toMap(this::getUrl, this::getNumLikes, 
  // ignore duplicates by choosing either
    (k, v) -> v));
	
   printDescendingByLikes(result);
  }
	
  private String getUrl(JsonNode node) {
    JsonNode shortCodeNode = node.findValue("shortcode");
    return "https://instagram.com/p/" + shortCodeNode.asText();
  }
	
  private int getNumLikes(JsonNode node) {
    JsonNode likeNode = node.get("edge_liked_by");
    return likeNode.get("count").asInt();
  }
	
  private boolean isPost(JsonNode node) {
    return node.findValue("display_url") != null;
  }

  private String getJson() {
    driver.get("https://www.instagram.com/explore/tags/" + TAG + "/?__a=1");
    String pageSource = driver.getPageSource();
    return removeHtmlTagsSinceReturnedAsWebPage(pageSource);
}

  private String removeHtmlTagsSinceReturnedAsWebPage(String pageSource) {
    String openTag = "<";
    String closeTag = ">";
    String anyCharactersInTag = "[^>]*";
		
    String regex = openTag + anyCharactersInTag + closeTag;
    return pageSource.replaceAll(regex, "");
  }

  private void printDescendingByLikes(Map<String, Integer> result) {
    Comparator<Entry<String, Integer>> comparator = 
       Comparator.comparing((Entry<String, Integer> e) -> e.getValue())
      .reversed();
	
    result.entrySet().stream()
       .sorted(comparator)
       .map(e -> e.getValue() + "\t" + e.getKey())
       .forEach(System.out::println);
    }
}

performance tuning selenium – firefox vs chrome vs headless

I’m the co-volunteer coordinator for NYC FIRST. Every year we are faced with a problem: we want to export the volunteer data including preferences for offseason events. The system provides an export feature but does not include a few fields we want. A few years ago, my friend Norm said “if only we could export those fields.” I’m a programmer; of course we can!

So I wrote him a program to do just this. It’s export-vol-data at Github. And fittingly, he “paid” me with free candy from the NYC FIRST office. Once a year we meet, Norm gives his credentials to the program and we wait. And wait. And wait. This year NYC FIRST had more events than ever before so it took a really long time. I wanted to tune it.

Getting test data

The problems with tuning have been:

  1. I have no control over when people volunteer for the event. It’s hard to performance test when the data set keeps changing.
  2. The time period when I have access to the event is not the time period that I have the most free time.

Norm solved these problems by creating a test event for me. I started over the summer, but then got accepted to speak at JavaOne and was really busy getting ready for that. Then I went back to it and someone deleted my test event. Norm solved that problem by creating a new event called “TEST EVENT FOR SOFTWARE DEVELOPMENT – DO NOT ENROLL OR DELETE, please. – FLL”. And one person did volunteer for that. But not a lot so it helped.

Performance tuning

I tried the following performance improvements based on our experience exporting in April 2017.

  1. SUCCESS: Run the program on the largest events first. (It’s feasible to manually export the data for small events. Plus those have largely people who also volunteered at a larger event.) This allows us to run for the events with the most business value first. It also allows us to abort the program at any time.
  2. SUCCESS: Skip events and roles with zero volunteers. For some reason, it takes a lot longer to load a page with no volunteers. So skipping this makes the program MUCH faster.
  3. SKIP: Add parallelization. I wound up not doing this because the program is so fast now.
  4. FAILED: Switch from Firefox driver to PhantomJS. I knew the site didn’t function with HtmlUnitDriver. I thought maybe it would work with PhantomJS – an in memory driver with better JavaScript support. Alas it didn’t.
  5. FAILED: Try to go directly to URLs with data. FIRST prevents this from working. You can’t simply simulate the REST calls externally.
  6. SUCCESS: Switch from  Firefox driver to Chrome driver. This made a huge difference in both performance and stability. The program would crash periodically in Firefox. I was never able to figure out why. I have retry/resume logic, but having to manually click “continue” makes it slower.
  7. UNKNOWN: I added support for Headless Chrome in the program. It doesn’t seem noticeably faster though. And it is fun for Norm and I to watch the program “click” through the site. So I left it as an option, but not the default.

Results

Like any good programming exercise, some things worked and some didn’t.  The program is an order of magnitude faster now that at the start though so I declare this a success!