Perry -- Katy vs Rick

Katy versus Rick -- and the winning Perry is . . .
Would you believe? a methodological note

This is the puzzle. Rick Perry is running for the nomination of the Republican party to be their candidate for the presidency in 2012. If you want to find the Twitter messages that refer to him how do you do it?

There are too many Perrys for a search for Perry to work very well. That is much less true for Bachmann, Gingrich, Huntsman, Romney and Santorum. 'Paul' is the general case. Imagine what you would get if you searched for 'paul.' If you search for Ron Paul you do not have the same problem. But 'Perry' is a special case. There is, after all, Katy who is a good deal more widely known than is Rick. If you want your search to find messages referring to an aging teenage singer Perry works very well.

So, how bad is it? I did a search for Perry with the idea that it would incorporate references to both Perrys. The search ran from 10:55 am to 5:55 pm. on January 5. That was a date on which they were both in the news. Rick Perry had just finished very poorly in the Iowa Republican caucuses, went to Texas to think, thought for about 8 hours, and announced that he was going to take his campaign to South Carolina. And Katy Perry was going through her first divorce. For the two weeks before the Iowa caucuses Rick Perry was mentioned in 3,800 Twitter messages a 24 hour day -- on average. I do not follow Katy so I cannot make the same comparison.

The search found 41,339 messages containing Perry. Searching those messages for Rick found 6,601 messages. The search for Katy found 12,366 messages. Apparently, celebrity teenage singers get twice as much attention as candidates for the Republican nomination for the presidency.

But the real problem is 22,372 that contain neither Rick nor Katy. Just under half the messages contained the full name of one or the other. What to do with the other half? One can make the assumption that the distribution in the 'other' half is about the same as in the half with their full names. That would give 6,600 + 7,400 for Rick Perry and 12,366 + 15,000 for Katy Perry. That would suggest that a search for 'Rick Perry' would be finding less than half the messages about the presidential candidate.

Another option is to augment the search with other information we have about the candidate. He is from Texas; perhaps adding Texas to the search would produce additional tweets. That only added 461 messages. What about Republican? Republican was less productive; there were only 190 instances of Republican. Governor produced 195 hits. If you put the three together 764 messages are found. That does not tap very many of the 7,400 that might be there.

Let me not drag this out. What if you search for Bachmann and Cain? 6,228! How did that happen?

RT @kellyoxford: Cain, Perry, Bachmann all claimed God told them to run for President, and all are out of the race. God is hilarious.

This retweet, or something very much like it, went wild during the day on January 5. It was incorrect since Perry decided to continue, but that did not stop the spread of the tweets. That plus Republican + Texas + governor get you very close to the presumed 7,400 not identified Rick Perry messages.

The methodological point is simple -- humility in reporting the results of searching. Republican, Texas, and governor are available every day. At least that is a good supposition. Unfortunately, they do not identify many of the 'missing' mentions of the candidate. Something completely idiosyncratic, specific to a single day, did a great job with the missing references. But two days before and two days after it would just not be there.

If Perry versus Perry was the only search that was problemmatic this would only be a footnote on the analysis of the race for the Republican nomination. But the problem is quite general. A second example -- when Ted Kennedy died you needed to make three searches to get what seemed a reasonably complete collection. The first phrase being used in Twitter messages was "Senator Ted Kennedy." But the second day people were much more likely to use "Ted Kennedy." "Ted Kennedy" was used much more than "Senator Ted Kennedy." I was able to capture 55,165 messages containing "Ted Kennedy." But there were only 5,000 the first day and there were 13,000 "Senator Ted Kennedy" the first day. Then a third phrase was used. The Kennedy family used #tedkennedy. While there were not many messages with this hashtag, they were different from the others because of their focus. I could give many more examples of searches with this sort of problem.

The general point is the necessity of learning the language people are using when writing Twitter messages. We are used to doing survey research. We get to write the questions -- put the questions into language -- and they have to answer. With social media it is not the language we would use that is relevant. What is important in searching is the language they use. Their language must be used for effective searching.