OSdata tech blog
collecting social media for your website
how to build a major motion picture movie studio tool
This is an example of embarrassing technological ignorance by rich and powerful businessmen.
Warner Brothers is looking to spend a bunch of money to invest in some company that will build a project that could easily be assigned as a newbie coding project for junior or senior level college students (or very bright sophmores or freshmen or even high school students).
And yes, I am mocking the rich and powerful Warner Brothers executives for being unable to get this up and running in a couple of weeks, because this all existing technology and a single skilled professional working full time should be able to build this all quickly.
Here is the key paragraph from the Los Angeles Times, Business section, Wednesday, June 18, 2014, Speed Dating For Tech Start-Ups by Paresh Save and Andrea Chang:
Warner Bros. hoped to track every Twitter and YourTube video that mentions one of its movies, and promote some of that content on the studios own website.
random pict forwarded by a friend
The expensive part of the project is the servers to collect, store, and evaluate the information. The programming and development could be done with home equipment that many American teenagers already own. The two most expensive parts are a working desktop (or laptop) computer and a smart phone.
The typical middle class college student is already going to have the required equipment and a computer science major should be able to get the school to provide the server access in trade for college independent study credit.
Over the next few months I will return to this topic and write some real working code for each part of this project and release it for free under Apache License 2.0. Dont worry, Ill only write the core portions of the task and leave the fun customization parts for each ambitious newbie to attempt on his or her own. If this was a full time job, Id be embaraased to take that long, but I have to write the code and blogs in my spare time.
The upcoming code (and, yes, I will be adding real working, fully tested code) will be in PHP and MySQL, but the principles should translate easily into your favorite scripting language and SQL.
Both Twitter and YouTube have APIs for accessing their systems.
The student will need to go to each site and sign up for an account (and the typical middle class student probably alrady has the accounts).
Then upgrade the account to a developer account. This is an expensive step. Both Twitter and YouTube require that the developer be rich enough to own a mobile phone and have that mobile phone connected up to an active phone number. They dont do this out of overt hatred for the poor, but rather as a lazy crporate method of confirming the identity of those who access their systems. But the requirement does have the unintended racist consequence of locking out many inner city youth. But, as Google proudly announced a month ago, their company hires predominantly White and Asian workers, who might not consider the negative consequences of their decisions on minorities and the poor.
Google Workforce Demographics Tech 83% Men; Leadership 72% White; Overall 30% Asian; Tech 1% Black; Leadership 1% Hispanic; all categories less than 1% Other.
Yes, I am a militant advocate of equal rights for everyone, even minorities, the poor, the elderly, the handicapped, gays and lesbians, and others that corporate America tries to ignore.
Twitter: Twitter Developers
YouTube: Google Developers
Both Twitter and YouTube use OAuth to establish communications.
OAuth is very deterministic. It is fairly straight forward to write the code to establish an OAuth connection. And there are a whole bunch of free open source APIs that can be used by lazy programmers.
Twitter: OAuth FAQ
Once you have your developer account and have successfully connected to it, the next step is to gather the information.
YouTube has a clumsy method. You will have to make repeated searches on the appropriate keywords.
YouTube: YouTube Data API (v3) NOTE: Be careful, Googles own search engine will send you first to the deprecated version 2.0!
Twitter makes gathering the information easy with a streaming API.
Twitter: The Streaming APIs
You could gather the Twitter information through search requests using their REST API, but the streaming approach is going to collect the information much more efficiently and in real time.
You will need to open a streaming connection on your server.
collecting the stream
decouple collection and processing
Because of the high volumes involved, it is important to decouple the collection and processing of the information. This means that one (or more) server(s) establish and collect the streams and place the results into a temporary store.
Other servers are fed the information in the temporary store for actual processing. This can be handled by treating the collection server(s) as a really large buffer and having a simple process that farms out the data to the next available processing server.
As an example, during the final one minute of the Brazil penalty-shootout win over Chile in the 2014 World Cup (soccer or football, depending on your nationality), there were 389,000 tweets (Twitter) about the game and a total of 16.4 million tweets throughout that day (Saturday, 28 June 2014) about the World Cup in general. At the peak of the Seattle Seahawks win over the Denver Broncos in the 2014 Super Bowl (American football), there were 382,000 tweets about the game and a total of 24.9 million tweets throughout the entire game.
As you can see, for popular topics you should expect a very large stream of tweets data. You simply cant take the time to process these tweets and still keep up with the incoming flow. You must separate collection from processing.
You need to write a simple parser (or use a free open source API).
You then need to filter out all of the unrelated tweets.
And finally you need to store the information (aggregation).
If you are gathering enough information (which might happen with the motion picture studio example), you need to use the map and reduce method. Split the job among multiple processes/servers and then combine the results.
Just as a hint, you might seriously consider doing this processing with BASH. The combination of pipes and excellent text processing tools in the Unix/BASH or Linux/BASH combiantion might be exactly what you need.
How much data are you really storing?
It is entirely possible that you might be able to get away with a standard SQL program (such as PostgreSQL, MySQL, or even Oracle). If you need to go to something bigger (around 5-10 TeraBytes of information), there are literally dozens of NoSQL solutions available.
If you can do the job with a simple SQL database, it will lower your costs and make the project easier to build and maintain.
Now that you have parsed, filtered, and otherwise processed your incoming information, you need to make it available to human beings.
You need reporting software that provides the business persons with aggregate information.
And you will need a system that plucks selected messages and videos for automatic placement on the company website.
When you look at the coding involved, you will see that these two different kinds of reporting have more in common than they are different. Although the output presentation is very different, both tasks will share much of the same code base.
As you can see, there is real work involved here, but each step is clearly identifiable and involves techniques that are in some cases 50 or 60 years old.
A simple version of this system that handles just one movie (or one song or one celebrity or one model of automobile or one fast food product) is something well within the capabilities of a bright college student or a small team of reasonably competent college students.
Sad that the executives at Warner Brothers couldnt get this job done in a couple of weeks.