Feeds:
Posts
Comments

Ever wonder how newspapers and analysts obtain information about private companies?  It’s still a mystery to me but these researchers have figured out a way to triangulate this information from news reports about public companies!  The basic intuition is that using the wisdom of the crowds, in this case reporters and journalists, news stories tend to connect different companies together.  Perhaps these news stories could become good predictive variables for revenue estimation.  Here’s what they did.

They first constructed a social network of both public and private companies based on news stories.  Companies that were mentioned in the same story were connected by a link.   Eight months of news stories on Yahoo Finance were used.  The more mentions of two particular companies together, the stronger the weight of the link between these two companies.   For two datasets, this method generated 2 lists of over 6000 companies each.

Then they used these social network to generate certain scores for each company.  These scores included weighted in-degree, weighted out-degree, out-degree minus in-degree, pagerank, HITS and betweenness.

These social network scores, along with publicly available financial information about a portion of the companies, were fed into a data mining engine using logistic regression and decision tree.  The goal was to see if they could classify the CRR (the company revenue relation between any 2 companies).

Results

They evaluated the performance of the predictions using 3 methods:

precision = number of correctly predicted positive (negative) instances / number of predicted positive (negative) instances ;
recall =number of correctly predicted positive (negative) instances / number of actual positive (negative) instances ;
accuracy = number of correctly predicted instances / number of instances

All these came out to be between 70-80%!

“The two dyad degree-based attributes, NWD and WDOD, fail to predict revenue relations well, whereas the four node degree-based and six node centrality-based attributes produce results nearly as good as those from using all 12 attributes together.”

So what

Many financial analysts who do not have access to the financials of private companies still need to perform valuation analysis.  This method of combining news and public company financial information provides another way to forecast the revenues of private companies.

References

Zhongming Ma, Olivia R.L. Sheng , Gautam Pant.  Discovering company revenue relations from news: A network approach.  Decision Support Systems.  2009 vol 47, p408-414.

Short post – we launched Sonamine Telco Churn Predictor at Prepaid Mobile 2009 show in Lisbon yesterday.  It was a great rush and already we have demoed it to many interested telcos.  B-eye-Network picked it up in their news and we expect more news to follow.  Stay tuned.  BTW, you can read more on our www.sonamine.com website.

Back from relaunching http://www.sonamine.com…and free eval product.

How do companies “get the word out” about a new product?  Used to be in the old days, you put out a press release and then you track the number of “press hits”.  Then you estimate the readership of these press hits to find overall coverage of the press release.

The new web20 social world essentially provides a more advanced form of this PR release…Thanks to a flowering of social media consulting sites, every business is posting information on facebook, twitter etc…  Do you ever wonder how much this information spreads?   How can we compare the old PR methods with the new social PR methods?  After all, we are getting past the hype peak for social media…

This study looks at how pictures were “faved” in Flickr and how subsequent social friends also faved it.  Here’s some data details first : 100 days of Flickr snapshots, covering about 25% of total user base which grew from 1.6M to 2.5M during that period, there were 34.7M markings of “faved” on 11.3M photos.  By timing when the photo was faved, researchers could study propagation of these photos through the network.

Questions being asked : How long does it take to spread to first user; then to subsequent users?  What are some of the factors involved.  A social cascade was defined as user A faving a photo, A being friend of B, and B faving the same photo before friending A.  Basically, A found (and faved) the photo by seeing her friend B’s fav photos.

cascade150% of the first cascade step occurred in less than 3 days, while 50% of the subsequent cascade step occurred in 50 days.  ”an order of magnitude larger than the time before the first step of the cascade.”

Good so far, but what I found really intriguing was the application of epidemic theory here.  Reproductive rate is the number of subsequent infections that result from a single infected person.  It is sort of a multiplier effect.  Cha et. al. tried to create a predictive function of this number, so that they could predict it based on known properties of the social network.   In particular,  they tried to create predictive function based on the degree (think friends) of each node and the % of friends that were infected.   The correlation between the predicted value and the empirical value is shown below.

cascade1Basically, for small values, the predictive function is pretty good.  In the words of the authors,  ”Given the transmission probability of a picture derived from a short time series of user activity, we can then predict the expected spread of the photo for any network structure for which the node degree distribution k is known. This means that we can predict the reproduction number and the resulting spread of these 1,000 photos when they are adopted into other online social networks such as Facebook, Livejournal, and MySpace.”

34,734,221 favorite markings over 11,267,320
distinct photos

So what

Direct online marketers can understand how information spreads within your house email list.  It’s easy to run a series of seed campaigns, track the forward links, utilize these techniques to document the “viral” nature of their house list.  Also you can use the same model to understand and track the “earned media” value of social networks.

More work

It would be really interesting next to see how the local neighborhood characteristics of the infectees correlated with the initial infection lag time…more to come.

Reference

Characterizing social cascades in Flickr.  Cha, M., Mislove, A., Adams, B., Gummadi, K.P.   WOSN’08, August 18, 2008, Seattle, Washington, USA

When you change a mobile service plan or phone number, the first thing you’ll probably do is let your friends know, just in case they encounter problems reaching you at your new number or new service plan.  So it’s natural to think that this “churn” behavior would propagate through the social network.

This study studies whether this “churn” spreads through a call graph network, and how you could use such a call graph to predict people who will churn in subsequent months.  Some basics first.  Study data is a small subset from a large mobile carrier during Mar 2007; the final size after some pruning and cleaning was 2.1 million users and 9.3 million calls.  Many of these users are prepaid card users for whom the carrier has no demographic information.  The authors then studied how probability of churning in April, May, June and July is related to each user’s local neighborhood of friends.

Result 1 – probability of churning is closely related to how many of your neighbors have churned.  The diagram below shows that your likelihood to churn rises steadily with the number of churner neighbors until a plateau.

churn neighborsResult 2 – probability of churn is related to the number of neighbors who have churned AND are also connected.

The authors then proceed to study how certain variables, including social network metrics, might accurately predict people who churn.  They used a standard decision tree classifier.  This is a standard supervised machine learning or data mining problem.  There were 3 types of input data

  1. Usage information (called DT1) – call frequency, number of calls, number of friends, call volume, duration of calls etc
  2. Connectivity (called DT2) – These are usage statistics but broken down by churner or non-churner neighbors.  So 2 examples are “number of churner neighbors” and “number of non-churner neighbors who have churners as neighbors”  The second example is closely related to the concept of eigenvector centrality of the churner network.
  3. Inter-connectivity (called DT3) – These are network or graph type metrics which can only be calculated using graph methods.  ”Number of adjacent pairs in the set of churner neighbors”, “Number of pairs in the churner friends connected by path length of 2″, number of pairs of churner friends whose shortest paths only include churner neighbors, “Total call volume on edges connecting adjacent churner friends”.

Results – Much more accurate churn prediction using social network metrics D3 and D2, compared to D1 alone.  See graph below.

churn neighborsRecall that a lift curve shows how many of the actual churners were predicted by the decision tree classifer as you walk through all the subscribers in the dataset.  When you walk through all 100% of the subscribers, you will catch 100% of the churners.  But that is not very good, you need many marketing promotions or calls to use that.  On the other hand, if you can identify 50% of the churners after walking through 10% of the subscribers, that’s pretty good.

The graph shows that using usage data (DT1), after walking through 10% of the base, they could predict about 10% of the churners.  Not good.

Using DT2, after walking through 10% of the base, they could predict about 18% of the churners.  A good improvement, but still not good.

Using DT3, after after walking through 10% of the base, they could predict about 40% of the churners.  This is very good.

Nick’s back-of-the-envelop business case calculation on why this matters

DT1

DT3

Total subscriber count

2,000,000

2,000,000

Churn rate

0.06

0.06

# churners

120,000

120,000

% of churners identified in 10% of subscribers

0.1

0.4

# churners identified

12,000

48,000

Conversion rate of discount offer

0.1

0.1

# churners converted/saved

1,200

4,800

# churners lost

118,800

115,200

Discount offer Rate

0.05

0.05

Revenue from saved churners calculated as (1-discount offer)*number of saved churners

1,140

4,560

Number of non churners in 10% of subscribers

188,000

152,000

Number of non churner who took the discount

18,800

15,200

Lost revenue from discount to non-churners

-940

-760

Return on investment on churn marketing campaign 200 3,800

Using DT1, it is probably not worth the effort to organize a marketing campaign to save a net of 200 subscribers.  But with DT3, it becames a profitable option to give a 5% discount offer to churners!

So what?

So if this social network prediction works, why aren’t all the telco providers jumping on this?  There are some companies like Idiro, Xtract and Datanetis that are starting to offer churn solutions around this idea.

I think there are 2 main reasons why the telco uptake has been slow.

(1) scalability, performance and reliability – telcos don’t just have 2 million subscribers, most have tens of millions of subscribers.  This type of social network analytics must score the data in minutes to have any operational value.

Shameless plug here : Check out http://www.sonamine.com for a high performance engine that can handle hundreds of millions of nodes and connections in minutes instead of hours.

(2) telcos have been battling churn for a long time and have their own processes that work.  Using social network analytics must therefore integrate with their current processes.  Complete solutions that spawn a different marketing and analytical process will be difficult to integrate and show proof of success.

References

Social ties and their relevance to churn in mobile telecoms networks.  K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, S. Mukherjea, A. Nanavati, A. Joshi.  EDBT’08.  March 25-30, 2008.  Nantes, France.

Smoking in the US has dropped from 45% to 21% in the past 4 decades.  The latest research in smoking cessation studies how social networks influence the likelihood of quitting.   In this study, the authors were examining 6 areas :  (1) are there social network clusters of smokers and non-smokers (2) relationship between one person’s smoking behavior and the smoking behavior of his/her social network contacts (3) the dependence of 2 on the types of social ties (4) the influence of education on the spread of smoking (5) does cessation behavior occur in sub-networks (6) do smokers occupy special positions on the the social network?

The data consisted of 12067 subjects over a period of 32 years.  Individual behaviors were collected every few years.   Clusters of smokers were identified by finding smokers that were “fully connected”.   In graph theory, this translates to a complete graph.  By calculating certain network attributes such as eigenvector centrality, the researchers created a logistic regressions to identify the effect of centrality on smoking behavior.

Results
Although the % of smokers dropped by 50%, the cluster size of smokers remained relatively unchange.  This indicates that smokers are quitting in connected social groups.  

smokers1At the same time, the eigenvector centrality of non smokers fell, relegating them to the periphery of the network.  

smokers2

Among the various other findings, I find this one most interesting:  They found that geographical “distance did not modify the intensity of the effect of the contact’s smoking behavior on the behavior of the subject.  That is, smoking behavior was related between subjects and their contacts, regardless of how far apart they were geographically.”

So what?

There are various smoking cessation social networks out there such as quitnet.com etc.  Since there is evidence that geographical distance does not alter the effect of social networks, we can hypothesize that online social networks might have the same effect as real life face-to-face networks.  Each of these online quitting networks can therefore further tailor their offerings if they can mine their network data and score their users with network “smoking” attributes.  Artificially creating clusters of quitters from the isolated smokers might be a viable intervention option. 

Reference

The collective dynamics of smoking in a large social network.  Kristakis, N. and Fowler, J. H.  New England Journal of Medicine  2008:358:2249-58

The AT&T communications network is a complex sprawl of fiber optic cables linked by routing stations and nodes.  In 1997, AT&T handled anywhere from 250 to 290 million calls per day.  Amazingly, 99.98% of all calls were completed on the first attempt.  To achieve this feat, AT&T uses a patented Real Time Network Routing (RTNR) algorithm which is implemented by the Fast Automatic Restoration (FASTAR) system.  These algorithms and systems allow network planners to include restoration capacity into the implementation of the network, and also allow the routing nodes in the system to redirect traffic to spare capacity routes when there are any failures in the network, whether they are from fibre cuts or power outages. 

Deciding how much restorative capacity to include and how to route affected calls is a complex network (graph) problem.  FASTAR works by first having sentinel nodes tell it when there are failed fibre or nodes.  FASTAR then works out the affected demand, ie. calls, and where their best alternative restorative path should be.  FASTAR proceeds to send new instructions to the routing nodes for the affected calls, to send the calls down the alternate path.  

AT&T labs and their network planning team came up with an alternate way to model this restorative capacity problem using a graph model and linear programming.  The constraints were two fold – to restore as many affected calls as possible; and a cost minimization to implement the restorative capacity.   The final network being modeled here was pretty ‘large’ – hundreds of nodes, thousands of connected arcs, tens of thousands of demand units.  The resulting LP model had millions of variables and constraints.  As a result, data aggregation was performed to ensure quick analysis times.  

It took 9 months to build the model.  The LP run using aggregated data occured on a Sun enterprise 3000 machine with 2GB of RAM using CPLEX v4 as the LP solver completed in 2 hours.  The unaggregated data run took 6 CPU days. 

Results

Using 1997 network plan as a baseline, the new model found that it needed 39% fewer additions to the network than the current system!  For the 1998 planning session, the new model showed savings of 36% without any degradation in performance.  With no incremental new builds and some small additional cabling, the model returned over 6800 demand units that were previously reserved for restorative capacity into active service, thus providing more capacity for revenue without incremental costs!

So what?

This application shows that in some cases, real time network network analysis is required.  As a result, some data size and analytic compromises were made.  Faster computation options should allow more data to be used in the future.

References

Optimizing restoration capacity in the AT&T network.  Ambs K., et. al.  Interfaces 30, 2000, p. 26-44

In 2006, the Water Distribution Systems Analysis Symposium put out a challenge to identify the most cost effective way to detect contaminants in a given water system.  They put out two data sets, each complete with information about water flow, pressure, population affected.  The two datasets had 129 nodes (pipe junctions, distribution points) and 12527 nodes respectively.

The challenge was to identify where to place 20 sensors so that they would minimize 4 different variables in the event of specific contamination scenarios: expected time of detection, expected population affected prior to detection, expected volume of consumed contaminated water prior to detection, and detection likelihood.

Leskovec and team were able to use network analysis techniques to accomplish this task.   One key contribution they made in this paper was to improve the speed of the technique.  In their words, “we need(ed) to simulate 3.6 million contamination scenarios, each of which takes approximately 7 seconds….We ran the simulation for a month on a cluster of about 40 machines”.  Without the improvements that resulted in a compression of the data and thus fitting the raw data into RAM, the simulations would have taken 1000x longer.   That’s not a typo… 1000x longer.

So what?

Some very useful network/graph techniques are going to be available for commercial use when better algorithms and distributed systems come together!

References

Z1 = expected time of detection,
2 Z2 = expected population affected prior to detection,
3 Z3 = expected volume of consumed contaminated water prior to detection,
4 Z4 = detection likelihood,

Cost effective outbreak detection in networks.  Leskovec, J., et. al.  KDD’07 Aug 12-15, 2007.  San Jose, California, USA.

Companies are deploying communities in droves to drive down support costs.  Folks like Microsoft, Intel and Best Buy all have communities, blogs and bulletin boards centered around product information and support.  Sonamine will embark on a project to figure out how community managers can further use network analytics to improve their communities.   The output will be whitepaper that will be shared with the participating community managers.

Any community managers or media strategists who want to get involved are welcome to contact me at nick@sonamine.com

Link to project page here.

Cancer treatment has been improving at great strides.  One important facet used to decide on the course of treatment is how likely the cancer will spread aggressively.  Metastasis is the term used to describe the cancer spreading.  If doctors think a cancer will spread aggressively, then the recommended treatment will also be more aggressive.  More aggressive treatment tends to result in more side effects.  

The aggressiveness of cancers have been studied in wide scale protein marker studies, ie.  what proteins and genes can be used to predict the level of aggression.  So far, the gene and protein predictors have been “scalar” and “independent”.  This study uses a novel approach – look at network of interactions between various proteins, and use these networks to predict metastatis.  

So the method is fairly straightforward.  The dataset consisted of 8141 genes studied from 2 sets of breast cancer patients.  Some of these breast cancer patients developed metastatic cancer, some did not.  To find the subnetworks, they implemented a simple algorithm that used 2 elements – a scoring function for the subnetwork and a “greedy node addition” step.  They started with a random node, and added a neighboring protein node.  The scoring function was calculated to see if new node added any value to the scoring function.  By going through this whole process, they found 149 and 243 significant subnetworks in the 2 data sets respectively.  

Each subnetwork was then measured with a activity score that is different from the discriminant scoring function used to find them.  This activity score was then analyzed with logistic regression to see how well they predicted whether a particular patient would become metastatic.  

At a fixed sensitivity of 90%, the subnetwork markers achieved 70.1%  and 72.2%  accuracy, measured as the percentage of correct classifications, compared 62 and 63% with using only gene markers.  An AUC analysis also shows that these subnetworks do a better job of predicting whether a particular cancer patient has an aggressive cancer.

auc

At a fixed sensitivity of
90%, the subnetwork markers achieved 70.1% (van de Vijver
et al, 2002) and 72.2% (Wang et al, 2005) accuracy, measured
as the percentage of correct classifications using the technique
of five-fold cross-validation within each data set. This
accuracy compares favorably with those reported in the
original studies (van de Vijver et al, 2002; Wang et al, 2005)
(62 and 63%; see Supplementary Table S1).

 

So what?

This study allows us now to start looking for groups of genes together that affect our health, not just individual genes.

References

Network based classification of breast cancer metastasis.  Chuang, H.Y. et al.  Molecular Systems Biology 3:140.  2007

An interesting question:  If a martian were to look at the NFL match schedules, would it be able to correctly identify the NFC and AFC, with the east, west, north south divisions and the four teams in each of them? 

Mann and team decided to use graph based techniques to tease this out.  The first step is to recognize that the game schedule is actually a graph in disguise.  Teams are nodes in the graph, when 2 teams play, they are connected by an edge.   

Next to identify hierarchies and clusters in the network, one assumes that the amount of “connectivity” within each cluster is much higher than outside the cluster.  So one way to separate out connectivity is to think in terms of flow of something, like water perhaps, through the network graph.  If a particular set of nodes (teams) are highly connected, then the flow through these nodes is very high.  That’s quite intuitive since there are more “pipes” among the nodes.

So the problem is really a graph problem.  You want to “cut” the graph into the smallest number of subgraphs while maintaining the highest flow in each of the subgraph and having the highest maximum concurrent flow.  The order of the cuts determine both the clusters and the hierarchical ordering of the subgraphs.

So they were able to correctly deduce the hierarchical structure from the team match schedule, for both NFL and NCAA.  The matrix below shows the 2 diagonal sections for NFC and AFC, with the further subdivisions in different shades of grey.  Neat.

games

So what?
Our human social network also has a hierarchical structure organized along age.  We speak less to our parents, but more to our friends.  Some of these techniques may be used to identify parent child clusters from call or facebook social graphs! 

References

The use of sparsest cuts to reveal the hierarchical community structure of social networks.  Mann, C. F., Matula, D.W., Olinik, E.V.  Social Networks, 30 (2008) 223-234.

Older Posts »