Isidre Royo

Isidre Royo
Isidre Royo is a Product Manager of OpenText Analytics, specializing in Predictive Analytics and Big Data, based in Barcelona.

Data Quality is the Key to Business Success

data quality

In the age of transformation, all successful companies collect data, but one of the most expensive and difficult problems to solve is the quality of that information. Data analysis is useless if we don’t have reliable information, because the answers we derive from it could deviate greatly from reality. Consequently, we could make bad decisions. Most organizations believe the data they work with is reasonably good, but they recognize that poor-quality data poses a substantial risk to their bottom line. (The State of Enterprise Quality Data 2016 – 451 Research) Meanwhile, the idiosyncrasies of Big Data are only making the data quality problem more acute. Information is being generated at increasingly faster rates, while larger data volumes are innately harder to manage. Data quality challenges There are four main drivers of dirty data: Lack of knowledge. You may not know what certain data mean. For example, does the entry “2017” refer to a year, a price ($2,017.00), the number of widgets sold (2,017), or an arbitrary employee ID number? This could happen because the structure is too complex, especially in large transactional database systems, or the data source is unclear (particularly if that source is external). Variety of data. This is a problem when you’re trying to integrate incompatible types of information. The incompatibility can be as simple as one data source reporting weights in pounds and another in kilograms, or as complex as different database formats. Data transfers. Employee typing errors can be reduced through proofreading and better training. But a business model that relies on external customers or partners to enter their own data has greater risks of “dirty” data because it can’t control the quality of their inputs. System errors caused by server outages, malfunctions, duplicates, and so forth. Dealing with dirty data? Correcting a data quality problem is not easy. For one thing, it is complicated and expensive; benefits aren’t apparent in the short term, so it can be hard to justify to management. And as I mentioned above, the data gathering and interpretation process has many vulnerable places where error can creep in. Furthermore, both the business processes from which you’re gathering data and the technology you’re using are liable to change at short notice, so quality correction processes need to be flexible. Therefore, an organization that wants reliable data quality needs to build in multiple quality checkpoints: during collection, delivery, storage, integration, recovery, and during analysis or data mining. The trick is having a plan Monitoring so many potential checkpoints, each requiring a different approach, calls for a thorough quality assurance plan. A classic starting point is analyzing data quality when it first enters the system – often via manual input, or where the organization may not have standardized data input systems. The risk analyzed is that data entry can be erroneous, duplicated, or overly abbreviated (e.g. “NY” instead of “New York City.)”  In these cases, data quality experts’ guidance falls into two categories. First, you can act preventively on the process architecture, e.g. building integrity checkpoints, enforcing existing checkpoints better, limiting the range of data to be entered (for example, replacing free-form entries with drop-down menus), rewarding successful data entry, and eliminating hardware or software limitations (for example, if a CRM system can’t pull data straight from a sales revenue database). The other option is to strike retrospectively, focused on data cleaning or diagnostic tasks (error detection). Experts recommend these steps: Analyzing the accuracy of the data, through either making a full inventory of the current situation (trustworthy but potentially expensive) or examining work and audit samples (less expensive, but not 100% reliable). Measuring the consistency and correspondence between data elements; problems here can affect the overall truth of your business information. Quantifying systems errors in analysis that could damage data quality. Measuring the success of completed processes, from data collection through transformation to consumption One example might be how many “invalid” or “incomplete” alerts remain at the end of a pass through of the data. Your secret weapon: “Data provocateurs” None of this will help if you don’t have the whole organization involved in improving data quality. Thomas C. Redman, an authority in the field, presents a model for this in a Harvard Business Review article, “Data Quality Should Be Everyone’s Job.” Redman says it’s necessary to involve what he calls “data provocateurs,” people in different areas of the business (from top executives to new employees), who will challenge data quality and think outside the box for ways to improve it. Some companies are even proposing awards to employees who detect process flaws where poor data quality can sneak in. This not only cuts down on errors, it has the added benefit of promoting the idea through the whole company that clean, accurate data is important. Summing up Organizations are rightly concerned about data quality and its impact on their bottom line. The ones that take measures to improve their data quality are seeing higher profits and more efficient operations because their decisions are based on reliable data. They also see lower costs from fixing errors and spend less time gathering and processing their data. The journey towards better data quality requires involving all levels of the company. It also requires assuming costs whose benefits may not be visible in the short term, but which eventually will end up boosting these companies’ profits and competitiveness.

Read More

CRM for BI? There’s No Substitute for Full-Powered Analytics

CRM

As CRM systems get more sophisticated, many people think these applications will answer all of their business questions – for example, “Which campaigns are boosting revenues the most?” or “Where are my opportunities getting stuck?” Unfortunately, those people are mistaken. You have a CRM Some vendors of CRM (or customer relationship management) platforms like to talk about their analytic capabilities. Usually, those CRM systems have reporting capabilities and let users make some dashboards or charts to follow the evolution of their business: tracking pipeline, campaign effectiveness, lead conversion, quarterly revenue and so on. But from the perspective of analytics, this is only the smallest fraction of the value of the information your CRM is capturing from your sales teams and your customers. A CRM system is perfect for its intended purpose: managing the relationship between your sales force and your customers (or potential customers) and all the information related to them. And yes, it gathers lots of data in its repository, ready to be mined. What you may not realize is that the tool that collects transactional data is not necessarily the best tool to take advantage of it. It is critical for a CRM to be agile and fast in the interactions with your sales force. If it’s not, it interferes with sales people selling, building relationships with customers, and contacting prospects. So an ideal CRM system should be architected to collect, structure, and deploy transactional data in a way that the platform can manage it easily. Here’s the bad news: this kind of agile, transactional data structure isn’t great for analytics. Complex questions and CRMs don’t match Some CRM vendors try to provide certain business intelligence capabilities to their applications, but they face a basic problem. Data that is optimized for quick transactional access is not prepared to be analyzed, and in this scenario there are lots of questions that can’t be answered easily: Which accounts have gone unattended in the last 30 days? Who owns specific accounts? Where are my opportunities getting stuck? At a given point in the sales cycle, which accounts are generating the most revenue? How profitable is this account? Which campaigns are influencing my opportunities? Which products are sold together most frequently? These are only a few of the hundreds of questions that can’t be answered only through the data and analytic techniques CRM offers. This becomes impossible if you want to blend this CRM data with other external data in a closed system like a CRM because CRM doesn’t have this capability. CRM and full-powered analytics software CRM is a perfect tool for some things. However, it isn’t up to the demands of addressing the ever more complex questions companies need answered. Getting a 360-degree view of their business shouldn’t be a block to growing and increasing revenue. OpenText™ Big Data Analytics helps companies blend and prepare their data, and provides a fast, agile and flexible toolbox to let business analysts look for answers to the questions that decision-makers require. CRM data is one important part of this equation: being more competitive using full-powered analytic software is another.

Read More

Big Data: The Key is Bridging Disparate Data Sources

Big Data

People say Big Data is the difference between driving blind in your business and having a full 360-degree view of your surroundings. But, adopting big data is not only about collecting data. You don’t get a Big Data club card just for changing your old (but still trustworthy) data warehouse into a data lake (or even worse, a data swamp). Big Data is not only about sheer volume of data. It’s not about making a muscular demonstration of how many petabytes you stored. To make a Big Data initiative succeed, the trick is to handle widely varied types of data, disparate sources, datasets that aren’t easily linkable, dirty data, and unstructured or semi-structured data. At least 40% of the C-level and high-ranking executives surveyed in the most recent NewVantage Partners’ Big Data Analytics Survey agree. Only 14.5% are worried about the volume of the data they’re trying to handle. One OpenText prospect’s Big Data struggle is a perfect example of why the key challenge is not data size but complexity. Recently, OpenText™ Analytics got an inquiry from an airline that needed better insights in order to head off customer losses. This low-cost airline had made a discovery about its loyal customers. Some of them, without explanation, would stop booking flights. These were customers that used to fly with them every month or even every week, but were now disappearing unexpectedly. The airline’s CIO asked why this was happening. The IT department struggled to push SQL queries against different systems and databases, exploring common scenarios for why customers leave. They examined: The booking application, looking for lost customers (or “churners”). Who has purchased flights in previous months but not the most recent month? Which were their last booked flights? The customer service ticketing system to find if any of the “churners” found in the booking system had a recent claim. Were any of those claims solved? Closed by the customer? Was there any hint of customer dissatisfaction? What are the most commonly used terms in their communications with the airline – for example, prices? Customer support? Seats? Delays? And what was the tone or sentiment around such terms? Were they calm or angry?  Merely irked, or furious and threatening to boycott the airline? The database of flight delays, looking for information about the churners’ last bookings. Were there any delays? How long? Were any of these delayed flights cancelled? Identifying segments of customers who left the company during the last month, whether due to claims unresolved or too many flights delayed or canceled, would be the first step towards winning them back. So at that point, the airline’s IT department’s most important job was to answer the CIO’s question – May I have this list of customers? The IT staff needed more than a month to get answers to these questions, because the three applications and their databases didn’t share information effectively. First they had to move long lists of customer IDs, booking codes, and flight numbers from one system to another. Then repeat the process when the results weren’t useful. It was a nightmare crafted of disperse data, complex SQL queries, transformation processes, and lots of efforts – and it delivered answers too late for the decision-maker. A new month came with more lost customers. That’s when the airline realized it needed a more powerful, flexible analytics solution that could effortlessly draw from all its various data sources. Intrigued by the possibilities of OpenText Analytics, it asked us to demonstrate how we could solve its problems. Using Big Data Analytics, we blended the three disparate data sources. In just 24 hours we were able to answer the questions and OpenText™ Big Data Analytics had worked its magic. The true value of Big Data is getting answers out of data coming from several diverse sources and different departments. This is the pure 360-degree view of business that everyone is talking about. But without an agile and flexible way to get that view, value is lost in delay. Analytical repositories that use columnar technologies – i.e., what OpenText Analytics solutions are built on – are there to help answer questions fast when a decision-maker needs answers to business challenges.

Read More

Data Scientist: What Can I Do For You?

Data Scientist and OpenText Big Data Analytics

After attending our first Enterprise World, I have just one word to define it: intense. In my memory, there are a huge number of incredible moments: spectacular keynotes, lots of demos and amazing breakout sessions. Now, trying to digest all of these experiences, collecting all the opinions, suggestions and thoughts of the customers that visited our booth, I remember a wonderful conversation with a customer about data mining techniques, their best approaches and where we can help with our products. From the details and the way he formed his questions it was pretty clear that, in front of me, I had a data scientist or, maybe someone who deeply understands this amazing world of data mining, machine learning algorithms and predictive analytics. Just to put it in context, usually the data scientist maintains a professional skepticism about applications that provide an easy-to-use interface, without a lot of options and knobs, when running algorithms for prescriptive or predictive analytics. They love to tweak algorithms, writing their own code or accessing and modifying all the parameters of a certain data mining technique, just to obtain the best model for their business challenge. They want to have the full control in the process and it is fully understandable. It is their comfort zone. Data scientists fight against concepts like the democratization of predictive analytics. They have good reasons. And, I agree with a large number of them. Most of the data mining techniques are pretty complex, difficult to understand and need a lot of statistics knowledge just to say, “Okay, this looks pretty good.” Predictive models need to be maintained and revised frequently, based on your business needs and the amount of data you expect to use during the training/testing process. More often than you can imagine, models can’t be reused for similar use cases. Each business challenge has its own data related, and that data is what will define how this prescriptive or predictive model should be trained, tested, validated and, ultimately, applied in the business. On the other hand, a business analyst or a business user without a PhD can take advantage of predictive applications that have those most common algorithms in a box (a black box) and start answering their questions about the business. Moreover, usually their companies can’t assume the expensive compensation of a data scientist, so they deal with all of this by themselves. But, what can we do for you, data scientist? The journey starts with the integration of distinct sources from databases, text files, spreadsheets or, even, applications in a single repository, where everything is connected. Exploring and visualizing complex data models with several levels of hierarchy offers a better approach to the business model than the most common huge table method. Having an analytical repository as a reflection of how the business flows, helps in one of the hardest parts of the Data Scientist: problem definition. Collecting data is just the beginning, there is a huge list of tasks related to data preparation, data quality and data normalization. Here is where the business analyst or the data scientist loses much of their precious time and we are here to help them, accelerating time from raw data to value. Once they have achieved their goal of getting clean data, a data scientist begins the step of analyzing the data, finding patterns, correlations and hidden relationships. OpenText Big Data Analytics can help provide an agile solution to perform all this analysis. Moreover, everything is calculated fast and using all your data, your big data, offering a flexible trial and error environment. So the answer to my question: OpenText Big Data Analytics can reduce the time during the preparation process, increasing time where it is really needed: analysis and decision making, even if the company is dealing with big data. So, why don’t you try it in our 30 days Free Trial or ask us for a demo?

Read More

Here’s Your No. 1 Tool for Fraud Detection

In the song Talk About the Blues, the experimental band The Jon Spencer Blues Explosion declares that “the blues is number 1.”  If you’re a blues fan, you probably associate the number 1 with Jon Spencer. But if you’re interested in fraudulent or anomalous numbers, you might just associate it with Frank Benford. Here’s why: Back in 1938, Benford observed that 1 is more likely than any other number to be the first digit (also called the most significant digit) of a natural number. He determined this based on analysis of 20 different sources of numbers, ranging from articles in Reader’s Digest to population sizes to drainage rates of rivers. At a remarkable rate, the first digit of the numbers Benford studied was either a 1 or a 2 – and most frequently a 1. But Benford wasn’t the first person to discover this. The astronomer Simon Newcomb noticed it in 1881, while thumbing through books of logarithm tables. Newcomb noticed that pages with log tables for numbers beginning with 1 or 2 were grubbier than the pages for numbers beginning with 8 or 9. After some mathematical exploration, Newcomb proposed a law stating that natural numbers were much more likely to begin with a 1 or a 2 than with any other digits. In fact, he said that natural numbers had 1 as their first digit about 30 percent of the time. Newcomb’s observation wasn’t discussed much for more than 50 years. Then Benford (who worked  as a physicist for the General Electric Company), tested Newcomb’s law on 20 different data sets. Based on his calculations (his distribution is shown above), Benford declared that Newcomb’s law was “certain” – and, without hesitation, he applied his own name to the phenomenon. (Smart guy!) Now known as Benford’s Law, the idea has come to acquire an aura of mystery. After all, if a collection of numbers is truly natural – that is, “occurring commonly and obviously in nature” – shouldn’t their first digits be identically distributed across all numbers from 1 to 9? Benford’s Law is mysterious, yes, but it works.  It’s now widely used by investigators looking for fraudulent numbers in tax returns, credit card transactions, and other big data sets that are not aggregated. Obviously, it doesn’t work with dates, postal codes and other “preformed” numbers. A helpful basic discussion of Benford’s Law is available via Khan Academy. Its simplicity makes Benford’s Law really easy to apply to automatic audit procedures. You only need to compare the first digit of the set of numbers that you want to analyze against the distribution in Benford’s Law. If certain values in your data deviate from what Benford’s Law dictates, those numbers probably aren’t natural. Instead, they may have been invented or manipulated, and a deeper analysis is required to find the problem. For example, consider the distribution shown in the dots and black line on the chart below. Compare them to the blue bars and numbers, which represent Benford’s distribution. You can clearly see (because we’ve circled it in red) that more than 20 percent of the numbers in the data set have 2 as their significant digit, even though Benford’s Law says they should represent less than 18 percent. This tells us something is fishy, and it may be worthwhile to dig deeper into the underlying numbers. This kind of analysis doesn’t have to end with the most significant digit. You can also analyze the second, third and fourth digits; each has its own distribution that will allow you to isolate possible fraudulent numbers from millions of legitimate transactions for further analysis. You can apply Benford’s Law to your data really easily with OpenText Actuate Big Data Analytics. Just follow this step-by-step guide in our Free Trial Resource Center.  Cross this information with incomes, tax returns, revenues, and financial transactions. If there is something strange or fraudulent in your data, you will find it. And instead of singing the blues like the lip-syncing actors below, you’ll sing the praises of Frank Benford. Number One image by Timothy Krause, via Flickr. 

Read More

Too Much Customer Churn? Don’t Panic, Use Analytics

Customer turnover, or churn, is a fact of life for many businesses. But predicting customer behavior can help mitigate losses due to churn, and may even help turn customer loyalty around. It’s an eye-opening experience when a business we consult with sees how this process works. And even if they discover a huge churn problem using analytics, the main message is this: Don’t panic. What do I mean? For starters, some of our customers began their analytics journey by testing data from their own enterprises. After drilling through reports, dashboards and scorecards, they assume that the first step in the journey is complete. They are happy when they see the “what happened” knowledge, as at least one big question has been answered. But, as often happens, several new questions appear after the “what happened” answer is unveiled: Why did this happen? What made it happen? Will it happen again? Is there a trend? How do my actions affect customer churn? One customer we worked with assumed he knew all about his business challenges. This was shown in the numbers and the charts, and everything pointed to the same pain: their valuable customers were fleeing, slowly but steadily, although the lead generators were going full throttle. At this point, the knowledge about customers and their behaviors was critical. But usually that means working with a lot of data and several heterogeneous sources: demographic data, old transactions, customer service interactions, browsing data from the company’s website, and responses to marketing campaigns. In many cases, in the era of the Internet of Things (IoT), the list is even longer. Once this information is integrated and running in a fast columnar database, identifying churners is easy. The real challenge is determining if a customer is not going to come back. With this set of customers, it is possible to find the attributes that define a churner. OpenText Big Data Analytics uses a profile algorithm based on Z-Score statistical analysis and compares all the desired attributes against the set of customers. Classifying and re-classifying data are some of the most common tasks that data scientists and business analysts do. It’s very important for companies make classifications of their customers. (Read more about why in my previous blog post, Why You Need to Classify Your Customers.) OpenText Big Data Analytics offers several algorithms to classify objects, including Naïve-Bayes (recently added to the 5.1 release), Decision Tree and Logistic Regression. It is always a good idea to compare all of the available models with your data to determine which data mining technique performs better. The next step after training and tuning the classification model to identify churners is to apply the model to your loyal customers – ones that have not yet fled. The classification algorithm searches for those patterns that define potential churners among the loyal customers and marks them as “probably future churners.” Now it’s time to take care of those customers. Admittedly, taking care of customers requires time and resources, and probably the company will need to prioritize some customers above others. So why not start with the group of customers that generate a higher profit? Yes, I’m talking again about classifying customers. You will never be finished with classifying customers. If you want to see a step-by-step simple guide on how to look for churners, try our OpenText Big Data Analytics Free Trial and find the Customer Analytics Samples that focus on analyzing and predicting churn. After that, try the techniques with your own customer data. If you find too many churners, please, don’t panic. Just do analytics. And let us know what you find.

Read More

Why You Need to Classify Your Customers

web content management solutions

When you are trying to understand the behavior of your customers, initially you look at their history of interactions with your company, their attributes, their metrics, their scores, the marketing response ratio. There are a lot of sources and data that you can use to increase your knowledge and use it to make things better and increase your sales and invest your marketing budget for a higher ROI. So, you know from your list of customers who is the most profitable, the churner, the loyal, the perseverant, the annoying, the occasional buyer, etc. At the end of this descriptive process you get some sets of customers. But trying to find what defines a customer that belongs to those segments using all their attributes or interactions could be a herculean task. Things get worse if your intention is to apply this knowledge to classify new customers. When you want to classify, the best option in this case is to rely this task on data mining techniques. There are several algorithms that could be used to classify objects using attributes like decision tree, logistic regression, Naïve-Bayes classifier (pictured below), neural networks and more. And one classic best practice rule is that you need to compare distinct classifiers with your training data. This comparison will tell you which one of them does it better than the others. And with this calculated model, you can classify the set of objects. You can know who is likely to churn or who would love that product that needs to raise its sales. Imagine doing that with millions of rows of data. It’s insane. Each iteration and algorithm takes a lot of time calculating each model, with all your data. That’s where OpenText Big Data Analytics offers the best results. And we have recently extended the library of data mining classifier algorithms. OpenText Big Data Analytics 5.1 came with a Naïve Bayes classifier algorithm in its engine. This data mining classifier is one of the most simple of all its kind, but its simplicity doesn’t mean that it’s not powerful and trustful in certain cases. It depends on the data. So, why don’t you get OpenText Big Data Analytics 5.1 downloaded in our 30 days free trial? You would be able to compare three classification algorithms, using millions of rows in a drag and drop experience. You only need to decide what to classify. What’s your business challenge?  

Read More