Gordon Linoff is one of the founders of Data Miners, Inc. He is a co-author of 7 books related to data mining and databases. He is also one of the top contributors at Stackoverflow for MySQL. If you want to learn about databases, Gordon should be your role model and a person to follow. His contributions are many and he loves to interact with people on the forums.
In his interview with Cloudways, Gordon talks about database and data mining, offers advice on distributed database designing and also talks about how he became a top contributor on Stackoverflow. Enjoy the interview 🙂
Cloudways: Gordon, what made you choose Database and Data Mining as your permanent field? Who was the inspiration?
Gordon: Ever since I was a kid, I was always interested in mathematics. If I had one inspiration, it was my older brother. . . and Martin Gardner’s Mathematical Games column in Scientific American. Of course, I didn’t realize the difference between mathematics, computer science, and statistics until I went through college. When I started working with large complex data and data analysis, it wasn’t called data mining; the advanced algorithms were more likely to be thought of as artificial intelligence.
Cloudways: How was your experience as a Database Marketing Consultant at New York Times? What were your key tasks? In general, what changes do you see in New York Times since you left?
Gordon: Actually, I have been consulting with the New York Times on-and-off for about fifteen years. Initially, a major part of my work was in customer-centric forecasting for the home delivery population. That project was launched because their subscriber numbers were deviating significantly from what was expecting — encouraging them to try another technique. That technique is customer-centric forecasting which uses survival analysis as the underlying technology.
I do still work with the Times. Some of the work still involves forecasting and lifetime value calculations. Of course, the business has become much more digital-oriented, so the focus has shifted more to the online world.
Cloudways: How would you compare SQL with NOSQL like MongoDB, Firebase, etc? Can it compete with SQL in terms of designing a database system or managing relational databases?
Gordon: NOSQL databases and SQL databases are often solving different problems. I am certainly a fan of traditional relational databases, and I do think they are more appropriate for many of the complex analyses needed for understanding business data. In particular, the widespread adoption of analytic functions is a boon for analyses. But, also the increasing support for textual and GIS data is also very important.
On the other hand, NOSQL databases certainly have their place. The term refers to a broad range of database technologies, including key-value stores, document databases, and graph databases. One of the big advantages is on the data collection side. By relaxing some of the ACID properties of traditional databases (atomicity, consistency, isolation, and durability), they are able to achieve the much higher throughputs needed for operational online processes.
Cloudways: What are the key points to keep in mind while architecting a database and creating its schema?
Gordon: That is too broad a question for a simple answer. Well, perhaps I do have some key points, but they aren’t very high-level. Use consistent naming conventions. All tables should have a primary key (preferably auto-incremented integers). Most tables should have columns indicating the date/time the row was inserted, who inserted the row, and on what machine. And, I have a strong preference for inserting new rows rather than updating existing data — that is, nothing gets overwritten, so history is there. Beyond that, the actual data structure depends on the problem(s) the application is there to solve.
Cloudways: Have you ever created a Distributed Database Design? What are the difficulties that occur in them and how can we resolve them easily?
Gordon: Distributed databases are challenging. The key point when designing them is understanding the underlying requirements. For instance, a large bank needed to have a fail-safe system for some of their operational systems. This requires replicating transactions as they occur, to ensure that they appear across all systems when committed (and maintaining a history of such transactions when a machine is offline). The solution involved having all systems serving customer requests. If one went down, then there would be a degradation in performance but the system would still be available to the users.
Other distributed databases may not have such arduous requirements. For instance, a copy of operational data may be needed, but it can be loaded into an analytic machine once per day or per hour. For this purpose, technologies such as log-shipping are quite sufficient.
Nowadays, there are many options for distributed processing. The important thing is to understand your requirements, so you can adequately evaluate the options.
Cloudways: Large databases tend to become a resource hog on the servers. Do you have any suggestions for large websites and enterprise level products to improve the performance of the database?
Gordon: This is definitely a big concern. My advice in such cases is two-fold. First, have top-notch DBAs available to be sure that you are doing the right thing. Second, evaluate at the application level to be sure the database is being used efficiently.
Cloudways: Hosting large databases can be tricky. Do you think hosting them on cloud servers, like AWS EC2 instances, will give a boost to database performance?
Gordon: Cloud servers are clearly here to stay. They may not meet every company’s needs; but they provide easy scalability — which is one reason why Amazon originally got into this business. Their peak need for servers around Christmas-time were not needed most of the rest of the year.
For decision support databases, scalability is not the only consideration. One of the big issues with analytics is getting the results from the lab to the front-line. Does the cloud server make it easy to schedule jobs? Does it provide easy “sandboxes” for analysts to work in? Does it provide consistent performance over time? Does it readily integrate with other decision support technologies, whether Excel, Google Sheet, Pentaho, Tableau, or something else? Does it integrate with other systems that might need the results? The database is only going to be successful if it communicates with the rest of the organization.
Cloudways: What is the idea behind Data Miners? What are the key services that you provide to your customers?
Gordon: Data Miners has been in business since 1997. We are a boutique consulting company specializing in big data and data mining solutions.
Cloudways: You are one of the top contributors at Stack Overflow for database. How do you get time to answer user questions since you are also running your own company and designing databases as well?
Gordon: You would be surprised how much time I spend waiting for computers to do something. That is actually how I started being involved on Stack Overflow. Some of the questions there are very interesting and require thought. Others are more like: “Oh, why is someone suffering over that? That answer is pretty easy.”
Cloudways: What is your advice to beginners and students looking to build a career in Database Development? How do you motivate them?
Gordon: I don’t want to over-motivate people. Some people are drawn to data, to analysis, to operationalizing it. These people are probably better candidates for data work than, say, those who know how to act, sing, cook, or argue in court.
I do think that my background in mathematics was incredibly useful as a basic foundation for critical thinking. One of the courses I took in college explained computer architecture by starting with NAND gates and working its way up to operating systems. Understanding how data flows through computers is tremendously useful — and I think even more important when dealing with the massively scalable architectures available today. Thinking in terms of data is actually very hard; I have to constantly revise what I think is happening based on what data is actually showing.
Cloudways: Which SQL fork would you suggest for data mining? Which algorithm for data mining for marketing research you would prefer the most?
Gordon: Any reasonable relational databases that supports windows/analytic functions is fine. I strive to be vendor neutral. In the world of free software, Postgres, SQL Server Express, and Oracle Express are all quite powerful. For my books, I have used primarily SQL Server and Postgres.
My favorite algorithm is survival analysis. Instead of asking whether or not something will happen, survival analysis asks when it will happen. This is immensely applicable to subscription relationships with customers, and for understanding them in terms of life-time value and forecasting. It is also very useful even for repeated events as well — how long until someone will return to my web site, for instance.
Cloudways: Amazon indicated that you have written 7 books on Data Mining mostly co-authored with Michael J. A. Berry. What was the motivation behind sharing so much about data mining with the world? Are you planning to publish any new book in the future?
Gordon: Actually, my most recent book is Data Analysis Using SQL and Excel, Second Edition, of which I am the only author.
Once upon a time, Michael and I worked together at a consulting company and we put together a presentation on data mining for — what was then — NationsBank (and is now BankofAmerica). A friend of ours, Susan Osterfelt, recommended us to an editor at Wiley to write a book on data mining. That was the first edition of Data Mining Techniques. This book is one of the most successful books in this area, a reference used for almost twenty years now. The most recent edition was published in 2011.
Cloudways: Cloudways provides one-click installation of popular PHP based apps like WordPress, Magento, Drupal and others with MySQL and MariaDB. Would you be interested to see if Cloudways provides the ability to clients to launch standalone DigitalOcean, Vultr, AWS EC2 and Google Cloud (GCE) servers with MySQL and MariaDB.
Gordon: Digital hosting environments can be a great way for companies to take advantage of advanced technology that might be too expensive or cumbersome for them to maintain. I’m always interested in learning about new applications of technology.
Start Creating Web Apps on Managed Cloud Servers NowEasy Web App Deployment for Agencies, Developers and E-Commerce Industry.
Ahmed was a PHP community expert at Cloudways - A Managed PHP Hosting Cloud Platform. He is a software engineer with extensive knowledge in PHP and SEO. He loves watching Game of Thrones is his free time. Follow Ahmed on Twitter to stay updated with his works. You can email him at [email protected]