Welcome to "AigoraCast", conversations with industry experts on how new technologies are transforming sensory and consumer science!
AigoraCast is available on Apple Podcasts, Stitcher, Google Podcasts, Spotify, PodCast Republic, Pandora, and Amazon Music. Remember to subscribe, and please leave a positive review if you like what you hear!
Gemma Hodgson has an MSc in Medical Statistics and worked in the pharmaceutical industry as a statistician for over 15 years. Gemma now runs Qi Statistics Ltd, and gives statistical training and consultancy advice to a wide variety of customers in sensory and consumer science, pharmaceuticals, and manufacturing. Gemma is known for her approachable attitude and skills at converting complex concepts into ideas that can be applied in practice.
Transcript (Semi-automated, forgive typos!)
John: Gemma, welcome to the show.
Gemma: Thanks, John. Pleased to be here.
John: Great. So, Gemma, as you know, this podcast is conversations with industry experts on how new technologies are impacting sensory and consumer science. One new technology that a lot of people are talking about is data science. But I know that you have a number of thoughts on this question is whether data science is even a new technology. So would you like to start by talking a little bit about what you think about data science and statistics?
Gemma: Yeah. Thanks, John. I supposed this is a topic that I've discussed many times with my colleagues at Qi, but also more recently at conferences and very recently at the RSS, the Royal Statistical Society, because we hosted a debate between data scientists and statisticians as to whether or not data science is actually just the emperor's new clothes, as we like to term it. And it was it was interesting to see the two takes because I think there's a massive confusion, certainly even between the community of data scientists and statisticians as to what constitutes data science, what constitutes statistics, but also the difference between is all we talking about the tools or the people because you know, it's a statistician someone who does statistics or it's the statistician someone who solves problems and thinks of business context and acts more like a consultant. And what occurred to me is if we're struggling, what are businesses, employers, the general public, how are they understanding what's going on when the people doing it can't seem to agree?
John: Right. Yeah. And I see these kinds of fields all the time on social media. People, you know, I just see seems like whenever two things are similar that you're always going to have camps and tribes that form and you'll have people set the data scientist saying, oh, statisticians, all they do is compute P values that they from then on small business problems. What what do you say to people who would kind of point fingers from this so-called data science side that statisticians with the with the claim that statisticians are not solving real problems?
Gemma: Well, I suppose. Well, my initial thought is probably we shouldn't go on air. But generally, I think it actually comes back to the statisticians ourselves. I think we can only we can take the blame for that, because I think for so long we have promoted P values and significance testing and hypothesis tests. And there are many examples of statisticians who actually do just cheer numbers that we actually need to be out there educating and showing businesses, employers and what we can do that, you know, we're not just number crunchers, we're problem solvers. And all the things that are quoted to the data scientist does, i.e., understand the business context and you know, use coding, use modern statistical methods, look at big data, machine learning. They're all tools that are completely within the statistician's capability and and actually being performed by statisticians. But if we're not telling anybody of the value adding, then we can only really name ourselves.
John: That's interesting. Why do you think historically statisticians allowed themselves to get kind of painted into this corner? What do you think was behind Paul's characterization?
Gemma: Well, I think some of it is actually what statisticians wanted to do. So I think statisticians were happy to kind of sit at their computers and all their stats tables and solve numerical problems. But I think certainly within most companies these days, the role of a statistician has definitely changed. And there are many companies I work with where as a statistician, you're there to not just run statistical models, because often there's a piece of software that does that or there's someone within the company who can do that. What you are there to do is to listen to the problem that the scientist has. Be that a consumer scientist or a, you know, a pharmaceutical scientist or somebody in business, listen to their problem and then try and suggest ways to help them split the problem down into actually things that can be considered looked at. What types of data should we be looking at rather than just letting them go off and gathering all the data they've got digging around till they find something. It's about applying the right data to the right problem and answering the right question. And so often what they actually trying to find out isn't what the data they've collected is showing them and I think as a statistician that's what we're there to do to help them work out.
John: And what role do you see experimental and design thing in this kind of larger paradigm you've described? I mean, for me, I see like, first off, I definitely support data science as an activity. So I don't want our listeners to think that I somehow am against data science. But I think that when it comes to data science, oftentimes there is this approach where the data is almost like found data they don't, there isn't an experimental design aspect to the data science. And it seems like sometimes, okay, we just got enough data, then we can mine it and find some patterns or whatever. Whereas I think one of the things that's missing from data science often is a question of designing an experiment specifically to answer a question rather than just saying, okay, what do you have? Whatever. I think that, this reminds me some of things Chris was talking about on our last AigoraCast episode, he was saying that a lot of times solution is not technological. It is better science. And in this case, experimental design seems like it's not getting but the.
Gemma: And that was one of the things that came out of the discussion the other day. And one of the guy at the RSS made a good point. And he said that just because we've got a big data set doesn't mean it's better. And you know if we've got sometimes a small, carefully designed sample that is representative of the population you're looking at, we'll give you much better results than just collecting everybody's opinion from, you know, across the whole of the world that's filled in an Internet questionnaire. And so generating you with masses and masses of data. And I think a lot of the data science methods have been generated or improved or fine-tuned to be able to address that the answer addressed the problems of big data. But not doesn't necessarily mean that better methods of solving business problems.
John: Right. So you would say that, I mean, certainly if you have a bias right in your data set, making it larger isn't going to fix that problem.
Gemma: Exactly. In fact, you're just going to make it worst. Well, this is the other thing that we were talking about the other day is that if you identify, if you think about machine learning, if machine learning looks at what's being put into the computer and then spot patterns, it will also be able to spot biased patterns and repeat them. So just add to the bias. So you know, this business that machine learning is unbiased or big data is unbiased, is not true at all because it learns from what it's given.
John: Right. Right. That's, of course, a big ethical question when you've got hiring decisions or loan applications being approved based on historical, historically train models. Yeah. OK, so then Gemma let me ask you, just generally speaking with your clients. Do you find I mean, are your clients coming to you say, OK, we need data science? Is that a term that they're now that they're using? What's your culture rim on that?
Gemma: Well, we have had that, I mean, not specifically with my sensory and consumer clients, because obviously they're much better educated. But we do get this need that people think that they're missing out if they're not doing data science. So I have had one client who said I think we ought to be doing some data science. Can you do that as well as the statistics as if it's a the thing that's different from statistics. And I've actually had a conference. I had somebody say to me, you know, you'd get loads more business if you deleted all those references to statistics on your website and replaced them with analytics or data science because you're putting people off. And I thought, oh my goodness you said you're really out of date. We don't call it statistics anymore. And I know I've certainly I was talking to someone from a university in the U.K. and they've rebranded their university course. They haven't changed the modules. They haven't changed the content. They've added one module about big data and they've changed their statistics course now. And they've called it an IPSC in data science. And the applications have gone through the roof. So basically, people want to learn data science, but they don't want to learn statistics because they think they're completely different beasts. And I agree with you, while there are aspects of data science that statistics doesn't cover like the computational side and and the big data and that, you know, to me, they're just new statistical methods that we've got to add in, really? They are new parts of a two key. It's like saying Bayesian statistics is one thing and frequent test is a completely different beast. There's still statistics that just different ways of approaching it. But I think the general understanding of what data science actually is really really low because no one is good at describing it. I mean, what what would your definition of data science be? Let's see if it's the same as mine.
John: OK, well, here's my definition. I believe the data science is the intersection of statistics, computer science and industrial design.
Gemma: Alright.We're pretty similar to mine then.
John: Yeah, I do think that there is a side of statistics, mathematical statistics that I think, I mean that for me as a pure mathematician, this is actually like my personality is such that if left to my own device, I would just sit in a room and solve math problems and they would become increasingly that's irrelevant over time, right? And that's what started to happen in math grad school. I noticed, you know, that I was becoming I mean, you need when you do like twelve hours math a day, you start to become like very strange person because you just are sitting in a room by yourself with your thoughts and you're working on things. And the extent to it like things are interesting because of the complexity and not because of the value in a large sense. Right. And I do think that yeah, I think it's a branch of statistics that, you know, mathematical statistics went down that path where they started to merge.You know, you can look at the relationship between statistics and measure theory or, you know, some I mean, it gets increasingly more abstract, right? Where people are just investigate problems for their own sake. So that needs to happen, right? Because your deep insights are gonna come from the people doing these like high level things, right. So that has to happen. But I do think that statistics allowed itself to get dominated by the mathematical side. And I think that may be part of what.
Gemma: And I think maybe what we're talking about is the difference between stats as you say, statistical theory and applied statistics. And I'm very much in applied statisticians. So honestly statistics is a tool in my toolkit. Like, I've also got a graphic toolkit. I've got a statistical toolkit of methods I can use. But ultimately, my goal is to solve real world problems using my tools. And I need to improve my coding if I want to do more kind of AI or machine learning type methods. Or I may need to look up some new methods. But ultimately, my job as a statistician, as an applied statistician hasn't changed. But that's just new tools out there. And I think that's where people get muddled because these data scientists, we're seeing their role as a scientific role and they use statistics where they see statisticians as just using statistics but not doing anything else. And I think that's like you say, I think it's the separation between applied and maybe pure.
John: All right. But I think there is an irony and tell me you think about this, I think there's an irony in that the use of the term data science, because for me, science is trying to understand things at a deeper level than to simply predict what's going to happen. You know, you think about like the Babylonians when they would they could predict with great accuracy where the planets were gonna be, but they had no reasonable model for what was actually going on, right? Whereas, you know, you think about well, certainly with Kepler, there was an actual model as to why things were the way they were. So if you build a statistical model where you have parameters that correspond to kind of known, I guess, I mean, they can be late in quantities but are late in, you know, so-called general regression model that might correspond to the impact some variable on the outcome, right? That you've got, I think, a statistical model that's well-developed. It's gonna be very informative because you can kind of read that model and have some sense of what the underlying structure that you're like, what is going on in the world that's driving the behavior that you're seeing. Right. Whereas a lot of the data scientific models are just black boxes in the sense that you've got some. Okay. You can do some things with random random forests to try to figure out variable importance. But at the extreme end, if you have a neural network, deep neural network, God knows what's going on inside that thing. And so a lot of times, you know, I see on social media data scientist taking the attitude. Well, who cares why it works as long as the predictions are accurate or at least reasonably, you know, or whatever measure your gonna use.
Gemma: I think that's the danger and I think certainly as applied statisticians or data scientists, whatever you want to call them, I think at job at the moment because of where the science sees, you know, it's relatively new. I have a responsibility to employers, to businesses is to you know, we're a community of highly numerate people. But the large majority of you know, that doesn't necessarily mean that all the business owners who might want to use machine learning techniques or AI or anything or statistics generally are highly numeral. And I think I think we're bad as a community, a community at communicating what's useful and what's not. Because not every business needs machine learning and I think that the thought is now that if you're not doing it, you're missing out. And that's simply not true. And I think with we're doing people a disservice when they're investing in data scientists must have a data scientist because that's what I need. And they don't need that. And I think it's our job to educate people. But to be able to educate people, we've really got a pin down what is we all do and what value we can add to businesses and we can add values. Well, I think we're bad at doing that. I think we just, yeah offered you a model. You know, it may be completely useless, but at least we did one, you know, that kind of thing.
John: Right. Right. Yeah. That's right. Yeah. You've got, you're explaining a lot of variability. Yeah, that's right.
Gemma: I mean, I had a case recently when I was talking to someone who was trying to hire a data scientist and they said that a lot of the applications they were getting through were from people who'd just done a six month or a year online course from data company and they could now program in Python. And they were now they changed their CV and they're now data scientists. Now well, that's not true for everyone, as the employer said to me, I know even less about data science than they do. So how do I know if they're any good or not? But I need a data scientist and I'm finding it really hard. And I think that's the problem is that we need like a set of some crates, not a regulated thing so how can we pick the good ones from the bad ones? Based on, you know, we don't know what training they've had. I think it's a problem that will go away with time because now universities are running all these data science courses in 5 years time, 10 years time. It's not an issue, but at the moment, I think it really is an issue.
John: Yeah, it's kind of a lawless space, so to speak, and say like a Wild West, the wild American Wild West. I don't know how wild are these things. So, can I ask you about the computer science piece of this time.Then to what extent do you find your own work is informed by computer science? Because I mean, to me, I oftentimes see data science as computer scientists rediscovers statistics. And so, you know, like that there's that side of data science. So there are some valuable things though in terms of being able to write scripts. I mean, obviously, even writing scripts and you know, SAS or SPSS I'm sure for a long time. So what are the kind of computer science tools that you find yourself maybe using more now or that you would recommend to your clients?
Gemma: Well, even though it's basic, I still think actually where technology, if we're talking technology has had the greatest impact in the last few years. It's not necessarily in that really complex stuff, but it's in the use of the visualizations. The way to, you know, to take something, I mean, even something as basic as like a PCI map. Although we're not talking with machine learning, but something like clustering the way we come now visualized things and the way people who don't have much statistical coding ability can get quite accurate, reliable visualizations so that they can understand and communicate. I think that's where a real kind of goal is at the moment because, as you say, half the time, it's data that we just happen to have, not necessarily data that we collected deliberately. And certainly the clients I work with because more and more they're expected to show data because that's the new fashion. They're kind of that they're desperate to have something that, you know, they say to me, can you take everything off it that they might ask me a question about don't have anything on there that says a statistical word or I'm going to have to explain it. So there still is this you know, a lot of the clients I have all the people between me and the business managers and make one easy, simple tools, which often technology and has has now given us to do very complex things but show them, you know, the background. There might be really complex statistical model or some kind of machine learning thing that's gone overnight for six days. But actually the end result is a picture.
John: Right. Yeah, I mean, I definitely see that the visualization. I mean, this has been a push. You know, there was a keynote speech at SSP a few years ago on storytelling with data. That you may, I dont have much if you're at that SSP, but the idea that using your visualizations to communicate your like to figure out what is the story you're trying to tell with your research and then having visualizations that clearly communicate the story that you're trying to tell. I think that's extremely important. That it helps to show that statisticians are not disconnected from the business problems. We say, look here we can clearly show you the outcome of our research and easy to understand picture. So what are the? I mean, one of the things I've been thinking about is that with the computer aided visuals, we have the ability to quickly generate extremely attractive graphics, right? But we also need to make sure that the graphics are, the downside of that is sometimes, especially when a person gonna do is a data science, I kind of fell in love with the fact that I could easily make hundreds of beautiful pictures, right? You know, you still have to learn to focus on, like, what are the key points that I'm trying to make? Right? So what are the some of the principles that you would advise your clients when it comes to preparing, you know, visualization?
Gemma: Well, I suppose the first one that for me and maybe it's just because I'm nervous, but is check that the visualization is actually showing what the numbers are saying because it's very easy to create a visualization, but it doesn't mean it's correct. And that is the biggest danger, I think, with this kind of oh, I've got a section I've got three lines of all code that I just cut and paste. You know, sometimes you can't check if that's right because it's come out of a black box piece of software. So definitely checking that. But also, in fact, I was talking to someone only this morning who had a map and the access had been stripped off the map. And it was I think it was a PCI map or something like that. And then it looked a bit strange, but it was it showed something. So they it had been put up in a meeting and actually, when they went back to the original map, the percentage of variation explains was I think 6% on one access and 5% on the other. That had been stripped off. So, yes, you know, be careful to look for what's not there on the visualizations because, you know, people do that all the time or presenting graphs without a scale. And then you find out that this massive difference you're seeing is actually, you know, less than one microgram or whatever the unit is. So I think it's what's not there in the visualizations which I say to my client is that the most dangerous thing.
John: That's interesting cause there definitely is this trend to take everything extraneous away. Right. You know, you'll read that, right? I got he wanted to only have the important information, but. Yeah, that's a deep point, actually, that the what you've taken away may actually have been very important for me.
Gemma: I mean, it's not even have a really, really basic level. I had a client recently where I mean, now I did it. So I know why it went wrong. But I was plotting some figures and they wanted it all on one graph. And one of the numbers was way way higher than the others. It was like in the thousands and the others were like one twos and threes. And so, of course, when you put that on one graph, the difference between ten and five is lost because all you can see is the 3000 number. So and I knew that was the case because I see the numbers, but I had a table underneath. So, you know, that was fine. But when the client looked it, they didn't look at the table and they just said, oh, so those other two groups the same because, I said, well, no, they're quite different. But we're looking at them on a different scale. So people do like to forget scales. And actually, the difference between the other two groups was of more importance than the other category. But, you know, because it was being dominated by that, it's a bit like when you look at Perita Thomson and, you know, there's all the things that are not important are collected in one category because they haven't happened very often, but they turn out to be the most serious things like heart failure or something.
John: Right. Exactly. All right. Well, we've got only about five minutes left. So I was hoping at the end of this you can talk about what do you think, what do you see that needs to happen and say the next five years in order for, like let me just ask this question, what are the kind of pressing problems that you see that your clients have? And then what do we, especially on the more quantitative side, you know, the statisticians or data scientists among us need to do in order to help solve this problem.
Gemma: So I think my biggest thing for the clients would be that I think people need reassurance about this fear of missing out, because if I'm completely honest, number of businesses who really really need to change everything they're doing is probably very small, because I think a lot of our sensory and consumer scientists are already looking at data that's coming in big term, big data. I mean, people think that big data has to mean zillions and zillions of observations. But, you know, some of the data sets that scientists already have a pretty big. I think people need to stick to focus to what are a business questions first and then what's the most appropriate way of solving it, rather than we need to be doing machine learning. We need to be using artificial intelligence. So let's crowbar our data into something and see if there's a business question we could answer. And I think people are panicking that they're missing out and they're missing the point of, you know, these are just more tools that have come on the market. This is not necessarily beyond the remit of the people you already have in your company and the skills and the data. And, you know, I think a lot of people already have enough. And so I think for the customers and for the clients, I would say just, you know, it's being blown out of proportion. I think the pace of change in this world. And I think for us as a scientific community of statisticians or data scientists, whatever you want to call yourself, it's about between us agreeing on some standard terminology and helping people outside of our discipline understand what we're doing. Explaining some of the math, I mean, you hear a lot about machine learning as if it is one thing. Machine learning covers a thousand different techniques. Not a thousand, but a number. So it's you know, when a client comes to and says, can I have machine learning? They don't even know what it is exactly. You know, if you say to people actually logistic regression was around, you know, hundred years ago. They say what I want that I want machine learning. Well, that's one of the methods, you know. That's the thing I don't think there's enough awareness because we're not telling anyone, partly because there's a lot of statisticians out there who are terrified by the concept of machine learning and artificial intelligence because they think they're going to have to learn a whole noted new stuff, which they perhaps are. So we're joining in with this kind of blackbox thing that, yeah, there's loads out there. So I think it's about being open, being more transparent, upskilling the statisticians and data scientists and equally educating the data scientists. The statisticians have been doing a science around data role for, you know, ever since they became consultants and applied statisticians is not different.
John: All right. Yeah, I mean, I see that, too. I was at the Symposium on Data Sciences statistics last May in Seattle. I gave a talk there and I definitely see that kind of conversions of the fields. It isn't, like I see these certainly this statistics community would like to be conversions because they don't wanna get left behind. Right. Yeah. The American Statistical Association is trying to I think they've pushed away the data scientist for a while and then they realized that all the jobs are going away, too. And they said, wait a minute, come back. And, you know, they're trying to...
Gemma: The proposal from this world's statistical society debate the other night was that actually we should be getting together with a computers, the similar bodies in the computer science world and the math's world and actually coming up with a set of either communications or tools that businesses could use to assess what it is they need and what actually they what's out there that they could do. But we need to work. They were looking to the RSS to lead this because obviously so much of it is statistics. But it was certainly something that was seen as we need to unite here to be able to preach the same message, because at the moment all we're doing is almost fighting each other for who's more important.
John: Right. It doesn't help anybody. And so do you think that just to kind of wrap up here, that kind of because we I mean, I would say our client bases are pretty similar that, you know, typically CPG companies or pharmaceuticals or people doing consumer research that, do you think that there is a need for consumer scientists to learn more statistics and data science? Or do you think there is more incumbent on people like you and I to support them with, you know, easy to use tools like dashboards or things like that?
Gemma: I suppose that my feeling on it, is that there's a reason that a consumer scientist is a consumer scientist. If they'd want it to be a statistician or a data scientist, that would mean one. So I'm not sure that we should be teaching everybody to be able to have scales in every field. I think we what we should be doing is educating them in what we can offer and where we can add value and then making it easy for them to access and understand those services that we can give them. So that they were a bit like we have done, you know, for the last 20 years in all the sensory statistical methods that are out there now. Not all companies do their own, but some do. But the general understanding of statistics has improved. The fact that there is variability out there. That's a good one now. People to understand. But I think people have got that. But probably 20 years ago, they might not have done. So I think the basics, it's always going to be agile to keep on pushing the basics out there. But I think in terms of the specialized techniques, they haven't got time, you know, if I want someone to come and fix my boiler, I'm not going to learn how to become a plumber. I'm gonna call a plumber. So, you know, why should statistics and data science be any different?
John: Right. And so then it's just our job as plumbers that maybe we can fix the boilers.
Gemma: The boiler needs fixing or not.
John: Right. Yeah, right. Right. Exactly. That's right. Yeah. OK Gem, it's been extremely informative discussion, and I really like what you had to say about the fear of missing out, because you're right that a lot of times there is this infatuation with using tools just for the sake of using them. And it's done, maybe actually what's needed. So, yeah, I think that's really good. Alright. Well, any last points you want it, you have any new courses coming up or where can people find you?
Gemma: We had recent upgrade to our website, so hopefully all our courses are much easier to find than in the past. But yeah, we've actually got a course next week in New York and then we've got a course coming up in November and then we've got a full list of courses for next year. So, yep. And I think that might even be one in there on some techniques in the machine learning. It's just an old course that we've rebranded. No, not really.
John: That's funny. I'm sure that's a new content. And which website can people go to in order to..
Gemma: Go to www.qistatistics.co.uk
John: Okay, perfect! Alright, Gemma. Well thank you so much for being on the show and I'll look forward to seeing you at conferences in the near future. Hopefully we can continue to work together to educate people.
Gemma: Yeah, definitely. Definitely. Thanks, John.
John: Okay, great. Thanks a lot, Gemma.
Gemma: Okay, bye!
John: Okay, that's it for this week. Remember to subscribe to our podcast to hear more conversations like this one in the future. And if you have any questions about any of the material we discussed or recommendations about who should be on the show in the future, please feel free to contact me on aigora.com or to connect with me through LinkedIn. Thanks for listening. And until next time.
That's it for now. If you'd like to receive email updates from Aigora, including weekly video recaps of our blog activity, click on the button below to join our email list. Thanks for stopping by!