Creative workflows to deal with Marketo API limitations
In this live panel discussion, Joe Reis of Ternary Data and Kostas Pardalis, Head of Product at RudderStack, discuss the future of batch data. They bring clarity to the artificial distinction made between real-time and batch data, and they articulate how new technologies are allowing data professionals to functionally erase the distinction.
Joe brings a unique perspective, informed by over 20 years of experience working with data in varied roles, and Kostas lends his expertise around the data pipelines that are making it all possible.
Here's what Kostas and Joe cover:
How did we get where we are today with batch systems?
Why is batch still so popular?
What can you do with streaming?
Why is streaming the right approach?
What are the pitfalls of streaming?
Why is moving beyond batch the future?
Joe calls himself a "Recovering data scientist". He's a data architect and data engineer helping companies build solid data foundations.
Well, thank you so much for joining us today, everyone. I am Brooks, I'm with RudderStack. I'm a marketing manager, and I am excited to have you all here for our webinar today on why Beyond Batch is the future. We are very excited to have a great panel of folks to chat about data processing with you today. So what we are going to cover really is how we got where we are today with batch systems. Then we're going to talk about why it's still popular, move into talking about a bit streaming, why it's the right approach. But what some of the pitfalls of event streaming are and talk about what does the future really look like for moving beyond batch.
So to really give us the lowdown on all of this, like I said, we have a great group of panelists, Joe Reis and Kostas Pardalis. I will do quick intros and then let you all expand a bit and can quickly just hit really the kind of catalyst of our webinar today. Probably about a month ago, Joe was on a data stack show podcast, which Kostas hosts, and we got into a conversation really about batch data versus real-time, and Joe made the point that essentially all data is real-time and that batching data is something that we as humans do because of technical limitations, but that's starting to change. So that's really again, the catalyst for our webinar today, and we're really excited to dig deep on that subject today.
Joe has a very background in lots of different data roles in operations, in data science, data engineering, and now is in a consulting role as the CEO and co-founder of Ternary Data. Kostas is Head of Product at RudderStack. We are simplifying the data pipelines and providing a unified solution for moving data. So with that said, Joe, I'll let you expand a little bit and Kostas so you can do the same and we can jump into our conversation.
Joe Reis (02:38)
Cool. Hey, thanks. Yeah, so my name's Joe Reis, as mentioned I'm with Ternary Data. So we're a data engineering, data architecture consulting firm based in Salt Lake city. And so I think we're known for helping companies evaluate the right data stacks and architectures. Then, unlike a lot of other services companies that just come in and charge you hourly for a bunch of button share hours, we actually teach you how to use these tools in production. So we empower your teams to be better versions of your data selves.
Great. Kostas, do you want share a little bit?
Kostas Pardalis (03:24)
Yeah, of course. I'm Kostas Pardalis. As you Brooks said earlier, I'm head of product here at RudderStack. Before RudderStack, I was CEO and one of the co-founders of Blendo. Blendo, was cloud-based ETL solution. So again, I was building products around moving data around. And yeah, I'm pretty much working. I have an engineering background. And before that I was still like, I think I've been working with data related technologies for more than 10 years now. That's my passion. And especially building products for data that are used by engineers and developers. So yeah, I had the amazing chats as part of the [inaudible 00:04:21] Joe. Many insights came from there and this is a good opportunity to expand on these insights and have a deeper conversation around the difference between batch streaming and what's going on and what's going to happen in the near future in terms of this technologies and practice.
All right. So I appreciate the interest, Joe and Kostas, and excited to jump into our conversation about data processing today. And like I said, we really want to start with how do we get to where we are today with that system? So Joe, I'll let you take this one first and also want to see it does either of you have a read or how did I get here guy?
Joe Reis (05:12)
Yeah, it's a talking heads, David Byrne.
Yeah, that's right. Well, Joe, can you help us figure out and help David Byrne here, understand how we got where we are with batch systems?
Joe Reis (05:27)
Well, I can certainly answer David Byrne how he got to where he is, and it's a classic video and song, but as far as how we got the batch, I think it's an interesting question because in one hand, we've always existed in a world where data has been generated as events. But batch systems, I think more accommodated the limitations of computational systems as well. And if we're talking about an analytics use case, which I think is the focus of today's chat, right. A lot of it had to do with the limitations of storage and compute historically.
So if you harken back to the early days of computing, even say the 50's and 60's and 70's and 80's and so forth, it would essentially because data might be entered as it might be saved in a real-time, but then of course getting the data out of those systems required a lot more overhead then, you might find the capabilities of today's systems where you can just read records automatically as changes occur. And so I think that's really where the limitations started. And if you look at older analytical systems, we'll start with maybe Teradata, for instance, I think that was one of the first modern data warehouses back in the late 70's.
It was a very expensive system and it was completely batch, right? So you had to load it in batch and you had to read from it very delicately because it didn't have a lot of horsepower, but for the time it was very sophisticated. As time went on, and I think as people started using commodity hardware to solve a lot of distributed computing problems, both for storing and writing and reading data, I think the notions started changing in terms of what could be possible with batch systems. Of course, this also sort of parallels the use of logs and systems, right? So in databases you've always had the notion of time-based logs, but again, those were typically data from databases was always read in batch, but I think now that compute and storage has caught up with the needs of streaming and maybe things are changing a bit. So what's your take Kostas?
Kostas Pardalis (08:10)
Yeah. I agree with your Joe. I think it has to do with the evolution of how we work with data. And I think that someone just better understand why we are here today and we need to start from where we started, right? So years ago, database is not something new, right? It's one of the first things that we have created as part of the technology. But there are different payloads when it comes to databases. We have the OLAP case, which is mainly for analytics purposes. We have OLP, which is the transactional databases that post SQL for example, that we could use to drive an application and this two things need to communicate. So traditionally what was happening, we have a transactional database data is created there. So of course the data is appraised at that point, right?
So you need to have a process who is going to pull the data from there and push it into a data warehouse, which is an OLAP system to do the analytics. It made a lot of sense to do it in the batch fashion, not because the data did not have a standard nature. Of course, they did, but we were putting that buffering between every way, which was a transactional database. And of course there were technological limitations there, there was cost related issues. All these things that Joe mentioned. So it made a lot of sense to just follow a batch process. And I would put another dimension here, which is how is it these to reason about the process itself, how you can identify issues and how you can react to issues on a streaming system compared to a batch system.
So a batch system is usually a bit safer. It's easier if something goes wrong to repeat the process, data is more difficult to get lost in the process. So it made a lot of sense to have this kind of let's say architecture, and of course, the needs from the market were not that real-time, right? 20 years ago marketeers didn't want to uncover things like real-time, right? We were not like none of the functions of say the company was maturing now to go and use realtime data. Now this things are changing and today we are facing a completely different reality. And yeah, makes much more sense and it's much more feasible to support them have streaming systems.
Kostas, we've hit on a lot of the reasons that it made sense to use batches its safer. The needs really just weren't there for real-time. And getting kind of alluding to streaming, before we get there, Joe, tell us a little bit about why is batch still popular? If we do see use cases for more real-time if technology is advancing, why are folks still using batch?
Joe Reis (11:38)
I think that the simple answer is that it's really convenient for a lot of reasons. So let's unpack the convenient part. You would have convenience in the sense that a lot of systems, analytical systems, OLAP databases still tend to favor a batch paradigm, right? So even if you're talking about popular data warehouses, a lot of those aren't built to adjust streaming data. You have to micro batch it. To ingest it, you're going to probably break it. So it's columns for databases are widely used, but there's definitely some limitations. And I would say we just, and people have grown to tolerate batch. I think we'll touch on that in a bit.
And I think that's going to be less and less of a t