Why New Relic
Easy instrumentation for cloud-based architecture and powerful analytics capabilities enabling teams to uncover performance improvements
- “Instrument everything” approach led to full operational visibility and optimized system performance
- Significant cost savings due to insight into amount of scaling needed, especially during peak traffic times
- Improved customer SLAs by decreasing transaction response times by 75%
In the wild world of e-commerce, fraud is a constant threat. Stolen credit card numbers are abundant and available, and fraudsters are constantly refreshing their cons.
Retailers operating a digital storefront are walking a tightrope between accepting a transaction that might be fraudulent and rejecting a valid customer. Unlike in-store purchases, if a retailer completes an online sale made with a stolen card, they—not the credit card company—are on the hook for the chargeback. They lose the revenue, the goods, and the cost of fulfillment.
That’s where Riskified comes in. The five-year-old New York-based company uses machine learning algorithms to analyze transactions, evaluating that person’s risk and delivering a definitive “approve” or “decline” answer on a purchase attempt. Not only does Riskified need to make accurate decisions, but it needs to do so at speed and scale for global brands including Wish, Prada, lastminute.com, Finish Line, Peloton, GOAT, Gucci, Mattel, and many more.
Riskified is so sure of its fraud-detection solution that it assumes the liability, guaranteeing 100% of the cost for any chargebacks its merchant customers incur as a result of an incorrectly approved order. According to Riskified’s co-founder and CTO Assaf Feldman, “Fraud is an unfortunate reality in e-commerce, and many merchants wind up rejecting a significant number of good purchases due to fear of loss. We developed a solution that would approve more good orders—instantly and at scale. To stay ahead of the 'bad guys,' that solution needs to be continually evaluated and optimized.”
Instrumenting everything to achieve optimal performance
Originally built with Ruby and running on Amazon Web Services (AWS), the Riskified platform has evolved to include services in Docker containers using Scala on the backend and Angular on the frontend. With millions of dollars hanging in the balance every second, Riskified can’t afford poor performance or slowdowns. “From day one, New Relic has provided us with the visibility and metrics to make better informed decisions,” says Feldman. “We can quantify the impact of any changes made to any part of the platform.”
“From day one, New Relic has provided us with the visibility and metrics to make better informed decisions. We can quantify the impact of any changes made to any part of the platform.”
Whenever a team at Riskified launches a new service, developers instrument it with New Relic so they have a holistic view of the application, including the proper AWS CloudFormation and container usage to get the optimal performance for the platform. This approach has paid off in many instances, including a particularly impactful event when Haim Ashkenazi, Riskified’s head of DevOps, got an alert on a low Apdex score for a server cluster. Calls to the service had increased from around 10 million per day to more than 1 billion in a couple of hours, and performance had plummeted.
The sudden 100x increase taxed Riskified’s normal autoscaling policies and impacted many parts of the system. Using New Relic APM and customized New Relic Insights dashboards, Riskified was able to retune its scaling policies and database metrics and restore performance within roughly 20 minutes.
Without New Relic, the process would have involved a substantial amount of trial and error to reach an adequate level of servers—and at a significant cost. Instead, Ashkenazi was able to adjust the scaling policy to handle the increased volume, buying time—and saving the company an unnecessary expense—while he figured out what happened. “It took less than 20 minutes to stabilize and resolve the problem,” recalls Ashkenazi, “And I could analyze exactly what part of the application was impacted.” In the end, it turned out that one of Riskified’s biggest customers had released an incorrectly configured version of its mobile app, resulting in overuse of the API.
Transaction response times reduced by 75%
As Riskified continues to scale, Feldman’s ongoing quest is to reduce transaction time—the amount of time it takes to send merchants that yes-or-no decision. He says, “It’s no longer just a question of monitoring software performance, it’s also making sure that we’re quickly providing decisions on transactions and meeting our customers’ SLAs. New Relic helps us get there.”
New Relic’s APM dashboards pointed Riskified in the right direction, and their engineers then added additional method-level instrumentation into the areas in the application code that were taking too much time to execute.
“Moving from 1-second decisions to 200-millisecond decisions is very difficult, since it’s all in the very low level of the code structure and efficiency,” says Feldman. “New Relic’s base capabilities were extremely helpful, and we were able to quickly and easily make custom additions to those capabilities to meet our demanding standards.” Feldman and his team created a variety of dashboards with New Relic Insights to monitor very specific elements of their operation that were critical to their success.
Using these dashboards, Riskified’s teams were able to combine the analytics of Insights and APM to drill down to the class and method level or the message level, systematically identifying areas that could be improved and fixing them. As a result, over the course of four months Riskified was able to slash its average transaction response time by 75%—from 800 milliseconds to 200 milliseconds or less.
Monitoring and scaling on Black Friday
For Black Friday (the most important shopping day of the year) and throughout the entire holiday shopping season, Ashkenazi notes, “You don’t take chances. You go as large as you can.”
To ensure its platform was ready to meet the expected uptick in request volume, Riskified used New Relic Infrastructure to perform stress tests, challenging its system to find the largest load the machines could handle without failing. Analyzing the data from New Relic, Ashkenazi and team determined that the cluster’s latency was not the best parameter for creating the policy for the autoscaling group. “New Relic helped us identify the right network for this type of server and create an effective scaling policy using the amount of network bytes transmitted per minute instead,” explains Ashkenazi.
The team monitored the platform for the entirety of Black Friday, and all systems performed superbly under the new policy. Ashkenazi can now sleep soundly. “I don't wake up at night. We have it handled. If we design the system correctly with scaling groups and policies, taking messages and responding accordingly, most events handle automatically,” he says. Significantly, the new policy has resulted in cost savings, as it enables Riskified to define its infrastructure and manage its spend with AWS effectively.
Continually advancing a DevOps culture
At Riskified, DevOps isn’t just the name of a team. “We’re trying to adopt DevOps as a culture,” Feldman says. “Everybody is responsible for the reliability of the system and performance.”
Feldman continues, “New Relic provides us not just efficiency but also ownership. Because every team can manage much of the infrastructure requirements themselves, they’re constantly thinking about improvements and can ensure that nothing falls through the cracks.”
“New Relic provides us not just efficiency but also ownership. Because every team can manage much of the infrastructure requirements themselves, they’re constantly thinking about improvements and can ensure that nothing falls through the cracks.”
Every technical staffer within Riskified uses New Relic APM and Infrastructure, thanks to the “instrument everything” philosophy Feldman has instilled in the company. Whenever a team at Riskified launches a new service, it sends all necessary observability data to New Relic. This enables every team to have the same view of operations, along with real-time insights via dashboards that offer a holistic view of the system, as well as custom views tailored to specific team needs and system processes.
Less risk to the bottom line
Using New Relic helps Riskified deliver on its innovative business model and customer guarantee to connect legitimate shoppers with the online retailers they love. Riskified is now able to pinpoint and respond to a performance issue faster than ever with a more elastic platform, eliminating safe havens for e-commerce fraudsters.