“Are they using data from just 2012 and forward, or are [sic] they using data for 50, 75, or 100 years ago that also take into account these difficult periods?”
I think the answers are mostly from 2014 and beyond and no.
First of all, the majority of the data that they (and the other loan based AI companies) use is data relevant to the specific persons. So they really aren’t using data from “50, 75, or 100 years ago”. It just doesn’t apply as well when trying to make AI based decisions for another person.
In the 10K (as well as the S1) they share a little bit about the data they use and you can draw some conclusions from that.
They started in 2014 with only 3rd party data feeding into their basic algorithms and expanded from there. Today they still use 3rd party data but have significantly augmented that with what they call Training Data Points (the secret sauce for training their AI algorithms).
The 3rd party data includes “standard credit attributes”, education, employment, and other factors including “macroeconomic signals”. This data will have some applicant specific historical data such as changes to the credit attributes over time, changes in education and employment, salary, prior defaults, etc. This is data that all lenders and algorithms have access to.
The Training Data Points are repayment events from their own data collected on the applicants/users. As of Dec 2020 they had ~10.5M data points providing over 17 billion cells of data (double from 1.5 years ago). That is over 1600 cells of data per repayment event. They are tracking many many attributes that probably include data such as timing of repayments relative to due date, method of payment, if payment was made from computer or mobile device, etc. One of the Chinese companies in the same space even used data points such as how quickly you typed on your phone as part of the training data.
As the training data has grown, so have the number of modeling techniques UPST implemented. The increase of modeling techniques is linked with the increase of their internal training data points because this is where they can create their moat. They are looking for relationships between behaviors and possibility of default for each specific borrower which they then can apply to the next applicant, separate from higher level macroeconomic events. The more data points they have, the finer they can train the algorithms for better results. And if they are the only ones with access to this data, they stand above the rest.
Success in the AI spaces require mainly 2 things - very good algorithms and lots and lots of data. The more data you have the more training you can do and the more 3rd or 4th level variables you can analyze for meaningful correlation.
Based on this, I believe that while the 3rd party data will contain some historical data on the specific applicants and economic conditions, the majority of the analysis is done on more recent data collected from their internal databases at the specific applicant/user level. The end result they are trying to achieve is to issue loans with the lowest default rates in ALL economic situations. If they are right in what they are doing, then the impact of the next recession will be lower for their approved loans than the industry. But only time will tell.
Best,
borngiantsfan
- no position in UPST at this time, but getting close