Big Data Analysis of the Romanian Trade Register

⚠️
The presented analyses do not have 100% accuracy, as the original information extracted from BERC (in PDF or JSON format) is not in a form that guarantees such precision. For instance, the variability in formats from BERC has resulted in poor scanning of Bucharest due to the distinct format of the registration certificate in this analysis.

We invite all interested parties to read the disclaimer for clear details.

Anyone who wants a copy of the statistical results, or additional information, can request it at: office@incorpo.ro.

Introduction

Incorpo.ro is a LawTech company dedicated to automating and simplifying complex legal tasks. Our goal is to eliminate bureaucracy and streamline time-consuming processes, helping people save valuable resources.

Within this goal, we set out to develop a software robot capable of correcting files and identifying errors in them before submitting them to the Trade Register. Fewer errors mean faster file admission and, consequently, happier clients.

To train the model to understand the behavior of registrars, as well as the legal and extra-legal (customary) reasons for delay, an in-depth analysis of existing data was necessary.

In this article, we will present the methodology used for the big data analysis of the activity of the Romanian Commercial Register, the key results obtained, and their implications for streamlining the process of registering commercial companies.

Data and collection process

The data used in this analysis comes from the Electronic Bulletin of the Trade Register, a public source that includes general interest information about commercial companies and their registrations. The use of this data for the stated purpose of informing the public about the functionality of the Trade Register as a public interest institution complies with legal and ethical provisions.

The data collection process involved downloading the electronic bulletins for the year 2024 and extracting relevant information using web scraping and PDF processing techniques.

Analysis Methodology

The analysis of the collected data was performed using Python scripts that processed the extracted information and generated suggestive visualizations of key performance indicators. Among the aspects investigated include:

  1. The speed of processing files at the county level
  2. The percentage of accepted, rejected and postponed files per county
  3. The efficiency of individual registrars, measured by the number of entries processed, working days, and average daily/hourly productivity
  4. Frequency of resolution types depending on the time of pronouncement
  5. The most common reasons for rejecting files, identified through natural language processing (NLP) and clustering techniques

Key results presentation

The speed of request processing

A first indicator of the efficiency of the Trade Register is the speed with which applications for the registration of commercial companies are processed. Our analysis showed that, in most counties, applications are resolved within 1-3 working days, a remarkable interval compared to other public institutions in Romania.

The graph above illustrates the distribution of processing times for the Bucharest County, highlighting that most decisions are made within the first 5 days of submitting the request.

Accepted, rejected, and postponed file percentage

Another important aspect is the distribution of decisions made by the Trade Register according to the final result: admission, rejection or postponement. Our analysis showed that, on average, over 93% of the applications filed are admitted, directly or after a postponement.

The graph above presents the status of files for all counties, highlighting the high weight of accepted requests and the relatively low percentages of rejections and postponements.

The efficiency of individual registrars

Our analysis also tracked the individual performance of registrars, measured by workload and average productivity. The results showed that while there are differences between registrars, most process a significant number of applications and maintain a steady work pace.

The most productive registrants of the Trade Register, period 01.01.2024-01.07.2024 (average number of files solved per day - with a single record - sampled)
The most productive registrants of the Trade Register, period 01.01.2024-01.07.2024 (average number of files solved per day - with a single record - sampled)
📊
The most efficient registrar identified is Ovidiu Bugeag, which processed 4,257 entries in 105 working days, with an average of 40.54 entries per day and 5.07 entries per hour.

At the opposite pole are registrars such as Maria-Cornelia Măglașu, which processes only 3.64 files per day, that is 0.46 files per hour worked.

Note: The data is of public interest, but we invite registrars to provide a right of reply if they wish to clarify the situation.

These results suggest that while there is room for improvement, most registrars are carrying out their duties with professionalism and efficiency.

💡
It's noteworthy that Many registrars have a total of 30-60 working days, which may mean that they are in the early stages of their career, on maternity leave, or have other issues that reduce their productivity.

We have adapted the analysis to calculate an average over the number of working days, not taking into account days with 0 solutions from a registrar (that is why there is a minimum of 1 file per working day on the registrar), which, however, is a solution that can lose people who do not actually work.

Frequency of resolution types by hour

An interesting analysis looked at the distribution of resolution types (admission, rejection, postponement) depending on the time of pronouncement. The results highlighted certain patterns, such as a higher frequency of postponements in the early morning hours and a concentration of admissions between 10:00 and 15:00.

The graph above illustrates these trends for the municipality of Bucharest, suggesting possible opportunities for optimizing working hours and resource allocation.

💡
Furthermore, it is commendable that people are working even before the program, at 6 a.m., respectively 7 a.m., a positive surprise that is manifested in quite a few counties.

It's clear that many people are working on the program, and the results are seen in the institution's above-average performance.

Reasons for rejection of applications

Using NLP and clustering techniques, we analyzed the texts of the rejection decisions to identify the most frequent reasons invoked by registrars. The results highlighted issues such as the lack of supporting documents, non-compliance with legal requirements regarding the company's object of activity or its name, as well as formal errors in drafting the applications.

T-SNE (Silhouette-based clustering plus elbow - 87 clusters) - deferring decisions on ORC registrars

t-SNE visualization serves to display how well different categories of delay are distinguished, and how efficient the model was in categorizing them.

Based on the image, it can be interpreted that very evident clusters are forming, which is a good sign. Below are the aggregated patterns, and the clusters understood with AI models to process the common patterns of all cluster members.

Analiza celor mai frecvente motive de amânare (Rezumat)

Interpretare și implicații

Our analysis results highlight, on the whole, a good level of efficiency and professionalism in the activity of the Trade Register compared to other public institutions in Romania.

The processing times for applications are reasonable, and the high rate of approval decisions suggests the correctness and compliance of the registration process.

However, the analysis also identified some opportunities for improvement, such as:

  1. Optimizarea alocării resurselor și a programului de lucru în funcție de tiparul observat al rezoluțiilor pe intervale orare
  2. Offering guidance and additional support to applicants to reduce frequent errors in application submissions
  3. Clarificarea unor zone care în prezent sunt interpretate în mod convențional, fără a se baza pe norme clare:
    1. Rejections due to the administrator's mandate being of an indefinite duration (Should be replaced with the supplementary period of 3 years, cf civil code)
    2. Lack of clear motivation in certain situations for rejections, they being unfounded.
    3. The CAEN Dilemma (Entrepreneurs are required to declare that they do not sell weapons or ammunition, and that they will not engage in activities for which they do not have permissions.)

      The declarations are equivalent to a declaration of "not intending to commit crimes", and are largely devoid of legal effect in fact.

Conclusions

The big data analysis of the Romanian Trade Register, conducted by the Incorpo.ro team, has provided valuable insights into the efficiency and challenges of this key institution in the Romanian business ecosystem. By employing advanced data processing techniques and visualization methods, we were able to identify both strengths and opportunities for improvement.

Our results underscore the importance of continuous investment in innovative technological solutions, such as process automation and the application of artificial intelligence, to further enhance the efficiency and quality of the services provided by the Trade Register.

On the other hand, we believe that the solid effort from the registrars should be rewarded, and it would be economically irrational to pay a flat rate in the case of over-performers.

Without mathematically analyzing the data, it is evident from the charts that working outside of business hours is a habit of registrars in most counties, with registrars in Botoșani even solving issues at 22-23, with a comparatively high frequency.

We believe that the Trade Register is a good case study for the shortcomings of the public remuneration system, where performance is discouraged. We will come back with a more detailed analysis to detail the hourly remuneration, proportional to the number of files completed, to highlight the flaws of the current system, and the potential for legislative change that would allow the promotion of the sustained efforts of the majority of registrars.

Over 60% of registrars process more than 25 files per day, which means they resolve files on average faster than one every 19 minutes, a good figure.

On the one hand, we must be careful that the effort does not become exorbitant, and the speed necessary to meet the standards of analysis reduces the efficiency of the registrars' diligences.

We hope that this analysis provides a solid basis for constructive discussions and concrete actions towards optimizing the activity of the Trade Register, for the benefit of the Romanian business environment and the economy as a whole.

We invite people to analyze the extensive information provided in the GitHub repository, where there are more charts, for each county, regarding the rate of admisibility, working program, etc.

For reply rights (if applicable), dataset requests, and other inquiries, we remain available at:

office@incorpo.ro
+40786833325

Disclaimer, Information regarding potential errors, etc.

At the indirect request of a person who responded to the post, I decided to provide a better example of how the analysis was conducted, where the data comes from, and what it actually reveals:

  1. We took the information from the electronic bulletin of the trade register, which we used for the analyses. We took everything from the year 2024, from all the counties in the country, up to 01.07.2024.
  2. I extracted the text from each document and used REGEX it has been shown to extract information efficiently from most documents managed by the trade register, approximately 90%. REGEX represents a way to search for "rules" in text, for example by instructing the program to read everything that comes after "Registrar of the trade register, [HERE IS THE NAME]."
  3. We calculated how many rejection or acceptance solutions the names of the registrars appeared in, and aggregated the information.. As some are made with OCR, and have lost their semantics, we have post-filtered the displayed information.
    Post-filters:
    1. At least 30 different days must be identified, so if there are anomalies, they should persist for 30 distinct calendar days. This way, we eliminate both new employees and anyone else who, for other reasons, may not be performing at the same level. You can't condemn a beginner for working more slowly.
    2. We have largely tried to unite common names, where we found them. Subsequently, after reasonable criticism from Mr. Alex Marin, we also aggregated based on name similarity, to eliminate situations where the same name is present in different forms in different places. For example: a misspelling, lack of diacritics, lack of "-" in the name.

Legitimate risks: Regex matching on text comes with its degree of inaccuracy in the context where the rules that underlie the identification of solutions are not sufficient to capture all the information. For example, even now, there are major differences in what the Bucharest dataset reveals, due to the non-use of the standard template by the registrars in Bucharest.

Accusations regarding bad faith, payment of "policies," revenge, etc.The analysis was conducted internally to identify the most frequent reasons for delay, a reason which, in my personal opinion, is sincere and against which I see no valid criticism.

Out of the bulk of registrars, the analysis identifies them all from Romania, I don't know most of them, I have nothing against any of them, and finally, the scores, even if they have a +-10% error, are good, overall. They show high efficiency, which, by the way, I underlined very well in the article.

Finally, if we were of ill intent, I don't think we would have published positive examples, and certainly not under the brand we want to build as being based on good faith, trust, and competence.


Request Re-evaluation + Result

As a reverification of the data analysis was requested, especially in relation to the registrars of the Trade Register, we proceeded to perform this check to identify any major discrepancies in the results.

Optimizations for the robustness of the analysis process:

We've made a number of improvements to the data collection and processing:

  1. Data saving process optimization by implementing a semaphore system (mutex lock) to prevent race conditions and inconsistencies caused by concurrent access to files.
  2. Extinderea perioadei de analiză până la data de 04.07.2024, prin crawl-uirea tuturor buletinelor publicate, inclusiv a celor care nu fuseseră disponibile anterior, asigurând astfel o acoperire completă a datelor.
  3. The inclusion in the corpus of data regarding the municipality of Bucharest, by modifying the regular expressions (regex) used for extracting information, thus eliminating the initial omission of this administrative entity treated separately from the counties. We assume that most of the changes in the analysis results come from here - the previous analysis did not include the Municipality of Bucharest.
  4. Improving the process of identifying county names by using a fuzzy search algorithm (fuzzy string matching), to allow for more flexible matching and to handle variations caused by OCR processing or deviations from the standard writing style.
  5. Implementation of name permutation management for registrants (e.g., "John Doe" and "Doe John" are treated as the same person) by applying a sorted name search algorithm, along with the previously mentioned fuzzy search.

Secondary Analysis:

At the end of the analysis, after applying these improvements, we proceeded with a comparative analysis of the results to identify potential errors in the first analysis. Thus:

  • As for the number of working days, the average difference was -2.0 days, and the median was 7.0 days, with a variation between -91 and 13 days. This variation suggests that while there were significant changes for some registrars (e.g., Georgeta Pacuraru with a decrease of 91 days), overall the impact was moderate.
  • Regarding the number of files processed, the average change was 292.43 files, and the median was 348.5 files, with a variation between -678 and 863 files. These values indicate incremental adjustments for most registrars, except for cases such as Daniela Oprișan, who recorded an increase of 863 processed files.
  • Daily productivity showed an average change of 2.61 files per day and a median change of 2.35 files per day, ranging from -7.11 to 14.94 files per day. These figures suggest that while there were significant improvements for some registrars (e.g., Ioana Cătălina Florea with an increase of 14.94 files per day), the changes were negative for others (e.g., Mihaela Vicol with a decrease of 7.11 files per day).
  • Hourly productivity had an average change of 0.33 files per hour and a median change of 0.29 files per hour, with variations between -0.89 and 1.87 files per hour. These values indicate relatively minor adjustments for most registrars.
  • Regarding changes in the ranking, there was a median improvement of 3.0 positions and an average improvement of about 1 position (-1.07). Although there were some re-rankings, they were not substantial overall, and most registrars maintained their approximate relative positions.

In conclusion, the re-analysis showed that while the improvements made refined the results and increased the accuracy of the study, they did not lead to fundamental changes in the initial conclusions. The initial analysis seems to have been generally solid and fair, and the adjustments made strengthened the findings without significantly altering them.

We believe that this process of re-verification and improvement of the analysis demonstrates our commitment to accuracy, transparency, and responsiveness to the feedback we receive.

The results of this study, thus revised, provide an even more detailed and substantiated picture of the activity of the registrars of the Trade Register.

Actualizări în timp real:

Review - Top 10 - 01.01.2024-03.07.2024 (Including Bucharest Municipality) - Files per hour
Review - Bottom 10 - 01.01.2024-03.07.2024 (Including Bucharest Municipality) - Files per hour

Transparency commitment

Given the criticism regarding the apparently opaque analysis process, we have published the code used in the analysis below, to increase transparency in the process. We have attached the files used in the analysis, as well as the preliminary information from the new analysis.

GitHub - Incorpororo/analiza-big-data-onrc
Contribute to Incorporo/analiza-big-data-onrc development by creating an account on GitHub.